Re: [R] Removing words and initials with tm

2015-04-11 Thread Sun Shine
Hi Jim

The name's come up on my radar, but that's about it. I'll look into it.

Thanks for the reference.

All the best
S

On 10/04/15 23:36, Jim Lemon wrote:
 Hi Sun,
 No, I was thinking of something like hunspell, which seems to fit into 
 the sort of work that you are doing.

 Jim


 On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine phaedr...@gmail.com 
 mailto:phaedr...@gmail.com wrote:

 Thanks Jeff.

 I'll add that to the ever-growing list my current studies are
 generating daily. :-)

 Cheers
 S



 On 10/04/15 14:32, Jeff Newmiller wrote:

 I suspect that it might have something to do with regular
 expressions, but to be honest, I'm (currently) pretty crap
 with those.

 I cannot think of a better incentive to take action on this
 hole in your education and buckle down to learn regular
 expressions. There are many books and tutorials available.
 
 ---
 Jeff NewmillerThe .. 
 Go Live...
 DCN:jdnew...@dcn.davis.ca.us
 mailto:jdnew...@dcn.davis.ca.us Basics: ##.#. 
  ##.#.  Live Go...
Live:   OO#.. Dead:
 OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.#.O#.  with
 /Software/Embedded Controllers)   .OO#..OO#. 
 rocks...1k
 
 ---
 Sent from my phone. Please excuse my brevity.

 On April 10, 2015 3:19:51 AM PDT, Sun Shine
 phaedr...@gmail.com mailto:phaedr...@gmail.com wrote:

 Hi list

 Using the tm package, part of the pre-processing work is
 to remove
 words, etc. from the corpus.

 I wish to remove people's names and also their initials
 which are
 peppered throughout the corpus. But, because some people's
 initials are

 the same as parts of common words - e.g. 'am' = 'became'
 = 'bec e' or
 'ec' = 'because' = 'b ause' or 'ar' = 'arrival' =
 'rival' (which has
 a
 completely different meaning).

 Is there any way of doing this without leaving a trail of
 nonsense
 half-terms behind? I suspect that it might have something
 to do with
 regular expressions, but to be honest, I'm (currently)
 pretty crap with

 those.

 Would it make a difference if I removed initials and names
 *prior* to
 converting all text to lower case, so I remove 'AM' and
 because
 'became'
 is lower case, it should remain unaffected?

 Any recommendations on how best to proceed with this?

 Thanks as always.
 Sun

 __
 R-help@r-project.org mailto:R-help@r-project.org mailing
 list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.



 __
 R-help@r-project.org mailto:R-help@r-project.org mailing list --
 To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Removing words and initials with tm

2015-04-11 Thread Bob Green

Hello Sun,

The order of the TM transformations makes a lot of difference.

It isn't a shortcut, but if you identify all names you could create 
your own Stop words list:


corpus  -tm_map(corpus , removeWords, c(english,   ))

In the case of York,  Key Word in Context (KWIC) syntax could be used 
to check how certain words are used. You could identify the words 
useages you want to remove or retain and respectively rename the 
relevant instances.


This is labour intensive, but Greis in his Quantitative Corpus 
Linguistics, notes that sometimes time spent on trying to refine code 
might be better spent on manual analysis (p164). This book includes a 
KWIC type function (page 127), but I haven't been able to work out 
how to modify it to read more than six words either side of the 
specified word. Six should be adequate for your purpose. Jockers book 
also includes a KWIC function but I don't believe it searches the 
entire corpus, rather a specified text.


I recently checked and TM doesn't have a KWIC function, but for the R 
talented (which excludes me) it might be possible to write one. For 
example, Jim Holtman once wrote a KWIC function to identify word use 
in a csv file.


Bob

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Removing words and initials with tm

2015-04-10 Thread Jim Lemon
Hi Sun,
No, I was thinking of something like hunspell, which seems to fit into the
sort of work that you are doing.

Jim


On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine phaedr...@gmail.com wrote:

 Thanks Jeff.

 I'll add that to the ever-growing list my current studies are generating
 daily. :-)

 Cheers
 S



 On 10/04/15 14:32, Jeff Newmiller wrote:

 I suspect that it might have something to do with regular expressions,
 but to be honest, I'm (currently) pretty crap with those.

 I cannot think of a better incentive to take action on this hole in your
 education and buckle down to learn regular expressions. There are many
 books and tutorials available.
 
 ---
 Jeff NewmillerThe .   .  Go
 Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live
 Go...
Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.
 rocks...1k
 
 ---
 Sent from my phone. Please excuse my brevity.

 On April 10, 2015 3:19:51 AM PDT, Sun Shine phaedr...@gmail.com wrote:

 Hi list

 Using the tm package, part of the pre-processing work is to remove
 words, etc. from the corpus.

 I wish to remove people's names and also their initials which are
 peppered throughout the corpus. But, because some people's initials are

 the same as parts of common words - e.g. 'am' = 'became' = 'bec e' or
 'ec' = 'because' = 'b ause' or 'ar' = 'arrival' = 'rival' (which has
 a
 completely different meaning).

 Is there any way of doing this without leaving a trail of nonsense
 half-terms behind? I suspect that it might have something to do with
 regular expressions, but to be honest, I'm (currently) pretty crap with

 those.

 Would it make a difference if I removed initials and names *prior* to
 converting all text to lower case, so I remove 'AM' and because
 'became'
 is lower case, it should remain unaffected?

 Any recommendations on how best to proceed with this?

 Thanks as always.
 Sun

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/
 posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.