Re: [R] Removing words and initials with tm
Hi Jim The name's come up on my radar, but that's about it. I'll look into it. Thanks for the reference. All the best S On 10/04/15 23:36, Jim Lemon wrote: Hi Sun, No, I was thinking of something like hunspell, which seems to fit into the sort of work that you are doing. Jim On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine phaedr...@gmail.com mailto:phaedr...@gmail.com wrote: Thanks Jeff. I'll add that to the ever-growing list my current studies are generating daily. :-) Cheers S On 10/04/15 14:32, Jeff Newmiller wrote: I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available. --- Jeff NewmillerThe .. Go Live... DCN:jdnew...@dcn.davis.ca.us mailto:jdnew...@dcn.davis.ca.us Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#.#.O#. with /Software/Embedded Controllers) .OO#..OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. On April 10, 2015 3:19:51 AM PDT, Sun Shine phaedr...@gmail.com mailto:phaedr...@gmail.com wrote: Hi list Using the tm package, part of the pre-processing work is to remove words, etc. from the corpus. I wish to remove people's names and also their initials which are peppered throughout the corpus. But, because some people's initials are the same as parts of common words - e.g. 'am' = 'became' = 'bec e' or 'ec' = 'because' = 'b ause' or 'ar' = 'arrival' = 'rival' (which has a completely different meaning). Is there any way of doing this without leaving a trail of nonsense half-terms behind? I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. Would it make a difference if I removed initials and names *prior* to converting all text to lower case, so I remove 'AM' and because 'became' is lower case, it should remain unaffected? Any recommendations on how best to proceed with this? Thanks as always. Sun __ R-help@r-project.org mailto:R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailto:R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Removing words and initials with tm
Hello Sun, The order of the TM transformations makes a lot of difference. It isn't a shortcut, but if you identify all names you could create your own Stop words list: corpus -tm_map(corpus , removeWords, c(english, )) In the case of York, Key Word in Context (KWIC) syntax could be used to check how certain words are used. You could identify the words useages you want to remove or retain and respectively rename the relevant instances. This is labour intensive, but Greis in his Quantitative Corpus Linguistics, notes that sometimes time spent on trying to refine code might be better spent on manual analysis (p164). This book includes a KWIC type function (page 127), but I haven't been able to work out how to modify it to read more than six words either side of the specified word. Six should be adequate for your purpose. Jockers book also includes a KWIC function but I don't believe it searches the entire corpus, rather a specified text. I recently checked and TM doesn't have a KWIC function, but for the R talented (which excludes me) it might be possible to write one. For example, Jim Holtman once wrote a KWIC function to identify word use in a csv file. Bob __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Removing words and initials with tm
Hi Sun, No, I was thinking of something like hunspell, which seems to fit into the sort of work that you are doing. Jim On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine phaedr...@gmail.com wrote: Thanks Jeff. I'll add that to the ever-growing list my current studies are generating daily. :-) Cheers S On 10/04/15 14:32, Jeff Newmiller wrote: I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. On April 10, 2015 3:19:51 AM PDT, Sun Shine phaedr...@gmail.com wrote: Hi list Using the tm package, part of the pre-processing work is to remove words, etc. from the corpus. I wish to remove people's names and also their initials which are peppered throughout the corpus. But, because some people's initials are the same as parts of common words - e.g. 'am' = 'became' = 'bec e' or 'ec' = 'because' = 'b ause' or 'ar' = 'arrival' = 'rival' (which has a completely different meaning). Is there any way of doing this without leaving a trail of nonsense half-terms behind? I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. Would it make a difference if I removed initials and names *prior* to converting all text to lower case, so I remove 'AM' and because 'became' is lower case, it should remain unaffected? Any recommendations on how best to proceed with this? Thanks as always. Sun __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.