Hi again

I've now had the chance to try this out, and using scan() doesn't seem to work either.

This is what I used:

1) I generated a plain text file called stopDict.txt. This file is of the format: "a, bunch, of, words, to, use"

2) I invoked scan(), like this:
> userStopList <- scan(text = '~/path/to/stopDict.txt', what = " ", sep = ",")

3) Then I used the externally generated list as stop words:
> docs <- tm_map(docs, removeWords, userStopList)

3) When I go to inspect the document, at least two of the user-defined stop words are in the text

Is there a further argument I should be passing to scan(), or is the stopDict.txt file not set up the correct way? I tried each term separated by ' ' and ',', (e.g. 'all', 'the', 'text') but that didn't work, neither does it seem to work when the whole list is enclosed within quotes (e.g. "all, the, text").

While not critical to have the capacity to read in an externally generated list, it sure would be helpful.

Thanks.

Sun


On 02/03/15 07:36, Sun Shine wrote:
Thanks Jim.

I thought that I was passing a vector, not realising I had converted this to a list object.

I haven't come across the scan() function so far, so this is good to know.

Good explanation - I'll give this a go when I can get back to that piece of work later today.

Thanks again.

Regards,

Sun


On 01/03/15 21:13, jim holtman wrote:
The 'read.table' was creating a data.frame (not a vector) and applying
'c' to it converted it to a list.  You should alway look at the object
you are creating.  You probably want to use 'scan'.

======================
testFile <- "Although,this,query,applies,specifically,to,the,tm,package"
# read in with read.table create a data.frame
df_words <- read.table(text = testFile, sep = ',')
df_words  # not a vector
         V1   V2    V3      V4           V5 V6  V7 V8      V9
1 Although this query applies specifically to the tm package
c(df_words)  # this results in a list
$V1
[1] Although
Levels: Although
$V2
[1] this
Levels: this
$V3
[1] query
Levels: query
$V4
[1] applies
Levels: applies
$V5
[1] specifically
Levels: specifically
$V6
[1] to
Levels: to
$V7
[1] the
Levels: the
$V8
[1] tm
Levels: tm
$V9
[1] package
Levels: package
# now read with 'scan'
scan_words <- scan(text = testFile, what = '', sep = ',')
Read 9 items
scan_words
[1] "Although"     "this"         "query"        "applies"
"specifically" "to"
[7] "the"          "tm"           "package"

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedr...@gmail.com> wrote:
Hi list

Although this query applies specifically to the tm package, perhaps it's
something that others might be able to lend a thought to.

Using tm to do some initial text mining, I want to include an external (to
R) generated dictionary of words that I want removed from the corpus.

I have created a comma separated list of terms in " " marks in a
stopList.txt plain UTF-8 file. I want to read this into R, so do:

stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
When I want to load it as part of the removeWords function in tm, I do:

docs <- tm_map(docs, removeWords, stopDict)
which has no effect. Neither does:

docs <- tm_map(docs, removeWords, c(stopDict))
What am I not seeing/ doing?

How do I pass a text file with pre-defined terms to the removeWords
transform of tm?

Thanks for any ideas.

Cheers

Sun

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to