Dear fellow R users,

I'm using the package tm for text mining, and have a problem with reading in a corpus from XML files. When I copy the example from "Introduction to the tm package" of the small reuters subset "crude", everything goes well, and I get a corpus with the required meta data. When I read in the entire reuters21578 corpus in XML format however (or a self-created subset thereof) the meta data is lost, and the files are interpreted as plain text. I use the following command, where the indicated directory contains all reuters 21578 documents as separate XML files:

> reuters21578 <- Corpus(DirSource("C:/Data/Reuters/preprocessed"), readerContol=list(reader=readReut21578XML))

I'm running R2.15.0 under Windows XP.

Has anybody else encountered this problem and found a cause/solution.

Best regards,

-Ad Feelders

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to