oddity?

Loek Cleophas Thu, 18 Feb 2010 02:12:57 -0800

Hi

While playing around some more with the 20newsgroups example code forthe Bayes classifiers, I ran into an oddity and a presumable bug:

instead of using (parts of) the 20 newsgroups data set, which wassplit nicely into one file per newsgroup, with the 'category, tab,tokens' line format, I generated such a file out of our company dataset. What I did though was generate 1 file to train, and 1 to testwith - so both files could have different lines having differentcategories, e.g.


cars    Ferrari red ....
animals cow cat dog ....

In training, this works fine. In testing, it crashes TestClassifierwith a null pointer exception. I presume that is because either thefile name does not match category.txt for some category name, orbecause there's multiple categories being used inside the single file- but I also presume that neither should crash the thing :) It alsobrings up the question: if the line format in the data files has thecategory in there, then why are the file names relevant at all? Seemslike redundancy to me. Shouldn't TestClassifier merely take all .txtfiles in the input data directory and process their contents?


Regards,
Loek

20newsgroups example/TestClassifier code - bug/oddity?

Reply via email to