I will look into this. On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <[email protected]>wrote:
> Hi > > While playing around some more with the 20newsgroups example code for the > Bayes classifiers, I ran into an oddity and a presumable bug: > > instead of using (parts of) the 20 newsgroups data set, which was split > nicely into one file per newsgroup, with the 'category, tab, tokens' line > format, I generated such a file out of our company data set. What I did > though was generate 1 file to train, and 1 to test with - so both files > could have different lines having different categories, e.g. > > cars Ferrari red .... > animals cow cat dog .... > > In training, this works fine. In testing, it crashes TestClassifier with a > null pointer exception. I presume that is because either the file name does > not match category.txt for some category name, or because there's multiple > categories being used inside the single file - but I also presume that > neither should crash the thing :) It also brings up the question: if the > line format in the data files has the category in there, then why are the > file names relevant at all? Seems like redundancy to me. Shouldn't > TestClassifier merely take all .txt files in the input data directory and > process their contents? > > Regards, > Loek >
