2010/1/18 Robin Anil <robin.a...@gmail.com>: > Its this kind of thing that forced to move to sequence files instead of > TextKeyValueInput format and other text based/ csv based formats. Kind of > regretting the decision to go with tab separated format for BayesClassifier > which i wrote it 2 years ago. I will be modifying this to use sparse vectors > or the sequence files which ever fits. > > My thought is that this kind of functionality should only be used by the > format convertors that convert to and back from sequence files. and when > storing it to sequence files just enforce the \n rule for line breaks
By the way, I tried to run the Bayesian classifier's features extractor on the following wikipedia chunk: s3://enwiki-pages-articles/enwiki-20090810-pages-articles/chunk-0001.xml And I got an EOFException in hadoop related classes (no mahout classes in the stacktrace). I wonder if this is related, or maybe this is related to the java serialization used in that step. The feature extractors works on all other chunks I tried though. All those chunks were extracted on a linux machine. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name