Re: Random thought: line separators

Olivier Grisel Mon, 18 Jan 2010 05:59:40 -0800

2010/1/18 Robin Anil <robin.a...@gmail.com>:
> Its this kind of thing that forced to move to sequence files instead of
> TextKeyValueInput format and other text based/ csv based formats. Kind of
> regretting the decision to go with tab separated format for BayesClassifier
> which i wrote it 2 years ago. I will be modifying this to use sparse vectors
> or the sequence files which ever fits.
>
> My thought is that this kind of functionality should only be used by the
> format convertors that convert to and back from sequence files. and when
> storing it to sequence files just enforce the \n rule for line breaks


By the way, I tried to run the Bayesian classifier's features
extractor on the following wikipedia chunk:

s3://enwiki-pages-articles/enwiki-20090810-pages-articles/chunk-0001.xml

And I got an EOFException in hadoop related classes (no mahout classes
in the stacktrace). I wonder if this is related, or maybe this is
related to the java serialization used in that step.

The feature extractors works on all other chunks I tried though. All
those chunks were extracted on a linux machine.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Random thought: line separators

Reply via email to