I had always thought we should be using Hadoop to number these features and create the vector the way Bayes Classifier does it. In Bayes classifier, I don't bother to number the feature. Instead use String=>double mapping. I will see If feature numbering could be done by a single map/reduce job. If thats the case, We can use the TfIdfDriver to generate the tfidf scores and then convert the docs into array(int=>double) vectors. That way it would be done in a distributed manner
Robin On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <[email protected]>wrote: > First, we need to create lucene index from this text. Typically, index > size is close to 30% of the raw text. (Though, I have seen cases, > where it could be as high as 45%). The vectors take 25% of index size > (Or, roughly 10% of original text) > > The space taken by index could be reclaimed after creating the vectors. > > --shashi > > On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <[email protected]> > wrote: > > > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > > > >> Might be of interest to all you Mahouts out there... > >> http://bixolabs.com/datasets/public-terabyte-dataset-project/ > >> > >> Would be cool to get this converted over to our vector format so that we > >> can cluster, etc. > > > > > > How much additional space would be required for the vectors, in some > optimal > > compressed format? Say as a percentage of raw text size. > > > > I'm asking because I have some flexibility in the processing and > associated > > metadata I can store as part of the dataset. > > > > -- Ken > > > > -------------------------------------------- > > Ken Krugler > > +1 530-210-6378 > > http://bixolabs.com > > e l a s t i c w e b m i n i n g > > > > > > > > > > >
