In the intermediate representation, it is very good to keep string -> double mappings in some form. In memory, we probably need to separate this into String -> index and index -> double representations so that we have flexibility of representation.
I am not sure which you intended. On Wed, Nov 4, 2009 at 1:16 AM, Robin Anil <[email protected]> wrote: > I had always thought we should be using Hadoop to number these features and > create the vector the way Bayes Classifier does it. In Bayes classifier, I > don't bother to number the feature. Instead use String=>double mapping. I > will see If feature numbering could be done by a single map/reduce job. If > thats the case, We can use the TfIdfDriver to generate the tfidf scores and > then convert the docs into array(int=>double) vectors. That way it would be > done in a distributed manner > > > Robin > > > On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <[email protected] > >wrote: > > > First, we need to create lucene index from this text. Typically, index > > size is close to 30% of the raw text. (Though, I have seen cases, > > where it could be as high as 45%). The vectors take 25% of index size > > (Or, roughly 10% of original text) > > > > The space taken by index could be reclaimed after creating the vectors. > > > > --shashi > > > > On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <[email protected] > > > > wrote: > > > > > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > > > > > >> Might be of interest to all you Mahouts out there... > > >> http://bixolabs.com/datasets/public-terabyte-dataset-project/ > > >> > > >> Would be cool to get this converted over to our vector format so that > we > > >> can cluster, etc. > > > > > > > > > How much additional space would be required for the vectors, in some > > optimal > > > compressed format? Say as a percentage of raw text size. > > > > > > I'm asking because I have some flexibility in the processing and > > associated > > > metadata I can store as part of the dataset. > > > > > > -- Ken > > > > > > -------------------------------------------- > > > Ken Krugler > > > +1 530-210-6378 > > > http://bixolabs.com > > > e l a s t i c w e b m i n i n g > > > > > > > > > > > > > > > > > > -- Ted Dunning, CTO DeepDyve
