First, we need to create lucene index from this text. Typically, index size is close to 30% of the raw text. (Though, I have seen cases, where it could be as high as 45%). The vectors take 25% of index size (Or, roughly 10% of original text)
The space taken by index could be reclaimed after creating the vectors. --shashi On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <[email protected]> wrote: > > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote: > >> Might be of interest to all you Mahouts out there... >> http://bixolabs.com/datasets/public-terabyte-dataset-project/ >> >> Would be cool to get this converted over to our vector format so that we >> can cluster, etc. > > > How much additional space would be required for the vectors, in some optimal > compressed format? Say as a percentage of raw text size. > > I'm asking because I have some flexibility in the processing and associated > metadata I can store as part of the dataset. > > -- Ken > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > >
