Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Robin Anil Wed, 04 Nov 2009 01:17:53 -0800

I had always thought we should be using Hadoop to number these features and
create the vector the way Bayes Classifier does it. In Bayes classifier, I
don't bother to number the feature. Instead use String=>double mapping. I
will see If feature numbering could be done by a single map/reduce job. If
thats the case, We can use the TfIdfDriver to generate the tfidf scores and
then convert the docs into array(int=>double) vectors. That way it would be
done in a distributed manner



Robin


On Wed, Nov 4, 2009 at 1:35 PM, Shashikant Kore <[email protected]>wrote:

> First, we need to create lucene index from this text. Typically, index
> size is close to 30% of the raw text. (Though, I have seen cases,
> where it could be as high as 45%). The vectors take 25% of index size
> (Or, roughly 10% of original text)
>
> The space taken by index could be reclaimed after creating the vectors.
>
> --shashi
>
> On Tue, Nov 3, 2009 at 9:19 PM, Ken Krugler <[email protected]>
> wrote:
> >
> > On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
> >
> >> Might be of interest to all you Mahouts out there...
> >>  http://bixolabs.com/datasets/public-terabyte-dataset-project/
> >>
> >> Would be cool to get this converted over to our vector format so that we
> >> can cluster, etc.
> >
> >
> > How much additional space would be required for the vectors, in some
> optimal
> > compressed format? Say as a percentage of raw text size.
> >
> > I'm asking because I have some flexibility in the processing and
> associated
> > metadata I can store as part of the dataset.
> >
> > -- Ken
> >
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to