I am committing the first level of changes so that drew can work it. I have
updated the patch on the issue as a reference. Ted please take a look when
you get time. The names will change correspondingly

What I have right now is

4 Main Entry points
DocumentProcessor - does SequenceFile => StringTuple(later replaced by
StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents => Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does
optional normalizing(used by both DictionaryVectorizer(no normalizing) and
TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional
normalizing

An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o
reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now)

Robin


On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Drew has an early code drop that should be posted shortly.  He has a
> generic
> AvroWritable that can serialize anything with an appropriate schema.  That
> changes your names and philosophy a bit.
>
> Regarding n-grams, I think that will be best combined with a non-dictionary
> based vectorizer because of the large implied vocabulary that would
> otherwise result.  Also, in many cases vectorization and n-gram generation
> is best done in the learning algorithm itself to avoid moving massive
> amounts of data.  As such, vectorization will probably need to be a library
> rather than a map-reduce program.
>
>
> On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > Lets break it down into milestones. See if you agree on the
> following(even
> > ClassNames ?)
> >
> > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > These are good questions.  I see the best course as answering these
> kinds
> > > of
> > > questions in phases.
> > >
> > > First, the only thing that is working right now is the current text =>
> > > vector stuff.  We should continue to refine this with alternative forms
> > of
> > > vectorization (random indexing, stochastic projection as well as the
> > > current
> > > dictionary approach).
> > >
> > > The input all these vectorization job is StucturedDocumentWritable
> format
> > which you and Drew will work on(Avro based)
> >
> > To create the StructuredDocumentWritable format we have to write
> Mapreduces
> > which will convert
> > a) SequenceFile => SingleField token array using Analyzer
> >     I am going with simple Document
> > => StucturedDocumentWritable(encapsulating StringTuple)      in   M1.
> >     Change it to StucturedDocumentWritable(                     in     M2
> > b) Lucene Repo  => StucturedDocumentWritable                       M2
> > c) Structured XML =>  StucturedDocumentWritable                  M2
> > d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
> >         M3
> >
> > Jobs using StructuredDocumentWritable
> > a) DictionaryVectorizer -> Makes VectorWritable                     M1
> > b) nGram Generator -> Makes ngrams ->
> >          1) Appends to the dictionary -> Creates Partial Vectors ->
> Merges
> > with vectors from Dictionary Vectorizer to create ngram based vectors
> > M1
> >          2) Appends to  other vectorizers(random indexing, stochastic)
> M1?
> > or M2
> > c) Random Indexing Job -> Makes VectorWritable  M1? or M2
> > d) StochasticProjection Job -> Makes Vector writable  M1? or M2
> >
> >
> > How does this sound ? Feel free to edit/reorder them
> >
> >
> >
> > A second step is to be able to store and represent more general documents
> > > similar to what is possible with Lucene.  This is critically important
> > for
> > > some of the things that I want to do where I need to store and
> segregate
> > > title, publisher, authors, abstracts and body text (and many other
> > > characteristics ... we probably have >100 of them).  It is also
> > critically
> > > important if we want to embrace the dualism between recommendation and
> > > search.  Representing documents can be done without discarding the
> > simpler
> > > approach we have now and it can be done in advance of good
> vectorization
> > of
> > > these complex documents.
> > >
> > > A third step is to define advanced vectorization for complex documents.
> >  As
> > > an interim step, we can simply vectorize using the dictionary and
> > > alternative vectorizers that we have now, but applied to a single field
> > of
> > > the document.  Shortly, though, we should be able to define cross
> > > occurrence
> > > features for a multi-field vectorization.
> > >
> > > The only dependencies here are that the third step depends on the first
> > and
> > > second.
> > >
> > > You have been working on the Dictionary vectorizer.  I did a bit of
> work
> > on
> > > stochastic projection with some cooccurrence.
> > >
> > > In parallel Drew and I have been working on building an Avro document
> > > schema.  This is driving forward on step 2.  I think that will actually
> > > bear
> > > some fruit quickly.  Once that is done, we should merge capabilities.
>  I
> > am
> > > hoping that the good momentum that you have established on (1) will
> mean
> > > that merging your vectorization with the complex documents will be
> > > relatively easy.
> > >
> > > Is that a workable idea?
> > >
> > > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com>
> > wrote:
> > >
> > > > And how does it
> > > > work with our sequence file format(string docid => string document>.
> > All
> > > we
> > > > have is text=>text ?
> > > > and finally its all vectors. How does same word in two different
> fields
> > > > translate into vector?
> > > >
> > > > if you have a clear plan lets do it or lets do the first version with
> > > just
> > > >
> > > > document -> analyzer -> token array -> vector
> > > >                                                      |-> ngram ->
> > vector
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to