I am committing the first level of changes so that drew can work it. I have updated the patch on the issue as a reference. Ted please take a look when you get time. The names will change correspondingly
What I have right now is 4 Main Entry points DocumentProcessor - does SequenceFile => StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable) DictionaryVectorizer - StringTuple of documents => Tf Vector PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0 TfidfConverter - Converts tf vector to tfidf vector with optional normalizing An example which uses all of them hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now) Robin On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Drew has an early code drop that should be posted shortly. He has a > generic > AvroWritable that can serialize anything with an appropriate schema. That > changes your names and philosophy a bit. > > Regarding n-grams, I think that will be best combined with a non-dictionary > based vectorizer because of the large implied vocabulary that would > otherwise result. Also, in many cases vectorization and n-gram generation > is best done in the learning algorithm itself to avoid moving massive > amounts of data. As such, vectorization will probably need to be a library > rather than a map-reduce program. > > > On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil <robin.a...@gmail.com> wrote: > > > Lets break it down into milestones. See if you agree on the > following(even > > ClassNames ?) > > > > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > These are good questions. I see the best course as answering these > kinds > > > of > > > questions in phases. > > > > > > First, the only thing that is working right now is the current text => > > > vector stuff. We should continue to refine this with alternative forms > > of > > > vectorization (random indexing, stochastic projection as well as the > > > current > > > dictionary approach). > > > > > > The input all these vectorization job is StucturedDocumentWritable > format > > which you and Drew will work on(Avro based) > > > > To create the StructuredDocumentWritable format we have to write > Mapreduces > > which will convert > > a) SequenceFile => SingleField token array using Analyzer > > I am going with simple Document > > => StucturedDocumentWritable(encapsulating StringTuple) in M1. > > Change it to StucturedDocumentWritable( in M2 > > b) Lucene Repo => StucturedDocumentWritable M2 > > c) Structured XML => StucturedDocumentWritable M2 > > d) Other Formats/DataSources(RDBMS) => StucturedDocumentWritable > > M3 > > > > Jobs using StructuredDocumentWritable > > a) DictionaryVectorizer -> Makes VectorWritable M1 > > b) nGram Generator -> Makes ngrams -> > > 1) Appends to the dictionary -> Creates Partial Vectors -> > Merges > > with vectors from Dictionary Vectorizer to create ngram based vectors > > M1 > > 2) Appends to other vectorizers(random indexing, stochastic) > M1? > > or M2 > > c) Random Indexing Job -> Makes VectorWritable M1? or M2 > > d) StochasticProjection Job -> Makes Vector writable M1? or M2 > > > > > > How does this sound ? Feel free to edit/reorder them > > > > > > > > A second step is to be able to store and represent more general documents > > > similar to what is possible with Lucene. This is critically important > > for > > > some of the things that I want to do where I need to store and > segregate > > > title, publisher, authors, abstracts and body text (and many other > > > characteristics ... we probably have >100 of them). It is also > > critically > > > important if we want to embrace the dualism between recommendation and > > > search. Representing documents can be done without discarding the > > simpler > > > approach we have now and it can be done in advance of good > vectorization > > of > > > these complex documents. > > > > > > A third step is to define advanced vectorization for complex documents. > > As > > > an interim step, we can simply vectorize using the dictionary and > > > alternative vectorizers that we have now, but applied to a single field > > of > > > the document. Shortly, though, we should be able to define cross > > > occurrence > > > features for a multi-field vectorization. > > > > > > The only dependencies here are that the third step depends on the first > > and > > > second. > > > > > > You have been working on the Dictionary vectorizer. I did a bit of > work > > on > > > stochastic projection with some cooccurrence. > > > > > > In parallel Drew and I have been working on building an Avro document > > > schema. This is driving forward on step 2. I think that will actually > > > bear > > > some fruit quickly. Once that is done, we should merge capabilities. > I > > am > > > hoping that the good momentum that you have established on (1) will > mean > > > that merging your vectorization with the complex documents will be > > > relatively easy. > > > > > > Is that a workable idea? > > > > > > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil <robin.a...@gmail.com> > > wrote: > > > > > > > And how does it > > > > work with our sequence file format(string docid => string document>. > > All > > > we > > > > have is text=>text ? > > > > and finally its all vectors. How does same word in two different > fields > > > > translate into vector? > > > > > > > > if you have a clear plan lets do it or lets do the first version with > > > just > > > > > > > > document -> analyzer -> token array -> vector > > > > |-> ngram -> > > vector > > > > > > > > > > > > > > > > -- > > > Ted Dunning, CTO > > > DeepDyve > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve >