[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robin Anil updated MAHOUT-237: ------------------------------ Attachment: MAHOUT-237-tfidf.patch 4 Main Entry points DocumentProcessor - does SequenceFile => StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable) DictionaryVectorizer - StringTuple of documents => Tf Vector PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0 TfidfConverter - Converts tf vector to tfidf vector with optional normalizing An example which uses all of them hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only when tfidf enabled not with tf) > Map/Reduce Implementation of Document Vectorizer > ------------------------------------------------ > > Key: MAHOUT-237 > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, > SparseVector-VIntWritable.patch > > > Current Vectorizer uses Lucene Index to convert documents into SparseVectors > Ted is working on a Hash based Vectorizer which can map features into Vectors > of fixed size and sum it up to get the document Vector > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > The input document is in SequenceFile<Text,Text> . with key = docid, value = > content > First Map/Reduce over the document collection and generate the feature counts. > Second Sequential pass reads the output of the map/reduce and converts them > to SequenceFile<Text, LongWritable> where key=feature, value = unique id > Second stage should create shards of features of a given split size > Third Map/Reduce over the document collection, using each shard and create > Partial(containing the features of the given shard) SparseVectors > Fourth Map/Reduce over partial shard, group by docid, create full document > Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.