Map/Reduce Implementation of Document Vectorizer
------------------------------------------------

                 Key: MAHOUT-237
                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
             Project: Mahout
          Issue Type: New Feature
    Affects Versions: 0.3
            Reporter: Robin Anil
            Assignee: Robin Anil
             Fix For: 0.3


Current Vectorizer uses Lucene Index to convert documents into SparseVectors
Ted is working on a Hash based Vectorizer which can map features into Vectors 
of fixed size and sum it up to get the document Vector
This is a pure bag-of-words based Vectorizer written in Map/Reduce. 

The input document is in SequenceFile<Text,Text> . with key = docid, value = 
content
First Map/Reduce over the document collection and generate the feature counts.
Second Sequential pass reads the output of the map/reduce and converts them to 
SequenceFile<Text, LongWritable> where key=feature, value = unique id 
    Second stage should create shards of features of a given split size
Third Map/Reduce over the document collection, using each shard and create 
Partial(containing the features of the given shard) SparseVectors 
Fourth Map/Reduce over partial shard, group by docid, create full document 
Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to