[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Robin Anil (JIRA) Tue, 02 Feb 2010 13:04:48 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: MAHOUT-237-tfidf.patch

Added IDF job which takes a sequence file of doc-id=>Vector. Calculates Tf-Idf 
using TFIDF class(internally uses Lucene DefaultSimilarity class, not yet 
modifiable)

has similar options to lucene driver (minDf, maxDfPercent). 

Purely Map/reduce solution. Chunks the Document Frequency Sequence File. and 
does multiple map/reduces over the input vectors as specified by the chunk size.


Seems like the Text field Vector Class Name (i.e RandomAccessSparseVector etc) 
is taking most of the space in the sequencefile. Cant we compact it(with an 
integer id and a factory)?



> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, 
> SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors 
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = 
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them 
> to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create 
> Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document 
> Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Reply via email to