[
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-237:
------------------------------
Attachment: MAHOUT-237-tfidf.patch
4 Main Entry points
DocumentProcessor - does SequenceFile => StringTuple(later replaced by
StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents => Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does
optional normalizing(used by both DictionaryVectorizer(no normalizing) and
TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional normalizing
An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o
reuters-vectors -w (tfidf|tf) --norm 2(works only when tfidf enabled not with
tf)
> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
> Key: MAHOUT-237
> URL: https://issues.apache.org/jira/browse/MAHOUT-237
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.3
> Reporter: Robin Anil
> Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch,
> SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> The input document is in SequenceFile<Text,Text> . with key = docid, value =
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them
> to SequenceFile<Text, LongWritable> where key=feature, value = unique id
> Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create
> Partial(containing the features of the given shard) SparseVectors
> Fourth Map/Reduce over partial shard, group by docid, create full document
> Vector
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.