[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831432#action_12831432 ]
Jake Mannix commented on MAHOUT-237: ------------------------------------ Yeah, well, I need the count, and I also am modifying the vectorizer and tfidf version to a) create vectors of the proper dimension (after doing createDictionaryChunks(), we know what the dimension of the output vectors is bound to be), and b) take a "--sequentialAccessOutput" cmdline flag to allow for the possibility of (in the final reducer) converting the vectors from their mutation-friendly form of RandomAccessSparseVector, and sealing them up in the not-very-mutation-friendly, but zippily faster for some tasks, SequentialAccessSparseVector form. I'll put up a patch with all my DistributedLanczosSolver stuff, because that's where this is needed (plus anyone who wants finite dimensional vectors, sometimes of SequentialAccess optimized form [like the clusterers]) > Map/Reduce Implementation of Document Vectorizer > ------------------------------------------------ > > Key: MAHOUT-237 > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, > SparseVector-VIntWritable.patch > > > Current Vectorizer uses Lucene Index to convert documents into SparseVectors > Ted is working on a Hash based Vectorizer which can map features into Vectors > of fixed size and sum it up to get the document Vector > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > The input document is in SequenceFile<Text,Text> . with key = docid, value = > content > First Map/Reduce over the document collection and generate the feature counts. > Second Sequential pass reads the output of the map/reduce and converts them > to SequenceFile<Text, LongWritable> where key=feature, value = unique id > Second stage should create shards of features of a given split size > Third Map/Reduce over the document collection, using each shard and create > Partial(containing the features of the given shard) SparseVectors > Fourth Map/Reduce over partial shard, group by docid, create full document > Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.