[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799625#action_12799625 ]
Jake Mannix commented on MAHOUT-237: ------------------------------------ Looking at this a little: is there a reason why the termFrequency map needs to exist? if instead of: {code} SparseVector vector; Map<String,MutableInt> termFrequency = new HashMap<String,MutableInt>(); token = new Token(); ts.reset(); while ((token = ts.next(token)) != null) { String tk = new String(token.termBuffer(), 0, token.termLength()); if(dictionary.containsKey(tk) == false) continue; if (termFrequency.containsKey(tk) == false) { count += tk.length() + 1; termFrequency.put(tk, new MutableInt(0)); } termFrequency.get(tk).increment(); } vector = new SparseVector(key.toString(), Integer.MAX_VALUE, termFrequency.size()); for (Map.Entry<String,MutableInt> pair : termFrequency.entrySet()) { String tk = pair.getKey(); if (dictionary.containsKey(tk) == false) continue; vector.setQuick(dictionary.get(tk).intValue(), pair.getValue() .doubleValue()); } {code} we instead just built it up on the vector itself: {code} String valueStr = value.toString(); vector = new SparseVector(key.toString(), Integer.MAX_VALUE, valueString.length / 5); // guess at initial size token = new Token(); ts.reset(); while ((token = ts.next(token)) != null) { String tk = new String(token.termBuffer(), 0, token.termLength()); if(dictionary.containsKey(tk) == false) continue; int tokenKey = dictionary.get(tk); vector.setQuick(tokenKey, vector.getQuick(tokenKey) + 1); } {code} At least when I micro-benchmark this, it's about 10% faster this way. Not much, but it's also simpler code. > Map/Reduce Implementation of Document Vectorizer > ------------------------------------------------ > > Key: MAHOUT-237 > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, SparseVector-VIntWritable.patch > > > Current Vectorizer uses Lucene Index to convert documents into SparseVectors > Ted is working on a Hash based Vectorizer which can map features into Vectors > of fixed size and sum it up to get the document Vector > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > The input document is in SequenceFile<Text,Text> . with key = docid, value = > content > First Map/Reduce over the document collection and generate the feature counts. > Second Sequential pass reads the output of the map/reduce and converts them > to SequenceFile<Text, LongWritable> where key=feature, value = unique id > Second stage should create shards of features of a given split size > Third Map/Reduce over the document collection, using each shard and create > Partial(containing the features of the given shard) SparseVectors > Fourth Map/Reduce over partial shard, group by docid, create full document > Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.