[ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799625#action_12799625
 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

Looking at this a little: 

is there a reason why the termFrequency map needs to exist?  

if instead of:

{code}

      SparseVector vector;
      Map<String,MutableInt> termFrequency = new HashMap<String,MutableInt>();

      token = new Token();
      ts.reset();
      while ((token = ts.next(token)) != null) {
        String tk = new String(token.termBuffer(), 0, token.termLength());
        if(dictionary.containsKey(tk) == false) continue;
        if (termFrequency.containsKey(tk) == false) {
          count += tk.length() + 1;
          termFrequency.put(tk, new MutableInt(0));
        }
        termFrequency.get(tk).increment();
      }

      vector =
          new SparseVector(key.toString(), Integer.MAX_VALUE, 
termFrequency.size());

      for (Map.Entry<String,MutableInt> pair : termFrequency.entrySet()) {
        String tk = pair.getKey();
        if (dictionary.containsKey(tk) == false) continue;
        vector.setQuick(dictionary.get(tk).intValue(), pair.getValue()
            .doubleValue());
      }
{code}
 
we instead just built it up on the vector itself:

{code}
      String valueStr = value.toString();
      vector =
          new SparseVector(key.toString(), Integer.MAX_VALUE, 
valueString.length / 5); // guess at initial size

      token = new Token();
      ts.reset();
      while ((token = ts.next(token)) != null) {
        String tk = new String(token.termBuffer(), 0, token.termLength());
        if(dictionary.containsKey(tk) == false) continue;
        int tokenKey = dictionary.get(tk);
        vector.setQuick(tokenKey, vector.getQuick(tokenKey) + 1);
      }
{code}

At least when I micro-benchmark this, it's about 10% faster this way.  Not 
much, but it's also simpler code.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors 
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = 
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them 
> to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create 
> Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document 
> Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to