Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Jake Mannix Tue, 02 Feb 2010 13:13:56 -0800

You volunteering to port to avro, Ted?  Awesome! :)

  -jake

On Feb 2, 2010 1:10 PM, "Ted Dunning (JIRA)" <j...@apache.org> wrote:

   [
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828763#action_12828763]

Ted Dunning commented on MAHOUT-237:
------------------------------------

{quote}
Seems like the Text field Vector Class Name (i.e RandomAccessSparseVector
etc) is taking most of the space in the sequencefile. Cant we compact
it(with an integer id and a factory)?
{quote}

What about switching to Avro to avoid this?

> Map/Reduce Implementation of Document Vectorizer >
----------------------------------------------...
>         Attachments: DictionaryVectorizer.patch,
DictionaryVectorizer.patch, DictionaryVectorizer.patch,
DictionaryVectorizer.patch, DictionaryVectorizer.patch,
MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch

> > > Current Vectorizer uses Lucene Index to convert documents into
SparseVectors > Ted is working ...

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Reply via email to