[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Jake Mannix (JIRA) Tue, 09 Feb 2010 03:09:52 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831413#action_12831413
 ]


Jake Mannix commented on MAHOUT-237:
------------------------------------

I think of it as that flag as well, but when doing decompositions of matrices, 
you will have a matrix with a bunch (N = numRows == 10^{8+}) of sparse vectors, 
each of which has some dimension (M = numCols).  Your goal is to find a matrix 
which has some (k = desiredRank = 100's) *dense* vectors which each has 
cardinality M.  If M is Integer.MAX_VALUE, you need a couple TB of RAM to even 
construct this final product.  You most certainly do not want to have 
eigenvectors which are represented as sparse vectors, because this is a) 
horribly inefficient storage for them (they're dense) and b) horribly 
inefficient CPU-wise (ditto).

We could certainly *use* Vector.size() == Integer.MAX_VALUE as an effective 
"flag" that it's unbounded, but the important thing is what we do with that 
information: when you create a DenseVector of this size, the key would be to 
*not* initialize the array this large, but instead initialize it to some small 
value, but set an internal flag saying that whenever some does a get/getQuick 
outside of range, return 0, and if someone does a set/setQuick outside of 
range, the vector automagically resizes internally to that size, and then sets. 
 So it becomes a "grow as needed" dense vector of infinite dimension.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, 
> SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors 
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = 
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them 
> to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create 
> Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document 
> Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Reply via email to