[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831413#action_12831413 ]
Jake Mannix commented on MAHOUT-237: ------------------------------------ I think of it as that flag as well, but when doing decompositions of matrices, you will have a matrix with a bunch (N = numRows == 10^{8+}) of sparse vectors, each of which has some dimension (M = numCols). Your goal is to find a matrix which has some (k = desiredRank = 100's) *dense* vectors which each has cardinality M. If M is Integer.MAX_VALUE, you need a couple TB of RAM to even construct this final product. You most certainly do not want to have eigenvectors which are represented as sparse vectors, because this is a) horribly inefficient storage for them (they're dense) and b) horribly inefficient CPU-wise (ditto). We could certainly *use* Vector.size() == Integer.MAX_VALUE as an effective "flag" that it's unbounded, but the important thing is what we do with that information: when you create a DenseVector of this size, the key would be to *not* initialize the array this large, but instead initialize it to some small value, but set an internal flag saying that whenever some does a get/getQuick outside of range, return 0, and if someone does a set/setQuick outside of range, the vector automagically resizes internally to that size, and then sets. So it becomes a "grow as needed" dense vector of infinite dimension. > Map/Reduce Implementation of Document Vectorizer > ------------------------------------------------ > > Key: MAHOUT-237 > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.3 > Reporter: Robin Anil > Assignee: Robin Anil > Fix For: 0.3 > > Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, > SparseVector-VIntWritable.patch > > > Current Vectorizer uses Lucene Index to convert documents into SparseVectors > Ted is working on a Hash based Vectorizer which can map features into Vectors > of fixed size and sum it up to get the document Vector > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > The input document is in SequenceFile<Text,Text> . with key = docid, value = > content > First Map/Reduce over the document collection and generate the feature counts. > Second Sequential pass reads the output of the map/reduce and converts them > to SequenceFile<Text, LongWritable> where key=feature, value = unique id > Second stage should create shards of features of a given split size > Third Map/Reduce over the document collection, using each shard and create > Partial(containing the features of the given shard) SparseVectors > Fourth Map/Reduce over partial shard, group by docid, create full document > Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.