[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Mannix updated MAHOUT-322: ------------------------------- Fix Version/s: (was: 0.3) pulling this out of the track for 0.3 > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> > instead of SequenceFile<IntWritable,VectorWritable> > ----------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-322 > URL: https://issues.apache.org/jira/browse/MAHOUT-322 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Reporter: Danny Leshem > Priority: Minor > > Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix > states that the matrix lives in SequenceFile<WritableComparable, > VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, > VectorWritable> is passed. > Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD > package, mainly to perform PCA on a massive document corpus. Given such > corpus, it makes sense to not limit the user by forcing the document "key" to > be integer. Instead, users should be able to use Text keys (document name or > id) or keys made of any other arbitrary class. One may even argue that > forcing a WritableComparable key is too limiting, and a simple Writable key > should be assumed. > In fact, it would be best if DistributedRowMatrix did not read the > SequenceFile key at all, to allow user-specific classes (unknown to Mahout) > to be used as opaque keys even when their libraries are not available in > runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but > reader has methods to query just the value, avoiding key deserialization > altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.