[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841158#action_12841158 ]
Danny Leshem commented on MAHOUT-322: ------------------------------------- As a side-note, if the change is implemented you can have the decomposers output a SequenceFile<EigenStatusWritable, VectorWritable> instead of SequenceFile<IntWritable, VectorWritable>. This would eliminate the need to encode eigenvectors' information inside the vector's name as currently done. The current EigenStatus class might not be perfect for this, but obviously a similar class can be constructed to encode the relevant information. > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> > instead of SequenceFile<IntWritable,VectorWritable> > ----------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-322 > URL: https://issues.apache.org/jira/browse/MAHOUT-322 > Project: Mahout > Issue Type: Improvement > Components: Math > Affects Versions: 0.3 > Reporter: Danny Leshem > Priority: Minor > Fix For: 0.3 > > > Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix > states that the matrix lives in SequenceFile<WritableComparable, > VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, > VectorWritable> is passed. > Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD > package, mainly to perform PCA on a massive document corpus. Given such > corpus, it makes sense to not limit the user by forcing the document "key" to > be integer. Instead, users should be able to use Text keys (document name or > id) or keys made of any other arbitrary class. One may even argue that > forcing a WritableComparable key is too limiting, and a simple Writable key > should be assumed. > In fact, it would be best if DistributedRowMatrix did not read the > SequenceFile key at all, to allow user-specific classes (unknown to Mahout) > to be used as opaque keys even when the their libraries are not available in > runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but > reader has methods to query just the value, avoiding key deserialization > altogether. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.