If the Vector->MSCanopy pre-job outputs all of its canopies then each of those canopies would contain the generated canopyId and its canopy center would contain the original vector with its docId. Seems like one could use that data set to get the membership information in a separate post-processing step. Certainly the post-processing job should be for later, after the List<Vector> -> List<canopyId> optimization.

Robin Anil (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-304:
------------------------------

    Attachment: MAHOUT-304.patch

Jeff, Meanshift uses only ids generated by the mapper to keep vector membership.  
I dont yet see how you can get the membership information i.e Vector docid => 
Canopy Id. Isnt that job missing? Maybe for later 0.4?

MeanShift doesn't read from VectorWritable
------------------------------------------

                Key: MAHOUT-304
                URL: https://issues.apache.org/jira/browse/MAHOUT-304
            Project: Mahout
         Issue Type: Improvement
         Components: Clustering
   Affects Versions: 0.3
           Reporter: Robin Anil
           Assignee: Robin Anil
            Fix For: 0.3

        Attachments: MAHOUT-304.patch, MAHOUT-304.patch, MAHOUT-304.patch


Need an M/R job for converting sequence file containing VectorWritable to MeanShiftCanopy before the MeanShift M/R


Reply via email to