Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this code will have to change in the patch but I haven't yet looked:

    String name = point.getName();
String key = (name != null) && (name.length() != 0) ? name : point.asFormatString(); output.collect(new Text(key), new Text(String.valueOf(nearestCluster.getId())));

Seems to me we need to rethink this step anyway if we are going to implement the CDbw cluster evaluation algorithm. For that we need a job step that outputs [clusterId:Vector_as_Writable] sequence files so that we can iterate over them to find representative points. Is anybody using the current format who would be impacted by such a change?

Jeff

On 4/17/10 8:14 AM, Robin Anil (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153
 ]

Robin Anil commented on MAHOUT-379:
-----------------------------------

If the id from the vector is removed, I believe it will affect all clustering 
algorithms. The final stage is generating the vector_id, cluster_id pair.  will 
have to verify if this doesn't affect that step


Reply via email to