Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this
code will have to change in the patch but I haven't yet looked:
String name = point.getName();
String key = (name != null) && (name.length() != 0) ? name :
point.asFormatString();
output.collect(new Text(key), new
Text(String.valueOf(nearestCluster.getId())));
Seems to me we need to rethink this step anyway if we are going to
implement the CDbw cluster evaluation algorithm. For that we need a job
step that outputs [clusterId:Vector_as_Writable] sequence files so that
we can iterate over them to find representative points. Is anybody using
the current format who would be impacted by such a change?
Jeff
On 4/17/10 8:14 AM, Robin Anil (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153
]
Robin Anil commented on MAHOUT-379:
-----------------------------------
If the id from the vector is removed, I believe it will affect all clustering
algorithms. The final stage is generating the vector_id, cluster_id pair. will
have to verify if this doesn't affect that step