Yeah thats what I changed -- now the key is point.asFormatString(). And it almost works, except the serialized state in this format string includes lengthSquared, and a mismatch there before/after makes this fail.
It may fail more significantly in the real world versus tests and we should be cautious about that. Though I'd hope this is a reason to perhaps look at ways to reengineer those jobs rather than rely on tucking an ID into the vector? You could always roll your own writable that decorates with this. I wouldn't really want to check this in if it breaks the fuzzy k-means test; what's the way forward then? On Sat, Apr 17, 2010 at 5:05 PM, Jeff Eastman <[email protected]> wrote: > Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this code > will have to change in the patch but I haven't yet looked: > > String name = point.getName(); > String key = (name != null) && (name.length() != 0) ? name : > point.asFormatString(); > output.collect(new Text(key), new > Text(String.valueOf(nearestCluster.getId()))); > > Seems to me we need to rethink this step anyway if we are going to implement > the CDbw cluster evaluation algorithm. For that we need a job step that > outputs [clusterId:Vector_as_Writable] sequence files so that we can iterate > over them to find representative points. Is anybody using the current format > who would be impacted by such a change? > > Jeff > > On 4/17/10 8:14 AM, Robin Anil (JIRA) wrote: >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153 >> ] >> >> Robin Anil commented on MAHOUT-379: >> ----------------------------------- >> >> If the id from the vector is removed, I believe it will affect all >> clustering algorithms. The final stage is generating the vector_id, >> cluster_id pair. will have to verify if this doesn't affect that step >> >> > >
