Yeah thats what I changed -- now the key is point.asFormatString().

And it almost works, except the serialized state in this format string
includes lengthSquared, and a mismatch there before/after makes this
fail.

It may fail more significantly in the real world versus tests and we
should be cautious about that. Though I'd hope this is a reason to
perhaps look at ways to reengineer those jobs rather than rely on
tucking an ID into the vector? You could always roll your own writable
that decorates with this.

I wouldn't really want to check this in if it breaks the fuzzy k-means
test; what's the way forward then?

On Sat, Apr 17, 2010 at 5:05 PM, Jeff Eastman
<j...@windwardsolutions.com> wrote:
> Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this code
> will have to change in the patch but I haven't yet looked:
>
>    String name = point.getName();
>    String key = (name != null) && (name.length() != 0) ? name :
> point.asFormatString();
>    output.collect(new Text(key), new
> Text(String.valueOf(nearestCluster.getId())));
>
> Seems to me we need to rethink this step anyway if we are going to implement
> the CDbw cluster evaluation algorithm. For that we need a job step that
> outputs [clusterId:Vector_as_Writable] sequence files so that we can iterate
> over them to find representative points. Is anybody using the current format
> who would be impacted by such a change?
>
> Jeff
>
> On 4/17/10 8:14 AM, Robin Anil (JIRA) wrote:
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153
>> ]
>>
>> Robin Anil commented on MAHOUT-379:
>> -----------------------------------
>>
>> If the id from the vector is removed, I believe it will affect all
>> clustering algorithms. The final stage is generating the vector_id,
>> cluster_id pair.  will have to verify if this doesn't affect that step
>>
>>
>
>

Reply via email to