[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883774#comment-13883774
 ] 

Suneel Marthi commented on MAHOUT-1030:
---------------------------------------

Andrew, looking at this issue now and u're right that its not a show stopper 
but needs to be fixed nevertheless. The mistake's with the distance calculation 
wherein we are always using 'squaredDistance' as opposed to using the CLI 
provided DistanceMeasure.

Below code snippet should fix the issue (needs to be added in both 
ClusterClassificationMapper and ClusterClassificationDriver where the distance 
is being calculated):

{Code}
    DistanceMeasureCluster distanceMeasureCluster = (DistanceMeasureCluster) 
cluster;
    DistanceMeasure distanceMeasure = distanceMeasureCluster.getMeasure();
    double d = distanceMeasure.distance(cluster.getCenter(), vw.get());
{Code}

With this change and CosineDistanceMeasure, the distances are now in the range 
[0,1].  

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
> MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to