[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Matt Molek (JIRA) Mon, 22 Oct 2012 08:22:14 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481433#comment-13481433
 ]


Matt Molek commented on MAHOUT-1103:
------------------------------------

Yes, I am clustering on ssvd output. I will try again with the vectors directly 
from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do 
think it is part of the issue. HashPartitioner uses the following logic to 
determine what partition a key belongs to: int partition = (key.hashCode() & 
Integer.MAX_VALUE) % 2;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, 
they were named VL-0 and VL-1, they would be properly split up by the 
HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then 
there would not be an issue. Dealing with this weird naming scheme (which I 
don't know the origin of since I'm not familiar with the inner workings of 
kmeans) seems to be the issue.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Reply via email to