[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481433#comment-13481433 ]
Matt Molek edited comment on MAHOUT-1103 at 10/22/12 3:23 PM: -------------------------------------------------------------- Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done. I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue. was (Author: mmolek): Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done. I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % 2; That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue. > clusterpp is not writing directories for all clusters > ----------------------------------------------------- > > Key: MAHOUT-1103 > URL: https://issues.apache.org/jira/browse/MAHOUT-1103 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Reporter: Matt Molek > Assignee: Paritosh Ranjan > Labels: clusterpp > > After running kmeans clustering on a set of ~3M points, clusterpp fails to > populate directories for some clusters, no matter what k is. > I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 > Even with k=2 only one cluster directory was created. For each reducer that > fails to produce directories there is an empty part-r-* file in the output > directory. > Here is my command sequence for the k=2 run: > {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o > 2clusters/pca-clusters -dm > org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 > -cl > bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o > 2clusters.txt > bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} > The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 > containing 2585843 and 1156624 points respectively. > Discussion on the user mailing list suggested that this might be caused by > the default hadoop hash partitioner. The hashes of these two clusters aren't > identical, but they are close. Putting both cluster names into a Text and > caling hashCode() gives: > VL-3742464 -> -685560454 > VL-3742466 -> -685560452 > Finally, when running with "-xm sequential", everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira