[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Molek updated MAHOUT-1103: ------------------------------- Description: After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}} {{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}} {{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. was: After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. > clusterpp is not writing directories for all clusters > ----------------------------------------------------- > > Key: MAHOUT-1103 > URL: https://issues.apache.org/jira/browse/MAHOUT-1103 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Reporter: Matt Molek > Labels: clusterpp > > After running kmeans clustering on a set of ~3M points, clusterpp fails to > populate directories for some clusters, no matter what k is. > I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 > Even with k=2 only one cluster directory was created. For each reducer that > fails to produce directories there is an empty part-r-* file in the output > directory. > Here is my command sequence for the k=2 run: > {{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o > 2clusters/pca-clusters -dm > org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 > -cl}} > {{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o > 2clusters.txt}} > {{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}} > The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 > containing 2585843 and 1156624 points respectively. > Discussion on the user mailing list suggested that this might be caused by > the default hadoop hash partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira