[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672718#comment-13672718
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-----------------------------------------

It has an assumption in the code that each cluster id ends up in a different 
part file by the fact the number of reducers is set to the number of clusters 
which is supposed to mean that there should be one output part file per reducer 
(i.e. per cluster id), but that isn't happening, at least in the simple testing 
I'm doing using pseudo M/R mode using data generated from.  Can someone test 
this on a real Hadoop cluster, as I don't have access to one right at the 
moment?  At least in the non-cluster env, the work around is to run in 
sequential mode.


{quote}
bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 
-cd 5 -t1 50 -t2 10 -dm 
org.apache.mahout.common.distance.EuclideanDistanceMeasure -i 
/path/content/synthetic_control.data  -ow -o output -cl
{quote}
and
{quote}
... 
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver
 -i output -o output/postMR
{quote}
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to