[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481546#comment-13481546
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1103 at 10/22/12 5:52 PM:
--------------------------------------------------------------------

bq. Since its not working for even two clusters, I don't see any problem due to 
the Partitioner. The input here looks like the output of SSVD. There has been 
problems reported earlier also, where SSVD output was creating problems in 
clustering.


What issues? Can you be more specific? 

The only discussion i am aware of was with Pat and he was having problem using 
embedded style and with the fact that there were no --USigma output available 
at that time. 

I am not aware of any issues in the HEAD. he should be able to use --pca true 
option and if it retains enough variance (i.e. fairly rapid spectrum decay in 
the first 100-ish values) he should be fine at least with euclidean coordinates 
clustering. 

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't 
need --pca option.

Either way it is not a problem of SSVD but a problem of the approach.

                
      was (Author: dlyubimov):
    bq. Since its not working for even two clusters, I don't see any problem 
due to the Partitioner. The input here looks like the output of SSVD. There has 
been problems reported earlier also, where SSVD output was creating problems in 
clustering.


What issues? Can you be more specific? 

The only discussion i am aware of was with Pat and he was having problem using 
embedded style and with the fact that there were no --USigma output available 
at that time. 

I am not aware of any issues in the HEAD. he should be able to use --pca true 
option and if it retains enough variance (i.e. fairly rapid spectrum decay in 
the first 100-ish values) he should be fine at least with euclidean coordinates 
clustering. 

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't 
need --pca option.

                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to