subject:"\[jira\] \[Updated\] \(MAHOUT\-1103\) clusterpp is not writing directories for all clusters"

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-08 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1103:


Attachment: MAHOUT-1103.patch

Matt, can you check this iteration on your patch?  That being said, it doesn't 
work for me running the MR job locally when testing on a small data set.  Would 
be nice to get this self contained somehow in a small unit test.

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Grant Ingersoll
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-06 Thread Matt Molek (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matt Molek updated MAHOUT-1103:
---

Attachment: MAHOUT-1103.patch

I've been held up with some local problems with running tests. When building
mahout with testing enabled, I'm getting lots of out of memory errors that I
haven't figured out yet. This is happening to me on a clean checkout of the
trunk, so it's nothing I've modified. It must just be something weird with my
local environment.

So, apologies for not being able to fully test this. It does build with
-DskipTests=true though, and it worked fine when testing it on some real data
just now.

As I was typing this up I just remembered that I changed the keys from Texts to
IntWritables, since int is the only type of ID a ClusterWritable can have. That
probably makes the map/reduce implementation inconsistent with the way the
sequential method does it though. To get identical output to the sequential
method, the reducer just needs to output a Text with the cluster id, instead of
an IntWritable with the cluster id like is does in my patch.

clusterpp is not writing directories for all clusters
-

Key: MAHOUT-1103
URL: https://issues.apache.org/jira/browse/MAHOUT-1103
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Grant Ingersoll
Labels: clusterpp
Fix For: 0.8

Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch

After running kmeans clustering on a set of ~3M points, clusterpp fails to
populate directories for some clusters, no matter what k is.
I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
Even with k=2 only one cluster directory was created. For each reducer that
fails to produce directories there is an empty part-r-* file in the output
directory.
Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o
2clusters/pca-clusters -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15
-cl
bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o
2clusters.txt
bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat}
The output of clusterdump shows two clusters: VL-3742464 and VL-3742466
containing 2585843 and 1156624 points respectively.
Discussion on the user mailing list suggested that this might be caused by
the default hadoop hash partitioner. The hashes of these two clusters aren't
identical, but they are close. Putting both cluster names into a Text and
caling hashCode() gives:
VL-3742464 - -685560454
VL-3742466 - -685560452
Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-01 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1103:


Fix Version/s: 0.8

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Paritosh Ranjan
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2012-10-24 Thread Paritosh Ranjan (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paritosh Ranjan updated MAHOUT-1103:

Attachment: MAHOUT-1103.patch

Dmitriy - yes, I think it was the same error.

Matt - I have created a partitioner and applied it at
ClusterOutputPostProcessorDriver assuming the valid cluster Ids are the latest
and sequential i.e. ids will be VL-8543 to VL 8563 if 20 unique clusters are
there. The attached test case demonstrates that it will work for this scenario.

If you want, you can try this patch on trunk, and check whether it works or
not. I am not sure about it, as I still need to figure out the nomenclature of
relevant cluster Ids.

clusterpp is not writing directories for all clusters
-

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2012-10-22 Thread Matt Molek (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Molek updated MAHOUT-1103:
---

Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl}}

{{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt}}

{{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. 

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. 


 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
  Labels: clusterpp

 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl}}
 {{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt}}
 {{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2012-10-22 Thread Matt Molek (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Molek updated MAHOUT-1103:
---

Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. 

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl}}

{{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt}}

{{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. 


 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
  Labels: clusterpp

 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2012-10-22 Thread Matt Molek (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Molek updated MAHOUT-1103:
---

Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. The hashes of these two clusters aren't 
identical, but they are close. Putting both cluster names into a Text and 
caling hashCode() gives:
VL-3742464 - -685560454
VL-3742466 - -685560452

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. 


 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
  Labels: clusterpp

 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2012-10-22 Thread Matt Molek (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Molek updated MAHOUT-1103:
---

Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. The hashes of these two clusters aren't 
identical, but they are close. Putting both cluster names into a Text and 
caling hashCode() gives:
VL-3742464 - -685560454
VL-3742466 - -685560452

Finally, when running with -xm sequential, everything performs as expected.

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to 
populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that 
fails to produce directories there is an empty part-r-* file in the output 
directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
2clusters/pca-clusters -dm 
org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
-cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the 
default hadoop hash partitioner. The hashes of these two clusters aren't 
identical, but they are close. Putting both cluster names into a Text and 
caling hashCode() gives:
VL-3742464 - -685560454
VL-3742466 - -685560452


 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
  Labels: clusterpp

 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

8 matches

Site Navigation

Mail list logo

Footer information