Re: bottom up clustering
Hi Rajesh, Streaming k-means clusters Vectors (that are in *, VectorWritable sequence files) and outputs IntWritable, CentroidWritable sequence files. A Centroid is the same as a Vector with the addition of an index and a weight. You can getVector() a Centroid to get its Vector. On Mon, Jun 3, 2013 at 2:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never had to work with arff ). I had tfidf-vectors as an example, u should be good with arff. Give it a try and let us know. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Cc: Ted Dunning ted.dunn...@gmail.com Sent: Monday, June 3, 2013 4:30 AM Subject: Re: bottom up clustering Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:03 PM Subject: Re: bottom up clustering Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com
Re: bottom up clustering
Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:03 PM Subject: Re: bottom up clustering Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO
Re: bottom up clustering
You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never had to work with arff ). I had tfidf-vectors as an example, u should be good with arff. Give it a try and let us know. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Cc: Ted Dunning ted.dunn...@gmail.com Sent: Monday, June 3, 2013 4:30 AM Subject: Re: bottom up clustering Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:03 PM Subject: Re: bottom up clustering Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using
Re: bottom up clustering
I tried with below commands hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.vectors.arff.Driver --input /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/t/input-set-vector \ -o /user/hadoop/t/skmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 4 \ -km 12 \ -testp 0.3 \ -mi 10 \ -ow and dumped with seqdumper hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.SequenceFileDumper -i /user/hadoop/t/skmeans/part-r-0 -o /mnt/cluster/t/skmeans-cluster-points.txt Dump contains centroids for clusters. == This was small test-set for which I could guess number of clusters. As streaming kmeans require -k to be specified, how to do the same in case sample set is big. It also gives error like when k was specified as 40 to streamingkmeans. -k 40 \ -km 190 \ java.lang.IllegalArgumentException: Must have more datapoints [4] than clusters [40] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) == How to use these centroids for clustering ? I am not understanding its use. Thanks, Rajesh On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never had to work with arff ). I had tfidf-vectors as an example, u should be good with arff. Give it a try and let us know. -- *From:* Rajesh Nikam rajeshni...@gmail.com *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com *Cc:* Ted Dunning ted.dunn...@gmail.com *Sent:* Monday, June 3, 2013 4:30 AM *Subject:* Re: bottom up clustering Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.comwrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user
Re: bottom up clustering
I am having 1500 points. Using km: k * log n On Jun 3, 2013 8:53 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: How many datapoints do u have in ur input? How r u computing the value of -km? From: Rajesh Nikam rajeshni...@gmail.com To: Suneel Marthi suneel_mar...@yahoo.com Cc: user@mahout.apache.org user@mahout.apache.org; Ted Dunning ted.dunn...@gmail.com Sent: Monday, June 3, 2013 9:55 AM Subject: Re: bottom up clustering I tried with below commands hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.vectors.arff.Driver --input /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/t/input-set-vector \ -o /user/hadoop/t/skmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 4 \ -km 12 \ -testp 0.3 \ -mi 10 \ -ow and dumped with seqdumper hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.SequenceFileDumper -i /user/hadoop/t/skmeans/part-r-0 -o /mnt/cluster/t/skmeans-cluster-points.txt Dump contains centroids for clusters. == This was small test-set for which I could guess number of clusters. As streaming kmeans require -k to be specified, how to do the same in case sample set is big. It also gives error like when k was specified as 40 to streamingkmeans. -k 40 \ -km 190 \ java.lang.IllegalArgumentException: Must have more datapoints [4] than clusters [40] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) == How to use these centroids for clustering ? I am not understanding its use. Thanks, Rajesh On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never had to work with arff ). I had tfidf-vectors as an example, u should be good with arff. Give it a try and let us know. -- *From:* Rajesh Nikam rajeshni...@gmail.com *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com *Cc:* Ted Dunning ted.dunn...@gmail.com *Sent:* Monday, June 3, 2013 4:30 AM *Subject:* Re: bottom up clustering Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o
Re: bottom up clustering
From the exception seems like u were trying to cluster 4 datapoints into 40 (-k) clusters and hence what u r seeing. So if your datapoints n = 1500 and clusters k = 40 -km = k * log n = 292 (rounded to nearest integer) - its a natural log Does that match with ur inputs? Sorry man I am having a busy day and may not be of much help as I would have liked to. Dan, could u jump in ? From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org Cc: Ted Dunning ted.dunn...@gmail.com Sent: Monday, June 3, 2013 11:51 AM Subject: Re: bottom up clustering I am having 1500 points. Using km: k * log n On Jun 3, 2013 8:53 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: How many datapoints do u have in ur input? How r u computing the value of -km? From: Rajesh Nikam rajeshni...@gmail.com To: Suneel Marthi suneel_mar...@yahoo.com Cc: user@mahout.apache.org user@mahout.apache.org; Ted Dunning ted.dunn...@gmail.com Sent: Monday, June 3, 2013 9:55 AM Subject: Re: bottom up clustering I tried with below commands hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.vectors.arff.Driver --input /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/t/input-set-vector \ -o /user/hadoop/t/skmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 4 \ -km 12 \ -testp 0.3 \ -mi 10 \ -ow and dumped with seqdumper hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.SequenceFileDumper -i /user/hadoop/t/skmeans/part-r-0 -o /mnt/cluster/t/skmeans-cluster-points.txt Dump contains centroids for clusters. == This was small test-set for which I could guess number of clusters. As streaming kmeans require -k to be specified, how to do the same in case sample set is big. It also gives error like when k was specified as 40 to streamingkmeans. -k 40 \ -km 190 \ java.lang.IllegalArgumentException: Must have more datapoints [4] than clusters [40] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) == How to use these centroids for clustering ? I am not understanding its use. Thanks, Rajesh On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: You should be able to feed arff.vectors to Streaming kmeans (have not tried that myself, never had to work with arff ). I had tfidf-vectors as an example, u should be good with arff. Give it a try and let us know. -- *From:* Rajesh Nikam rajeshni...@gmail.com *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com *Cc:* Ted Dunning ted.dunn...@gmail.com *Sent:* Monday, June 3, 2013 4:30 AM *Subject:* Re: bottom up clustering Hi Suneel, I have used seqdirectory followed by seq2sparse on 20newsgroup set. Then used following command to run streamingkmeans to get 40 clusters. hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/news-vectors/tf-vectors/ \ -o /user/hadoop/news-stream-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 40 \ -km 190 \ -testp 0.3 \ -mi 10 \ -ow dumped output using seqdumper from /user/hadoop/news-stream-kmeans/part-r-0. In the dumped file I see centroids are dumped like: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 0, weight = 1.00, vector = {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, Key: 1: Value: key = 1, weight = 3.00, vector = {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 Key: 2: Value: key = 2, weight = 105.00, vector = {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 Key: 3: Value: key = 28, weight = 259.00, vector = {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 -- more --- -- I have tried using arff.vector to covert arff to vector where I dont know how to covert it to tf-idf vectors format as expected by streaming kmeans ? Thanks Rajesh On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9
Re: bottom up clustering
Hi Suneel, Thanks a lot for detailed steps ! I will try out the steps. Thanks, Ted for pointing this out! Thanks, Rajesh On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:03 PM Subject: Re: bottom up clustering Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : attempt_201305231030_0519_m_00_0, Status : FAILED java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.math.VectorWritable Thanks Rajesh
Re: bottom up clustering
The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : attempt_201305231030_0519_m_00_0, Status : FAILED java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.math.VectorWritable Thanks Rajesh
Re: bottom up clustering
Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : attempt_201305231030_0519_m_00_0, Status : FAILED java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.math.VectorWritable Thanks Rajesh
Re: bottom up clustering
Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : attempt_201305231030_0519_m_00_0, Status : FAILED java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.math.VectorWritable Thanks Rajesh
Re: bottom up clustering
To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part of any release yet. The steps for using Streaming k-means (I don't think its been documented yet) 1. Generate Sparse vectors via seq2sparse (u have this already). 2. mahout streamingkmeans -i path to tfidf-vectors -o output path --tempDir temp folder path -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k No. of clusters -km see below for the math -k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter. From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:03 PM Subject: Re: bottom up clustering Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, May 30, 2013 10:56 AM Subject: bottom up clustering Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for next level of course with correct t1 and t2. I have tried using 'canopy' which give centroids as output. How to apply one more level of clustering on these centroids ? /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level of canopy. mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o /user/hadoop/t/hclust -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02 -ow It gave following error: 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : attempt_201305231030_0519_m_00_0, Status : FAILED java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.math.VectorWritable Thanks Rajesh