Re: bottom up clustering

2013-06-04 Thread Dan Filimon
Hi Rajesh,

Streaming k-means clusters Vectors (that are in *, VectorWritable
sequence files) and outputs IntWritable, CentroidWritable sequence files.
A Centroid is the same as a Vector with the addition of an index and a
weight. You can getVector() a Centroid to get its Vector.




On Mon, Jun 3, 2013 at 2:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 You should be able to feed arff.vectors to Streaming kmeans (have not
 tried that myself, never had to work with arff ).
 I had tfidf-vectors as an example, u should be good with arff.

 Give it a try and let us know.




 
  From: Rajesh Nikam rajeshni...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Cc: Ted Dunning ted.dunn...@gmail.com
 Sent: Monday, June 3, 2013 4:30 AM
 Subject: Re: bottom up clustering



 Hi Suneel,


 I have used seqdirectory followed by seq2sparse on 20newsgroup set.


 Then used following command to run streamingkmeans to get 40 clusters.


 hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
 -i /user/hadoop/news-vectors/tf-vectors/ \
 -o /user/hadoop/news-stream-kmeans \
   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
   -k 40 \
   -km 190 \
   -testp 0.3 \
   -mi 10 \
   -ow


 dumped output using  seqdumper from
 /user/hadoop/news-stream-kmeans/part-r-0.


 In the dumped file I see centroids are dumped like:

 Key class: class org.apache.hadoop.io.IntWritable Value Class: class
 org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
 Key: 0: Value: key = 0, weight = 1.00, vector =
 {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
 Key: 1: Value: key = 1, weight = 3.00, vector =
 {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
 Key: 2: Value: key = 2, weight = 105.00, vector =
 {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
 Key: 3: Value: key = 28, weight = 259.00, vector =
 {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020

 --


  more --- 

 --


 I have tried using arff.vector to covert arff to vector where I dont know
 how to covert it to tf-idf vectors format as expected by streaming kmeans ?

 Thanks
 Rajesh




 On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com
 wrote:

 Hi Suneel,
 
 
 Thanks a lot for detailed steps !
 
 I will try out the steps.
 
 
 Thanks, Ted for pointing this out!
 
 
 
 Thanks,
 Rajesh
 
 
 
 
 On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
 To add to Ted's reply, streaming k-means was recently added to Mahout
 (thanks to Dan and Ted).
 
 Here's the reference paper that talks about Streaming k-means:
 
 http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
 
 You have to be working off of trunk to use this, its not available as
 part of any release yet.
 
 The steps for using Streaming k-means (I don't think its been documented
 yet)
 
 1.  Generate Sparse vectors via seq2sparse (u have this already).
 
 2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output
 path --tempDir temp folder path -ow
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
  -k No. of clusters -km see below for the math
 
 -k = no of clusters
 -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
 to cluster,  round this to the nearest integer
 
 You have option of using a FastProjectionSearch or ProjectionSearch or
 LocalitySensitiveHashSearch for the -sc parameter.
 
 
 
 
 
 
 
 
 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Sent: Thursday, May 30, 2013 12:03 PM
 Subject: Re: bottom up clustering
 
 
 
 Rajesh
 
 The streaming k-means implementation is very much like what you are
 asking for.  The first pass is to cluster into many, many clusters and then
 cluster those clusters.
 
 Sent from my iPhone
 
 On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:
 
  Hello Suneel,
 
  I got it. Next step to canopy is to feed these centroids to kmeans and
  cluster.
 
  However I want is to use centroids from these clusters and do
 clustering on
  them so as to find related clusters.
 
  Thanks
  Rajesh
 
 
  On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi 
 suneel_mar...@yahoo.comwrote:
 
  The input to canopy is your vectors from seq2sparse and not cluster
  centroids (as u had it), hence the error message u r seeing.
 
  The output of canopy could be fed into kmeans as input centroids.
 
 
 
 
  
  From: Rajesh Nikam rajeshni...@gmail.com

Re: bottom up clustering

2013-06-03 Thread Rajesh Nikam
Hi Suneel,

I have used seqdirectory followed by seq2sparse on 20newsgroup set.

Then used following command to run streamingkmeans to get 40 clusters.

hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
-i /user/hadoop/news-vectors/tf-vectors/ \
-o /user/hadoop/news-stream-kmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 40 \
  -km 190 \
  -testp 0.3 \
  -mi 10 \
  -ow

dumped output using  seqdumper from
/user/hadoop/news-stream-kmeans/part-r-0.

In the dumped file I see centroids are dumped like:

Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 0, weight = 1.00, vector =
{1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
Key: 1: Value: key = 1, weight = 3.00, vector =
{1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
Key: 2: Value: key = 2, weight = 105.00, vector =
{794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
Key: 3: Value: key = 28, weight = 259.00, vector =
{1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
--

 more --- 
--

I have tried using arff.vector to covert arff to vector where I dont know
how to covert it to tf-idf vectors format as expected by streaming kmeans ?

Thanks
Rajesh



On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote:

 Hi Suneel,

 Thanks a lot for detailed steps !
 I will try out the steps.

 Thanks, Ted for pointing this out!

 Thanks,
 Rajesh


 On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 To add to Ted's reply, streaming k-means was recently added to Mahout
 (thanks to Dan and Ted).

 Here's the reference paper that talks about Streaming k-means:

 http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

 You have to be working off of trunk to use this, its not available as
 part of any release yet.

 The steps for using Streaming k-means (I don't think its been documented
 yet)

 1.  Generate Sparse vectors via seq2sparse (u have this already).

 2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output path
 --tempDir temp folder path -ow
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
  -k No. of clusters -km see below for the math

 -k = no of clusters
 -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
 to cluster,  round this to the nearest integer

 You have option of using a FastProjectionSearch or ProjectionSearch or
 LocalitySensitiveHashSearch for the -sc parameter.







 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Sent: Thursday, May 30, 2013 12:03 PM
 Subject: Re: bottom up clustering


 Rajesh

 The streaming k-means implementation is very much like what you are
 asking for.  The first pass is to cluster into many, many clusters and then
 cluster those clusters.

 Sent from my iPhone

 On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:

  Hello Suneel,
 
  I got it. Next step to canopy is to feed these centroids to kmeans and
  cluster.
 
  However I want is to use centroids from these clusters and do
 clustering on
  them so as to find related clusters.
 
  Thanks
  Rajesh
 
 
  On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  The input to canopy is your vectors from seq2sparse and not cluster
  centroids (as u had it), hence the error message u r seeing.
 
  The output of canopy could be fed into kmeans as input centroids.
 
 
 
 
  
  From: Rajesh Nikam rajeshni...@gmail.com
  To: user@mahout.apache.org user@mahout.apache.org
  Sent: Thursday, May 30, 2013 10:56 AM
  Subject: bottom up clustering
 
 
  Hi,
 
  I want to do bottom up clustering (rather hierarchical clustering)
 rather
  than top-down as mentioned in
 
  https://cwiki.apache.org/MAHOUT/top-down-clustering.html
  kmeans-clusterdump-clusterpp and then kmeans on each cluster
 
  How to use centroid from first phase of canopy and use them for next
 level
  of course with correct t1 and t2.
 
  I have tried using 'canopy' which give centroids as output. How to
 apply
  one more level of clustering on these centroids ?
 
  /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
 level
  of canopy.
 
  mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
  /user/hadoop/t/hclust -dm
  org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2
 0.02
  -ow
 
  It gave following error:
 
   13/05/30 20:21:38 INFO

Re: bottom up clustering

2013-06-03 Thread Suneel Marthi
You should be able to feed arff.vectors to Streaming kmeans (have not tried 
that myself, never had to work with arff ).
I had tfidf-vectors as an example, u should be good with arff.

Give it a try and let us know.





 From: Rajesh Nikam rajeshni...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Cc: Ted Dunning ted.dunn...@gmail.com 
Sent: Monday, June 3, 2013 4:30 AM
Subject: Re: bottom up clustering
 


Hi Suneel,


I have used seqdirectory followed by seq2sparse on 20newsgroup set.


Then used following command to run streamingkmeans to get 40 clusters.


hadoop jar mahout-core-0.8-SNAPSHOT-job.jar 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
    -i /user/hadoop/news-vectors/tf-vectors/ \
    -o /user/hadoop/news-stream-kmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 40 \
  -km 190 \
  -testp 0.3 \
  -mi 10 \
  -ow   


dumped output using  seqdumper from  
/user/hadoop/news-stream-kmeans/part-r-0.


In the dumped file I see centroids are dumped like:

Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 0, weight = 1.00, vector = 
{1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
Key: 1: Value: key = 1, weight = 3.00, vector = 
{1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
Key: 2: Value: key = 2, weight = 105.00, vector = 
{794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
Key: 3: Value: key = 28, weight = 259.00, vector = 
{1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020

--


 more --- 

--


I have tried using arff.vector to covert arff to vector where I dont know how 
to covert it to tf-idf vectors format as expected by streaming kmeans ?

Thanks
Rajesh




On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com wrote:

Hi Suneel,


Thanks a lot for detailed steps !

I will try out the steps.


Thanks, Ted for pointing this out!



Thanks,
Rajesh




On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

To add to Ted's reply, streaming k-means was recently added to Mahout (thanks 
to Dan and Ted).

Here's the reference paper that talks about Streaming k-means:

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

You have to be working off of trunk to use this, its not available as part of 
any release yet.

The steps for using Streaming k-means (I don't think its been documented yet)

1.  Generate Sparse vectors via seq2sparse (u have this already).

2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output path 
--tempDir temp folder path -ow
 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
 -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
 -k No. of clusters -km see below for the math

-k = no of clusters
-km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to 
cluster,  round this to the nearest integer

You have option of using a FastProjectionSearch or ProjectionSearch or 
LocalitySensitiveHashSearch for the -sc parameter.









 From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org
Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com
Sent: Thursday, May 30, 2013 12:03 PM
Subject: Re: bottom up clustering



Rajesh

The streaming k-means implementation is very much like what you are asking 
for.  The first pass is to cluster into many, many clusters and then cluster 
those clusters. 

Sent from my iPhone

On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:

 Hello Suneel,

 I got it. Next step to canopy is to feed these centroids to kmeans and
 cluster.

 However I want is to use centroids from these clusters and do clustering on
 them so as to find related clusters.

 Thanks
 Rajesh


 On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi 
 suneel_mar...@yahoo.comwrote:

 The input to canopy is your vectors from seq2sparse and not cluster
 centroids (as u had it), hence the error message u r seeing.

 The output of canopy could be fed into kmeans as input centroids.




 
 From: Rajesh Nikam rajeshni...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, May 30, 2013 10:56 AM
 Subject: bottom up clustering


 Hi,

 I want to do bottom up clustering (rather hierarchical clustering) rather
 than top-down as mentioned in

 https://cwiki.apache.org/MAHOUT/top-down-clustering.html
 kmeans-clusterdump-clusterpp and then kmeans on each cluster

 How to use centroid from first phase of canopy and use them for next level
 of course with correct t1 and t2.

 I have tried using

Re: bottom up clustering

2013-06-03 Thread Rajesh Nikam
I tried with below commands

hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
org.apache.mahout.utils.vectors.arff.Driver --input
/mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
--dictOut /mnt/cluster/t/input-set-dict

hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
-i /user/hadoop/t/input-set-vector \
-o /user/hadoop/t/skmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 4 \
  -km 12 \
  -testp 0.3 \
  -mi 10 \
  -ow

and dumped with seqdumper

hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
org.apache.mahout.utils.SequenceFileDumper -i
/user/hadoop/t/skmeans/part-r-0 -o
/mnt/cluster/t/skmeans-cluster-points.txt

Dump contains centroids for clusters.

==

This was small test-set for which I could guess number of clusters.
As streaming kmeans require -k to be specified, how to do the same in case
sample set is big.

It also gives error like when k was specified as 40 to streamingkmeans.

-k 40 \
-km 190 \

java.lang.IllegalArgumentException: Must have more datapoints [4] than
clusters [40]
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)

==

How to use these centroids for clustering ? I am not understanding its use.

Thanks,
Rajesh







On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 You should be able to feed arff.vectors to Streaming kmeans (have not
 tried that myself, never had to work with arff ).
 I had tfidf-vectors as an example, u should be good with arff.

 Give it a try and let us know.


   --
  *From:* Rajesh Nikam rajeshni...@gmail.com
 *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 *Cc:* Ted Dunning ted.dunn...@gmail.com
 *Sent:* Monday, June 3, 2013 4:30 AM

 *Subject:* Re: bottom up clustering

 Hi Suneel,

 I have used seqdirectory followed by seq2sparse on 20newsgroup set.

 Then used following command to run streamingkmeans to get 40 clusters.

 hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
 -i /user/hadoop/news-vectors/tf-vectors/ \
 -o /user/hadoop/news-stream-kmeans \
   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
   -k 40 \
   -km 190 \
   -testp 0.3 \
   -mi 10 \
   -ow

 dumped output using  seqdumper from
 /user/hadoop/news-stream-kmeans/part-r-0.

 In the dumped file I see centroids are dumped like:

 Key class: class org.apache.hadoop.io.IntWritable Value Class: class
 org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
 Key: 0: Value: key = 0, weight = 1.00, vector =
 {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
 Key: 1: Value: key = 1, weight = 3.00, vector =
 {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
 Key: 2: Value: key = 2, weight = 105.00, vector =
 {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
 Key: 3: Value: key = 28, weight = 259.00, vector =
 {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
 --

  more --- 
 --

 I have tried using arff.vector to covert arff to vector where I dont know
 how to covert it to tf-idf vectors format as expected by streaming kmeans ?

 Thanks
 Rajesh



 On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.comwrote:

 Hi Suneel,

 Thanks a lot for detailed steps !
 I will try out the steps.

 Thanks, Ted for pointing this out!

 Thanks,
 Rajesh


 On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 To add to Ted's reply, streaming k-means was recently added to Mahout
 (thanks to Dan and Ted).

 Here's the reference paper that talks about Streaming k-means:

 http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

 You have to be working off of trunk to use this, its not available as part
 of any release yet.

 The steps for using Streaming k-means (I don't think its been documented
 yet)

 1.  Generate Sparse vectors via seq2sparse (u have this already).

 2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output path
 --tempDir temp folder path -ow
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
  -k No. of clusters -km see below for the math

 -k = no of clusters
 -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to
 cluster,  round this to the nearest integer

 You have option of using a FastProjectionSearch or ProjectionSearch or
 LocalitySensitiveHashSearch for the -sc parameter.







 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Cc: user

Re: bottom up clustering

2013-06-03 Thread Rajesh Nikam
I am having 1500 points. Using km: k * log n
On Jun 3, 2013 8:53 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 How many datapoints do u have in ur input?  How r u computing the value of
 -km?




 
  From: Rajesh Nikam rajeshni...@gmail.com
 To: Suneel Marthi suneel_mar...@yahoo.com
 Cc: user@mahout.apache.org user@mahout.apache.org; Ted Dunning 
 ted.dunn...@gmail.com
 Sent: Monday, June 3, 2013 9:55 AM
 Subject: Re: bottom up clustering


 I tried with below commands

 hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
 org.apache.mahout.utils.vectors.arff.Driver --input
 /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
 --dictOut /mnt/cluster/t/input-set-dict

 hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
 -i /user/hadoop/t/input-set-vector \
 -o /user/hadoop/t/skmeans \
   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
   -k 4 \
   -km 12 \
   -testp 0.3 \
   -mi 10 \
   -ow

 and dumped with seqdumper

 hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
 org.apache.mahout.utils.SequenceFileDumper -i
 /user/hadoop/t/skmeans/part-r-0 -o
 /mnt/cluster/t/skmeans-cluster-points.txt

 Dump contains centroids for clusters.

 ==

 This was small test-set for which I could guess number of clusters.
 As streaming kmeans require -k to be specified, how to do the same in case
 sample set is big.

 It also gives error like when k was specified as 40 to streamingkmeans.

 -k 40 \
 -km 190 \

 java.lang.IllegalArgumentException: Must have more datapoints [4] than
 clusters [40]
 at
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)

 ==

 How to use these centroids for clustering ? I am not understanding its use.

 Thanks,
 Rajesh







 On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  You should be able to feed arff.vectors to Streaming kmeans (have not
  tried that myself, never had to work with arff ).
  I had tfidf-vectors as an example, u should be good with arff.
 
  Give it a try and let us know.
 
 
--
   *From:* Rajesh Nikam rajeshni...@gmail.com
  *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
  suneel_mar...@yahoo.com
  *Cc:* Ted Dunning ted.dunn...@gmail.com
  *Sent:* Monday, June 3, 2013 4:30 AM
 
  *Subject:* Re: bottom up clustering
 
  Hi Suneel,
 
  I have used seqdirectory followed by seq2sparse on 20newsgroup set.
 
  Then used following command to run streamingkmeans to get 40 clusters.
 
  hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
  org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
  -i /user/hadoop/news-vectors/tf-vectors/ \
  -o /user/hadoop/news-stream-kmeans \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
-k 40 \
-km 190 \
-testp 0.3 \
-mi 10 \
-ow
 
  dumped output using  seqdumper from
  /user/hadoop/news-stream-kmeans/part-r-0.
 
  In the dumped file I see centroids are dumped like:
 
  Key class: class org.apache.hadoop.io.IntWritable Value Class: class
  org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
  Key: 0: Value: key = 0, weight = 1.00, vector =
 
 {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
  Key: 1: Value: key = 1, weight = 3.00, vector =
 
 {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
  Key: 2: Value: key = 2, weight = 105.00, vector =
 
 {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
  Key: 3: Value: key = 28, weight = 259.00, vector =
 
 {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
  --
 
   more --- 
  --
 
  I have tried using arff.vector to covert arff to vector where I dont know
  how to covert it to tf-idf vectors format as expected by streaming
 kmeans ?
 
  Thanks
  Rajesh
 
 
 
  On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com
 wrote:
 
  Hi Suneel,
 
  Thanks a lot for detailed steps !
  I will try out the steps.
 
  Thanks, Ted for pointing this out!
 
  Thanks,
  Rajesh
 
 
  On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  To add to Ted's reply, streaming k-means was recently added to Mahout
  (thanks to Dan and Ted).
 
  Here's the reference paper that talks about Streaming k-means:
 
  http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
 
  You have to be working off of trunk to use this, its not available as
 part
  of any release yet.
 
  The steps for using Streaming k-means (I don't think its been documented
  yet)
 
  1.  Generate Sparse vectors via seq2sparse (u have this already).
 
  2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o

Re: bottom up clustering

2013-06-03 Thread Suneel Marthi
From the exception seems like u were trying to cluster 4 datapoints into 40 
(-k) clusters and hence what u r seeing.

So if your datapoints n = 1500
and clusters k = 40
-km = k * log n = 292 (rounded to nearest integer) - its a natural log

Does that match with ur inputs?

Sorry man I am having a busy day and may not be of much help as I would have 
liked to. 

Dan, could u jump in ?





 From: Rajesh Nikam rajeshni...@gmail.com
To: user@mahout.apache.org 
Cc: Ted Dunning ted.dunn...@gmail.com 
Sent: Monday, June 3, 2013 11:51 AM
Subject: Re: bottom up clustering
 

I am having 1500 points. Using km: k * log n
On Jun 3, 2013 8:53 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 How many datapoints do u have in ur input?  How r u computing the value of
 -km?




 
  From: Rajesh Nikam rajeshni...@gmail.com
 To: Suneel Marthi suneel_mar...@yahoo.com
 Cc: user@mahout.apache.org user@mahout.apache.org; Ted Dunning 
 ted.dunn...@gmail.com
 Sent: Monday, June 3, 2013 9:55 AM
 Subject: Re: bottom up clustering


 I tried with below commands

 hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
 org.apache.mahout.utils.vectors.arff.Driver --input
 /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
 --dictOut /mnt/cluster/t/input-set-dict

 hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
     -i /user/hadoop/t/input-set-vector \
     -o /user/hadoop/t/skmeans \
   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
   -k 4 \
   -km 12 \
   -testp 0.3 \
   -mi 10 \
   -ow

 and dumped with seqdumper

 hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
 org.apache.mahout.utils.SequenceFileDumper -i
 /user/hadoop/t/skmeans/part-r-0 -o
 /mnt/cluster/t/skmeans-cluster-points.txt

 Dump contains centroids for clusters.

 ==

 This was small test-set for which I could guess number of clusters.
 As streaming kmeans require -k to be specified, how to do the same in case
 sample set is big.

 It also gives error like when k was specified as 40 to streamingkmeans.

 -k 40 \
 -km 190 \

 java.lang.IllegalArgumentException: Must have more datapoints [4] than
 clusters [40]
         at
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)

 ==

 How to use these centroids for clustering ? I am not understanding its use.

 Thanks,
 Rajesh







 On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  You should be able to feed arff.vectors to Streaming kmeans (have not
  tried that myself, never had to work with arff ).
  I had tfidf-vectors as an example, u should be good with arff.
 
  Give it a try and let us know.
 
 
    --
   *From:* Rajesh Nikam rajeshni...@gmail.com
  *To:* user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
  suneel_mar...@yahoo.com
  *Cc:* Ted Dunning ted.dunn...@gmail.com
  *Sent:* Monday, June 3, 2013 4:30 AM
 
  *Subject:* Re: bottom up clustering
 
  Hi Suneel,
 
  I have used seqdirectory followed by seq2sparse on 20newsgroup set.
 
  Then used following command to run streamingkmeans to get 40 clusters.
 
  hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
  org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
      -i /user/hadoop/news-vectors/tf-vectors/ \
      -o /user/hadoop/news-stream-kmeans \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
    -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
    -k 40 \
    -km 190 \
    -testp 0.3 \
    -mi 10 \
    -ow
 
  dumped output using  seqdumper from
  /user/hadoop/news-stream-kmeans/part-r-0.
 
  In the dumped file I see centroids are dumped like:
 
  Key class: class org.apache.hadoop.io.IntWritable Value Class: class
  org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
  Key: 0: Value: key = 0, weight = 1.00, vector =
 
 {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
  Key: 1: Value: key = 1, weight = 3.00, vector =
 
 {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
  Key: 2: Value: key = 2, weight = 105.00, vector =
 
 {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
  Key: 3: Value: key = 28, weight = 259.00, vector =
 
 {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
  --
 
   more --- 
  --
 
  I have tried using arff.vector to covert arff to vector where I dont know
  how to covert it to tf-idf vectors format as expected by streaming
 kmeans ?
 
  Thanks
  Rajesh
 
 
 
  On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam rajeshni...@gmail.com
 wrote:
 
  Hi Suneel,
 
  Thanks a lot for detailed steps !
  I will try out the steps.
 
  Thanks, Ted for pointing this out!
 
  Thanks,
  Rajesh
 
 
  On Thu, May 30, 2013 at 9

Re: bottom up clustering

2013-05-31 Thread Rajesh Nikam
Hi Suneel,

Thanks a lot for detailed steps !
I will try out the steps.

Thanks, Ted for pointing this out!

Thanks,
Rajesh


On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 To add to Ted's reply, streaming k-means was recently added to Mahout
 (thanks to Dan and Ted).

 Here's the reference paper that talks about Streaming k-means:

 http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

 You have to be working off of trunk to use this, its not available as part
 of any release yet.

 The steps for using Streaming k-means (I don't think its been documented
 yet)

 1.  Generate Sparse vectors via seq2sparse (u have this already).

 2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output path
 --tempDir temp folder path -ow
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
  -k No. of clusters -km see below for the math

 -k = no of clusters
 -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to
 cluster,  round this to the nearest integer

 You have option of using a FastProjectionSearch or ProjectionSearch or
 LocalitySensitiveHashSearch for the -sc parameter.







 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Sent: Thursday, May 30, 2013 12:03 PM
 Subject: Re: bottom up clustering


 Rajesh

 The streaming k-means implementation is very much like what you are asking
 for.  The first pass is to cluster into many, many clusters and then
 cluster those clusters.

 Sent from my iPhone

 On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:

  Hello Suneel,
 
  I got it. Next step to canopy is to feed these centroids to kmeans and
  cluster.
 
  However I want is to use centroids from these clusters and do clustering
 on
  them so as to find related clusters.
 
  Thanks
  Rajesh
 
 
  On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  The input to canopy is your vectors from seq2sparse and not cluster
  centroids (as u had it), hence the error message u r seeing.
 
  The output of canopy could be fed into kmeans as input centroids.
 
 
 
 
  
  From: Rajesh Nikam rajeshni...@gmail.com
  To: user@mahout.apache.org user@mahout.apache.org
  Sent: Thursday, May 30, 2013 10:56 AM
  Subject: bottom up clustering
 
 
  Hi,
 
  I want to do bottom up clustering (rather hierarchical clustering)
 rather
  than top-down as mentioned in
 
  https://cwiki.apache.org/MAHOUT/top-down-clustering.html
  kmeans-clusterdump-clusterpp and then kmeans on each cluster
 
  How to use centroid from first phase of canopy and use them for next
 level
  of course with correct t1 and t2.
 
  I have tried using 'canopy' which give centroids as output. How to apply
  one more level of clustering on these centroids ?
 
  /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
 level
  of canopy.
 
  mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
  /user/hadoop/t/hclust -dm
  org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2
 0.02
  -ow
 
  It gave following error:
 
   13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
  attempt_201305231030_0519_m_00_0, Status : FAILED
  java.lang.ClassCastException:
  org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
  org.apache.mahout.math.VectorWritable
 
  Thanks
  Rajesh
 



Re: bottom up clustering

2013-05-30 Thread Suneel Marthi
The input to canopy is your vectors from seq2sparse and not cluster centroids 
(as u had it), hence the error message u r seeing.

The output of canopy could be fed into kmeans as input centroids.





 From: Rajesh Nikam rajeshni...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, May 30, 2013 10:56 AM
Subject: bottom up clustering
 

Hi,

I want to do bottom up clustering (rather hierarchical clustering) rather
than top-down as mentioned in

https://cwiki.apache.org/MAHOUT/top-down-clustering.html
kmeans-clusterdump-clusterpp and then kmeans on each cluster

How to use centroid from first phase of canopy and use them for next level
of course with correct t1 and t2.

I have tried using 'canopy' which give centroids as output. How to apply
one more level of clustering on these centroids ?

/user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
of canopy.

mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
/user/hadoop/t/hclust -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
-ow

It gave following error:

  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
attempt_201305231030_0519_m_00_0, Status : FAILED
java.lang.ClassCastException:
org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
org.apache.mahout.math.VectorWritable

Thanks
Rajesh

Re: bottom up clustering

2013-05-30 Thread Rajesh Nikam
Hello Suneel,

I got it. Next step to canopy is to feed these centroids to kmeans and
cluster.

However I want is to use centroids from these clusters and do clustering on
them so as to find related clusters.

Thanks
Rajesh


On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 The input to canopy is your vectors from seq2sparse and not cluster
 centroids (as u had it), hence the error message u r seeing.

 The output of canopy could be fed into kmeans as input centroids.




 
  From: Rajesh Nikam rajeshni...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, May 30, 2013 10:56 AM
 Subject: bottom up clustering


 Hi,

 I want to do bottom up clustering (rather hierarchical clustering) rather
 than top-down as mentioned in

 https://cwiki.apache.org/MAHOUT/top-down-clustering.html
 kmeans-clusterdump-clusterpp and then kmeans on each cluster

 How to use centroid from first phase of canopy and use them for next level
 of course with correct t1 and t2.

 I have tried using 'canopy' which give centroids as output. How to apply
 one more level of clustering on these centroids ?

 /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
 of canopy.

 mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
 /user/hadoop/t/hclust -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
 -ow

 It gave following error:

   13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
 attempt_201305231030_0519_m_00_0, Status : FAILED
 java.lang.ClassCastException:
 org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
 org.apache.mahout.math.VectorWritable

 Thanks
 Rajesh



Re: bottom up clustering

2013-05-30 Thread Ted Dunning
Rajesh

The streaming k-means implementation is very much like what you are asking for. 
 The first pass is to cluster into many, many clusters and then cluster those 
clusters.  

Sent from my iPhone

On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:

 Hello Suneel,
 
 I got it. Next step to canopy is to feed these centroids to kmeans and
 cluster.
 
 However I want is to use centroids from these clusters and do clustering on
 them so as to find related clusters.
 
 Thanks
 Rajesh
 
 
 On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
 The input to canopy is your vectors from seq2sparse and not cluster
 centroids (as u had it), hence the error message u r seeing.
 
 The output of canopy could be fed into kmeans as input centroids.
 
 
 
 
 
 From: Rajesh Nikam rajeshni...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, May 30, 2013 10:56 AM
 Subject: bottom up clustering
 
 
 Hi,
 
 I want to do bottom up clustering (rather hierarchical clustering) rather
 than top-down as mentioned in
 
 https://cwiki.apache.org/MAHOUT/top-down-clustering.html
 kmeans-clusterdump-clusterpp and then kmeans on each cluster
 
 How to use centroid from first phase of canopy and use them for next level
 of course with correct t1 and t2.
 
 I have tried using 'canopy' which give centroids as output. How to apply
 one more level of clustering on these centroids ?
 
 /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
 of canopy.
 
 mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
 /user/hadoop/t/hclust -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
 -ow
 
 It gave following error:
 
  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
 attempt_201305231030_0519_m_00_0, Status : FAILED
 java.lang.ClassCastException:
 org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
 org.apache.mahout.math.VectorWritable
 
 Thanks
 Rajesh
 


Re: bottom up clustering

2013-05-30 Thread Suneel Marthi
To add to Ted's reply, streaming k-means was recently added to Mahout (thanks 
to Dan and Ted).

Here's the reference paper that talks about Streaming k-means: 

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

You have to be working off of trunk to use this, its not available as part of 
any release yet.

The steps for using Streaming k-means (I don't think its been documented yet)

1.  Generate Sparse vectors via seq2sparse (u have this already).

2.  mahout  streamingkmeans  -i path to tfidf-vectors  -o output path 
--tempDir temp folder path -ow
 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
 -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
 -k No. of clusters -km see below for the math

-k = no of clusters 
-km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to 
cluster,  round this to the nearest integer

You have option of using a FastProjectionSearch or ProjectionSearch or 
LocalitySensitiveHashSearch for the -sc parameter.








 From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org user@mahout.apache.org 
Cc: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Sent: Thursday, May 30, 2013 12:03 PM
Subject: Re: bottom up clustering
 

Rajesh

The streaming k-means implementation is very much like what you are asking for. 
 The first pass is to cluster into many, many clusters and then cluster those 
clusters.  

Sent from my iPhone

On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote:

 Hello Suneel,
 
 I got it. Next step to canopy is to feed these centroids to kmeans and
 cluster.
 
 However I want is to use centroids from these clusters and do clustering on
 them so as to find related clusters.
 
 Thanks
 Rajesh
 
 
 On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
 The input to canopy is your vectors from seq2sparse and not cluster
 centroids (as u had it), hence the error message u r seeing.
 
 The output of canopy could be fed into kmeans as input centroids.
 
 
 
 
 
 From: Rajesh Nikam rajeshni...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, May 30, 2013 10:56 AM
 Subject: bottom up clustering
 
 
 Hi,
 
 I want to do bottom up clustering (rather hierarchical clustering) rather
 than top-down as mentioned in
 
 https://cwiki.apache.org/MAHOUT/top-down-clustering.html
 kmeans-clusterdump-clusterpp and then kmeans on each cluster
 
 How to use centroid from first phase of canopy and use them for next level
 of course with correct t1 and t2.
 
 I have tried using 'canopy' which give centroids as output. How to apply
 one more level of clustering on these centroids ?
 
 /user/hadoop/t/canopy-centroids/clusters-0-final is output of first level
 of canopy.
 
 mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
 /user/hadoop/t/hclust -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 0.02
 -ow
 
 It gave following error:
 
  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
 attempt_201305231030_0519_m_00_0, Status : FAILED
 java.lang.ClassCastException:
 org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
 org.apache.mahout.math.VectorWritable
 
 Thanks
 Rajesh