How many datapoints do u have in ur input? How r u computing the value of -km?
________________________________ From: Rajesh Nikam <rajeshni...@gmail.com> To: Suneel Marthi <suneel_mar...@yahoo.com> Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Ted Dunning <ted.dunn...@gmail.com> Sent: Monday, June 3, 2013 9:55 AM Subject: Re: bottom up clustering I tried with below commands hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.vectors.arff.Driver --input /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ -i /user/hadoop/t/input-set-vector \ -o /user/hadoop/t/skmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 4 \ -km 12 \ -testp 0.3 \ -mi 10 \ -ow and dumped with seqdumper hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar org.apache.mahout.utils.SequenceFileDumper -i /user/hadoop/t/skmeans/part-r-00000 -o /mnt/cluster/t/skmeans-cluster-points.txt Dump contains centroids for clusters. ==>> This was small test-set for which I could guess number of clusters. As streaming kmeans require -k to be specified, how to do the same in case sample set is big. It also gives error like when k was specified as 40 to streamingkmeans. -k 40 \ -km 190 \ java.lang.IllegalArgumentException: Must have more datapoints [4] than clusters [40] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) ==>> How to use these centroids for clustering ? I am not understanding its use. Thanks, Rajesh On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote: > You should be able to feed arff.vectors to Streaming kmeans (have not > tried that myself, never had to work with arff ). > I had tfidf-vectors as an example, u should be good with arff. > > Give it a try and let us know. > > > ------------------------------ > *From:* Rajesh Nikam <rajeshni...@gmail.com> > *To:* "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi < > suneel_mar...@yahoo.com> > *Cc:* Ted Dunning <ted.dunn...@gmail.com> > *Sent:* Monday, June 3, 2013 4:30 AM > > *Subject:* Re: bottom up clustering > > Hi Suneel, > > I have used seqdirectory followed by seq2sparse on 20newsgroup set. > > Then used following command to run streamingkmeans to get 40 clusters. > > hadoop jar mahout-core-0.8-SNAPSHOT-job.jar > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \ > -i /user/hadoop/news-vectors/tf-vectors/ \ > -o /user/hadoop/news-stream-kmeans \ > -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ > -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ > -k 40 \ > -km 190 \ > -testp 0.3 \ > -mi 10 \ > -ow > > dumped output using seqdumper from > /user/hadoop/news-stream-kmeans/part-r-00000. > > In the dumped file I see centroids are dumped like: > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable > Key: 0: Value: key = 0, weight = 1.00, vector = > {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0, > Key: 1: Value: key = 1, weight = 3.00, vector = > {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854 > Key: 2: Value: key = 2, weight = 105.00, vector = > {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0 > Key: 3: Value: key = 28, weight = 259.00, vector = > {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020 > -- > > more --- > > -- > > I have tried using arff.vector to covert arff to vector where I dont know > how to covert it to tf-idf vectors format as expected by streaming kmeans ? > > Thanks > Rajesh > > > > On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshni...@gmail.com>wrote: > > Hi Suneel, > > Thanks a lot for detailed steps ! > I will try out the steps. > > Thanks, Ted for pointing this out! > > Thanks, > Rajesh > > > On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote: > > To add to Ted's reply, streaming k-means was recently added to Mahout > (thanks to Dan and Ted). > > Here's the reference paper that talks about Streaming k-means: > > http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf > > You have to be working off of trunk to use this, its not available as part > of any release yet. > > The steps for using Streaming k-means (I don't think its been documented > yet) > > 1. Generate Sparse vectors via seq2sparse (u have this already). > > 2. mahout streamingkmeans -i <path to tfidf-vectors> -o <output path> > --tempDir <temp folder path> -ow > -dm org.apache.mahout.common.distance.CosineDistanceMeasure > -sc org.apache.mahout.math.neighborhood.FastProjectionSearch > -k <No. of clusters> -km <see below for the math> > > -k = no of clusters > -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to > cluster, round this to the nearest integer > > You have option of using a FastProjectionSearch or ProjectionSearch or > LocalitySensitiveHashSearch for the -sc parameter. > > > > > > > > ________________________________ > From: Ted Dunning <ted.dunn...@gmail.com> > To: "user@mahout.apache.org" <user@mahout.apache.org> > Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi < > suneel_mar...@yahoo.com> > Sent: Thursday, May 30, 2013 12:03 PM > Subject: Re: bottom up clustering > > > Rajesh > > The streaming k-means implementation is very much like what you are asking > for. The first pass is to cluster into many, many clusters and then > cluster those clusters. > > Sent from my iPhone > > On May 30, 2013, at 11:20, Rajesh Nikam <rajeshni...@gmail.com> wrote: > > > Hello Suneel, > > > > I got it. Next step to canopy is to feed these centroids to kmeans and > > cluster. > > > > However I want is to use centroids from these clusters and do clustering > on > > them so as to find related clusters. > > > > Thanks > > Rajesh > > > > > > On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <suneel_mar...@yahoo.com > >wrote: > > > >> The input to canopy is your vectors from seq2sparse and not cluster > >> centroids (as u had it), hence the error message u r seeing. > >> > >> The output of canopy could be fed into kmeans as input centroids. > >> > >> > >> > >> > >> ________________________________ > >> From: Rajesh Nikam <rajeshni...@gmail.com> > >> To: "user@mahout.apache.org" <user@mahout.apache.org> > >> Sent: Thursday, May 30, 2013 10:56 AM > >> Subject: bottom up clustering > >> > >> > >> Hi, > >> > >> I want to do bottom up clustering (rather hierarchical clustering) > rather > >> than top-down as mentioned in > >> > >> https://cwiki.apache.org/MAHOUT/top-down-clustering.html > >> kmeans->clusterdump->clusterpp and then kmeans on each cluster > >> > >> How to use centroid from first phase of canopy and use them for next > level > >> of course with correct t1 and t2. > >> > >> I have tried using 'canopy' which give centroids as output. How to apply > >> one more level of clustering on these centroids ? > >> > >> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first > level > >> of canopy. > >> > >> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o > >> /user/hadoop/t/hclust -dm > >> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2 > 0.02 > >> -ow > >> > >> It gave following error: > >> > >> 13/05/30 20:21:38 INFO mapred.JobClient: Task Id : > >> attempt_201305231030_0519_m_000000_0, Status : FAILED > >> java.lang.ClassCastException: > >> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to > >> org.apache.mahout.math.VectorWritable > >> > >> Thanks > >> Rajesh > >> > > > > > >