Re: bottom up clustering

Rajesh Nikam Mon, 03 Jun 2013 09:26:18 -0700

Thanks!

Got km logic corrected using natural log.


How to get number of clusters in big set? Canopy could be used which gives
centroids.  I am understanding then what is the use of streaming kmeans?
And its role in hierarchical clustering.
On Jun 3, 2013 9:34 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

> From the exception seems like u were trying to cluster 4 datapoints into
> 40 (-k) clusters and hence what u r seeing.
>
> So if your datapoints n = 1500
> and clusters k = 40
> -km = k * log n = 292 (rounded to nearest integer) - its a natural log
>
> Does that match with ur inputs?
>
> Sorry man I am having a busy day and may not be of much help as I would
> have liked to.
>
> Dan, could u jump in ?
>
>
>
>
> ________________________________
>  From: Rajesh Nikam <rajeshni...@gmail.com>
> To: user@mahout.apache.org
> Cc: Ted Dunning <ted.dunn...@gmail.com>
> Sent: Monday, June 3, 2013 11:51 AM
> Subject: Re: bottom up clustering
>
>
> I am having 1500 points. Using km: k * log n
> On Jun 3, 2013 8:53 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>
> > How many datapoints do u have in ur input?  How r u computing the value
> of
> > -km?
> >
> >
> >
> >
> > ________________________________
> >  From: Rajesh Nikam <rajeshni...@gmail.com>
> > To: Suneel Marthi <suneel_mar...@yahoo.com>
> > Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Ted Dunning <
> > ted.dunn...@gmail.com>
> > Sent: Monday, June 3, 2013 9:55 AM
> > Subject: Re: bottom up clustering
> >
> >
> > I tried with below commands
> >
> > hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
> > org.apache.mahout.utils.vectors.arff.Driver --input
> > /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
> > --dictOut /mnt/cluster/t/input-set-dict
> >
> > hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
> >     -i /user/hadoop/t/input-set-vector \
> >     -o /user/hadoop/t/skmeans \
> >   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> >   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> >   -k 4 \
> >   -km 12 \
> >   -testp 0.3 \
> >   -mi 10 \
> >   -ow
> >
> > and dumped with seqdumper
> >
> > hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
> > org.apache.mahout.utils.SequenceFileDumper -i
> > /user/hadoop/t/skmeans/part-r-00000 -o
> > /mnt/cluster/t/skmeans-cluster-points.txt
> >
> > Dump contains centroids for clusters.
> >
> > ==>>
> >
> > This was small test-set for which I could guess number of clusters.
> > As streaming kmeans require -k to be specified, how to do the same in
> case
> > sample set is big.
> >
> > It also gives error like when k was specified as 40 to streamingkmeans.
> >
> > -k 40 \
> > -km 190 \
> >
> > java.lang.IllegalArgumentException: Must have more datapoints [4] than
> > clusters [40]
> >         at
> > com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> >
> > ==>>
> >
> > How to use these centroids for clustering ? I am not understanding its
> use.
> >
> > Thanks,
> > Rajesh
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi <suneel_mar...@yahoo.com
> > >wrote:
> >
> > > You should be able to feed arff.vectors to Streaming kmeans (have not
> > > tried that myself, never had to work with arff ).
> > > I had tfidf-vectors as an example, u should be good with arff.
> > >
> > > Give it a try and let us know.
> > >
> > >
> > >   ------------------------------
> > >  *From:* Rajesh Nikam <rajeshni...@gmail.com>
> > > *To:* "user@mahout.apache.org" <user@mahout.apache.org>; Suneel
> Marthi <
> > > suneel_mar...@yahoo.com>
> > > *Cc:* Ted Dunning <ted.dunn...@gmail.com>
> > > *Sent:* Monday, June 3, 2013 4:30 AM
> > >
> > > *Subject:* Re: bottom up clustering
> > >
> > > Hi Suneel,
> > >
> > > I have used seqdirectory followed by seq2sparse on 20newsgroup set.
> > >
> > > Then used following command to run streamingkmeans to get 40 clusters.
> > >
> > > hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> > > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver
> \
> > >     -i /user/hadoop/news-vectors/tf-vectors/ \
> > >     -o /user/hadoop/news-stream-kmeans \
> > >   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> > >   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> > >   -k 40 \
> > >   -km 190 \
> > >   -testp 0.3 \
> > >   -mi 10 \
> > >   -ow
> > >
> > > dumped output using  seqdumper from
> > > /user/hadoop/news-stream-kmeans/part-r-00000.
> > >
> > > In the dumped file I see centroids are dumped like:
> > >
> > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> > > Key: 0: Value: key = 0, weight = 1.00, vector =
> > >
> >
> {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
> > > Key: 1: Value: key = 1, weight = 3.00, vector =
> > >
> >
> {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
> > > Key: 2: Value: key = 2, weight = 105.00, vector =
> > >
> >
> {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
> > > Key: 3: Value: key = 28, weight = 259.00, vector =
> > >
> >
> {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
> > > --
> > >
> > >  more --- >
> > > --
> > >
> > > I have tried using arff.vector to covert arff to vector where I dont
> know
> > > how to covert it to tf-idf vectors format as expected by streaming
> > kmeans ?
> > >
> > > Thanks
> > > Rajesh
> > >
> > >
> > >
> > > On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshni...@gmail.com
> > >wrote:
> > >
> > > Hi Suneel,
> > >
> > > Thanks a lot for detailed steps !
> > > I will try out the steps.
> > >
> > > Thanks, Ted for pointing this out!
> > >
> > > Thanks,
> > > Rajesh
> > >
> > >
> > > On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <
> suneel_mar...@yahoo.com
> > >wrote:
> > >
> > > To add to Ted's reply, streaming k-means was recently added to Mahout
> > > (thanks to Dan and Ted).
> > >
> > > Here's the reference paper that talks about Streaming k-means:
> > >
> > > http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
> > >
> > > You have to be working off of trunk to use this, its not available as
> > part
> > > of any release yet.
> > >
> > > The steps for using Streaming k-means (I don't think its been
> documented
> > > yet)
> > >
> > > 1.  Generate Sparse vectors via seq2sparse (u have this already).
> > >
> > > 2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output
> path>
> > > --tempDir <temp folder path> -ow
> > >  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
> > >  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
> > >  -k <No. of clusters> -km <see below for the math>
> > >
> > > -k = no of clusters
> > > -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints
> > to
> > > cluster,  round this to the nearest integer
> > >
> > > You have option of using a FastProjectionSearch or ProjectionSearch or
> > > LocalitySensitiveHashSearch for the -sc parameter.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > >  From: Ted Dunning <ted.dunn...@gmail.com>
> > > To: "user@mahout.apache.org" <user@mahout.apache.org>
> > > Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> > > suneel_mar...@yahoo.com>
> > > Sent: Thursday, May 30, 2013 12:03 PM
> > > Subject: Re: bottom up clustering
> > >
> > >
> > > Rajesh
> > >
> > > The streaming k-means implementation is very much like what you are
> > asking
> > > for.  The first pass is to cluster into many, many clusters and then
> > > cluster those clusters.
> > >
> > > Sent from my iPhone
> > >
> > > On May 30, 2013, at 11:20, Rajesh Nikam <rajeshni...@gmail.com> wrote:
> > >
> > > > Hello Suneel,
> > > >
> > > > I got it. Next step to canopy is to feed these centroids to kmeans
> and
> > > > cluster.
> > > >
> > > > However I want is to use centroids from these clusters and do
> > clustering
> > > on
> > > > them so as to find related clusters.
> > > >
> > > > Thanks
> > > > Rajesh
> > > >
> > > >
> > > > On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <
> > suneel_mar...@yahoo.com
> > > >wrote:
> > > >
> > > >> The input to canopy is your vectors from seq2sparse and not cluster
> > > >> centroids (as u had it), hence the error message u r seeing.
> > > >>
> > > >> The output of canopy could be fed into kmeans as input centroids.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> ________________________________
> > > >> From: Rajesh Nikam <rajeshni...@gmail.com>
> > > >> To: "user@mahout.apache.org" <user@mahout.apache.org>
> > > >> Sent: Thursday, May 30, 2013 10:56 AM
> > > >> Subject: bottom up clustering
> > > >>
> > > >>
> > > >> Hi,
> > > >>
> > > >> I want to do bottom up clustering (rather hierarchical clustering)
> > > rather
> > > >> than top-down as mentioned in
> > > >>
> > > >> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
> > > >> kmeans->clusterdump->clusterpp and then kmeans on each cluster
> > > >>
> > > >> How to use centroid from first phase of canopy and use them for next
> > > level
> > > >> of course with correct t1 and t2.
> > > >>
> > > >> I have tried using 'canopy' which give centroids as output. How to
> > apply
> > > >> one more level of clustering on these centroids ?
> > > >>
> > > >> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
> > > level
> > > >> of canopy.
> > > >>
> > > >> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
> > > >> /user/hadoop/t/hclust -dm
> > > >> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01
> -t2
> > > 0.02
> > > >> -ow
> > > >>
> > > >> It gave following error:
> > > >>
> > > >>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
> > > >> attempt_201305231030_0519_m_000000_0, Status : FAILED
> > > >> java.lang.ClassCastException:
> > > >> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast
> > to
> > > >> org.apache.mahout.math.VectorWritable
> > > >>
> > > >> Thanks
> > > >> Rajesh
> > > >>
> > >
> > >
> > >
> > >
> > >
> > >

Re: bottom up clustering

Reply via email to