Re: bottom up clustering

Suneel Marthi Mon, 03 Jun 2013 08:24:21 -0700

How many datapoints do u have in ur input?  How r u computing the value of -km?





________________________________
 From: Rajesh Nikam <rajeshni...@gmail.com>
To: Suneel Marthi <suneel_mar...@yahoo.com> 
Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Ted Dunning 
<ted.dunn...@gmail.com> 
Sent: Monday, June 3, 2013 9:55 AM
Subject: Re: bottom up clustering
 

I tried with below commands

hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
org.apache.mahout.utils.vectors.arff.Driver --input
/mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/
--dictOut /mnt/cluster/t/input-set-dict

hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
    -i /user/hadoop/t/input-set-vector \
    -o /user/hadoop/t/skmeans \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
  -k 4 \
  -km 12 \
  -testp 0.3 \
  -mi 10 \
  -ow

and dumped with seqdumper

hadoop jar mahout-examples-0.8-SNAPSHOT-job.jar
org.apache.mahout.utils.SequenceFileDumper -i
/user/hadoop/t/skmeans/part-r-00000 -o
/mnt/cluster/t/skmeans-cluster-points.txt

Dump contains centroids for clusters.

==>>

This was small test-set for which I could guess number of clusters.
As streaming kmeans require -k to be specified, how to do the same in case
sample set is big.

It also gives error like when k was specified as 40 to streamingkmeans.

-k 40 \
-km 190 \

java.lang.IllegalArgumentException: Must have more datapoints [4] than
clusters [40]
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)

==>>

How to use these centroids for clustering ? I am not understanding its use.

Thanks,
Rajesh







On Mon, Jun 3, 2013 at 6:19 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote:

> You should be able to feed arff.vectors to Streaming kmeans (have not
> tried that myself, never had to work with arff ).
> I had tfidf-vectors as an example, u should be good with arff.
>
> Give it a try and let us know.
>
>
>   ------------------------------
>  *From:* Rajesh Nikam <rajeshni...@gmail.com>
> *To:* "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> suneel_mar...@yahoo.com>
> *Cc:* Ted Dunning <ted.dunn...@gmail.com>
> *Sent:* Monday, June 3, 2013 4:30 AM
>
> *Subject:* Re: bottom up clustering
>
> Hi Suneel,
>
> I have used seqdirectory followed by seq2sparse on 20newsgroup set.
>
> Then used following command to run streamingkmeans to get 40 clusters.
>
> hadoop jar mahout-core-0.8-SNAPSHOT-job.jar
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver \
>     -i /user/hadoop/news-vectors/tf-vectors/ \
>     -o /user/hadoop/news-stream-kmeans \
>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
>   -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
>   -k 40 \
>   -km 190 \
>   -testp 0.3 \
>   -mi 10 \
>   -ow
>
> dumped output using  seqdumper from
> /user/hadoop/news-stream-kmeans/part-r-00000.
>
> In the dumped file I see centroids are dumped like:
>
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> Key: 0: Value: key = 0, weight = 1.00, vector =
> {1421:1.0,2581:1.0,5911:1.0,7854:3.0,7855:3.0,10022:2.0,11141:1.0,11188:1.0,11533:1.0,
> Key: 1: Value: key = 1, weight = 3.00, vector =
> {1297:1.0,1421:0.0,1499:1.0,2581:0.0,5899:1.0,5911:0.0,6322:2.0,6741:1.0,6869:1.0,7854
> Key: 2: Value: key = 2, weight = 105.00, vector =
> {794:0.09090909090909091,835:0.045454545454545456,1120:0.045454545454545456,1297:0.0
> Key: 3: Value: key = 28, weight = 259.00, vector =
> {1:0.030303030303030297,8:0.0101010101010101,12:0.0202020202020202,18:0.02020202020
> --
>
>  more --- >
> --
>
> I have tried using arff.vector to covert arff to vector where I dont know
> how to covert it to tf-idf vectors format as expected by streaming kmeans ?
>
> Thanks
> Rajesh
>
>
>
> On Fri, May 31, 2013 at 7:23 PM, Rajesh Nikam <rajeshni...@gmail.com>wrote:
>
> Hi Suneel,
>
> Thanks a lot for detailed steps !
> I will try out the steps.
>
> Thanks, Ted for pointing this out!
>
> Thanks,
> Rajesh
>
>
> On Thu, May 30, 2013 at 9:50 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote:
>
> To add to Ted's reply, streaming k-means was recently added to Mahout
> (thanks to Dan and Ted).
>
> Here's the reference paper that talks about Streaming k-means:
>
> http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
>
> You have to be working off of trunk to use this, its not available as part
> of any release yet.
>
> The steps for using Streaming k-means (I don't think its been documented
> yet)
>
> 1.  Generate Sparse vectors via seq2sparse (u have this already).
>
> 2.  mahout  streamingkmeans  -i <path to tfidf-vectors>  -o <output path>
> --tempDir <temp folder path> -ow
>  -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>  -sc org.apache.mahout.math.neighborhood.FastProjectionSearch
>  -k <No. of clusters> -km <see below for the math>
>
> -k = no of clusters
> -km = (k * log(n))  where k = no. of clusters and n = no. of datapoints to
> cluster,  round this to the nearest integer
>
> You have option of using a FastProjectionSearch or ProjectionSearch or
> LocalitySensitiveHashSearch for the -sc parameter.
>
>
>
>
>
>
>
> ________________________________
>  From: Ted Dunning <ted.dunn...@gmail.com>
> To: "user@mahout.apache.org" <user@mahout.apache.org>
> Cc: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi <
> suneel_mar...@yahoo.com>
> Sent: Thursday, May 30, 2013 12:03 PM
> Subject: Re: bottom up clustering
>
>
> Rajesh
>
> The streaming k-means implementation is very much like what you are asking
> for.  The first pass is to cluster into many, many clusters and then
> cluster those clusters.
>
> Sent from my iPhone
>
> On May 30, 2013, at 11:20, Rajesh Nikam <rajeshni...@gmail.com> wrote:
>
> > Hello Suneel,
> >
> > I got it. Next step to canopy is to feed these centroids to kmeans and
> > cluster.
> >
> > However I want is to use centroids from these clusters and do clustering
> on
> > them so as to find related clusters.
> >
> > Thanks
> > Rajesh
> >
> >
> > On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi <suneel_mar...@yahoo.com
> >wrote:
> >
> >> The input to canopy is your vectors from seq2sparse and not cluster
> >> centroids (as u had it), hence the error message u r seeing.
> >>
> >> The output of canopy could be fed into kmeans as input centroids.
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: Rajesh Nikam <rajeshni...@gmail.com>
> >> To: "user@mahout.apache.org" <user@mahout.apache.org>
> >> Sent: Thursday, May 30, 2013 10:56 AM
> >> Subject: bottom up clustering
> >>
> >>
> >> Hi,
> >>
> >> I want to do bottom up clustering (rather hierarchical clustering)
> rather
> >> than top-down as mentioned in
> >>
> >> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
> >> kmeans->clusterdump->clusterpp and then kmeans on each cluster
> >>
> >> How to use centroid from first phase of canopy and use them for next
> level
> >> of course with correct t1 and t2.
> >>
> >> I have tried using 'canopy' which give centroids as output. How to apply
> >> one more level of clustering on these centroids ?
> >>
> >> /user/hadoop/t/canopy-centroids/clusters-0-final is output of first
> level
> >> of canopy.
> >>
> >> mahout canopy -i /user/hadoop/t/canopy-centroids/clusters-0-final -o
> >> /user/hadoop/t/hclust -dm
> >> org.apache.mahout.common.distance.TanimotoDistanceMeasure -t1 0.01 -t2
> 0.02
> >> -ow
> >>
> >> It gave following error:
> >>
> >>  13/05/30 20:21:38 INFO mapred.JobClient: Task Id :
> >> attempt_201305231030_0519_m_000000_0, Status : FAILED
> >> java.lang.ClassCastException:
> >> org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to
> >> org.apache.mahout.math.VectorWritable
> >>
> >> Thanks
> >> Rajesh
> >>
>
>
>
>
>
>

Re: bottom up clustering

Reply via email to