Re: Mahout-279/kmeans++

Ted Dunning Thu, 30 Aug 2012 11:49:42 -0700

The input to the BallKmeans is actually not a matrix.  It is an
Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
this.


So one way to deal with this is to build your own Iterable and put
NamedVectors into it.  NamedVector retain labels as you want.

On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mwhit...@harris.com>wrote:

> I need to be using the matrices for BallKmeans.  Can matrices be named? By
> this I mean can I assign a column of my matrix to be the "name" of each row?
>
> Thanks!
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Wednesday, August 29, 2012 12:17 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Yes.  The ball k-means implementation does use weights to indicate multiple
> vectors.
>
> The implementation is definitely ready to test.  I would be slightly
> surprised if it has absolutely zero issues, but your feedback on such
> issues would help them get fixed much sooner than others.
>
> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhit...@harris.com
> >wrote:
>
> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost
> > more points in the resulting clusters ( total points in the clusters =
> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
> total
> > number of data points fed into the algorithm is 53365862 -- so even v0.5
> is
> > missing 14% of the data.
> >
> > I'm thinking if I weight these dense vectors with a weight equal to the
> > number of identical vectors in the set that could work -- Ball Kmeans
> seems
> > to do this.  Is this a correct interpretation of how to use weights in
> Ball
> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
> >
> > Thanks
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pran...@xebia.com]
> > Sent: Thursday, August 23, 2012 12:34 PM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > clusterDump works in memory, and there are no plans yet to make it
> > distributed ( or not in memory ). See thishttps://
> > issues.apache.org/*jira*/browse/MAHOUT-940
> >
> > clusterpp has an option for distributed processing, so you can process
> any
> > amount of data with it.
> >
> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
> > > Yes, unique names will be my next plan -- I just can't kick off that
> job
> > until after the weekend.  If this makes no difference I will also try the
> > noise idea, and I'll follow up about both.
> > >
> > > My next question is regarding clusterDump.  Is there a way to run this
> > in parallel? I have found some code to execute in java (the preferable
> > method for me) but I would like the method to be faster and not in
> memory.
> >  Is this a possibility? Or in the works?
> > >
> > > Thanks!
> > >
> > > -----Original Message-----
> > > From: Paritosh Ranjan [mailto:pran...@xebia.com]
> > > Sent: Wednesday, August 22, 2012 9:09 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: Mahout-279/kmeans++
> > >
> > > Can you also try to provide distinct names to vectors and then cluster?
> > > It should not have any affect, but would be good to know the behavior.
> > >
> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> > >> Yes, I have data which is exactly the same.  If I give every vector a
> > name which is distinct (albeit the data point is the same as other points
> > in the set) will this keep the algorithm from dropping non-distinct
> > vectors/data points (which is what I THINK but have yet to verify is what
> > is going on)?
> > >>
> > >> Thanks,
> > >>
> > >> Mattie
> > >>
> > >> -----Original Message-----
> > >> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > >> Sent: Wednesday, August 22, 2012 1:18 PM
> > >> To: user@mahout.apache.org
> > >> Subject: Re: Mahout-279/kmeans++
> > >>
> > >> Just an off thought, do you have duplicate input points?
> > >>
> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
> mwhit...@harris.com
> > >wrote:
> > >>
> > >>> ... I have also verified by running canopy multiple times with 0.5
> and
> > 0.7
> > >>> that there is a continual discrepancy between the two clustering
> > versions.
> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
> > 0.7 is:
> > >>> 921998/5.  They should not necessarily be the same, since I am using
> > canopy
> > >>> clustering to find initial centroids, however I would think they
> would
> > have
> > >>> the same sum, which they do not (45901885 vs 1599154).
> > >>>
> > >>> Here is the method I am running:
> > >>>
> > >>> public static void KmeansClusteringCanopy(String outputDir, String T,
> > >>> String itMax)
> > >>>                           throws IOException, InterruptedException,
> > >>> ClassNotFoundException,
> > >>>                           InstantiationException,
> > IllegalAccessException {
> > >>>
> > >>>                   Configuration conf = new Configuration();
> > >>>
> > >>>                   DistanceMeasure measure = new
> > EuclideanDistanceMeasure();
> > >>>
> > >>>                   Path vectorsFolder = new Path(outputDir,
> "vectors");
> > >>>                   Path clusterCenters = new Path(outputDir +
> > >>> "-canopy/centriods");
> > >>>                   Path clusterOutput = new Path(outputDir +
> > >>> "-canopy/clusters");
> > >>>
> > >>>                   // create canopies instead of initial vectors
> > >>>                   CanopyDriver.run(conf, vectorsFolder,
> clusterCenters,
> > >>> measure,
> > >>>                                   Double.parseDouble(T),
> > >>> Double.parseDouble(T), false, 0, false);
> > >>>
> > >>>
> > >>>                   // kmeans cluster operation
> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
> > >>> Path(clusterCenters,
> > >>>                                   "clusters-0-final/part-r-00000"),
> > >>> clusterOutput, measure, 0.01,
> > >>>                                   Integer.parseInt(itMax), true, 0.0,
> > false);
> > >>>
> > >>>
> > >>>                   //post process by putting completed clusters into
> > their
> > >>> own files.
> > >>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
> > >>>                                   new
> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> > >>>
> > >>>           }
> > >>>
> > >>> What do you think?
> > >>>
> > >>> On another but related note: Is there a plan to have a method -- say
> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
> vectors
> > >>> within clusters as well as a separate folder containing pruned
> > outliers?
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Mattie
> > >>>
> > >>> -----Original Message-----
> > >>> From: Paritosh Ranjan [mailto:pran...@xebia.com]
> > >>> Sent: Friday, August 17, 2012 12:16 PM
> > >>> To: user@mahout.apache.org
> > >>> Subject: Re: Mahout-279/kmeans++
> > >>>
> > >>> The clustering algorithm has also changed internally. So, expect the
> > >>> results to be different ( and better ).
> > >>>
> > >>> I can think of one reason for this behavior. Maybe lots of clusters
> are
> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not
> > >>> output any cluster with single vector.
> > >>> So, I think, its clusterdumper which is doing the invisible
> "pruning" (
> > >>> by not ouputting clusters with single vectors ).
> > >>>
> > >>> Can you cross check the output once with
> > ClusterOutputPostProcessorDriver?
> > >>>
> > >>> No, no tool can output the pruned vectors. The only way to see all
> > >>> vectors assigned to any cluster is to set
> > clusterClassificationThreshold
> > >>> to 0.
> > >>>
> > >>> If you still face the problem, then please provide the parameters
> with
> > >>> which you are calling kmeans.
> > >>>
> > >>> Regarding "I should also mention I have vectors which are exactly the
> > >>> same (even their names), perhaps they are the ones being pruned, is
> > that
> > >>> possible? "
> > >>>
> > >>> The name of the vector has nothing to do with clustering, I am not
> sure
> > >>> whether it will have any effect when clusterdumper is in action. So,
> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
> > >>>
> > >>> Good luck.
> > >>> Paritosh
> > >>>
> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
> >  Previously
> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
> the
> > >>> resultant clusters was the same as the total amount fed to the
> > algorithm.
> > >>>    I wish this to be the case when clustering with v0.7.  The only
> > change in
> > >>> the algorithm is clusterClassificationThreshold,  I set this value to
> > be 0
> > >>> so that it will in fact cluster all vectors in the dataset.
> > >>>> My logic here was no vector should have a probability of being in
> some
> > >>> cluster less than 0 and therefore all vectors should cluster.
> > >>>> However after running a clusterdump I find that vectors (1/3
> roughly)
> > >>> have been pruned.
> > >>>> Is this a bug, or me just not understanding the new capabilities?
> > >>>>
> > >>>> I should also mention I have vectors which are exactly the same
> (even
> > >>> their names), perhaps they are the ones being pruned, is that
> possible?
> > >>>> Another question if I may: I will eventually want to use the pruning
> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> > >>> similar method) have the capability of outputting the pruned vectors
> > into a
> > >>> folder?
> > >>>> Thanks! Please let me know if I'm still not being clear enough.
> > >>>>
> > >>>> Mattie
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Paritosh Ranjan [mailto:pran...@xebia.com]
> > >>>> Sent: Friday, August 17, 2012 11:20 AM
> > >>>> To: user@mahout.apache.org
> > >>>> Subject: Re: Mahout-279/kmeans++
> > >>>>
> > >>>> clusterClassificationThreshold is for outlier removal, and this is
> the
> > >>> way it should be used.
> > >>>> Can you provide some more information about your job and the way you
> > are
> > >>> calling it?
> > >>>> And if I look at the code, the vector should be clustered even if
> the
> > >>> pdf is 0. The method which decides whether the vector should be
> > assigned to
> > >>> a particular cluster or not -
> > >>>> /**
> > >>>>        * Decides whether the vector should be classified or not
> based
> > on
> > >>> the max pdf
> > >>>>        * value of the clusters and threshold value.
> > >>>>        *
> > >>>>        * @return whether the vector should be classified or not.
> > >>>>        */
> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> > Double
> > >>> clusterClassificationThreshold) {
> > >>>>         return pdfPerCluster.maxValue() >=
> > clusterClassificationThreshold;
> > >>>>       }
> > >>>>
> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> > >>>>
> > >>>>> Hi Ted,
> > >>>>>
> > >>>>> Yes this is great!  I hope to start working with this algorithm in
> > the
> > >>> next couple weeks.
> > >>>>> I have a question about the 0.7 implementation of kmeans and the
> > >>> clusterClassificationThreshold,  I have this value set at zero, but
> the
> > >>> output is still showing that about 1/3 of my data is not assigned to
> a
> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> > despite
> > >>> the clusterClassificationThreshold = 0.
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Mattie
> > >>>>>
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> > >>>>> To: user@mahout.apache.org
> > >>>>> Subject: Re: Mahout-279/kmeans++
> > >>>>>
> > >>>>> Mattie,
> > >>>>>
> > >>>>> Would this help?
> > >>>>>
> > >>>>>
> > >>>
> >
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> > >>>>> and
> > >>>>>
> > >>>>>
> > >>>
> >
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> > mwhit...@harris.com
> > >>>> wrote:
> > >>>>>> Hi!
> > >>>>>>
> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
> patch
> > >>> like
> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of a
> > set
> > >>> of
> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
> > better
> > >>>>>> results, but still need to do a few passes of kmeans to determine
> my
> > >>> T, and
> > >>>>>> the random seed does take a long time.
> > >>>>>>
> > >>>>>> The comments say that you are working on a kmeans++, I searched
> > around
> > >>> but
> > >>>>>> couldn't confirm any more information about it.  Is a scalable
> > >>> kmeans++ in
> > >>>>>> the works? (I know research on the subject is quite new)
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Mattie Whitmore
> > >>>>>> Mathematician/IR&D Software Engineer
> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
> > >>>>>> 301.837.5278
> > >>>>>> mwhit...@harris.com<mailto:tiffany.fork...@harris.com>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >
> >
> >
> >
>

Re: Mahout-279/kmeans++

Reply via email to