Re: Mahout-279/kmeans++

Ted Dunning Thu, 30 Aug 2012 11:52:57 -0700

But columns aren't what I would expect you to want labeled.  I think that
row labels might be nicer.  Happily, each named vector has a name for the
entire vector as well.


On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> The input to the BallKmeans is actually not a matrix.  It is an
> Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
> this.
>
> So one way to deal with this is to build your own Iterable and put
> NamedVectors into it.  NamedVector retain labels as you want.
>
>
> On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mwhit...@harris.com>wrote:
>
>> I need to be using the matrices for BallKmeans.  Can matrices be named?
>> By this I mean can I assign a column of my matrix to be the "name" of each
>> row?
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> Sent: Wednesday, August 29, 2012 12:17 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Yes.  The ball k-means implementation does use weights to indicate
>> multiple
>> vectors.
>>
>> The implementation is definitely ready to test.  I would be slightly
>> surprised if it has absolutely zero issues, but your feedback on such
>> issues would help them get fixed much sooner than others.
>>
>> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhit...@harris.com
>> >wrote:
>>
>> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost
>> > more points in the resulting clusters ( total points in the clusters =
>> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
>> total
>> > number of data points fed into the algorithm is 53365862 -- so even
>> v0.5 is
>> > missing 14% of the data.
>> >
>> > I'm thinking if I weight these dense vectors with a weight equal to the
>> > number of identical vectors in the set that could work -- Ball Kmeans
>> seems
>> > to do this.  Is this a correct interpretation of how to use weights in
>> Ball
>> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
>> >
>> > Thanks
>> >
>> > -----Original Message-----
>> > From: Paritosh Ranjan [mailto:pran...@xebia.com]
>> > Sent: Thursday, August 23, 2012 12:34 PM
>> > To: user@mahout.apache.org
>> > Subject: Re: Mahout-279/kmeans++
>> >
>> > clusterDump works in memory, and there are no plans yet to make it
>> > distributed ( or not in memory ). See thishttps://
>> > issues.apache.org/*jira*/browse/MAHOUT-940
>> >
>> > clusterpp has an option for distributed processing, so you can process
>> any
>> > amount of data with it.
>> >
>> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
>> > > Yes, unique names will be my next plan -- I just can't kick off that
>> job
>> > until after the weekend.  If this makes no difference I will also try
>> the
>> > noise idea, and I'll follow up about both.
>> > >
>> > > My next question is regarding clusterDump.  Is there a way to run this
>> > in parallel? I have found some code to execute in java (the preferable
>> > method for me) but I would like the method to be faster and not in
>> memory.
>> >  Is this a possibility? Or in the works?
>> > >
>> > > Thanks!
>> > >
>> > > -----Original Message-----
>> > > From: Paritosh Ranjan [mailto:pran...@xebia.com]
>> > > Sent: Wednesday, August 22, 2012 9:09 PM
>> > > To: user@mahout.apache.org
>> > > Subject: Re: Mahout-279/kmeans++
>> > >
>> > > Can you also try to provide distinct names to vectors and then
>> cluster?
>> > > It should not have any affect, but would be good to know the behavior.
>> > >
>> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> > >> Yes, I have data which is exactly the same.  If I give every vector a
>> > name which is distinct (albeit the data point is the same as other
>> points
>> > in the set) will this keep the algorithm from dropping non-distinct
>> > vectors/data points (which is what I THINK but have yet to verify is
>> what
>> > is going on)?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Mattie
>> > >>
>> > >> -----Original Message-----
>> > >> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> > >> Sent: Wednesday, August 22, 2012 1:18 PM
>> > >> To: user@mahout.apache.org
>> > >> Subject: Re: Mahout-279/kmeans++
>> > >>
>> > >> Just an off thought, do you have duplicate input points?
>> > >>
>> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
>> mwhit...@harris.com
>> > >wrote:
>> > >>
>> > >>> ... I have also verified by running canopy multiple times with 0.5
>> and
>> > 0.7
>> > >>> that there is a continual discrepancy between the two clustering
>> > versions.
>> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
>> > 0.7 is:
>> > >>> 921998/5.  They should not necessarily be the same, since I am using
>> > canopy
>> > >>> clustering to find initial centroids, however I would think they
>> would
>> > have
>> > >>> the same sum, which they do not (45901885 vs 1599154).
>> > >>>
>> > >>> Here is the method I am running:
>> > >>>
>> > >>> public static void KmeansClusteringCanopy(String outputDir, String
>> T,
>> > >>> String itMax)
>> > >>>                           throws IOException, InterruptedException,
>> > >>> ClassNotFoundException,
>> > >>>                           InstantiationException,
>> > IllegalAccessException {
>> > >>>
>> > >>>                   Configuration conf = new Configuration();
>> > >>>
>> > >>>                   DistanceMeasure measure = new
>> > EuclideanDistanceMeasure();
>> > >>>
>> > >>>                   Path vectorsFolder = new Path(outputDir,
>> "vectors");
>> > >>>                   Path clusterCenters = new Path(outputDir +
>> > >>> "-canopy/centriods");
>> > >>>                   Path clusterOutput = new Path(outputDir +
>> > >>> "-canopy/clusters");
>> > >>>
>> > >>>                   // create canopies instead of initial vectors
>> > >>>                   CanopyDriver.run(conf, vectorsFolder,
>> clusterCenters,
>> > >>> measure,
>> > >>>                                   Double.parseDouble(T),
>> > >>> Double.parseDouble(T), false, 0, false);
>> > >>>
>> > >>>
>> > >>>                   // kmeans cluster operation
>> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
>> > >>> Path(clusterCenters,
>> > >>>                                   "clusters-0-final/part-r-00000"),
>> > >>> clusterOutput, measure, 0.01,
>> > >>>                                   Integer.parseInt(itMax), true,
>> 0.0,
>> > false);
>> > >>>
>> > >>>
>> > >>>                   //post process by putting completed clusters into
>> > their
>> > >>> own files.
>> > >>>
>> ClusterOutputPostProcessorDriver.run(clusterOutput,
>> > >>>                                   new
>> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>> > >>>
>> > >>>           }
>> > >>>
>> > >>> What do you think?
>> > >>>
>> > >>> On another but related note: Is there a plan to have a method -- say
>> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
>> vectors
>> > >>> within clusters as well as a separate folder containing pruned
>> > outliers?
>> > >>>
>> > >>> Thanks!
>> > >>>
>> > >>> Mattie
>> > >>>
>> > >>> -----Original Message-----
>> > >>> From: Paritosh Ranjan [mailto:pran...@xebia.com]
>> > >>> Sent: Friday, August 17, 2012 12:16 PM
>> > >>> To: user@mahout.apache.org
>> > >>> Subject: Re: Mahout-279/kmeans++
>> > >>>
>> > >>> The clustering algorithm has also changed internally. So, expect the
>> > >>> results to be different ( and better ).
>> > >>>
>> > >>> I can think of one reason for this behavior. Maybe lots of clusters
>> are
>> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> > >>> output any cluster with single vector.
>> > >>> So, I think, its clusterdumper which is doing the invisible
>> "pruning" (
>> > >>> by not ouputting clusters with single vectors ).
>> > >>>
>> > >>> Can you cross check the output once with
>> > ClusterOutputPostProcessorDriver?
>> > >>>
>> > >>> No, no tool can output the pruned vectors. The only way to see all
>> > >>> vectors assigned to any cluster is to set
>> > clusterClassificationThreshold
>> > >>> to 0.
>> > >>>
>> > >>> If you still face the problem, then please provide the parameters
>> with
>> > >>> which you are calling kmeans.
>> > >>>
>> > >>> Regarding "I should also mention I have vectors which are exactly
>> the
>> > >>> same (even their names), perhaps they are the ones being pruned, is
>> > that
>> > >>> possible? "
>> > >>>
>> > >>> The name of the vector has nothing to do with clustering, I am not
>> sure
>> > >>> whether it will have any effect when clusterdumper is in action. So,
>> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer
>> this.
>> > >>>
>> > >>> Good luck.
>> > >>> Paritosh
>> > >>>
>> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
>> >  Previously
>> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
>> the
>> > >>> resultant clusters was the same as the total amount fed to the
>> > algorithm.
>> > >>>    I wish this to be the case when clustering with v0.7.  The only
>> > change in
>> > >>> the algorithm is clusterClassificationThreshold,  I set this value
>> to
>> > be 0
>> > >>> so that it will in fact cluster all vectors in the dataset.
>> > >>>> My logic here was no vector should have a probability of being in
>> some
>> > >>> cluster less than 0 and therefore all vectors should cluster.
>> > >>>> However after running a clusterdump I find that vectors (1/3
>> roughly)
>> > >>> have been pruned.
>> > >>>> Is this a bug, or me just not understanding the new capabilities?
>> > >>>>
>> > >>>> I should also mention I have vectors which are exactly the same
>> (even
>> > >>> their names), perhaps they are the ones being pruned, is that
>> possible?
>> > >>>> Another question if I may: I will eventually want to use the
>> pruning
>> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> > >>> similar method) have the capability of outputting the pruned vectors
>> > into a
>> > >>> folder?
>> > >>>> Thanks! Please let me know if I'm still not being clear enough.
>> > >>>>
>> > >>>> Mattie
>> > >>>>
>> > >>>> -----Original Message-----
>> > >>>> From: Paritosh Ranjan [mailto:pran...@xebia.com]
>> > >>>> Sent: Friday, August 17, 2012 11:20 AM
>> > >>>> To: user@mahout.apache.org
>> > >>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>
>> > >>>> clusterClassificationThreshold is for outlier removal, and this is
>> the
>> > >>> way it should be used.
>> > >>>> Can you provide some more information about your job and the way
>> you
>> > are
>> > >>> calling it?
>> > >>>> And if I look at the code, the vector should be clustered even if
>> the
>> > >>> pdf is 0. The method which decides whether the vector should be
>> > assigned to
>> > >>> a particular cluster or not -
>> > >>>> /**
>> > >>>>        * Decides whether the vector should be classified or not
>> based
>> > on
>> > >>> the max pdf
>> > >>>>        * value of the clusters and threshold value.
>> > >>>>        *
>> > >>>>        * @return whether the vector should be classified or not.
>> > >>>>        */
>> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
>> > Double
>> > >>> clusterClassificationThreshold) {
>> > >>>>         return pdfPerCluster.maxValue() >=
>> > clusterClassificationThreshold;
>> > >>>>       }
>> > >>>>
>> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>> > >>>>
>> > >>>>> Hi Ted,
>> > >>>>>
>> > >>>>> Yes this is great!  I hope to start working with this algorithm in
>> > the
>> > >>> next couple weeks.
>> > >>>>> I have a question about the 0.7 implementation of kmeans and the
>> > >>> clusterClassificationThreshold,  I have this value set at zero, but
>> the
>> > >>> output is still showing that about 1/3 of my data is not assigned
>> to a
>> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
>> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
>> > despite
>> > >>> the clusterClassificationThreshold = 0.
>> > >>>>> Thanks,
>> > >>>>>
>> > >>>>> Mattie
>> > >>>>>
>> > >>>>>
>> > >>>>> -----Original Message-----
>> > >>>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>> > >>>>> To: user@mahout.apache.org
>> > >>>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>>
>> > >>>>> Mattie,
>> > >>>>>
>> > >>>>> Would this help?
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>> > >>>>> and
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
>> > mwhit...@harris.com
>> > >>>> wrote:
>> > >>>>>> Hi!
>> > >>>>>>
>> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
>> patch
>> > >>> like
>> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of
>> a
>> > set
>> > >>> of
>> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
>> > better
>> > >>>>>> results, but still need to do a few passes of kmeans to
>> determine my
>> > >>> T, and
>> > >>>>>> the random seed does take a long time.
>> > >>>>>>
>> > >>>>>> The comments say that you are working on a kmeans++, I searched
>> > around
>> > >>> but
>> > >>>>>> couldn't confirm any more information about it.  Is a scalable
>> > >>> kmeans++ in
>> > >>>>>> the works? (I know research on the subject is quite new)
>> > >>>>>>
>> > >>>>>> Thanks!
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Mattie Whitmore
>> > >>>>>> Mathematician/IR&D Software Engineer
>> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
>> > >>>>>> 301.837.5278
>> > >>>>>> mwhit...@harris.com<mailto:tiffany.fork...@harris.com>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >
>> >
>> >
>> >
>>
>
>

Re: Mahout-279/kmeans++

Reply via email to