But columns aren't what I would expect you to want labeled. I think that row labels might be nicer. Happily, each named vector has a name for the entire vector as well.
On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > The input to the BallKmeans is actually not a matrix. It is an > Iterable<MatrixSlice>. This can be a matrix since a matrix implements > this. > > So one way to deal with this is to build your own Iterable and put > NamedVectors into it. NamedVector retain labels as you want. > > > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mwhit...@harris.com>wrote: > >> I need to be using the matrices for BallKmeans. Can matrices be named? >> By this I mean can I assign a column of my matrix to be the "name" of each >> row? >> >> Thanks! >> >> -----Original Message----- >> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >> Sent: Wednesday, August 29, 2012 12:17 PM >> To: user@mahout.apache.org >> Subject: Re: Mahout-279/kmeans++ >> >> Yes. The ball k-means implementation does use weights to indicate >> multiple >> vectors. >> >> The implementation is definitely ready to test. I would be slightly >> surprised if it has absolutely zero issues, but your feedback on such >> issues would help them get fixed much sooner than others. >> >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhit...@harris.com >> >wrote: >> >> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost >> > more points in the resulting clusters ( total points in the clusters = >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The >> total >> > number of data points fed into the algorithm is 53365862 -- so even >> v0.5 is >> > missing 14% of the data. >> > >> > I'm thinking if I weight these dense vectors with a weight equal to the >> > number of identical vectors in the set that could work -- Ball Kmeans >> seems >> > to do this. Is this a correct interpretation of how to use weights in >> Ball >> > Kmeans, and is Ball Kmeans ready enough to be used/tested? >> > >> > Thanks >> > >> > -----Original Message----- >> > From: Paritosh Ranjan [mailto:pran...@xebia.com] >> > Sent: Thursday, August 23, 2012 12:34 PM >> > To: user@mahout.apache.org >> > Subject: Re: Mahout-279/kmeans++ >> > >> > clusterDump works in memory, and there are no plans yet to make it >> > distributed ( or not in memory ). See thishttps:// >> > issues.apache.org/*jira*/browse/MAHOUT-940 >> > >> > clusterpp has an option for distributed processing, so you can process >> any >> > amount of data with it. >> > >> > On 23-08-2012 19:55, Whitmore, Mattie wrote: >> > > Yes, unique names will be my next plan -- I just can't kick off that >> job >> > until after the weekend. If this makes no difference I will also try >> the >> > noise idea, and I'll follow up about both. >> > > >> > > My next question is regarding clusterDump. Is there a way to run this >> > in parallel? I have found some code to execute in java (the preferable >> > method for me) but I would like the method to be faster and not in >> memory. >> > Is this a possibility? Or in the works? >> > > >> > > Thanks! >> > > >> > > -----Original Message----- >> > > From: Paritosh Ranjan [mailto:pran...@xebia.com] >> > > Sent: Wednesday, August 22, 2012 9:09 PM >> > > To: user@mahout.apache.org >> > > Subject: Re: Mahout-279/kmeans++ >> > > >> > > Can you also try to provide distinct names to vectors and then >> cluster? >> > > It should not have any affect, but would be good to know the behavior. >> > > >> > > On 22-08-2012 23:10, Whitmore, Mattie wrote: >> > >> Yes, I have data which is exactly the same. If I give every vector a >> > name which is distinct (albeit the data point is the same as other >> points >> > in the set) will this keep the algorithm from dropping non-distinct >> > vectors/data points (which is what I THINK but have yet to verify is >> what >> > is going on)? >> > >> >> > >> Thanks, >> > >> >> > >> Mattie >> > >> >> > >> -----Original Message----- >> > >> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >> > >> Sent: Wednesday, August 22, 2012 1:18 PM >> > >> To: user@mahout.apache.org >> > >> Subject: Re: Mahout-279/kmeans++ >> > >> >> > >> Just an off thought, do you have duplicate input points? >> > >> >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie < >> mwhit...@harris.com >> > >wrote: >> > >> >> > >>> ... I have also verified by running canopy multiple times with 0.5 >> and >> > 0.7 >> > >>> that there is a continual discrepancy between the two clustering >> > versions. >> > >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and >> > 0.7 is: >> > >>> 921998/5. They should not necessarily be the same, since I am using >> > canopy >> > >>> clustering to find initial centroids, however I would think they >> would >> > have >> > >>> the same sum, which they do not (45901885 vs 1599154). >> > >>> >> > >>> Here is the method I am running: >> > >>> >> > >>> public static void KmeansClusteringCanopy(String outputDir, String >> T, >> > >>> String itMax) >> > >>> throws IOException, InterruptedException, >> > >>> ClassNotFoundException, >> > >>> InstantiationException, >> > IllegalAccessException { >> > >>> >> > >>> Configuration conf = new Configuration(); >> > >>> >> > >>> DistanceMeasure measure = new >> > EuclideanDistanceMeasure(); >> > >>> >> > >>> Path vectorsFolder = new Path(outputDir, >> "vectors"); >> > >>> Path clusterCenters = new Path(outputDir + >> > >>> "-canopy/centriods"); >> > >>> Path clusterOutput = new Path(outputDir + >> > >>> "-canopy/clusters"); >> > >>> >> > >>> // create canopies instead of initial vectors >> > >>> CanopyDriver.run(conf, vectorsFolder, >> clusterCenters, >> > >>> measure, >> > >>> Double.parseDouble(T), >> > >>> Double.parseDouble(T), false, 0, false); >> > >>> >> > >>> >> > >>> // kmeans cluster operation >> > >>> KMeansDriver.run(conf, vectorsFolder, new >> > >>> Path(clusterCenters, >> > >>> "clusters-0-final/part-r-00000"), >> > >>> clusterOutput, measure, 0.01, >> > >>> Integer.parseInt(itMax), true, >> 0.0, >> > false); >> > >>> >> > >>> >> > >>> //post process by putting completed clusters into >> > their >> > >>> own files. >> > >>> >> ClusterOutputPostProcessorDriver.run(clusterOutput, >> > >>> new >> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >> > >>> >> > >>> } >> > >>> >> > >>> What do you think? >> > >>> >> > >>> On another but related note: Is there a plan to have a method -- say >> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the >> vectors >> > >>> within clusters as well as a separate folder containing pruned >> > outliers? >> > >>> >> > >>> Thanks! >> > >>> >> > >>> Mattie >> > >>> >> > >>> -----Original Message----- >> > >>> From: Paritosh Ranjan [mailto:pran...@xebia.com] >> > >>> Sent: Friday, August 17, 2012 12:16 PM >> > >>> To: user@mahout.apache.org >> > >>> Subject: Re: Mahout-279/kmeans++ >> > >>> >> > >>> The clustering algorithm has also changed internally. So, expect the >> > >>> results to be different ( and better ). >> > >>> >> > >>> I can think of one reason for this behavior. Maybe lots of clusters >> are >> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not >> > >>> output any cluster with single vector. >> > >>> So, I think, its clusterdumper which is doing the invisible >> "pruning" ( >> > >>> by not ouputting clusters with single vectors ). >> > >>> >> > >>> Can you cross check the output once with >> > ClusterOutputPostProcessorDriver? >> > >>> >> > >>> No, no tool can output the pruned vectors. The only way to see all >> > >>> vectors assigned to any cluster is to set >> > clusterClassificationThreshold >> > >>> to 0. >> > >>> >> > >>> If you still face the problem, then please provide the parameters >> with >> > >>> which you are calling kmeans. >> > >>> >> > >>> Regarding "I should also mention I have vectors which are exactly >> the >> > >>> same (even their names), perhaps they are the ones being pruned, is >> > that >> > >>> possible? " >> > >>> >> > >>> The name of the vector has nothing to do with clustering, I am not >> sure >> > >>> whether it will have any effect when clusterdumper is in action. So, >> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer >> this. >> > >>> >> > >>> Good luck. >> > >>> Paritosh >> > >>> >> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote: >> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans. >> > Previously >> > >>> (v0.5) when I did a clusterdump the total amount of vectors within >> the >> > >>> resultant clusters was the same as the total amount fed to the >> > algorithm. >> > >>> I wish this to be the case when clustering with v0.7. The only >> > change in >> > >>> the algorithm is clusterClassificationThreshold, I set this value >> to >> > be 0 >> > >>> so that it will in fact cluster all vectors in the dataset. >> > >>>> My logic here was no vector should have a probability of being in >> some >> > >>> cluster less than 0 and therefore all vectors should cluster. >> > >>>> However after running a clusterdump I find that vectors (1/3 >> roughly) >> > >>> have been pruned. >> > >>>> Is this a bug, or me just not understanding the new capabilities? >> > >>>> >> > >>>> I should also mention I have vectors which are exactly the same >> (even >> > >>> their names), perhaps they are the ones being pruned, is that >> possible? >> > >>>> Another question if I may: I will eventually want to use the >> pruning >> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a >> > >>> similar method) have the capability of outputting the pruned vectors >> > into a >> > >>> folder? >> > >>>> Thanks! Please let me know if I'm still not being clear enough. >> > >>>> >> > >>>> Mattie >> > >>>> >> > >>>> -----Original Message----- >> > >>>> From: Paritosh Ranjan [mailto:pran...@xebia.com] >> > >>>> Sent: Friday, August 17, 2012 11:20 AM >> > >>>> To: user@mahout.apache.org >> > >>>> Subject: Re: Mahout-279/kmeans++ >> > >>>> >> > >>>> clusterClassificationThreshold is for outlier removal, and this is >> the >> > >>> way it should be used. >> > >>>> Can you provide some more information about your job and the way >> you >> > are >> > >>> calling it? >> > >>>> And if I look at the code, the vector should be clustered even if >> the >> > >>> pdf is 0. The method which decides whether the vector should be >> > assigned to >> > >>> a particular cluster or not - >> > >>>> /** >> > >>>> * Decides whether the vector should be classified or not >> based >> > on >> > >>> the max pdf >> > >>>> * value of the clusters and threshold value. >> > >>>> * >> > >>>> * @return whether the vector should be classified or not. >> > >>>> */ >> > >>>> private static boolean shouldClassify(Vector pdfPerCluster, >> > Double >> > >>> clusterClassificationThreshold) { >> > >>>> return pdfPerCluster.maxValue() >= >> > clusterClassificationThreshold; >> > >>>> } >> > >>>> >> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote: >> > >>>> >> > >>>>> Hi Ted, >> > >>>>> >> > >>>>> Yes this is great! I hope to start working with this algorithm in >> > the >> > >>> next couple weeks. >> > >>>>> I have a question about the 0.7 implementation of kmeans and the >> > >>> clusterClassificationThreshold, I have this value set at zero, but >> the >> > >>> output is still showing that about 1/3 of my data is not assigned >> to a >> > >>> cluster in my output. Am I using this value incorrectly? I did a >> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned >> > despite >> > >>> the clusterClassificationThreshold = 0. >> > >>>>> Thanks, >> > >>>>> >> > >>>>> Mattie >> > >>>>> >> > >>>>> >> > >>>>> -----Original Message----- >> > >>>>> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM >> > >>>>> To: user@mahout.apache.org >> > >>>>> Subject: Re: Mahout-279/kmeans++ >> > >>>>> >> > >>>>> Mattie, >> > >>>>> >> > >>>>> Would this help? >> > >>>>> >> > >>>>> >> > >>> >> > >> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java >> > >>>>> and >> > >>>>> >> > >>>>> >> > >>> >> > >> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf >> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie < >> > mwhit...@harris.com >> > >>>> wrote: >> > >>>>>> Hi! >> > >>>>>> >> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a >> patch >> > >>> like >> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of >> a >> > set >> > >>> of >> > >>>>>> more than 100,000,000. I have been using canopy clustering for >> > better >> > >>>>>> results, but still need to do a few passes of kmeans to >> determine my >> > >>> T, and >> > >>>>>> the random seed does take a long time. >> > >>>>>> >> > >>>>>> The comments say that you are working on a kmeans++, I searched >> > around >> > >>> but >> > >>>>>> couldn't confirm any more information about it. Is a scalable >> > >>> kmeans++ in >> > >>>>>> the works? (I know research on the subject is quite new) >> > >>>>>> >> > >>>>>> Thanks! >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> Mattie Whitmore >> > >>>>>> Mathematician/IR&D Software Engineer >> > >>>>>> HARRIS Corporation - Advanced Information Solutions >> > >>>>>> 301.837.5278 >> > >>>>>> mwhit...@harris.com<mailto:tiffany.fork...@harris.com> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > > >> > >> > >> > >> > >