Yeah... very useful. Clearly the adaptive limit on the number of surrogate points is much too restrictive.
On Fri, Dec 7, 2012 at 1:21 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote: > I took the plunge and rendered a few plots in R with how the > parameters of streaming-k-means evolve. > Here's the link [1]. > > [1] https://github.com/dfilimon/knn/wiki/skm-visualization > > On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > Still not that odd if several clusters are getting squashed. This can > > happen if the threshold increases too high or if the searcher is unable > to > > resolve the cube properly. By its nature, the cube is hard to reduce to > a > > smaller dimension. > > > > > > On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon < > dangeorge.fili...@gmail.com> > > wrote: > >> > >> But the weight referred to is the distance between a centroid and the > >> mean of a distribution (a cube vertice). > >> This should still be very small (also BallKMeans gets it right). > >> > >> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > >> > IN order to succeed here, SKM will need to have maxClusters set to > >> > 20,000 or > >> > so. > >> > > >> > The maximum distance between clusters on a 10d hypercube is sqrt(10) = > >> > 3.1 > >> > or so. If three clusters get smashed together, then you have a > >> > threshold of > >> > 1.4 or so. > >> > > >> > > >> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon > >> > <dangeorge.fili...@gmail.com> > >> > wrote: > >> >> > >> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the > >> >> radius is in fact 0.01. > >> >> > >> >> What's happening is that for 10 dimension, I was expecting ~1024 > >> >> clusters (or at least have small distances) but StreamingKMeans fails > >> >> on both accounts. > >> >> BallKMeans does in fact get the clusters. > >> >> > >> >> So, yes, it's probably a bug of some kind since I end up with > anywhere > >> >> between 400 and 1000 clusters (based on the searcher used) but the > >> >> distances are still wrong. > >> >> > >> >> Here's how many clusters I get and the searchers I get them with [1]. > >> >> As you can see, the number of clusters is all over the place. > >> >> > >> >> The distance too is also super huge. The assert said that all > >> >> distances should be less than 0.05. > >> >> Here is where it fails [2]. > >> >> And here is the corresponding GitHub issue (no info yet) [3]. > >> >> > >> >> [1] https://gist.github.com/4220406 > >> >> [2] > >> >> > >> >> > https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104 > >> >> [3] https://github.com/dfilimon/knn/issues/1 > >> >> > >> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <ted.dunn...@gmail.com> > >> >> wrote: > >> >> > How many clusters are you talking about? > >> >> > > >> >> > If you pick a modest number then streaming k-means should work well > >> >> > if > >> >> > it > >> >> > has several times more surrogate points than there are clusters. > >> >> > > >> >> > Also, typically a hyper-cube test works with very small cluster > >> >> > radius. > >> >> > Try > >> >> > 0.1 or 0.01. Otherwise, your clusters overlap and the theoretical > >> >> > guarantees go out the window. Without the guarantees, it is hard > to > >> >> > interpret test results. With small radii, and a modest number of > >> >> > clusters, > >> >> > what should happen is that the threshold in streaming k-means > quickly > >> >> > adapts > >> >> > but stays << 1 which is the minimum distance between clusters. > That > >> >> > guarantees that we will have at least 1 surrogate in each real > >> >> > cluster. > >> >> > > >> >> > Failure modes I can imagine could include: > >> >> > > >> >> > a) threshold gets very big and the number of surrogates drops to 1 > >> >> > due > >> >> > to a > >> >> > bug. > >> >> > > >> >> > b) unit test has exponentially many clusters (all corners = 2^d). > >> >> > This > >> >> > will > >> >> > cause the threshold to be increased to 1 or larger and will cause > us > >> >> > to > >> >> > try > >> >> > to cover many clusters with a single surrogate. > >> >> > > >> >> > c) something else (always possible) > >> >> > > >> >> > > >> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon > >> >> > <dangeorge.fili...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> Okay, please disregard the previous e-mail. > >> >> >> That hypothesis is toast; clustering works just fine with ball > >> >> >> k-means. > >> >> >> > >> >> >> So, the problem lies in streaming k-means somewhere. > >> >> >> > >> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon > >> >> >> <dangeorge.fili...@gmail.com> wrote: > >> >> >> > Hi, > >> >> >> > > >> >> >> > One of the most basic tests for streaming k-means (and k-means > in > >> >> >> > general) is whether it works well for points that are > >> >> >> > multi-normally > >> >> >> > distributed around the vertices of a unit cube. > >> >> >> > > >> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating > >> >> >> > thousands of points should cluster them in those 8 clusters and > >> >> >> > they > >> >> >> > should be relatively close to the means of these multinormal > >> >> >> > distributions. > >> >> >> > > >> >> >> > I decided to generalize it to more than 3 dimensions, and see > how > >> >> >> > it > >> >> >> > works for hypercubes with n dimensions and 2^n vertices. > >> >> >> > > >> >> >> > Not well it turns out. > >> >> >> > > >> >> >> > The clusters become less balanced as the number of dimensions > >> >> >> > increases. > >> >> >> > I'm not sure if this is to be expected. I understand that in > high > >> >> >> > dimensional spaces, it becomes more likely for distances to be > >> >> >> > equal > >> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at > 5 > >> >> >> > dimensions and this doesn't seem like a particularly high number > >> >> >> > of > >> >> >> > dimension to me. > >> >> >> > > >> >> >> > Is this normal? > >> >> >> > Should the hypercube no longer have all sides equal to 1? The > >> >> >> > variance > >> >> >> > of the multinormals is also 1. > >> >> >> > > >> >> >> > Thanks! > >> >> > > >> >> > > >> > > >> > > > > > >