Yeah... very useful.  Clearly the adaptive limit on the number of surrogate
points is much too restrictive.

On Fri, Dec 7, 2012 at 1:21 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> I took the plunge and rendered a few plots in R with how the
> parameters of streaming-k-means evolve.
> Here's the link [1].
>
> [1] https://github.com/dfilimon/knn/wiki/skm-visualization
>
> On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> > Still not that odd if several clusters are getting squashed.  This can
> > happen if the threshold increases too high or if the searcher is unable
> to
> > resolve the cube properly.  By its nature, the cube is hard to reduce to
> a
> > smaller dimension.
> >
> >
> > On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon <
> dangeorge.fili...@gmail.com>
> > wrote:
> >>
> >> But the weight referred to is the distance between a centroid and the
> >> mean of a distribution (a cube vertice).
> >> This should still be very small (also BallKMeans gets it right).
> >>
> >> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >> > IN order to succeed here, SKM will need to have maxClusters set to
> >> > 20,000 or
> >> > so.
> >> >
> >> > The maximum distance between clusters on a 10d hypercube is sqrt(10) =
> >> > 3.1
> >> > or so.  If three clusters get smashed together, then you have a
> >> > threshold of
> >> > 1.4 or so.
> >> >
> >> >
> >> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon
> >> > <dangeorge.fili...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the
> >> >> radius is in fact 0.01.
> >> >>
> >> >> What's happening is that for 10 dimension, I was expecting ~1024
> >> >> clusters (or at least have small distances) but StreamingKMeans fails
> >> >> on both accounts.
> >> >> BallKMeans does in fact get the clusters.
> >> >>
> >> >> So, yes, it's probably a bug of some kind since I end up with
> anywhere
> >> >> between 400 and 1000 clusters (based on the searcher used) but the
> >> >> distances are still wrong.
> >> >>
> >> >> Here's how many clusters I get and the searchers I get them with [1].
> >> >> As you can see, the number of clusters is all over the place.
> >> >>
> >> >> The distance too is also super huge. The assert said that all
> >> >> distances should be less than 0.05.
> >> >> Here is where it fails [2].
> >> >> And here is the corresponding GitHub issue (no info yet) [3].
> >> >>
> >> >> [1] https://gist.github.com/4220406
> >> >> [2]
> >> >>
> >> >>
> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
> >> >> [3] https://github.com/dfilimon/knn/issues/1
> >> >>
> >> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <ted.dunn...@gmail.com>
> >> >> wrote:
> >> >> > How many clusters are you talking about?
> >> >> >
> >> >> > If you pick a modest number then streaming k-means should work well
> >> >> > if
> >> >> > it
> >> >> > has several times more surrogate points than there are clusters.
> >> >> >
> >> >> > Also, typically a hyper-cube test works with very small cluster
> >> >> > radius.
> >> >> > Try
> >> >> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> >> >> > guarantees go out the window.  Without the guarantees, it is hard
> to
> >> >> > interpret test results.  With small radii, and a modest number of
> >> >> > clusters,
> >> >> > what should happen is that the threshold in streaming k-means
> quickly
> >> >> > adapts
> >> >> > but stays << 1 which is the minimum distance between clusters.
>  That
> >> >> > guarantees that we will have at least 1 surrogate in each real
> >> >> > cluster.
> >> >> >
> >> >> > Failure modes I can imagine could include:
> >> >> >
> >> >> > a) threshold gets very big and the number of surrogates drops to 1
> >> >> > due
> >> >> > to a
> >> >> > bug.
> >> >> >
> >> >> > b) unit test has exponentially many clusters (all corners = 2^d).
> >> >> > This
> >> >> > will
> >> >> > cause the threshold to be increased to 1 or larger and will cause
> us
> >> >> > to
> >> >> > try
> >> >> > to cover many clusters with a single surrogate.
> >> >> >
> >> >> > c) something else (always possible)
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
> >> >> > <dangeorge.fili...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Okay, please disregard the previous e-mail.
> >> >> >> That hypothesis is toast; clustering works just fine with ball
> >> >> >> k-means.
> >> >> >>
> >> >> >> So, the problem lies in streaming k-means somewhere.
> >> >> >>
> >> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> >> >> >> <dangeorge.fili...@gmail.com> wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > One of the most basic tests for streaming k-means (and k-means
> in
> >> >> >> > general) is whether it works well for points that are
> >> >> >> > multi-normally
> >> >> >> > distributed around the vertices of a unit cube.
> >> >> >> >
> >> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> >> >> >> > thousands of points should cluster them in those 8 clusters and
> >> >> >> > they
> >> >> >> > should be relatively close to the means of these multinormal
> >> >> >> > distributions.
> >> >> >> >
> >> >> >> > I decided to generalize it to more than 3 dimensions, and see
> how
> >> >> >> > it
> >> >> >> > works for hypercubes with n dimensions and 2^n vertices.
> >> >> >> >
> >> >> >> > Not well it turns out.
> >> >> >> >
> >> >> >> > The clusters become less balanced as the number of dimensions
> >> >> >> > increases.
> >> >> >> > I'm not sure if this is to be expected. I understand that in
> high
> >> >> >> > dimensional spaces, it becomes more likely for distances to be
> >> >> >> > equal
> >> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at
> 5
> >> >> >> > dimensions and this doesn't seem like a particularly high number
> >> >> >> > of
> >> >> >> > dimension to me.
> >> >> >> >
> >> >> >> > Is this normal?
> >> >> >> > Should the hypercube no longer have all sides equal to 1? The
> >> >> >> > variance
> >> >> >> > of the multinormals is also 1.
> >> >> >> >
> >> >> >> > Thanks!
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Reply via email to