This is actually something I could use a little expert Hadoop assistance on.
The general idea is that the points that are clustered in eigenspace have a
1-to-1 correspondence with the original points (which is how you get your
cluster assignments), but this back-mapping after clustering isn't
explicitly implemented yet, since that's the core of the IO issue.

My block on this is my lack of understanding in how the actual ordering of
the points change (or not?) from when they are projected into eigenspace
(the Lanczos solver) and when K-means makes its cluster assignments. On a
one-node setup the original ordering appears to be preserved through all the
operations, so the labels of the original points can be assigned by giving
original_point[i] the label of projected_point[i], hence the cluster
assignments are easy to determine. For multi-node setups, however, I simply
don't know if this heuristic holds.

But I believe the immediate issue here is that we're feeding the projected
points to the display, when it should be the original points *annotated*
with the cluster assignments from the corresponding projected points. The
question is how to shift those assignments over robustly; right now it's
just a hack job in the SpectralKMeansDriver...or maybe (hopefully!) it's
just the version I have locally :o)

On Tue, May 24, 2011 at 2:13 PM, Jeff Eastman <jeast...@narus.com> wrote:

> Yes, I expect it is pilot error on my part. The original implementation was
> failing in this manner because I was requesting 5 eigenvectors (clusters). I
> changed it to 2 and now it displays something but it is not even close to
> correct. I think this is because I have not transformed back from eigen
> space to vector space. This all relates to the IO issue for the spectral
> clustering code which I don't grok.
>
> The display driver begins with the sample points and generates the affinity
> matrix using a distance measure. Not clear this is even a correct
> interpretation of that matrix. Then spectral kmeans runs and produces 2
> clusters which I display directly. Seems like this number should be more
> like the k in kmeans, and 5 was more realistic given the data. I believe
> there is a missing output transformation to recover the clusters from the
> eigenvectors but I don't know how to do that.
>
> I bet you do :)
>
> -----Original Message-----
> From: Shannon Quinn (JIRA) [mailto:j...@apache.org]
> Sent: Tuesday, May 24, 2011 8:07 AM
> To: dev@mahout.apache.org
> Subject: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example
> fails
>
>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038608#comment-13038608]
>
> Shannon Quinn commented on MAHOUT-524:
> --------------------------------------
>
> +1, I'm on it.
>
> I'm a little unclear as to the context of the initial Hudson comment: the
> display method is expecting 2D vectors, but getting 5D ones?
>
> > DisplaySpectralKMeans example fails
> > -----------------------------------
> >
> >                 Key: MAHOUT-524
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-524
> >             Project: Mahout
> >          Issue Type: Bug
> >          Components: Clustering
> >    Affects Versions: 0.4, 0.5
> >            Reporter: Jeff Eastman
> >            Assignee: Jeff Eastman
> >              Labels: clustering, k-means, visualization
> >             Fix For: 0.6
> >
> >         Attachments: aff.txt, raw.txt, spectralkmeans.png
> >
> >
> > I've committed a new display example that attempts to push the standard
> mixture of models data set through spectral k-means. After some tweaking of
> configuration arguments and a bug fix in EigenCleanupJob it runs spectral
> k-means to completion. The display example is expecting 2-d clustered points
> and the example is producing 5-d points. Additional I/O work is needed
> before this will play with the rest of the clustering algorithms.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Reply via email to