Re: Selectively discarding EigenVerification results and clustering assignments

Jake Mannix Thu, 24 Jun 2010 11:07:44 -0700

On Thu, Jun 24, 2010 at 5:49 PM, Shannon Quinn <[email protected]> wrote:
>
>
> I figured that was the case...I just found it strange that, if
> saveCleanEigens() was saving them somewhere else that was hard-coded into
> the method, that that output path wasn't specified by the caller, nor was it
> returned by the callee, so I just assumed the next logical (what seemed
> logical to me, anyway :P) pattern: that the original eigenvectors were
> overwritten.



Sensible, and you raise a valid point: having the output path be a parameter
which is transparently used to, well, put the output, is a good change if
you want to incorporate that into a patch.  Using a hardcoded subdirectory
isn't necessary in this case, it's just following the typical pattern of
having $output/dictionary, $output/someModelSubDir, $output/someOtherData,
in many of our jobs.  In this case, there's really only one subdirectory.


> Sorry, it was definitely unclear. Since I'm running Kmeans clustering on
> the matrix of eigenvectors as a proxy for running Kmeans on the actual data
> (where each component of the eigenvectors represents one of the original
> data points),


Let me be clear in understanding this: you take the matrix of eigenvectors,
which has desiredRank rows, of originalSize columns each, and take the
*columns* of this matrix (all originalSize of them, each of which has
desiredRank entries) and cluster them with KMeans, right?


> I need to, in effect, "transfer" the clustering assignments that Kmeans
> gives on the eigenvectors back to the original data. And then output those
> assignments, ideally in exactly the same format as Kmeans, or any of the
> other clustering algorithms. I looked into the Kmeans unit tests and feel
> like I can easily read off the clustering assignments and correlate them to
> the original data, but then I'm not sure how to output these correlations,
> since the clustering was done on the eigenvector components.
>

Well the nice thing at this point is that the output of KMeans is to give
assigments keyed on the original keys of the input matrix (I think!), and
produces a SequenceFile<IntWritable,WeightedVectorWritable>, and this
basically *should* be already correlated with your original data, directly.
You don't really want to just be using clusterdumer, that's just for seeing
stuff on the command line output...

Can someone more familiar with the KMeansDriver chime in here, maybe?

  -jake

Re: Selectively discarding EigenVerification results and clustering assignments

Reply via email to