On Thu, Jun 24, 2010 at 5:49 PM, Shannon Quinn <[email protected]> wrote: > > > I figured that was the case...I just found it strange that, if > saveCleanEigens() was saving them somewhere else that was hard-coded into > the method, that that output path wasn't specified by the caller, nor was it > returned by the callee, so I just assumed the next logical (what seemed > logical to me, anyway :P) pattern: that the original eigenvectors were > overwritten.
Sensible, and you raise a valid point: having the output path be a parameter which is transparently used to, well, put the output, is a good change if you want to incorporate that into a patch. Using a hardcoded subdirectory isn't necessary in this case, it's just following the typical pattern of having $output/dictionary, $output/someModelSubDir, $output/someOtherData, in many of our jobs. In this case, there's really only one subdirectory. > Sorry, it was definitely unclear. Since I'm running Kmeans clustering on > the matrix of eigenvectors as a proxy for running Kmeans on the actual data > (where each component of the eigenvectors represents one of the original > data points), Let me be clear in understanding this: you take the matrix of eigenvectors, which has desiredRank rows, of originalSize columns each, and take the *columns* of this matrix (all originalSize of them, each of which has desiredRank entries) and cluster them with KMeans, right? > I need to, in effect, "transfer" the clustering assignments that Kmeans > gives on the eigenvectors back to the original data. And then output those > assignments, ideally in exactly the same format as Kmeans, or any of the > other clustering algorithms. I looked into the Kmeans unit tests and feel > like I can easily read off the clustering assignments and correlate them to > the original data, but then I'm not sure how to output these correlations, > since the clustering was done on the eigenvector components. > Well the nice thing at this point is that the output of KMeans is to give assigments keyed on the original keys of the input matrix (I think!), and produces a SequenceFile<IntWritable,WeightedVectorWritable>, and this basically *should* be already correlated with your original data, directly. You don't really want to just be using clusterdumer, that's just for seeing stuff on the command line output... Can someone more familiar with the KMeansDriver chime in here, maybe? -jake
