Hi Jake,
I don't think that the EigenVerificationJob *modifies* any SequenceFiles -
that's a big no-no in Hadoop-land (data is write-once). The output path for
the cleaned eigenvectors is "${mapred.output.dir}/largestCleanEigens/" -
look in EigenVerificationJob.saveCleanEigens(). It will give you as many
cleaned eigenvectors as it can get out of the ones that you gave it (ie.
every eigenvector which has error less than maxError, and eigenvalue greater
than minEigenvalue will be kept).
I figured that was the case...I just found it strange that, if
saveCleanEigens() was saving them somewhere else that was hard-coded
into the method, that that output path wasn't specified by the caller,
nor was it returned by the callee, so I just assumed the next logical
(what seemed logical to me, anyway :P) pattern: that the original
eigenvectors were overwritten.
If you wanted to add a parameter to that job "maxEigensToKeep", which
would prune off the smallest eigenvectors of the remaining cleaned set and
keep only that value, it would be a nice addition.
Ahhhh. Yes. This had crossed my mind, but I was just curious if there
was anything I could do outside modifying the EigenVerifier itself to
prune out unwanted vectors. Will do.
I'm not exactly sure what you're asking about the cluster dumping...
Sorry, it was definitely unclear. Since I'm running Kmeans clustering on
the matrix of eigenvectors as a proxy for running Kmeans on the actual
data (where each component of the eigenvectors represents one of the
original data points), I need to, in effect, "transfer" the clustering
assignments that Kmeans gives on the eigenvectors back to the original
data. And then output those assignments, ideally in exactly the same
format as Kmeans, or any of the other clustering algorithms. I looked
into the Kmeans unit tests and feel like I can easily read off the
clustering assignments and correlate them to the original data, but then
I'm not sure how to output these correlations, since the clustering was
done on the eigenvector components.
Please let me know if that's still not clear. Thanks again!
Shannon