Hi all, I have a question that's a bit more rooted in theory. In my clustering algorithm, I depend on the markovian properties of the graph representation of the data, using the eigenvectors/eigenvalues to determine where the cluster boundaries are. The markov transition matrix is dense, as it represents the full distribution of a random walk through the entire data set.
Within the thesis for this algorithm (link: http://dl.dropbox.com/u/1377610/eigencuts_thesis.pdf ), the author dedicates an entire chapter (chpt 8) to a discussion on using successively lower-rank sparse approximations of the full markov transition matrix in order to speed up computation of the eigendecomposition (since there is a bottleneck in that the decomposition is performed at every iterative step in the clustering process). My question is whether or not, on a framework like Mahout, this optimization is necessary. I certainly don't doubt there would be a performance improvement, but would it be noticeable across a distributed framework? If the markov transition matrix is already a DistributedRowMatrix, and I'm already using tools like the DistributedLanczosSolver, is an extra map/reduce procedure for creating a lower-rank markov matrix really necessary? I'm very eager to hear other folks' thoughts on this. Thanks! Regards, Shannon
