Re: Spectral Clustering
@ShannonQuinn ?? On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote: Hi Team, I am new to Mahout and working on a project where I need to cluster json documents. I went through the documentation but didn't get enough insights about this. Could you please help me on how the start the implementation in Java and how shall my input file looks like? Thanks, Sugam
Re: Spectral Clustering
Shannon would be the right guy to answer this. On Thu, May 7, 2015 at 1:52 PM, sugam bahl sugamb...@yahoo.co.in wrote: What do we mean by ShannonQuinn?? Thanks, Sugam On Thursday, 7 May 2015 10:49 AM, Suneel Marthi smar...@apache.org wrote: @ShannonQuinn ?? On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote: Hi Team, I am new to Mahout and working on a project where I need to cluster json documents. I went through the documentation but didn't get enough insights about this. Could you please help me on how the start the implementation in Java and how shall my input file looks like? Thanks, Sugam
Re: Spectral Clustering
What do we mean by ShannonQuinn?? Thanks, Sugam On Thursday, 7 May 2015 10:49 AM, Suneel Marthi smar...@apache.org wrote: @ShannonQuinn ?? On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote: Hi Team, I am new to Mahout and working on a project where I need to cluster json documents. I went through the documentation but didn't get enough insights about this. Could you please help me on how the start the implementation in Java and how shall my input file looks like? Thanks, Sugam
Re: Spectral clustering
Hi Sugam, To clarify, the RBF I mentioned for computing affinities is the radial basis function, linked in mahout's spectral clustering documentation: http://en.wikipedia.org/wiki/RBF_kernel The basic layout is to compare documents pairwise, use RBF to compute their similiarity, and set the entries in the affinity matrix corresponding to the two documents to the output of the RBF. On 5/7/15 4:59 PM, Shannon Quinn wrote: Hi Sugam, This is in response to your original thread: http://mail-archives.apache.org/mod_mbox/mahout-user/201505.mbox/%3C1412053714.1387791.1431020729309.JavaMail.yahoo%40mail.yahoo.com%3E The first thing you need to do is build the graph affinity matrix yourself. That's the input to the map-reduce spectral clustering algorithm, and what is described in the documentation (the i, j, value part). Basically you'll consider each document as a single node in a graph, and weight the connections between nodes. i and j are the pair of nodes you're considering, and value is the similarity / affinity, usually between 0 (completely dissimilar) and 1 (identical). Typically you use RBF to compute affinities. Once you have the data in this format, then you can feed it to the spectral clustering algorithm. Having the Mahout package compute the affinities is at the top of my to-do list for the next version (though there are still some questions that have to be addressed), so in theory you could just submit the documents as you would to any other algorithm in Mahout, but for now you have to compute the affinities yourself. Let me know if anything still isn't clear. Shannon
Re: Spectral clustering - a bundle of issues
I've had a chance to look at some of these. Using your dataset and your command line arguments: 1) This was a problem for me on a pseudo-distributed Ubuntu machine; however, on a single-node setup on my OS X laptop, I didn't run into this problem. Can anyone with more experience with the setJarByClass tickets elucidate? 2) Newlines and comments would be a nice thing to add, absolutely. Since I'll be changing how the input is done to the spectral algorithms, this is good to keep in mind. 3) Not sure on this one. I'll have to drum up some more creative test cases to figure out what's going on here, but you're probably right in that it's a consequence of the aforementioned newlines and comments. 4) I think this is actually an off-by-one error: you specify 37 dimensions, but unfortunately I designed the input m/r job to be 0-indexed (another caveat/artifact that will vanish once raw input is acceptable), so you'll either need to re-index your affinity text file, or simply make the number of dimensions 38 and put up with an first row and column made entirely of 0's. Even so, running this job with --dimensions 38 causes the Lanczos solver path errors you and I have been seeing a lot of lately: Exception in thread main java.lang.IllegalStateException: java.io.FileNotFoundException: File file:/data/output/calculations/laplacian-80/tmp/data does not exist. at org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:222) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104) ... The postfix data after tmp is strange to me; my guess is that the voodoo magic I did with the Paths in the SpectralKMeansDriver last summer is indeed rotting at this point and needs to be reworked. I'll start there. Thanks for helping with this, Dan. Shannon On 9/7/11 8:45 AM, Dan Brickley wrote: Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html ... seems perhaps some code rot? Can anyone else report success with Spectral clustering against recent trunk? Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter 10 --dimensions 37 ...with the small example affinity file we discussed yesterday, I hit a series of problems. data: http://danbri.org/2011/mahout/afftest.txt 1. As I mentioned in comments in http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/ (both for local pseudo-cluster, and a real one) I had to patch in calls to job.setJarByClass before job.waitForCompletion. This problem occured for others elsewhere in Mahout, e.g. MAHOUT-428 and MAHOUT-197, but I presume it can't be hitting everyone. From grepping around, this might not be the only component missing setJarByClass calls. Or is this just me, somehow? 2. Newlines in the input data made it fail, but the associated warning from AffinityMatrixInputMapper was very vague. I'd suggest allowing those and #-comments, but maybe not a good idea to make per-component syntax designs? Suggest also it's worth printing the problem line (see patch below) when complaining. 3. Failing to load the affinity matrix (surely a requirement for further progress?) does not seem to halt the job, I see exceptions mixed in with ongoing processing (until a later problem hits us). Transcript: https://gist.github.com/1200455 ... actually it wasn't clear if the newline problem was more of a warning, and other rows from the input data were accepted. In which case, reporting them as java.io.IOException seems a bit draconian. So maybe bits of the input file were in fact loaded. It would be great to clarify what expected behaviour is. 4. After all that, the job still fails. Full transcript here: https://gist.github.com/1200428 Excerpt: (I've added a bit more reporting output in a few places) 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from: specout/calculations/diagonal/part-r-0 Exception in thread main java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152) at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121) However that file does exist in hdfs, and seqdumper seems to accept it; it just seems empty: Input Path: specout/calculations/diagonal/part-r-0 Key class: class org.apache.hadoop.io.NullWritable Value Class: class org.apache.mahout.math.VectorWritable Count: 0 I've posted an informal composite patch at https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt ... if you can confirm the above issues and a breakdown into JIRAs, I'll attach cleaner patches where appropriate. Looking forward to getting this running, cheers, Dan
Re: Spectral clustering - a bundle of issues
Cool! Also, DisplaySpectralClustering does not work. It has some problems with the data directory names. I did not succeed in tracking these names via eclipse. https://issues.apache.org/jira/browse/MAHOUT-524 On Wed, Sep 7, 2011 at 5:45 AM, Dan Brickley dan...@danbri.org wrote: Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html ... seems perhaps some code rot? Can anyone else report success with Spectral clustering against recent trunk? Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter 10 --dimensions 37 ...with the small example affinity file we discussed yesterday, I hit a series of problems. data: http://danbri.org/2011/mahout/afftest.txt 1. As I mentioned in comments in http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/ (both for local pseudo-cluster, and a real one) I had to patch in calls to job.setJarByClass before job.waitForCompletion. This problem occured for others elsewhere in Mahout, e.g. MAHOUT-428 and MAHOUT-197, but I presume it can't be hitting everyone. From grepping around, this might not be the only component missing setJarByClass calls. Or is this just me, somehow? 2. Newlines in the input data made it fail, but the associated warning from AffinityMatrixInputMapper was very vague. I'd suggest allowing those and #-comments, but maybe not a good idea to make per-component syntax designs? Suggest also it's worth printing the problem line (see patch below) when complaining. 3. Failing to load the affinity matrix (surely a requirement for further progress?) does not seem to halt the job, I see exceptions mixed in with ongoing processing (until a later problem hits us). Transcript: https://gist.github.com/1200455 ... actually it wasn't clear if the newline problem was more of a warning, and other rows from the input data were accepted. In which case, reporting them as java.io.IOException seems a bit draconian. So maybe bits of the input file were in fact loaded. It would be great to clarify what expected behaviour is. 4. After all that, the job still fails. Full transcript here: https://gist.github.com/1200428 Excerpt: (I've added a bit more reporting output in a few places) 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from: specout/calculations/diagonal/part-r-0 Exception in thread main java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152) at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121) However that file does exist in hdfs, and seqdumper seems to accept it; it just seems empty: Input Path: specout/calculations/diagonal/part-r-0 Key class: class org.apache.hadoop.io.NullWritable Value Class: class org.apache.mahout.math.VectorWritable Count: 0 I've posted an informal composite patch at https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt ... if you can confirm the above issues and a breakdown into JIRAs, I'll attach cleaner patches where appropriate. Looking forward to getting this running, cheers, Dan -- Lance Norskog goks...@gmail.com
Re: Spectral Clustering, EigenCuts and Affinity Matrix
2011/4/11 Daniel McEnnis dmcen...@gmail.com: Lance, My reading of it is that it is not a Wikipedia affinity matrix but a distance matrix. Representing a distance matrix as a graph, clustering, then projecting back to a vector space is common for bleeding edge (circa 2008) clustering algorithms. It sounds like what your describing in the email. The distance matrix is the opposite of an affinity matrix. See the following inline replies. From the Mahout wiki: https://cwiki.apache.org/MAHOUT/spectral-clustering.html This n by n comparison of all objects with all others forms the affinity matrix, which can be intuitively thought of as a rough representation of an underlying undirected, weighted, and fully-connected graph whose edges express the relative relationships, or affinities, between each pair of objects in the original data. This definition is correct. Affinity matrix is a common term for spectral clustering algorithms. Similarity matrix is a synonym in this context as pointed by Ted. There are several ways to build an affinity matrix A: - take the heat kernel of the distances by applying the following gaussian transform component-wise on the distance matrix D a_i,j = exp(- d_i,j^2 / (sigma * mean(D)^2)) This representation is dense (as D is), hence not memory efficient unless you threshold small values (but it's not trivial to find a good threashold). Furthermore the clustering performance can be sensitive to the value of sigma. - compute the k nearest neighburs (e.g. k=10) of each sample vector and define A as the adjacency (connectivity) matrix of the KNN graph. This representation is sparse, hence more efficient for large-scale problems (once you have extracted the kNN graph). http://en.wikipedia.org/wiki/Affinity_%28mathematics%29 Wikipedia defines 'affinity matrix' as an affine transform. This is a completely unrelated meaning of the word affinity. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
Re: Spectral Clustering, EigenCuts and Affinity Matrix
The heat kernel definitions I find are about diffusion. This formula does not diffuse anything. Does this algorithm merely use the core heat kernel formula to compress the dynamic range of the matrix? Lance On Sat, Apr 23, 2011 at 8:09 AM, Olivier Grisel olivier.gri...@ensta.org wrote: 2011/4/11 Daniel McEnnis dmcen...@gmail.com: Lance, My reading of it is that it is not a Wikipedia affinity matrix but a distance matrix. Representing a distance matrix as a graph, clustering, then projecting back to a vector space is common for bleeding edge (circa 2008) clustering algorithms. It sounds like what your describing in the email. The distance matrix is the opposite of an affinity matrix. See the following inline replies. From the Mahout wiki: https://cwiki.apache.org/MAHOUT/spectral-clustering.html This n by n comparison of all objects with all others forms the affinity matrix, which can be intuitively thought of as a rough representation of an underlying undirected, weighted, and fully-connected graph whose edges express the relative relationships, or affinities, between each pair of objects in the original data. This definition is correct. Affinity matrix is a common term for spectral clustering algorithms. Similarity matrix is a synonym in this context as pointed by Ted. There are several ways to build an affinity matrix A: - take the heat kernel of the distances by applying the following gaussian transform component-wise on the distance matrix D a_i,j = exp(- d_i,j^2 / (sigma * mean(D)^2)) This representation is dense (as D is), hence not memory efficient unless you threshold small values (but it's not trivial to find a good threashold). Furthermore the clustering performance can be sensitive to the value of sigma. - compute the k nearest neighburs (e.g. k=10) of each sample vector and define A as the adjacency (connectivity) matrix of the KNN graph. This representation is sparse, hence more efficient for large-scale problems (once you have extracted the kNN graph). http://en.wikipedia.org/wiki/Affinity_%28mathematics%29 Wikipedia defines 'affinity matrix' as an affine transform. This is a completely unrelated meaning of the word affinity. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Lance Norskog goks...@gmail.com
Re: Spectral Clustering, EigenCuts and Affinity Matrix
That is a different concept. I would have used the term similarity matrix. See http://en.wikipedia.org/wiki/Similarity_matrix On Sun, Apr 10, 2011 at 4:09 PM, Lance Norskog goks...@gmail.com wrote: http://en.wikipedia.org/wiki/Affinity_%28mathematics%29 Wikipedia defines 'affinity matrix' as an affine transform. Does the affinity matrix in Eigencuts match the Wikipedia entry, or is it a completely different concept that happens to be called an 'affinity matrix'?
Re: Spectral Clustering, EigenCuts and Affinity Matrix
Lance, My reading of it is that it is not a Wikipedia affinity matrix but a distance matrix. Representing a distance matrix as a graph, clustering, then projecting back to a vector space is common for bleeding edge (circa 2008) clustering algorithms. It sounds like what your describing in the email. Daniel. On Sun, Apr 10, 2011 at 7:09 PM, Lance Norskog goks...@gmail.com wrote: From the Mahout wiki: https://cwiki.apache.org/MAHOUT/spectral-clustering.html This n by n comparison of all objects with all others forms the affinity matrix, which can be intuitively thought of as a rough representation of an underlying undirected, weighted, and fully-connected graph whose edges express the relative relationships, or affinities, between each pair of objects in the original data. http://en.wikipedia.org/wiki/Affinity_%28mathematics%29 Wikipedia defines 'affinity matrix' as an affine transform. Does the affinity matrix in Eigencuts match the Wikipedia entry, or is it a completely different concept that happens to be called an 'affinity matrix'? -- Lance Norskog goks...@gmail.com