Re: Spectral Clustering

2015-05-07 Thread Suneel Marthi
@ShannonQuinn ??

On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote:

 Hi Team,

 I am new to Mahout and working on a project where I need to cluster json
 documents. I went through the documentation but didn't get enough insights
 about this. Could you please help me on how the start the implementation in
 Java and how shall my input file looks like?

 Thanks,
 Sugam



Re: Spectral Clustering

2015-05-07 Thread Suneel Marthi
Shannon would be the right guy to answer this.

On Thu, May 7, 2015 at 1:52 PM, sugam bahl sugamb...@yahoo.co.in wrote:

 What do we mean by ShannonQuinn??

 Thanks,
 Sugam



 On Thursday, 7 May 2015 10:49 AM, Suneel Marthi smar...@apache.org
 wrote:
 @ShannonQuinn ??


 On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote:

  Hi Team,
 
  I am new to Mahout and working on a project where I need to cluster json
  documents. I went through the documentation but didn't get enough
 insights
  about this. Could you please help me on how the start the implementation
 in
  Java and how shall my input file looks like?
 
  Thanks,
  Sugam
 



Re: Spectral Clustering

2015-05-07 Thread sugam bahl
What do we mean by ShannonQuinn??

Thanks,
Sugam



On Thursday, 7 May 2015 10:49 AM, Suneel Marthi smar...@apache.org wrote:
@ShannonQuinn ??


On Thu, May 7, 2015 at 1:45 PM, sugam bahl sugamb...@yahoo.co.in wrote:

 Hi Team,

 I am new to Mahout and working on a project where I need to cluster json
 documents. I went through the documentation but didn't get enough insights
 about this. Could you please help me on how the start the implementation in
 Java and how shall my input file looks like?

 Thanks,
 Sugam



Re: Spectral clustering

2015-05-07 Thread Shannon Quinn

Hi Sugam,

To clarify, the RBF I mentioned for computing affinities is the radial 
basis function, linked in mahout's spectral clustering documentation: 
http://en.wikipedia.org/wiki/RBF_kernel


The basic layout is to compare documents pairwise, use RBF to compute 
their similiarity, and set the entries in the affinity matrix 
corresponding to the two documents to the output of the RBF.


On 5/7/15 4:59 PM, Shannon Quinn wrote:

Hi Sugam,

This is in response to your original thread: 
http://mail-archives.apache.org/mod_mbox/mahout-user/201505.mbox/%3C1412053714.1387791.1431020729309.JavaMail.yahoo%40mail.yahoo.com%3E


The first thing you need to do is build the graph affinity matrix 
yourself. That's the input to the map-reduce spectral clustering 
algorithm, and what is described in the documentation (the i, j, 
value part). Basically you'll consider each document as a single node 
in a graph, and weight the connections between nodes. i and j are 
the pair of nodes you're considering, and value is the similarity / 
affinity, usually between 0 (completely dissimilar) and 1 (identical). 
Typically you use RBF to compute affinities.


Once you have the data in this format, then you can feed it to the 
spectral clustering algorithm. Having the Mahout package compute the 
affinities is at the top of my to-do list for the next version (though 
there are still some questions that have to be addressed), so in 
theory you could just submit the documents as you would to any other 
algorithm in Mahout, but for now you have to compute the affinities 
yourself.


Let me know if anything still isn't clear.

Shannon




Re: Spectral clustering - a bundle of issues

2011-09-09 Thread Shannon Quinn
I've had a chance to look at some of these. Using your dataset and your 
command line arguments:


1) This was a problem for me on a pseudo-distributed Ubuntu machine; 
however, on a single-node setup on my OS X laptop, I didn't run into 
this problem. Can anyone with more experience with the setJarByClass 
tickets elucidate?


2) Newlines and comments would be a nice thing to add, absolutely. Since 
I'll be changing how the input is done to the spectral algorithms, this 
is good to keep in mind.


3) Not sure on this one. I'll have to drum up some more creative test 
cases to figure out what's going on here, but you're probably right in 
that it's a consequence of the aforementioned newlines and comments.


4) I think this is actually an off-by-one error: you specify 37 
dimensions, but unfortunately I designed the input m/r job to be 
0-indexed (another caveat/artifact that will vanish once raw input is 
acceptable), so you'll either need to re-index your affinity text file, 
or simply make the number of dimensions 38 and put up with an first row 
and column made entirely of 0's.


Even so, running this job with --dimensions 38 causes the Lanczos solver 
path errors you and I have been seeing a lot of lately:


Exception in thread main java.lang.IllegalStateException: 
java.io.FileNotFoundException: File 
file:/data/output/calculations/laplacian-80/tmp/data does not exist.
at 
org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:222)
at 
org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)

...

The postfix data after tmp is strange to me; my guess is that the 
voodoo magic I did with the Paths in the SpectralKMeansDriver last 
summer is indeed rotting at this point and needs to be reworked. I'll 
start there.


Thanks for helping with this, Dan.

Shannon

On 9/7/11 8:45 AM, Dan Brickley wrote:

Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
... seems perhaps some code rot?

Can anyone else report success with Spectral clustering against recent trunk?

Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
10 --dimensions 37

...with the small example affinity file we discussed yesterday, I hit
a series of problems.

data: http://danbri.org/2011/mahout/afftest.txt

1. As I mentioned in comments in
http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
(both for local pseudo-cluster, and a real one) I had to patch in
calls to job.setJarByClass before job.waitForCompletion. This problem
occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
MAHOUT-197, but I presume it can't be hitting everyone. From grepping
around, this might not be the only component missing setJarByClass
calls. Or is this just me, somehow?

2. Newlines in the input data made it fail, but the associated warning
from AffinityMatrixInputMapper was very vague. I'd suggest allowing
those and #-comments, but maybe not a good idea to make per-component
syntax designs? Suggest also it's worth printing the problem line (see
patch below) when complaining.

3. Failing to load the affinity matrix (surely a requirement for
further progress?) does not seem to halt the job, I see exceptions
mixed in with ongoing processing (until a later problem hits us).
Transcript: https://gist.github.com/1200455 ... actually it wasn't
clear if the newline problem was more of a warning, and other rows
from the input data were accepted. In which case, reporting them as
java.io.IOException seems a bit draconian. So maybe bits of the input
file were in fact loaded. It would be great to clarify what expected
behaviour is.


4. After all that, the job still fails. Full transcript here:
https://gist.github.com/1200428

Excerpt: (I've added a bit more reporting output in a few places)

11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
specout/calculations/diagonal/part-r-0
Exception in thread main java.util.NoSuchElementException
at 
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
at 
org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

However that file does exist in hdfs, and seqdumper seems to accept
it; it just seems empty:

Input Path: specout/calculations/diagonal/part-r-0
Key class: class org.apache.hadoop.io.NullWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

I've posted an informal composite patch at
https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
  ... if you can confirm the above issues and a breakdown into JIRAs,
I'll attach cleaner patches where appropriate.

Looking forward to getting this running,

cheers,

Dan




Re: Spectral clustering - a bundle of issues

2011-09-07 Thread Lance Norskog
Cool! Also, DisplaySpectralClustering does not work. It has some problems
with the data directory names.  I did not succeed in tracking these names
via eclipse.

https://issues.apache.org/jira/browse/MAHOUT-524

On Wed, Sep 7, 2011 at 5:45 AM, Dan Brickley dan...@danbri.org wrote:

 Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
 ... seems perhaps some code rot?

 Can anyone else report success with Spectral clustering against recent
 trunk?

 Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
 10 --dimensions 37

 ...with the small example affinity file we discussed yesterday, I hit
 a series of problems.

 data: http://danbri.org/2011/mahout/afftest.txt

 1. As I mentioned in comments in
 http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
 (both for local pseudo-cluster, and a real one) I had to patch in
 calls to job.setJarByClass before job.waitForCompletion. This problem
 occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
 MAHOUT-197, but I presume it can't be hitting everyone. From grepping
 around, this might not be the only component missing setJarByClass
 calls. Or is this just me, somehow?

 2. Newlines in the input data made it fail, but the associated warning
 from AffinityMatrixInputMapper was very vague. I'd suggest allowing
 those and #-comments, but maybe not a good idea to make per-component
 syntax designs? Suggest also it's worth printing the problem line (see
 patch below) when complaining.

 3. Failing to load the affinity matrix (surely a requirement for
 further progress?) does not seem to halt the job, I see exceptions
 mixed in with ongoing processing (until a later problem hits us).
 Transcript: https://gist.github.com/1200455 ... actually it wasn't
 clear if the newline problem was more of a warning, and other rows
 from the input data were accepted. In which case, reporting them as
 java.io.IOException seems a bit draconian. So maybe bits of the input
 file were in fact loaded. It would be great to clarify what expected
 behaviour is.


 4. After all that, the job still fails. Full transcript here:
 https://gist.github.com/1200428

 Excerpt: (I've added a bit more reporting output in a few places)

 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
 specout/calculations/diagonal/part-r-0
 Exception in thread main java.util.NoSuchElementException
at
 com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
at
 org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

 However that file does exist in hdfs, and seqdumper seems to accept
 it; it just seems empty:

 Input Path: specout/calculations/diagonal/part-r-0
 Key class: class org.apache.hadoop.io.NullWritable Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 I've posted an informal composite patch at

 https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
  ... if you can confirm the above issues and a breakdown into JIRAs,
 I'll attach cleaner patches where appropriate.

 Looking forward to getting this running,

 cheers,

 Dan




-- 
Lance Norskog
goks...@gmail.com


Re: Spectral Clustering, EigenCuts and Affinity Matrix

2011-04-23 Thread Olivier Grisel
2011/4/11 Daniel McEnnis dmcen...@gmail.com:
 Lance,

 My reading of it is that it is not a Wikipedia affinity matrix but a
 distance matrix.  Representing a distance matrix as a graph,
 clustering, then projecting back to a vector space is common for
 bleeding edge (circa 2008) clustering algorithms.  It sounds like what
 your describing in the email.

The distance matrix is the opposite of an affinity matrix. See the
following inline replies.

 From the Mahout wiki:
 https://cwiki.apache.org/MAHOUT/spectral-clustering.html

 This n by n comparison of all objects with all others forms the
 affinity matrix, which can be intuitively thought of as a rough
 representation of an underlying undirected, weighted, and
 fully-connected graph whose edges express the relative relationships,
 or affinities, between each pair of objects in the original data. 

This definition is correct. Affinity matrix is a common term for
spectral clustering algorithms. Similarity matrix is a synonym in this
context as pointed by Ted.

There are several ways to build an affinity matrix A:

- take the heat kernel of the distances by applying the following
gaussian transform component-wise on the distance matrix D

   a_i,j = exp(- d_i,j^2 / (sigma * mean(D)^2))

This representation is dense (as D is), hence not memory efficient
unless you threshold small values (but it's not trivial to find a good
threashold). Furthermore the clustering performance can be sensitive
to the value of sigma.

- compute the k nearest neighburs (e.g. k=10) of each sample vector
and define A as the adjacency (connectivity) matrix of the KNN graph.

This representation is sparse, hence more efficient for large-scale
problems (once you have extracted the kNN graph).

 http://en.wikipedia.org/wiki/Affinity_%28mathematics%29
 Wikipedia defines 'affinity matrix' as an affine transform.

This is a completely unrelated meaning of the word affinity.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel


Re: Spectral Clustering, EigenCuts and Affinity Matrix

2011-04-23 Thread Lance Norskog
The heat kernel definitions I find are about diffusion. This formula
does not diffuse anything. Does this algorithm merely use the core
heat kernel formula to compress the dynamic range of the matrix?

Lance

On Sat, Apr 23, 2011 at 8:09 AM, Olivier Grisel
olivier.gri...@ensta.org wrote:
 2011/4/11 Daniel McEnnis dmcen...@gmail.com:
 Lance,

 My reading of it is that it is not a Wikipedia affinity matrix but a
 distance matrix.  Representing a distance matrix as a graph,
 clustering, then projecting back to a vector space is common for
 bleeding edge (circa 2008) clustering algorithms.  It sounds like what
 your describing in the email.

 The distance matrix is the opposite of an affinity matrix. See the
 following inline replies.

 From the Mahout wiki:
 https://cwiki.apache.org/MAHOUT/spectral-clustering.html

 This n by n comparison of all objects with all others forms the
 affinity matrix, which can be intuitively thought of as a rough
 representation of an underlying undirected, weighted, and
 fully-connected graph whose edges express the relative relationships,
 or affinities, between each pair of objects in the original data. 

 This definition is correct. Affinity matrix is a common term for
 spectral clustering algorithms. Similarity matrix is a synonym in this
 context as pointed by Ted.

 There are several ways to build an affinity matrix A:

 - take the heat kernel of the distances by applying the following
 gaussian transform component-wise on the distance matrix D

   a_i,j = exp(- d_i,j^2 / (sigma * mean(D)^2))

 This representation is dense (as D is), hence not memory efficient
 unless you threshold small values (but it's not trivial to find a good
 threashold). Furthermore the clustering performance can be sensitive
 to the value of sigma.

 - compute the k nearest neighburs (e.g. k=10) of each sample vector
 and define A as the adjacency (connectivity) matrix of the KNN graph.

 This representation is sparse, hence more efficient for large-scale
 problems (once you have extracted the kNN graph).

 http://en.wikipedia.org/wiki/Affinity_%28mathematics%29
 Wikipedia defines 'affinity matrix' as an affine transform.

 This is a completely unrelated meaning of the word affinity.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel




-- 
Lance Norskog
goks...@gmail.com


Re: Spectral Clustering, EigenCuts and Affinity Matrix

2011-04-10 Thread Ted Dunning
That is a different concept.

I would have used the term similarity matrix.  See
http://en.wikipedia.org/wiki/Similarity_matrix


On Sun, Apr 10, 2011 at 4:09 PM, Lance Norskog goks...@gmail.com wrote:

 http://en.wikipedia.org/wiki/Affinity_%28mathematics%29
 Wikipedia defines 'affinity matrix' as an affine transform. Does the
 affinity matrix in Eigencuts match the Wikipedia entry, or is it a
 completely different concept that happens to be called an 'affinity
 matrix'?


Re: Spectral Clustering, EigenCuts and Affinity Matrix

2011-04-10 Thread Daniel McEnnis
Lance,

My reading of it is that it is not a Wikipedia affinity matrix but a
distance matrix.  Representing a distance matrix as a graph,
clustering, then projecting back to a vector space is common for
bleeding edge (circa 2008) clustering algorithms.  It sounds like what
your describing in the email.

Daniel.

On Sun, Apr 10, 2011 at 7:09 PM, Lance Norskog goks...@gmail.com wrote:
 From the Mahout wiki:
 https://cwiki.apache.org/MAHOUT/spectral-clustering.html

 This n by n comparison of all objects with all others forms the
 affinity matrix, which can be intuitively thought of as a rough
 representation of an underlying undirected, weighted, and
 fully-connected graph whose edges express the relative relationships,
 or affinities, between each pair of objects in the original data. 

 http://en.wikipedia.org/wiki/Affinity_%28mathematics%29
 Wikipedia defines 'affinity matrix' as an affine transform. Does the
 affinity matrix in Eigencuts match the Wikipedia entry, or is it a
 completely different concept that happens to be called an 'affinity
 matrix'?



 --
 Lance Norskog
 goks...@gmail.com