[ 
https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108353#comment-13108353
 ] 

Dan Brickley commented on MAHOUT-524:
-------------------------------------

re job jar error, see MAHOUT-428 MAHOUT-197.

draft patch: 
https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt


I posted a patch that got me past those errors in the recent mailing list 
thread 'Spectral clustering - a bundle of issues'. I'll paste the relevant 
chunk of my email below. see 
http://comments.gmane.org/gmane.comp.apache.mahout.user/9319


-----

Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
... seems perhaps some code rot?

Can anyone else report success with Spectral clustering against recent trunk?

Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
10 --dimensions 37

...with the small example affinity file we discussed yesterday, I hit
a series of problems.

data: http://danbri.org/2011/mahout/afftest.txt

1. As I mentioned in comments in
http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
(both for local pseudo-cluster, and a real one) I had to patch in
calls to job.setJarByClass before job.waitForCompletion. This problem
occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
MAHOUT-197, but I presume it can't be hitting everyone. From grepping
around, this might not be the only component missing setJarByClass
calls. Or is this just me, somehow?

2. Newlines in the input data made it fail, but the associated warning
from AffinityMatrixInputMapper was very vague. I'd suggest allowing
those and #-comments, but maybe not a good idea to make per-component
syntax designs? Suggest also it's worth printing the problem line (see
patch below) when complaining.

3. Failing to load the affinity matrix (surely a requirement for
further progress?) does not seem to halt the job, I see exceptions
mixed in with ongoing processing (until a later problem hits us).
Transcript: https://gist.github.com/1200455 ... actually it wasn't
clear if the newline problem was more of a warning, and other rows
from the input data were accepted. In which case, reporting them as
java.io.IOException seems a bit draconian. So maybe bits of the input
file were in fact loaded. It would be great to clarify what expected
behaviour is.


4. After all that, the job still fails. Full transcript here:
https://gist.github.com/1200428

Excerpt: (I've added a bit more reporting output in a few places)

11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
specout/calculations/diagonal/part-r-00000
Exception in thread "main" java.util.NoSuchElementException
       at 
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
       at 
org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

However that file does exist in hdfs, and seqdumper seems to accept
it; it just seems empty:

Input Path: specout/calculations/diagonal/part-r-00000
Key class: class org.apache.hadoop.io.NullWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

I've posted an informal composite patch at
https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
 ... if you can confirm the above issues and a breakdown into JIRAs,
I'll attach cleaner patches where appropriate.


> DisplaySpectralKMeans example fails
> -----------------------------------
>
>                 Key: MAHOUT-524
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-524
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Jeff Eastman
>            Assignee: Shannon Quinn
>              Labels: clustering, k-means, visualization
>             Fix For: 0.6
>
>         Attachments: EclipseLog_20110918.txt, 
> SpectralKMeans_fail_20110919.txt, aff.txt, raw.txt, spectralkmeans.png
>
>
> I've committed a new display example that attempts to push the standard 
> mixture of models data set through spectral k-means. After some tweaking of 
> configuration arguments and a bug fix in EigenCleanupJob it runs spectral 
> k-means to completion. The display example is expecting 2-d clustered points 
> and the example is producing 5-d points. Additional I/O work is needed before 
> this will play with the rest of the clustering algorithms. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to