[ https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108353#comment-13108353 ]
Dan Brickley commented on MAHOUT-524: ------------------------------------- re job jar error, see MAHOUT-428 MAHOUT-197. draft patch: https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt I posted a patch that got me past those errors in the recent mailing list thread 'Spectral clustering - a bundle of issues'. I'll paste the relevant chunk of my email below. see http://comments.gmane.org/gmane.comp.apache.mahout.user/9319 ----- Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html ... seems perhaps some code rot? Can anyone else report success with Spectral clustering against recent trunk? Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter 10 --dimensions 37 ...with the small example affinity file we discussed yesterday, I hit a series of problems. data: http://danbri.org/2011/mahout/afftest.txt 1. As I mentioned in comments in http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/ (both for local pseudo-cluster, and a real one) I had to patch in calls to job.setJarByClass before job.waitForCompletion. This problem occured for others elsewhere in Mahout, e.g. MAHOUT-428 and MAHOUT-197, but I presume it can't be hitting everyone. From grepping around, this might not be the only component missing setJarByClass calls. Or is this just me, somehow? 2. Newlines in the input data made it fail, but the associated warning from AffinityMatrixInputMapper was very vague. I'd suggest allowing those and #-comments, but maybe not a good idea to make per-component syntax designs? Suggest also it's worth printing the problem line (see patch below) when complaining. 3. Failing to load the affinity matrix (surely a requirement for further progress?) does not seem to halt the job, I see exceptions mixed in with ongoing processing (until a later problem hits us). Transcript: https://gist.github.com/1200455 ... actually it wasn't clear if the newline problem was more of a warning, and other rows from the input data were accepted. In which case, reporting them as java.io.IOException seems a bit draconian. So maybe bits of the input file were in fact loaded. It would be great to clarify what expected behaviour is. 4. After all that, the job still fails. Full transcript here: https://gist.github.com/1200428 Excerpt: (I've added a bit more reporting output in a few places) 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from: specout/calculations/diagonal/part-r-00000 Exception in thread "main" java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152) at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121) However that file does exist in hdfs, and seqdumper seems to accept it; it just seems empty: Input Path: specout/calculations/diagonal/part-r-00000 Key class: class org.apache.hadoop.io.NullWritable Value Class: class org.apache.mahout.math.VectorWritable Count: 0 I've posted an informal composite patch at https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt ... if you can confirm the above issues and a breakdown into JIRAs, I'll attach cleaner patches where appropriate. > DisplaySpectralKMeans example fails > ----------------------------------- > > Key: MAHOUT-524 > URL: https://issues.apache.org/jira/browse/MAHOUT-524 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.4, 0.5 > Reporter: Jeff Eastman > Assignee: Shannon Quinn > Labels: clustering, k-means, visualization > Fix For: 0.6 > > Attachments: EclipseLog_20110918.txt, > SpectralKMeans_fail_20110919.txt, aff.txt, raw.txt, spectralkmeans.png > > > I've committed a new display example that attempts to push the standard > mixture of models data set through spectral k-means. After some tweaking of > configuration arguments and a bug fix in EigenCleanupJob it runs spectral > k-means to completion. The display example is expecting 2-d clustered points > and the example is producing 5-d points. Additional I/O work is needed before > this will play with the rest of the clustering algorithms. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira