[ https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll resolved MAHOUT-978. ------------------------------------ Resolution: Won't Fix I'd say, won't fix, as there is a workaround. Please re-open if there is a specific patch. > spectralkmeans utility fails when input filename begins with leading > underscore > ------------------------------------------------------------------------------- > > Key: MAHOUT-978 > URL: https://issues.apache.org/jira/browse/MAHOUT-978 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.6 > Environment: Tested on a real Linux-based cluster running Hadoop > 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop > 0.20.203.0 running 16 Feb trunk build. > Reporter: Dan Brickley > Priority: Minor > Attachments: jira-underscore-spectral-log.txt > > > The commandline 'bin/mahout spectralkmeans' utility fails with > NoSuchElementException after "Loading vector from: > spectral/output/results2/calculations/diagonal/part-r-00000" when input data > in hdfs has filename beginning with a leading underscore. > This was partially reported in comments for MAHOUT-524 but I believe > identified now as a distinct issue (thanks to Shannon for help diagnosing). I > have not investigated if there is an equivalent problem for API-based use of > this piece of Mahout. > Steps to reproduce: > 1. put affinity file into hdfs, following > https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs > count from zero etc. Name your file with a leading underscore. For example, > try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in > spectral/input/_topic_skm.csv > (I'll leave that example input file in place unchanged for others to try. It > is built from dbpedia data, encoding associations from Wikipedia pages to > categories. Whether it is a good use of spectral clustering I'm not sure, but > I'd at least hope the job would run to completion.) > 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o > spectral/output/results1' > 3. Wait for it to fail just after printing "Loading vector from: > spectral/output/results1/calculations/diagonal/part-r-00000", with > java.util.NoSuchElementException at > com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152). > 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the > command (give a different results dir or cleanup from the first run, to avoid > mixing the tests). This attempt should succeed and you'll see it proceed > deeper into the job, i.e. something like > 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: > spectral/output/results2/calculations/diagonal/part-r-00000 > 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing > the arguments. Applications should implement Tool for the same. > 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1 > 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005 > 12/02/19 14:38:45 INFO mapred.JobClient: map 0% reduce 0% > 12/02/19 14:39:31 INFO mapred.JobClient: map 1% reduce 0% > (5. You might get a memory-based failure some time later; that is a separate > problem.) > I'll attach a more detailed transcript. I've made no attempt to diagnose > internals yet, but did make some other tests and can confirm that it does not > seem to matter whether the commandline invocation names the file explicitly, > or by directory name only. Also trailing slash does not seem to be an issue. > Finally, a related 'gotcha': make sure the results directory is not inside > the input directory when testing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira