[ 
https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-978.
------------------------------------

    Resolution: Won't Fix

I'd say, won't fix, as there is a workaround.  Please re-open if there is a 
specific patch.
                
> spectralkmeans utility fails when input filename begins with leading 
> underscore
> -------------------------------------------------------------------------------
>
>                 Key: MAHOUT-978
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-978
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Tested on a real Linux-based cluster running Hadoop 
> 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 
> 0.20.203.0 running 16 Feb trunk build.
>            Reporter: Dan Brickley
>            Priority: Minor
>         Attachments: jira-underscore-spectral-log.txt
>
>
> The commandline 'bin/mahout spectralkmeans' utility fails with 
> NoSuchElementException after "Loading vector from: 
> spectral/output/results2/calculations/diagonal/part-r-00000"  when input data 
> in hdfs has filename beginning with a leading underscore.
> This was partially reported in comments for MAHOUT-524 but I believe 
> identified now as a distinct issue (thanks to Shannon for help diagnosing). I 
> have not investigated if there is an equivalent problem for API-based use of 
> this piece of Mahout.
> Steps to reproduce: 
> 1. put affinity file into hdfs, following 
> https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs 
> count from zero etc. Name your file with a leading underscore. For example, 
> try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in 
> spectral/input/_topic_skm.csv
> (I'll leave that example input file in place unchanged for others to try. It 
> is built from dbpedia data, encoding associations from Wikipedia pages to 
> categories. Whether it is a good use of spectral clustering I'm not sure, but 
> I'd at least hope the job would run to completion.)
> 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o 
> spectral/output/results1'
> 3. Wait for it to fail just after printing "Loading vector from: 
> spectral/output/results1/calculations/diagonal/part-r-00000", with 
> java.util.NoSuchElementException at 
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).
> 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the 
> command (give a different results dir or cleanup from the first run, to avoid 
> mixing the tests). This attempt should succeed and you'll see it proceed 
> deeper into the job, i.e. something like 
> 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: 
> spectral/output/results2/calculations/diagonal/part-r-00000
> 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
> 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
> 12/02/19 14:38:45 INFO mapred.JobClient:  map 0% reduce 0%
> 12/02/19 14:39:31 INFO mapred.JobClient:  map 1% reduce 0%
> (5. You might get a memory-based failure some time later; that is a separate 
> problem.)
> I'll attach a more detailed transcript. I've made no attempt to diagnose 
> internals yet, but did make some other tests and can confirm that it does not 
> seem to matter whether the commandline invocation names the file explicitly, 
> or by directory name only. Also trailing slash does not seem to be an issue. 
> Finally, a related 'gotcha': make sure the results directory is not inside 
> the input directory when testing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to