Hi Jeff, On Fri, Jun 10, 2011 at 7:38 PM, Jeff Eastman <jeast...@narus.com> wrote:
The first run on MapR: > MAHOUT_LOCAL is set, running locally [...] > INFO: Command line arguments: {--charset=UTF-8, --chunkSize=5, > --endPhase=2147483647, > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, > --input=mahout-work/reuters-out, --keyPrefix=, > --output=mahout-work/reuters-out-seqdir, --startPhase=0, --tempDir=temp} > Exception in thread "main" java.io.IOException: No FileSystem for scheme: > maprfs > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) I force seqdirectory to run locally because it is considerable more efficient to copy its output up to hdfs instead of copying the output of the prior step, extract reuters, up to hdfs and then running seqdirectory on the cluster. When seqdirectory is run locally, we are simply calling java with the classpath set up with bin/mahout -- I think it is not likely that this classpath includes the MapR classes. However,We are pointing to the hadoop configuration that references a maprfs filesystem. That configuration is loaded and maprfs is not understood as a valid scheme due to the absent MapR classes. As a result we encounter this error. bin/mahout should/could slurp in the classpath appropriate to the hadoop installation somehow, not sure of the best way to do this. > And then, after changing HADOOP_HOME & HADOOP_CONF_DIR to CDH3 on a fresh > untar/install of 0.5: [..] > INFO: Command line arguments: {--charset=UTF-8, --chunkSize=5, > --endPhase=2147483647, > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, > --input=mahout-work/reuters-out, --keyPrefix=, > --output=mahout-work/reuters-out-seqdir, --startPhase=0, --tempDir=temp} > Exception in thread "main" java.io.IOException: Call to > hadoop1.eng.narus.com/172.31.2.200:8020 failed on local exception: > java.io.EOFException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) I wonder if this is a similar case, where the hadoop classes packaged with mahout are being used to talk to a CDH3 hadoop cluster and thus we're bumping against protocol incompatibilities? Although MAHOUT_LOCAL is set in this case, seqdirectory is clearly trying to reach out to HDFS as a part of the filesystem setup process. All in all it seems that rejiggering the classpath in bin/mahout to make the classes specific to the hadoop environment it is executing within appear first in the classpath may be the correct way to resolve this issue. Do I vaguely recall seeing another discussion regarding classpath order pop up on the list recently? Jeff, I really appreciate you putting this through its paces on the wide variety of environments you have access too, thanks! Drew