Hi, everybody. I'm running into some difficulties getting needed libraries to map/reduce tasks using the distributed cache.
I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement by the client, so more current versions are not really viable options. The code I've inherited is Java, which sets up and runs the MR job. There's currently some nontrivial pre- and post-processing, so it will be a large refactoring before I can just run bare MR jobs rather than starting them through Java. Further complicating matters: in practice the Java jobs are launched by Oozie, which of course does so by wrapping each one in a MR shell. The upshot is that I don't have any control over which "local" filesystem the Java job is run from, though if local files are absolutely needed I can make my Java wrappers copy stuff back from HDFS to the Java job's local filesystem. So here's the problem mappers and/or reducers need class Needed, which is contained in needed-1.0.jar, which is in HDFS: hdfs://.../libdir/distributed/needed-1.0.jar Java program executes: DistributedCache.addFiletoClassPath(new Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration()); Inspecting the Job object I find the file has been added to the cache files as expected: job.conf.overlay[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar job.conf.properties[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar And the class seems to show up in the internal ClassLoader: job.conf.classLoader.classes[...] = "class my.class.package.Needed" though this may just be inherited from the ClassLoader of the Java process itself (which also uses Needed). And yet as soon as I get into the mapreduce job itself I start getting: 2011-05-25 17:22:56,080 INFO JobClient - Task Id : attempt_201105251330_0037_r_000043_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: my.class.package.Needed Up until this point we've run things by having a directory on each node containing all the libraries we'd need, and including that in the Hadoop classpath, but we have no such control in the deployment scenario, so we have to make our program hand the needed libraries to the map and reduce nodes via the distributed cache classpath. Thanks in advance for any insight or assistance you can offer.