So I think I've figured out how to fix my problem with putting files on
the distributed classpath by digging through the code Hadoop uses to
process -libjars.

If I say

DistributedCache.addFileToClassPath(hdfsFile,conf);

then hdfsFile is added to the distributed cache, but doesn't show upon the
classpath of the mappers or reducers, leading to crashings and burnings.

But if I say

DistributedCache.addFileToClassPath(new
Path(hdfsFile.toUri().getPath()),conf);

then hdfsFile *does* show up on the classpath, and my jobs run fine.

So here's the thing: if I grab the code at this point in a debugger and
check

new Path(hdfsFile.toUri().getPath()).equals(hdfsFile)

I *always* get true.  That is, according to Path.equals(), hdfsFile and
new Path(hdfsFile.toUri().getPath()) are semantically identical.

So while I'm glad that I've got something working (and hopefully this will
help anyone stuck in a similar place in the future), I'm incredibly
confused as to *why* it works.  Any ideas?

Reply via email to