On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." <gordoslo...@gmail.com> wrote: > Not 100% clear on what you meant. You are saying I should put the file into > my HDFS cluster or should I use DistributedCache? If you suggest the > latter, > could you address my original question?
I mean that you can certainly get away with putting information into a known place on HDFS and loading it in each mapper or reducer, but that may become very inefficient as your problem scales up. Mostly I was responding to Shi Yu's question about why the DC is even worth using at all. As to your question, here's how I do it, which I think I basically lifted from an example in The Definitive Guide. There may be better ways, though. In my setup, I put files into the DC by getting Path objects (which should be able to reference either HDFS or local filesystem files, though I always have my files on HDFS to start) and using DistributedCache.addCacheFile(path.toUri(), conf); Then within my mapper or reducer I retrieve all the cached files with Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); IIRC, this is what you were doing. The problem is this gets all the cached files, although they are now in a working directory on the local filesystem. Luckily, I know the filename of the file I want, so I iterate for (Path cachePath : cacheFiles) { if (cachePath.getName().equals(cachedFilename)) { return cachePath; } } Then I've got the path to the local filesystem copy of the file I want in hand and I can do whatever I want with it. hth