Hmm, I must have really missed an important piece somewhere. This is from
the MapRed tutorial text:

"DistributedCache is a facility provided by the Map/Reduce framework to
cache files (text, archives, jars and so on) needed by applications.

Applications specify the files to be cached via urls (hdfs://) in the
JobConf. The DistributedCache* assumes that the files specified via hdfs://
urls are already present on the FileSystem.*

*The framework will copy the necessary files to the slave node before any
tasks for the job are executed on that node*. Its efficiency stems from the
fact that the files are only copied once per job and the ability to cache
archives which are un-archived on the slaves."


After some close reading, the two bolded pieces seem to be in contradiction
of each other? I'd always that addCacheFile() would perform the 2nd bolded
statement. If that sentence is true, then I still don't have an explanation
of why our job didn't correctly push out new versions of the cache files
upon the startup and execution of JobConfiguration. We deleted them before
our job started, not during.

On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Meng Mao,
>
> The way the distributed cache is currently written, it does not verify the
> integrity of the cache files at all after they are downloaded.  It just
> assumes that if they were downloaded once they are still there and in the
> proper shape.  It might be good to file a JIRA to add in some sort of check.
>  Another thing to do is that the distributed cache also includes the time
> stamp of the original file, just incase you delete the file and then use a
> different version.  So if you want it to force a download again you can copy
> it delete the original and then move it back to what it was before.
>
> --Bobby Evans
>
> On 9/23/11 1:57 AM, "Meng Mao" <meng...@gmail.com> wrote:
>
> We use the DistributedCache class to distribute a few lookup files for our
> jobs. We have been aggressively deleting failed task attempts' leftover
> data
> , and our script accidentally deleted the path to our distributed cache
> files.
>
> Our task attempt leftover data was here [per node]:
> /hadoop/hadoop-metadata/cache/mapred/local/
> and our distributed cache path was:
> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/<nameNode>
> We deleted this path by accident.
>
> Does this latter path look normal? I'm not that familiar with
> DistributedCache but I'm up right now investigating the issue so I thought
> I'd ask.
>
> After that deletion, the first 2 jobs to run (which are use the
> addCacheFile
> method to distribute their files) didn't seem to push the files out to the
> cache path, except on one node. Is this expected behavior? Shouldn't
> addCacheFile check to see if the files are missing, and if so, repopulate
> them as needed?
>
> I'm trying to get a handle on whether it's safe to delete the distributed
> cache path when the grid is quiet and no jobs are running. That is, if
> addCacheFile is designed to be robust against the files it's caching not
> being at each job start.
>
>

Reply via email to