Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
That is correct, However, it is a bit more complicated then that. The Task Tracker's in memory index of the distributed cache is keyed off of the path of the file and the HDFS creation time of the file. So if you delete the original file off of HDFS, and then recreate it with a new time stamp

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
So the proper description of how DistributedCache normally works is: 1. have files to be cached sitting around in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space. Each worker node copies the to-be-cached files from HDFS to local disk, but more importantly, the

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
Yes, all of the state for the task tracker is in memory. It never looks at the disk to see what is there, it only maintains the state in memory. --bobby Evans On 9/27/11 1:00 PM, "Meng Mao" wrote: I'm not concerned about disk space usage -- the script we used that deleted the taskTracker cac

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
I'm not concerned about disk space usage -- the script we used that deleted the taskTracker cache path has been fixed not to do so. I'm curious about the exact behavior of jobs that use DistributedCache files. Again, it seems safe from your description to delete files between completed runs. How c

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
If you are never ever going to use that file again for any map/reduce task in the future then yes you can delete it, but I would not recommend it. If you want to reduce the amount of space that is used by the distributed cache there is a config parameter for that. "local.cache.size" it is the

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
>From that interpretation, it then seems like it would be safe to delete the files between completed runs? How could it distinguish between the files having been deleted and their not having been downloaded from a previous run? On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans wrote: > addCacheFile

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
addCacheFile sets a config value in your jobConf that indicates which files your particular job depends on. When the TaskTracker is assigned to run part of your job (map task or reduce task), it will download your jobConf, read it in, and then download the files listed in the conf, if it has no

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans wrote: > The problem is the step 4 in the breaking sequence. Currently the > TaskTracker never looks

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
The problem is the step 4 in the breaking sequence. Currently the TaskTracker never looks at the disk to know if a file is in the distributed cache or not. It assumes that if it downloaded the file and did not delete that file itself then the file is still there in its original form. It does

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-26 Thread Meng Mao
Let's frame the issue in another way. I'll describe a sequence of Hadoop operations that I think should work, and then I'll get into what we did and how it failed. Normal sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3.

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Meng Mao
Hmm, I must have really missed an important piece somewhere. This is from the MapRed tutorial text: "DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. Applications specify the files to be cached via urls (hd

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-23 Thread Robert Evans
Meng Mao, The way the distributed cache is currently written, it does not verify the integrity of the cache files at all after they are downloaded. It just assumes that if they were downloaded once they are still there and in the proper shape. It might be good to file a JIRA to add in some so