That is correct, However, it is a bit more complicated then that. The Task
Tracker's in memory index of the distributed cache is keyed off of the path of
the file and the HDFS creation time of the file. So if you delete the original
file off of HDFS, and then recreate it with a new time stamp
So the proper description of how DistributedCache normally works is:
1. have files to be cached sitting around in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space. Each worker node copies the to-be-cached files from HDFS to local
disk, but more importantly, the
Yes, all of the state for the task tracker is in memory. It never looks at the
disk to see what is there, it only maintains the state in memory.
--bobby Evans
On 9/27/11 1:00 PM, "Meng Mao" wrote:
I'm not concerned about disk space usage -- the script we used that deleted
the taskTracker cac
I'm not concerned about disk space usage -- the script we used that deleted
the taskTracker cache path has been fixed not to do so.
I'm curious about the exact behavior of jobs that use DistributedCache
files. Again, it seems safe from your description to delete files between
completed runs. How c
If you are never ever going to use that file again for any map/reduce task in
the future then yes you can delete it, but I would not recommend it. If you
want to reduce the amount of space that is used by the distributed cache there
is a config parameter for that.
"local.cache.size" it is the
>From that interpretation, it then seems like it would be safe to delete the
files between completed runs? How could it distinguish between the files
having been deleted and their not having been downloaded from a previous
run?
On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans wrote:
> addCacheFile
addCacheFile sets a config value in your jobConf that indicates which files
your particular job depends on. When the TaskTracker is assigned to run part
of your job (map task or reduce task), it will download your jobConf, read it
in, and then download the files listed in the conf, if it has no
Who is in charge of getting the files there for the first time? The
addCacheFile call in the mapreduce job? Or a manual setup by the
user/operator?
On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans wrote:
> The problem is the step 4 in the breaking sequence. Currently the
> TaskTracker never looks
The problem is the step 4 in the breaking sequence. Currently the TaskTracker
never looks at the disk to know if a file is in the distributed cache or not.
It assumes that if it downloaded the file and did not delete that file itself
then the file is still there in its original form. It does
Let's frame the issue in another way. I'll describe a sequence of Hadoop
operations that I think should work, and then I'll get into what we did and
how it failed.
Normal sequence:
1. have files to be cached in HDFS
2. Run Job A, which specifies those files to be put into DistributedCache
space
3.
Hmm, I must have really missed an important piece somewhere. This is from
the MapRed tutorial text:
"DistributedCache is a facility provided by the Map/Reduce framework to
cache files (text, archives, jars and so on) needed by applications.
Applications specify the files to be cached via urls (hd
Meng Mao,
The way the distributed cache is currently written, it does not verify the
integrity of the cache files at all after they are downloaded. It just assumes
that if they were downloaded once they are still there and in the proper shape.
It might be good to file a JIRA to add in some so
12 matches
Mail list logo