Hi mapreduce gurus -

Today while looking into a few TaskTracker's with full disks I came
across the following directory using 207GB:

/data/disk3/mapred/local/taskTracker/archive/namenode.foo.com/tmp/temp-2024081/tmp942425908/tmp942425908

Digging a bit further a job did indeed reference a 200+GB directory in
HDFS via the distributed cache, and it appeared to copy the whole
thing locally to the point where the disk filled up. Looking at the
cluster config we don't explicitly set local.cache.size, so the 10GB
default limit should take effect.

Is anyone familiar with how the distributed cache deals when datasets
larger than the total cache size are referenced? I've disabled the job
that caused this situation but am wondering if I can configure things
more defensively.

Thanks!
Travis

Reply via email to