Hi mapreduce gurus - Today while looking into a few TaskTracker's with full disks I came across the following directory using 207GB:
/data/disk3/mapred/local/taskTracker/archive/namenode.foo.com/tmp/temp-2024081/tmp942425908/tmp942425908 Digging a bit further a job did indeed reference a 200+GB directory in HDFS via the distributed cache, and it appeared to copy the whole thing locally to the point where the disk filled up. Looking at the cluster config we don't explicitly set local.cache.size, so the 10GB default limit should take effect. Is anyone familiar with how the distributed cache deals when datasets larger than the total cache size are referenced? I've disabled the job that caused this situation but am wondering if I can configure things more defensively. Thanks! Travis