I have a 1 node pseudo cluster with plenty of RAM and 5 HDs. As an experiment, I set mapreduce.cluster.local.dir to point to a ram disk. For this experiment I am running 8GB terasort, so I made a 9GB ram disk. This change sped up the run time of the job by ~16% versus pointing mapreduce.cluster.local.dir to a csv list of 5 HDs.

I have two questions about this -

- Will this work in a cluster situation where say I have a 12GB ram disk per cluster node, and I am working on a 128GB terasort, or does the cluster.local.dir free space size per node have to be big enough to hold all intermediate results? My hunch is yes but I am not sure.

- By googling I found that there is very little info about people trying to use ram disks with hadoop in this way, so it seems like there is a technical reason people do not do it, perhaps the size related issue I mentioned. Are there other gotchas about trying to use a ram disk like this? It seems like a quick and dirty way to get some performance.

Thanks,
Eric


Reply via email to