Hi all,

I've instantiated a Hadoop 2.4.1 cluster and I've found that running
MapReduce applications will parallelize differently depending on what
kind of filesystem the input data is on.

Using HDFS, a MapReduce job will spawn enough containers to maximize
use of all available memory. For example, a 3-node cluster with 172GB
of memory with each map task allocating 2GB, about 86 application
containers will be created.

On a filesystem that isn't HDFS (like NFS or in my use case, a
parallel filesystem), a MapReduce job will only allocate a subset of
available tasks (e.g., with the same 3-node cluster, about 25-40
containers are created). Since I'm using a parallel filesystem, I'm
not as concerned with the bottlenecks one would find if one were to
use NFS.

Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
configuration that will allow me to effectively maximize resource
utilization?

Thanks,
Calvin

Reply via email to