Hi all, I've instantiated a Hadoop 2.4.1 cluster and I've found that running MapReduce applications will parallelize differently depending on what kind of filesystem the input data is on.
Using HDFS, a MapReduce job will spawn enough containers to maximize use of all available memory. For example, a 3-node cluster with 172GB of memory with each map task allocating 2GB, about 86 application containers will be created. On a filesystem that isn't HDFS (like NFS or in my use case, a parallel filesystem), a MapReduce job will only allocate a subset of available tasks (e.g., with the same 3-node cluster, about 25-40 containers are created). Since I'm using a parallel filesystem, I'm not as concerned with the bottlenecks one would find if one were to use NFS. Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml) configuration that will allow me to effectively maximize resource utilization? Thanks, Calvin