I've looked a bit into this problem some more, and from what another person has written, HDFS is tuned to scale appropriately [1] given the number of input splits, etc.
In the case of utilizing the local filesystem (which is really a network share on a parallel filesystem), the settings might be set conservatively in order not to thrash the local disks or present a bottleneck in processing. Since this isn't a big concern, I'd rather tune the settings to efficiently utilize the local filesystem. Are there any pointers to where in the source code I could look in order to tweak such parameters? Thanks, Calvin [1] https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems On Tue, Aug 12, 2014 at 12:29 PM, Calvin <iphcal...@gmail.com> wrote: > Hi all, > > I've instantiated a Hadoop 2.4.1 cluster and I've found that running > MapReduce applications will parallelize differently depending on what > kind of filesystem the input data is on. > > Using HDFS, a MapReduce job will spawn enough containers to maximize > use of all available memory. For example, a 3-node cluster with 172GB > of memory with each map task allocating 2GB, about 86 application > containers will be created. > > On a filesystem that isn't HDFS (like NFS or in my use case, a > parallel filesystem), a MapReduce job will only allocate a subset of > available tasks (e.g., with the same 3-node cluster, about 25-40 > containers are created). Since I'm using a parallel filesystem, I'm > not as concerned with the bottlenecks one would find if one were to > use NFS. > > Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml) > configuration that will allow me to effectively maximize resource > utilization? > > Thanks, > Calvin