I've looked a bit into this problem some more, and from what another
person has written, HDFS is tuned to scale appropriately [1] given the
number of input splits, etc.

In the case of utilizing the local filesystem (which is really a
network share on a parallel filesystem), the settings might be set
conservatively in order not to thrash the local disks or present a
bottleneck in processing.

Since this isn't a big concern, I'd rather tune the settings to
efficiently utilize the local filesystem.

Are there any pointers to where in the source code I could look in
order to tweak such parameters?

Thanks,
Calvin

[1] 
https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems

On Tue, Aug 12, 2014 at 12:29 PM, Calvin <iphcal...@gmail.com> wrote:
> Hi all,
>
> I've instantiated a Hadoop 2.4.1 cluster and I've found that running
> MapReduce applications will parallelize differently depending on what
> kind of filesystem the input data is on.
>
> Using HDFS, a MapReduce job will spawn enough containers to maximize
> use of all available memory. For example, a 3-node cluster with 172GB
> of memory with each map task allocating 2GB, about 86 application
> containers will be created.
>
> On a filesystem that isn't HDFS (like NFS or in my use case, a
> parallel filesystem), a MapReduce job will only allocate a subset of
> available tasks (e.g., with the same 3-node cluster, about 25-40
> containers are created). Since I'm using a parallel filesystem, I'm
> not as concerned with the bottlenecks one would find if one were to
> use NFS.
>
> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
> configuration that will allow me to effectively maximize resource
> utilization?
>
> Thanks,
> Calvin

Reply via email to