15K mappers on a 4 node system will definitely crash it unless you have
tuned yarn (RM, NM) well. That many mappers reading data off few disks in
parallel can create disk storm and disk can also turn out to be your bottle
neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize  default
value) of data. 15K mappers means you are reading 1.9 TB of data. Based on
the number of memory capacity you have, you can reduce the number of
mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G RAM
and some memory taken for node manager, data node, etc) and you are running
mappers with 1G heap (mapreduce.map.java.opts) and 1.5G
(mapreduce.map.memory.mb) container sizes, you can run 117 containers in
parallel.

set pig.maxCombinedSplitSize 10737418240

Above setting will make each map process 10G of data which will create
about ~190 maps and you should be able to run without bringing down your
cluster.

Regards,
Rohini

On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <olaf.colli...@gmail.com>
wrote:

> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
> space (*200TB*).
>
> In my pig script, I load several monthly files that are about *200Gb* in
> size each.
>
> I load my data like this
>
> data = LOAD 'mypath/data_2015*'
>
>           USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
>
> then I FILTER the data and probably remove 80% of the data after the filter
> step.
>
> I noticed that if I load in my pig script about one year of data, Pig
> creates about 15k mappers and the whole process takes approximately 3 hours
> (including the reduce step).
>
> Instead, if I load 2 years of data, then Pig creates about 30k mappers and
> basically all the nodes become unhealthy after processing for more than 15
> hours.
>
> Am I hitting some kind of bottleneck here? Or there is some default options
> I should play with?
>
> Many thanks!
>

Reply via email to