Hello Rohini

Super helpful, thanks!
I was able to get the exact characteristics of my cluster. Here it is:

Block size 128MB, 300TB of raw data storage (100TB if you account for
replication) and each of the 4 nodes has 384GB RAM

Does that change your answer?

Thanks again!!

On 27 May 2016 at 17:09, Rohini Palaniswamy <rohini.adi...@gmail.com> wrote:
> 15K mappers on a 4 node system will definitely crash it unless you have
> tuned yarn (RM, NM) well. That many mappers reading data off few disks in
> parallel can create disk storm and disk can also turn out to be your bottle
> neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize  default
> value) of data. 15K mappers means you are reading 1.9 TB of data. Based on
> the number of memory capacity you have, you can reduce the number of
> mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G RAM
> and some memory taken for node manager, data node, etc) and you are running
> mappers with 1G heap (mapreduce.map.java.opts) and 1.5G
> (mapreduce.map.memory.mb) container sizes, you can run 117 containers in
> parallel.
>
> set pig.maxCombinedSplitSize 10737418240
>
> Above setting will make each map process 10G of data which will create
> about ~190 maps and you should be able to run without bringing down your
> cluster.
>
> Regards,
> Rohini
>
> On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <olaf.colli...@gmail.com>
> wrote:
>
>> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
>> space (*200TB*).
>>
>> In my pig script, I load several monthly files that are about *200Gb* in
>> size each.
>>
>> I load my data like this
>>
>> data = LOAD 'mypath/data_2015*'
>>
>>           USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
>>
>> then I FILTER the data and probably remove 80% of the data after the filter
>> step.
>>
>> I noticed that if I load in my pig script about one year of data, Pig
>> creates about 15k mappers and the whole process takes approximately 3 hours
>> (including the reduce step).
>>
>> Instead, if I load 2 years of data, then Pig creates about 30k mappers and
>> basically all the nodes become unhealthy after processing for more than 15
>> hours.
>>
>> Am I hitting some kind of bottleneck here? Or there is some default options
>> I should play with?
>>
>> Many thanks!
>>

Reply via email to