Hello Rohini Super helpful, thanks! I was able to get the exact characteristics of my cluster. Here it is:
Block size 128MB, 300TB of raw data storage (100TB if you account for replication) and each of the 4 nodes has 384GB RAM Does that change your answer? Thanks again!! On 27 May 2016 at 17:09, Rohini Palaniswamy <rohini.adi...@gmail.com> wrote: > 15K mappers on a 4 node system will definitely crash it unless you have > tuned yarn (RM, NM) well. That many mappers reading data off few disks in > parallel can create disk storm and disk can also turn out to be your bottle > neck. Pig creates 1 map per 128MB ( pig.maxCombinedSplitSize default > value) of data. 15K mappers means you are reading 1.9 TB of data. Based on > the number of memory capacity you have, you can reduce the number of > mappers. For eg: If you have 44 G heap per node for tasks (Assuming 48G RAM > and some memory taken for node manager, data node, etc) and you are running > mappers with 1G heap (mapreduce.map.java.opts) and 1.5G > (mapreduce.map.memory.mb) container sizes, you can run 117 containers in > parallel. > > set pig.maxCombinedSplitSize 10737418240 > > Above setting will make each map process 10G of data which will create > about ~190 maps and you should be able to run without bringing down your > cluster. > > Regards, > Rohini > > On Wed, May 11, 2016 at 10:17 AM, Olaf Collider <olaf.colli...@gmail.com> > wrote: > >> I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk >> space (*200TB*). >> >> In my pig script, I load several monthly files that are about *200Gb* in >> size each. >> >> I load my data like this >> >> data = LOAD 'mypath/data_2015*' >> >> USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload') >> >> then I FILTER the data and probably remove 80% of the data after the filter >> step. >> >> I noticed that if I load in my pig script about one year of data, Pig >> creates about 15k mappers and the whole process takes approximately 3 hours >> (including the reduce step). >> >> Instead, if I load 2 years of data, then Pig creates about 30k mappers and >> basically all the nodes become unhealthy after processing for more than 15 >> hours. >> >> Am I hitting some kind of bottleneck here? Or there is some default options >> I should play with? >> >> Many thanks! >>