I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk space (*200TB*).
In my pig script, I load several monthly files that are about *200Gb* in size each. I load my data like this data = LOAD 'mypath/data_2015*' USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload') then I FILTER the data and probably remove 80% of the data after the filter step. I noticed that if I load in my pig script about one year of data, Pig creates about 15k mappers and the whole process takes approximately 3 hours (including the reduce step). Instead, if I load 2 years of data, then Pig creates about 30k mappers and basically all the nodes become unhealthy after processing for more than 15 hours. Am I hitting some kind of bottleneck here? Or there is some default options I should play with? Many thanks!