I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
space (*200TB*).
In my pig script, I load several monthly files that are about *200Gb* in
size each.
I load my data like this
data = LOAD 'mypath/data_2015*'
USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')
then I FILTER the data and probably remove 80% of the data after the filter
step.
I noticed that if I load in my pig script about one year of data, Pig
creates about 15k mappers and the whole process takes approximately 3 hours
(including the reduce step).
Instead, if I load 2 years of data, then Pig creates about 30k mappers and
basically all the nodes become unhealthy after processing for more than 15
hours.
Am I hitting some kind of bottleneck here? Or there is some default options
I should play with?
Many thanks!