I am using a hadoop Cloudera system with 4 nodes only, but a lot of disk
space (*200TB*).

In my pig script, I load several monthly files that are about *200Gb* in
size each.

I load my data like this

data = LOAD 'mypath/data_2015*'

          USING com.twitter.elephanbird.pig.load.JsonLoader('-nestedload')

then I FILTER the data and probably remove 80% of the data after the filter
step.

I noticed that if I load in my pig script about one year of data, Pig
creates about 15k mappers and the whole process takes approximately 3 hours
(including the reduce step).

Instead, if I load 2 years of data, then Pig creates about 30k mappers and
basically all the nodes become unhealthy after processing for more than 15
hours.

Am I hitting some kind of bottleneck here? Or there is some default options
I should play with?

Many thanks!

Reply via email to