Any idea why if I use more containers I get a lot of stopped because GC?

2015-02-05 8:59 GMT+01:00 Guillermo Ortiz <>:
> I'm not caching the data. with "each iteration I mean,, each 128mb
> that a executor has to process.
> The code is pretty simple.
> final Conversor c = new Conversor(null, null, null, longFields,typeFields);
> SparkConf conf = new SparkConf().setAppName("Simple Application");
> JavaSparkContext sc = new JavaSparkContext(conf);
> JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock());
>  JavaRDD<String> rddString = Function<byte[], String>() {
>      @Override
>       public String call(byte[] arg0) throws Exception {
>          String result = c.parse(arg0).toString();
>           return result;
>     }
>  });
> rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ "/");
> The parse function just takes an array of bytes and applies some
> transformations like,,,
> [0..3] an integer, [4...20] an String, [21..27] another String and so on.
> It's just a test code, I'd like to understand what it's happeing.
> 2015-02-04 18:57 GMT+01:00 Sandy Ryza <>:
>> Hi Guillermo,
>> What exactly do you mean by "each iteration"?  Are you caching data in
>> memory?
>> -Sandy
>> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <>
>> wrote:
>>> I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
>>> I have 5 slaves:
>>> (32cores /256Gb / 7physical disks) x 5
>>> I have been trying many different configurations with YARN.
>>> yarn.nodemanager.resource.memory-mb 196Gb
>>> yarn.nodemanager.resource.cpu-vcores 24
>>> I have tried to execute the job with different number of executors a
>>> memory (1-4g)
>>> With 20 executors takes 25s each iteration (128mb) and it never has a
>>> really long time waiting because GC.
>>> When I execute around 60 executors the process time it's about 45s and
>>> some tasks take until one minute because GC.
>>> I have no idea why it's calling GC when I execute more executors
>>> simultaneously.
>>> The another question it's why it takes more time to execute each
>>> block. My theory about the this it's because there're only 7 physical
>>> disks and it's not the same 5 processes writing than 20.
>>> The code is pretty simple, it's just a map function which parse a line
>>> and write the output in HDFS. There're a lot of substrings inside of
>>> the function what it could cause GC.
>>> Any theory about?
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to