Re: DataFrame operation on parquet: GC overhead limit exceeded

Yin Huai Wed, 18 Mar 2015 10:39:07 -0700

Seems there are too many distinct groups processed in a task, which trigger
the problem.


How many files do your dataset have and how large is a file? Seems your
query will be executed with two stages, table scan and map-side aggregation
in the first stage and the final round of reduce-side aggregation in the
second stage. Can you take a look at the numbers of tasks launched in these
two stages?

Thanks,

Yin

On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas <johngou...@gmail.com>
wrote:

> Hi there, I set the executor memory to 8g but it didn't help
>
> On 18 March 2015 at 13:59, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> You should probably increase executor memory by setting
>> "spark.executor.memory".
>>
>> Full list of available configurations can be found here
>> http://spark.apache.org/docs/latest/configuration.html
>>
>> Cheng
>>
>>
>> On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:
>>
>>> Hi there,
>>>
>>> I was trying the new DataFrame API with some basic operations on a
>>> parquet dataset.
>>> I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
>>> standalone cluster mode.
>>> The code is the following:
>>>
>>> val people = sqlContext.parquetFile("/data.parquet");
>>> val res = people.groupBy("name","date").agg(sum("power"),sum("supply")
>>> ).take(10);
>>> System.out.println(res);
>>>
>>> The dataset consists of 16 billion entries.
>>> The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> My configuration is:
>>>
>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>> spark.driver.memory    6g
>>> spark.executor.extraJavaOptions -XX:+UseCompressedOops
>>> spark.shuffle.manager    sort
>>>
>>> Any idea how can I workaround this?
>>>
>>> Thanks a lot
>>>
>>
>>
>

Re: DataFrame operation on parquet: GC overhead limit exceeded

Reply via email to