No, never really resolved the problem, except by increasing the permgem
space which only partially solved it. Still have to restart the job
multiple times so make the whole job complete (it stores intermediate
results).

The parquet data sources have about 70 columns, and yes Cheng, it works
fine when only loading a small sample of the data.

Thankful for any hints,
Anders

On Wed, Jul 22, 2015 at 5:29 PM Cheng Lian <lian.cs....@gmail.com> wrote:

>  How many columns are there in these Parquet files? Could you load a small
> portion of the original large dataset successfully?
>
> Cheng
>
>
> On 6/25/15 5:52 PM, Anders Arpteg wrote:
>
> Yes, both the driver and the executors. Works a little bit better with
> more space, but still a leak that will cause failure after a number of
> reads. There are about 700 different data sources that needs to be loaded,
> lots of data...
>
>  tor 25 jun 2015 08:02 Sabarish Sasidharan <
> <sabarish.sasidha...@manthan.com>sabarish.sasidha...@manthan.com> skrev:
>
> Did you try increasing the perm gen for the driver?
>>
>> Regards
>> Sab
>>
> On 24-Jun-2015 4:40 pm, "Anders Arpteg" <arp...@spotify.com> wrote:
>>
> When reading large (and many) datasets with the Spark 1.4.0 DataFrames
>>> parquet reader (the org.apache.spark.sql.parquet format), the following
>>> exceptions are thrown:
>>>
>>>  Exception in thread "sk-result-getter-0"
>>>
>> Exception: java.lang.OutOfMemoryError thrown from the
>>> UncaughtExceptionHandler in thread "task-result-getter-0"
>>> Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError:
>>> PermGen space
>>> Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError:
>>> PermGen space
>>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError:
>>> PermGen space
>>>
>>
>>>  and many more like these from different threads. I've tried increasing
>>> the PermGen space using the -XX:MaxPermSize VM setting, but even after
>>> tripling the space, the same errors occur. I've also tried storing
>>> intermediate results, and am able to get the full job completed by running
>>> it multiple times and starting for the last successful intermediate result.
>>> There seems to be some memory leak in the parquet format. Any hints on how
>>> to fix this problem?
>>>
>>>  Thanks,
>>> Anders
>>>
>>

Reply via email to