Re: Parquet problems

Michael Misiewicz Wed, 22 Jul 2015 07:42:04 -0700

Hi Anders,

Did you ever get to the bottom of this issue? I'm encountering it too, but
only in "yarn-cluster" mode running on spark 1.4.0. I was thinking of
trying 1.4.1 today.


Michael

On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg <arp...@spotify.com> wrote:

> Yes, both the driver and the executors. Works a little bit better with
> more space, but still a leak that will cause failure after a number of
> reads. There are about 700 different data sources that needs to be loaded,
> lots of data...
>
> tor 25 jun 2015 08:02 Sabarish Sasidharan <sabarish.sasidha...@manthan.com>
> skrev:
>
>> Did you try increasing the perm gen for the driver?
>>
>> Regards
>> Sab
>> On 24-Jun-2015 4:40 pm, "Anders Arpteg" <arp...@spotify.com> wrote:
>>
>>> When reading large (and many) datasets with the Spark 1.4.0 DataFrames
>>> parquet reader (the org.apache.spark.sql.parquet format), the following
>>> exceptions are thrown:
>>>
>>> Exception in thread "task-result-getter-0"
>>> Exception: java.lang.OutOfMemoryError thrown from the
>>> UncaughtExceptionHandler in thread "task-result-getter-0"
>>> Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError:
>>> PermGen space
>>> Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError:
>>> PermGen space
>>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError:
>>> PermGen space
>>>
>>> and many more like these from different threads. I've tried increasing
>>> the PermGen space using the -XX:MaxPermSize VM setting, but even after
>>> tripling the space, the same errors occur. I've also tried storing
>>> intermediate results, and am able to get the full job completed by running
>>> it multiple times and starting for the last successful intermediate result.
>>> There seems to be some memory leak in the parquet format. Any hints on how
>>> to fix this problem?
>>>
>>> Thanks,
>>> Anders
>>>
>>

Re: Parquet problems

Reply via email to