Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data...
tor 25 jun 2015 08:02 Sabarish Sasidharan <sabarish.sasidha...@manthan.com> skrev: > Did you try increasing the perm gen for the driver? > > Regards > Sab > On 24-Jun-2015 4:40 pm, "Anders Arpteg" <arp...@spotify.com> wrote: > >> When reading large (and many) datasets with the Spark 1.4.0 DataFrames >> parquet reader (the org.apache.spark.sql.parquet format), the following >> exceptions are thrown: >> >> Exception in thread "task-result-getter-0" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "task-result-getter-0" >> Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: >> PermGen space >> Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError: >> PermGen space >> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: >> PermGen space >> >> and many more like these from different threads. I've tried increasing >> the PermGen space using the -XX:MaxPermSize VM setting, but even after >> tripling the space, the same errors occur. I've also tried storing >> intermediate results, and am able to get the full job completed by running >> it multiple times and starting for the last successful intermediate result. >> There seems to be some memory leak in the parquet format. Any hints on how >> to fix this problem? >> >> Thanks, >> Anders >> >