When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown:
Exception in thread "task-result-getter-0" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "task-result-getter-0" Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: PermGen space Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError: PermGen space Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: PermGen space and many more like these from different threads. I've tried increasing the PermGen space using the -XX:MaxPermSize VM setting, but even after tripling the space, the same errors occur. I've also tried storing intermediate results, and am able to get the full job completed by running it multiple times and starting for the last successful intermediate result. There seems to be some memory leak in the parquet format. Any hints on how to fix this problem? Thanks, Anders