When reading large (and many) datasets with the Spark 1.4.0 DataFrames
parquet reader (the org.apache.spark.sql.parquet format), the following
exceptions are thrown:

Exception in thread "task-result-getter-0"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "task-result-getter-0"
Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError:
PermGen space
Exception in thread "task-result-getter-1" java.lang.OutOfMemoryError:
PermGen space
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError:
PermGen space

and many more like these from different threads. I've tried increasing the
PermGen space using the -XX:MaxPermSize VM setting, but even after tripling
the space, the same errors occur. I've also tried storing intermediate
results, and am able to get the full job completed by running it multiple
times and starting for the last successful intermediate result. There seems
to be some memory leak in the parquet format. Any hints on how to fix this
problem?

Thanks,
Anders

Reply via email to