Hi all,

Doing some simple column transformations (e.g. trimming strings) on a
DataFrame using UDFs. This DataFrame is in Avro format and being loaded off
HDFS. The job has about 16,000 parts/tasks.

About half way through the job, then fails with a message;

org.apache.spark.SparkException: Job aborted due to stage failure: Total
size of serialized results of 6843 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)

This seems strange as we are saving the dataframe using the
df.write.parquet("/my/path") API.

As far as I understand, the code that runs these checks
https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L595
is only used in one method;
https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L47
in TaskResultGetter.

So it looks that even though we are not doing a .collect in the driver, its
still returning metadata about the successful tasks. I would assume its
returning the result value as a DirectTaskResult, with a BoxedUnit as the
return type. This data structure should be super small. However it seems
this is node the case.

This code is being run from the Spark 1.6 shell.

Any ideas why this what could be happening here?

Cheers,
~N

Reply via email to