[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029993#comment-16029993
 ] 

Jeffrey Quinn commented on SPARK-20925:
---------------------------------------

Hi Sean,

Sorry for not providing adequate information. Thank you for responding so 
quickly.

The error message we get indicates yarn is killing the containers. The 
executors are running out of memory and not the driver.

```Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 184 in stage 74.0 failed 4 times, most recent failure: Lost task 
184.3 in stage 74.0 (TID 19110, ip-10-242-15-251.ec2.internal, executor 14): 
ExecutorLostFailure (executor 14 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 21.5 GB of 20.9 
GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.```

We tried a full parameter sweep, including using dynamic allocation and setting 
executor memory as high as 20GB. The result was the same each time, with the 
job failing due to lost executors due to YARN killing containers.

I attempted to trace through the source code to see how `partitionBy` is 
implemented, I was surprised to see it cause this problem since it seems like 
it should not require a shuffle. Unfortunately I am not experienced enough with 
the DataFrame source code to figure out what is going on. For now we are 
reimplementing the partitioning ourselves as a work around, but very curious to 
know what could have been happening here. My next step was going to be to 
obtain a full heap dump and poke around in it with my profiler, does that sound 
like a reasonable approach?

Thanks!

Jeff



> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20925
>                 URL: https://issues.apache.org/jira/browse/SPARK-20925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
>         .read
>         .schema(inputSchema)
>         .json(expandedInputPath)
>         .select(columnMap:_*)
>         .write.partitionBy("partition_by_column")
>         .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to