[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

Sean Owen (JIRA) Tue, 30 May 2017 12:24:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029984#comment-16029984
 ]


Sean Owen commented on SPARK-20925:
-----------------------------------

Not enough info here -- is the JVM running out of memory? is YARN killing it? 
is the driver or executor running out of memory?
All of those are typically matters of setting memory config properly, and not a 
Spark issue, so I am not sure this stands as a JIRA.

> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20925
>                 URL: https://issues.apache.org/jira/browse/SPARK-20925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
>         .read
>         .schema(inputSchema)
>         .json(expandedInputPath)
>         .select(columnMap:_*)
>         .write.partitionBy("partition_by_column")
>         .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

Reply via email to