[ 
https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Quinn updated SPARK-20925:
----------------------------------
    Description: 
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
        .read
        .schema(inputSchema)
        .json(expandedInputPath)
        .select(columnMap:_*)
        .write.partitionBy("partition_by_column")
        .parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.

We were able to bisect that `partitionBy` is the problem by progressively 
removing/commenting out parts of our workflow. Finally when we get to the above 
state, if we remove `partitionBy` the job succeeds with no OOM.

  was:
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON 
format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
        .read
        .schema(inputSchema)
        .json(expandedInputPath)
        .select(columnMap:_*)
        .write.partitionBy("partition_by_column")
        .parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers 
running out of memory. We observed containers of up to 20GB OOMing, which is 
surprising because the input data itself is only 15 GB compressed and maybe 
100GB uncompressed.


> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20925
>                 URL: https://issues.apache.org/jira/browse/SPARK-20925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON 
> format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
>         .read
>         .schema(inputSchema)
>         .json(expandedInputPath)
>         .select(columnMap:_*)
>         .write.partitionBy("partition_by_column")
>         .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers 
> running out of memory. We observed containers of up to 20GB OOMing, which is 
> surprising because the input data itself is only 15 GB compressed and maybe 
> 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively 
> removing/commenting out parts of our workflow. Finally when we get to the 
> above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to