[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Tomohiro Tanaka (Jira) Tue, 04 Feb 2020 19:16:27 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tomohiro Tanaka updated SPARK-30735:
------------------------------------
    Description: 
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). Please 
check the attachment (the left figure shows "not using repartition based on 
columns before partitionBy", the other shows "using repartition".

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
 

 

  was:
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
 ** Not using repartition before partitionBy:
 ** Using repartition before partitionBy

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
 

 


> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30735
>                 URL: https://issues.apache.org/jira/browse/SPARK-30735
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.3, 2.4.4
>         Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>            Reporter: Tomohiro Tanaka
>            Priority: Trivial
>              Labels: performance, pull-request-available
>             Fix For: 3.0.0, 3.1.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). Please 
> check the attachment (the left figure shows "not using repartition based on 
> columns before partitionBy", the other shows "using repartition".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
> df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Reply via email to