[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Dongjoon Hyun (Jira) Wed, 05 Feb 2020 13:48:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031065#comment-17031065
 ]


Dongjoon Hyun commented on SPARK-30735:
---------------------------------------

Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. 
So, you should keep them empty. Also, we don't allow backporting of new 
feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you 
should use `3.1.0` for `Affected Version`. In other words, new improvement and 
feature cannot affect old versions.

I'll adjust the fields appropriately. Thanks.

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30735
>                 URL: https://issues.apache.org/jira/browse/SPARK-30735
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.3, 2.4.4
>         Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>            Reporter: Tomohiro Tanaka
>            Priority: Trivial
>              Labels: performance, pull-request-available
>             Fix For: 3.0.0, 3.1.0
>
>         Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
> df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Reply via email to