[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Tomohiro Tanaka (Jira) Wed, 05 Feb 2020 14:51:40 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031099#comment-17031099
 ]


Tomohiro Tanaka commented on SPARK-30735:
-----------------------------------------

Hello, [~dongjoon]! Thanks for your checking this JIRA and PR.

Sorry for my wrong inputs, and I understand guidelines about "versions" based 
on your comments.

Also, I update the title of my PR from [SPARK-30735][CORE] to 
[SPARK-30735][SQL] (If it's no need to be changed, please let me know).

 

Thank you very much for your kind help.

 

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30735
>                 URL: https://issues.apache.org/jira/browse/SPARK-30735
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>         Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>            Reporter: Tomohiro Tanaka
>            Priority: Trivial
>              Labels: performance, pull-request-available
>         Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load(<INPUT_PATH>)
> df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

Reply via email to