[ https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-30735: ---------------------------------- Component/s: (was: Spark Core) SQL > Improving writing performance by adding repartition based on columns to > partitionBy for DataFrameWriter > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-30735 > URL: https://issues.apache.org/jira/browse/SPARK-30735 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Environment: * Spark-3.0.0 > * Scala: version 2.12.10 > * sbt 0.13.18, script ver: 1.3.7 (Built using sbt) > * Java: 1.8.0_231 > ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11) > ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode) > Reporter: Tomohiro Tanaka > Priority: Trivial > Labels: performance, pull-request-available > Attachments: repartition-before-partitionby.png > > Original Estimate: 336h > Remaining Estimate: 336h > > h1. New functionality for {{partitionBy}} > To enhance performance using partitionBy , calling {{repartition}} method > based on columns is much good before calling {{partitionBy}}. I added new > function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to > {{partitionBy}}. > > h2. Problems when not using {{repartition}} before {{partitionBy}}. > When using {{paritionBy}}, following problems happen because of specified > columns in {{partitionBy}} are located separately. > * The spark application which includes {{partitionBy}} takes much longer > (for example, [[python - partitionBy taking too long while saving a dataset > on S3 using Pyspark - Stack > Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])] > * When using {{partitionBy}}, memory usage increases much high compared with > not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). > * Additional information about memory usage affection by partitionBy: Please > check the attachment (the left figure shows "using partitionBy", the other > shows "not using partitionBy)". > h2. How to use? > It's very simple. If you want to use repartition method before > {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in > {{partitionBy}}. > Example: > {code:java} > val df = spark.read.format("csv").option("header", true).load(<INPUT_PATH>) > df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org