[ https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031065#comment-17031065 ]
Dongjoon Hyun edited comment on SPARK-30735 at 2/5/20 9:49 PM: --------------------------------------------------------------- Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR. Since it seems to be your first time, I want to give you some information. - https://spark.apache.org/contributing.html According to the above guideline, we use `Fix Version` when we merge finally. So, you should keep them empty. Also, we don't allow backporting of new feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you should use `3.1.0` for `Affected Version`. In other words, new improvement and feature cannot affect old versions. Finally, `Target Version` is reserved for committers. So, please keep them empty, too. I'll adjust the fields appropriately. Thanks. was (Author: dongjoon): Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR. Since it seems to be your first time, I want to give you some information. - https://spark.apache.org/contributing.html According to the above guideline, we use `Fix Version` when we merge finally. So, you should keep them empty. Also, we don't allow backporting of new feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you should use `3.1.0` for `Affected Version`. In other words, new improvement and feature cannot affect old versions. I'll adjust the fields appropriately. Thanks. > Improving writing performance by adding repartition based on columns to > partitionBy for DataFrameWriter > ------------------------------------------------------------------------------------------------------- > > Key: SPARK-30735 > URL: https://issues.apache.org/jira/browse/SPARK-30735 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Environment: * Spark-3.0.0 > * Scala: version 2.12.10 > * sbt 0.13.18, script ver: 1.3.7 (Built using sbt) > * Java: 1.8.0_231 > ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11) > ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode) > Reporter: Tomohiro Tanaka > Priority: Trivial > Labels: performance, pull-request-available > Attachments: repartition-before-partitionby.png > > Original Estimate: 336h > Remaining Estimate: 336h > > h1. New functionality for {{partitionBy}} > To enhance performance using partitionBy , calling {{repartition}} method > based on columns is much good before calling {{partitionBy}}. I added new > function: {color:#0747a6}{{partitionBy(<True | False>, columns>}}{color} to > {{partitionBy}}. > > h2. Problems when not using {{repartition}} before {{partitionBy}}. > When using {{paritionBy}}, following problems happen because of specified > columns in {{partitionBy}} are located separately. > * The spark application which includes {{partitionBy}} takes much longer > (for example, [[python - partitionBy taking too long while saving a dataset > on S3 using Pyspark - Stack > Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])] > * When using {{partitionBy}}, memory usage increases much high compared with > not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). > * Additional information about memory usage affection by partitionBy: Please > check the attachment (the left figure shows "using partitionBy", the other > shows "not using partitionBy)". > h2. How to use? > It's very simple. If you want to use repartition method before > {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in > {{partitionBy}}. > Example: > {code:java} > val df = spark.read.format("csv").option("header", true).load(<INPUT_PATH>) > df.write.format("json").partitionBy(true, columns).save(<OUTPUT_PATH>){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org