[
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sujit Das updated SPARK-32966:
--
Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5 (was: EMR -
5.30.0; Hadoop -2.8.5; Spark- 2.4.5)
> Spark| PartitionBy is taking long time to process
> -
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
>Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5
>Reporter: Sujit Das
>Priority: Major
> Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>
> 2. I have added conf -
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time
> (more than 50 min before I force terminated the EMR cluster). But I have
> observed the partitions have been created and data files are present. But in
> EMR cluster the process is still showing as running, where as in spark
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> 4. Again I disabled the conf and run plain write with partition. It took
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> Only one conf is common in the above scenarios is
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything
> I am missing
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org