[ https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200571#comment-17200571 ]
Takeshi Yamamuro commented on SPARK-32966: ------------------------------------------ Is this a question? At least, I think you need to describe more info (e.g., a complete query to reproduce the issue). > Spark| PartitionBy is taking long time to process > ------------------------------------------------- > > Key: SPARK-32966 > URL: https://issues.apache.org/jira/browse/SPARK-32966 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.4.5 > Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5 > Reporter: Sujit Das > Priority: Major > Labels: AWS, pyspark, spark-conf > > # When I do a write without any partition it takes 8 min > df2_merge.write.mode('overwrite').parquet(dest_path) > > 2. I have added conf - > spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time > (more than 50 min before I force terminated the EMR cluster). But I have > observed the partitions have been created and data files are present. But in > EMR cluster the process is still showing as running, where as in spark > history server it is showing no running or pending process. > df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it > took 24 min > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > 4. Again I disabled the conf and run plain write with partition. It took > 30 min. > df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest) > > Only one conf is common in the above scenarios is > spark.sql.adaptive.coalescePartitions.initialPartitionNum=100 > My point is to reduce the time of writing with partitionBy. Is there anything > I am missing > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org