[jira] [Updated] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Sujit Das (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujit Das updated SPARK-32966:
--
Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5  (was: EMR - 
5.30.0; Hadoop -2.8.5; Spark- 2.4.5)

> Spark| PartitionBy is taking long time to process
> -
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5
>Reporter: Sujit Das
>Priority: Major
>  Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>  
>        2. I have added conf - 
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time 
> (more than 50 min before I force terminated the EMR cluster). But I have 
> observed the partitions have been created and data files are present. But in 
> EMR cluster the process is still showing as running, where as in spark 
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>       3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>      4. Again I disabled the conf and run plain write with partition. It took 
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
> Only one conf is common in the above scenarios is 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything 
> I am missing
>  
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32966) Spark| PartitionBy is taking long time to process

2020-09-22 Thread Sujit Das (Jira)
Sujit Das created SPARK-32966:
-

 Summary: Spark| PartitionBy is taking long time to process
 Key: SPARK-32966
 URL: https://issues.apache.org/jira/browse/SPARK-32966
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.5
 Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
Reporter: Sujit Das


# When I do a write without any partition it takes 8 min

df2_merge.write.mode('overwrite').parquet(dest_path)

 

       2. I have added conf - spark.sql.sources.partitionOverwriteMode=dynamic 
; it took a longer time (more than 50 min before I force terminated the EMR 
cluster). But I have observed the partitions have been created and data files 
are present. But in EMR cluster the process is still showing as running, where 
as in spark history server it is showing no running or pending process.

df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

      3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
took 24 min

df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

     4. Again I disabled the conf and run plain write with partition. It took 
30 min.

df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

 

Only one conf is common in the above scenarios is 
spark.sql.adaptive.coalescePartitions.initialPartitionNum=100

My point is to reduce the time of writing with partitionBy. Is there anything I 
am missing

 

   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org