[GitHub] [spark] Tagar commented on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

GitBox Thu, 29 Oct 2020 18:26:31 -0700


Tagar commented on pull request #29120:
URL: https://github.com/apache/spark/pull/29120#issuecomment-719114594



   I've seen Spark users do this all the time to save to a single file. Great 
improvement.
   
   Would this change cover `.repartition(1)` too and not just `.coalesce(1)` ?
   
   Thanks
   
   ps. Fun fact, some time back I wrote a small tool to workaround this very 
issue 
   
https://github.com/Tagar/abalon/blob/v2.3.3/abalon/spark/sparkutils.py#L444-L445
   that used HDFS API calls to coalesce files together to not affecting Spark 
join pallelism 
   https://github.com/Tagar/abalon/blob/v2.3.3/abalon/spark/sparkutils.py#L340


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Tagar commented on pull request #29120: [SPARK-32291][SQL] COALESCE should not reduce the child parallelism if it contains a Join

Reply via email to