[GitHub] [iceberg] arunb2w opened a new issue, #6928: Merge-on-read vs copy-on-write behavior during merge into

via GitHub Fri, 24 Feb 2023 03:36:38 -0800


arunb2w opened a new issue, #6928:
URL: https://github.com/apache/iceberg/issues/6928


   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I am running a merge-into query and wanted to see how it behaves with 
write.merge.mode'='copy-on-write' and 'merge-on-read' for the same input on the 
same cluster config."
   From the spark UI, the merge sql took only **2.5mins in copy-on-write** and 
for the same load **merge-on-read took 11mins.**
   Attaching the images to show stage level behavior in which we can see huge 
shuffle write with merge-on-read whereas it is very minimal with copy-on-write.
   When further analysing the SQL tab of the spark UI, were able to find out 
dynamic-pruning is not happened with MoR whereas we could see that filter in 
CoW.
   My understanding is that, for faster writes we should prefer MoR but in this 
case MoR is actually performing worst than CoW. 
   
   Questions:
   Why dynamic pruning is not happening with MoR when running merge into query?
   Why shuffle write is huge when using MoR for the same input batch?
   How to optimize the MoR performance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] arunb2w opened a new issue, #6928: Merge-on-read vs copy-on-write behavior during merge into

Reply via email to