RussellSpitzer commented on pull request #4395:
URL: https://github.com/apache/iceberg/pull/4395#issuecomment-1079565485


   So one big issue I have here, and why I kind of gave up on this for the 
moment, is that we actually know all of this information on the driver before 
any executors run. Not only do we have it there but we can also eliminate 
duplicate counts for both delete files and for data files. 
   
   In the current code we would get incorrect results for number of data files 
if our split size is smaller than our file size. Imagine we have a 512 mb file 
and 128 mb split size (and row group) this will generate 4 separate tasks for 
the same Data file. Counting the file 4 times.
   
   On the driver we have already materialized every single FileScanTask and 
know this information and can do the dedupe, our only problem is currently 
spark doesn't allow us to populate custom metrics at the Source itself on the 
driver. I think we should push for that capability in spark rather than doing 
this at the executor level. (Or do it in conjunction with the executor metrics)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to