wypoon commented on pull request #4395:
URL: https://github.com/apache/iceberg/pull/4395#issuecomment-1079613533


   > So one big issue I have here, and why I kind of gave up on this for the 
moment, is that we actually know all of this information on the driver before 
any executors run. Not only do we have it there but we can also eliminate 
duplicate counts for both delete files and for data files.
   > 
   Yes, I realize that we know the number of files before the tasks are run. It 
would certainly be better if Spark had driver-side custom metrics. But this is 
something we can implement with what is Spark 3.2 now. It's not the best way, 
but I think it's still useful to provide this metric, and it can be implemented 
in a better way when we get driver-side custom metrics in Spark.
   
   > In the current code we would get incorrect results for number of data 
files if our split size is smaller than our file size. Imagine we have a 512 mb 
file and 128 mb split size (and row group) this will generate 4 separate tasks 
for the same Data file. Counting the file 4 times.
   > 
   That is a good point. Even so, such a count is still a useful metric, in 
that it is a measure of amount of work to do, although "number of files" is not 
the correct name for it. Of course, the actual number of files is a metric 
we're interested in as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to