[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

GitBox Wed, 14 Dec 2022 08:56:56 -0800


RussellSpitzer commented on issue #6424:
URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351768402


   > now if total row count of a split/file = (scannedFileFraction * 
file().recordCount())
   
   This is I think the confusion, we are attempting to determine how many rows 
are in this split specifically because we are summing over all splits. 
   
   
https://github.com/apache/iceberg/blob/7fd9ded0a119c050746d765bd90c59fef93506b1/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L143
   
    A file may be turned into multiple FileScanTasks so we can only count the 
number of rows contributed by the tasks we are currently looking at. Let's say 
we do a scan which only ends up touching FileA but File A is divided into 
multiple  FileScanTasks (or now Content Scan Tasks) for read parallelism A1, 
A2, and A3
   
   If they have lengths 2/5 A , 2/5 A and 1/5 A then we do the following math
   Sum over all tasks (2/5 A * Total A Rows + 2/5 A * total A Rows + 1/5 A * 
total A Rows) = Total A Rows
   This should be the total amount of rows in the scan
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

Reply via email to