RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351768402
> now if total row count of a split/file = (scannedFileFraction * file().recordCount()) This is I think the confusion, we are attempting to determine how many rows are in this split specifically because we are summing over all splits. https://github.com/apache/iceberg/blob/7fd9ded0a119c050746d765bd90c59fef93506b1/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L143 A file may be turned into multiple FileScanTasks so we can only count the number of rows contributed by the tasks we are currently looking at. Let's say we do a scan which only ends up touching FileA but File A is divided into multiple FileScanTasks (or now Content Scan Tasks) for read parallelism A1, A2, and A3 If they have lengths 2/5 A , 2/5 A and 1/5 A then we do the following math Sum over all tasks (2/5 A * Total A Rows + 2/5 A * total A Rows + 1/5 A * total A Rows) = Total A Rows This should be the total amount of rows in the scan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
