[GitHub] [iceberg] rdblue commented on a change in pull request #4446: Spark [3.2] : Improve stats estimation for spark scan

GitBox Thu, 31 Mar 2022 11:57:15 -0700


rdblue commented on a change in pull request #4446:
URL: https://github.com/apache/iceberg/pull/4446#discussion_r839927551




##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
##########
@@ -145,10 +145,12 @@ protected Statistics estimateStatistics(Snapshot 
snapshot) {
     }
 
     long numRows = 0L;
+    long defaultRowSizeInBytes = readSchema().defaultSize();
 
     for (CombinedScanTask task : tasks()) {
       for (FileScanTask file : task.files()) {
-        numRows += file.file().recordCount();
+        // TODO: if possible, take deletes also into consideration.
+        numRows += Math.min(file.length() / defaultRowSizeInBytes, 
file.file().recordCount());

Review comment:
       We always want to base the estimate on the number of records in the file 
and not on the file length. Calculations based on the file length are quite bad 
in practice because Parquet compression varies.
   
   If you want to find out how many records from the file will actually be 
scanned by this task, then you could do something different, like finding the 
proportion of the file that is covered by the task and multiplying by the 
record count.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #4446: Spark [3.2] : Improve stats estimation for spark scan

Reply via email to