[PR] Spark: Use actual file sizes instead of schema-based estimates for table statistics [iceberg]

via GitHub Thu, 19 Mar 2026 19:52:24 -0700


majian1998 opened a new pull request, #15693:
URL: https://github.com/apache/iceberg/pull/15693


   ## What
   SparkScan.estimateStatistics() currently estimates table size by multiplying 
StructType.defaultSize() (hardcoded per-type constants, e.g. STRING=54 bytes) 
by the total row count. This can be wildly inaccurate compared to actual data, 
causing Spark to pick suboptimal join strategies (e.g. missing 
BroadcastHashJoin or broadcasting a table that's too large).
   
   ## Changes
   Replace the type-based estimation with real file size data that Iceberg 
already tracks:
   Partitioned tables (no filters): read total-files-size from snapshot summary
   All other paths: sum ScanTaskGroup.sizeBytes() which reflects actual 
fileSizeInBytes from manifests
   Applies to both SparkScan and SparkChangelogScan
   This makes Iceberg-based table statistics consistent with how Spark's native 
Parquet source reports size (using actual file sizes on disk), so the same data 
produces the same join strategy regardless of the source.
   
   Related issue: https://github.com/apache/iceberg/issues/15684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark: Use actual file sizes instead of schema-based estimates for table statistics [iceberg]

Reply via email to