Re: [PR] feat(datafusion): Expose DataFusion statistics on an IcebergTableScan [iceberg-rust]

via GitHub Wed, 08 Jan 2025 21:05:23 -0800


liurenjie1024 commented on code in PR #880:
URL: https://github.com/apache/iceberg-rust/pull/880#discussion_r1908170767



##########
crates/integrations/datafusion/src/table/mod.rs:
##########
@@ -41,16 +42,21 @@ pub struct IcebergTableProvider {
     table: Table,
     /// Table snapshot id that will be queried via this provider.
     snapshot_id: Option<i64>,
+    /// Statistics for the table; row count and null count/min-max values per 
column.
+    /// If not present defaults to `None`.
+    statistics: Option<Statistics>,
     /// A reference-counted arrow `Schema`.
     schema: ArrowSchemaRef,
 }
 
 impl IcebergTableProvider {
-    pub(crate) fn new(table: Table, schema: ArrowSchemaRef) -> Self {
+    pub(crate) async fn new(table: Table, schema: ArrowSchemaRef) -> Self {
+        let statistics = compute_statistics(&table, None).await.ok();

Review Comment:
   I don't see why statistics matters for join, I think you are referring to 
join reordering algorithm in query optimizer? From my experience, complex table 
statistics doesn't help much in join reordering. For example, if the joined 
table has many filters, how would you estimate correct statistics after 
filtering. Histogram may help for single column filter, but not for complex 
filters. Also cardinality estimation in join doesn't work well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(datafusion): Expose DataFusion statistics on an IcebergTableScan [iceberg-rust]

Reply via email to