xinlifoobar commented on issue #8698:
URL: https://github.com/apache/datafusion/issues/8698#issuecomment-2182112425

   Hi @alamb 
      
   I am looking into this issue and have a proposal for the following PR. The 
support for substrait statistics is quite basic compared to that of DataFusion. 
For example, Substrait only supports table-level statistics without precision 
and does not support column-level statistics. Here is the current Substrait 
`Stats` message:  
      
   ```protobuf  
   message Stats {  
     double row_count = 1;  
     double record_size = 2;  
     substrait.extensions.AdvancedExtension advanced_extension = 10;  
   }  
   ```  
      
   In contrast, DataFusion's statistics are more detailed:  
      
   ```rust  
   #[derive(Debug, Clone, PartialEq, Eq)]  
   pub struct Statistics {  
     /// The number of table rows.  
     pub num_rows: Precision,  
     /// Total bytes of the table rows.  
     pub total_byte_size: Precision,  
     /// Statistics on a column level. It contains a [`ColumnStatistics`] for  
     /// each field in the schema of the table to which the [`Statistics`] 
refer.  
     pub column_statistics: Vec<ColumnStatistics>,  
   }  
   ```  
      
   To enhance the support of statistics in Substrait, I propose adding an 
`AdvancedExtension` to the `Stats` message. This extension is defined as 
follows:  
      
   ```protobuf  
   // A generic object that can be used to embed additional extension 
information  
   // into the serialized Substrait plan.  
   message AdvancedExtension {  
     // An optimization is helpful information that doesn't influence 
semantics. May  
     // be ignored by a consumer.  
     google.protobuf.Any optimization = 1;  
       
     // An enhancement alters semantics. Cannot be ignored by a consumer.  
     google.protobuf.Any enhancement = 2;  
   }  
   ```  
      
   I would add a new type in the `datafusion-proto` to define the new message 
with all the necessary fields. The new message is defined as:  
      
   ```protobuf  
   message DatafusionStatsExtension {  
     // The version of the extension.  
     int32 version = 1;  
       
     // The statistics.  
     datafusion_common.Statistics statistics = 6;  
   }  
   ```  
      
   On the producer side, it will try to encode the `DatafusionStatsExtension` 
message and attach it to the `AdvancedExtension` as an optimization.  
      
   On the consumer side, it will try to parse the `AdvancedExtension` message, 
extract the `DatafusionStatsExtension` message, and update the `Stats` message 
accordingly. If the `DatafusionStatsExtension` message is not present, it will 
treat the `num_rows` and `total_byte_size` as table-level statistics with 
`Precision::EXACT`.  
   
   What do you think about this idea? I'd like to hear your thoughts on this.  
      
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to