[PR] GH-43553: [Format] Add the specification for statistics schema [arrow]

via GitHub Tue, 17 Dec 2024 23:12:34 -0800


kou opened a new pull request, #45058:
URL: https://github.com/apache/arrow/pull/45058


   ### Rationale for this change
   
   Statistics are useful for fast query processing. Many query engines
   use statistics to optimize their query plan.
   
   Apache Arrow format doesn't have statistics but other formats that can
   be read as Apache Arrow data may have statistics. For example, Apache
   Parquet C++ can read Apache Parquet file as Apache Arrow data and
   Apache Parquet file may have statistics.
   
   One of the Apache Arrow C streaming interface use cases is the following:
   
   1. Module A reads Apache Parquet file as Apache Arrow data
   2. Module A passes the read Apache Arrow data to module B through the
      Arrow C data interface
   3. Module B processes the passed Apache Arrow data
   
   If module A can pass the statistics associated with the Apache Parquet
   file to module B, module B can use the statistics to optimize its
   query plan.
   
   ### What changes are included in this PR?
   
   We standardize how to represent statistics as an Apache Arrow array
   for easy to exchange.
   
   We don't standardize how to pass the statistics array. You can use any
   interface for it. For example, you can us ethe Apache Arrow C data interface.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-43553: [Format] Add the specification for statistics schema [arrow]

Reply via email to