tustvold commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2102419130

   Sorry I am a bit late to the party here, correctly interpreting the 
statistics requires more than just 
[Statistics](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html),
 as there is additional information that specifies things like sort order, 
truncation, logical types, etc... It is very likely the existing logic in DF is 
incorrect, which is fine, but we shouldn't commit to an API here that prevents 
us doing this correctly.
   
   Additionally the API needs to be able to also handle the [Page 
Index](https://docs.rs/parquet/latest/parquet/file/page_index/index/struct.PageIndex.html)
 which exposes slightly different information from what is encoded in the file 
metadata.
   
   I don't mean to discourage you, but this is one of the most arcane and 
subtle areas of parquet and I wonder if it might be worth starting out with 
something a little simpler as a first contribution? I'd recommend any of the 
issues marked "good first issue". As it stands this ticket needs extensive 
research and design work from someone with a good deal of knowledge about 
parquet, before even getting started on what will likely be pretty complex 
code. There are still ongoing discussions on parquet-format about correctly 
interpreting statistics, the standard under-specified a number of key things 
:sweat_smile:.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to