Re: [I] Treat truncated parquet stats as inexact [datafusion]
alamb closed issue #15976: Treat truncated parquet stats as inexact URL: https://github.com/apache/datafusion/issues/15976 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Treat truncated parquet stats as inexact [datafusion]
nssalian commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-3095182816 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Treat truncated parquet stats as inexact [datafusion]
CookiePieWw commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2925153931 Thanks for your feedback! I found `ValueStatistics` has already have [`max_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.max_is_exact) and [`min_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.min_is_exact), seems we can directly make use of them. I've drafted a pr at https://github.com/apache/arrow-rs/pull/7574 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Treat truncated parquet stats as inexact [datafusion]
alamb commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2920485084 > But I didn't find a method to access the ..exact flags in StatisticsConverter, so my plan is to first add functions similar to row_group_mins to the converter to extract the flags, which requires a change to arrow-rs first, and then collect and pass the extracted boolean array of flags to get_col_stats to decide which one to use, Precision::Exact and Precision::InExact. I think the first thing that is needed in arrow-rs is to expose the [`is_max_value_exact`](https://docs.rs/parquet/latest/parquet/format/struct.Statistics.html#structfield.is_max_value_exact) and `is_min_value_exact` fields into the corresponding Rust structs (`ValueStatistics`): https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Treat truncated parquet stats as inexact [datafusion]
CookiePieWw commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2920132245 Hi :) I've spent some time on this and found the problem in `get_col_stats` https://github.com/apache/datafusion/blob/2c2f225926958b6abf06b01fcfb594017531043c/datafusion/datasource-parquet/src/file_format.rs#L1101-L1107 Here we always use `Precision::Exact` to wrap the stats, but actually we need to respect the `is_max_value_exact` and `is_min_value_exact` flags in the column metadata. The max and min values are extracted at https://github.com/apache/datafusion/blob/2c2f225926958b6abf06b01fcfb594017531043c/datafusion/datasource-parquet/src/file_format.rs#L1112-L1139 But I didn't find a method to access the `..exact` flags in `StatisticsConverter`, so my plan is to first add a function similar to `row_group_mins` to the converter to extract the flags, which requires a change to `arrow-rs` first, and then collect and pass the extracted boolean array of flags to `get_col_stats` to decide which one to use, `Precision::Exact` and `Precision::InExact`. Please let me know if this direction makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Treat truncated parquet stats as inexact [datafusion]
CookiePieWw commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2901868240 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[I] Treat truncated parquet stats as inexact [datafusion]
robert3005 opened a new issue, #15976: URL: https://github.com/apache/datafusion/issues/15976 ### Describe the bug When reading parquet files with truncated stats datafusion will report the min/max as exact even though metadata in the file indicates that min/max has been truncated ### To Reproduce Create a parquet file with truncated statistics, read the file, the statistics on the table are Exact ### Expected behavior The statistic should be Absent or Inexact ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
