Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-08-18 Thread via GitHub


alamb closed issue #15976: Treat truncated parquet stats as inexact
URL: https://github.com/apache/datafusion/issues/15976


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-07-20 Thread via GitHub


nssalian commented on issue #15976:
URL: https://github.com/apache/datafusion/issues/15976#issuecomment-3095182816

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-31 Thread via GitHub


CookiePieWw commented on issue #15976:
URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2925153931

   Thanks for your feedback! I found `ValueStatistics` has already have 
[`max_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.max_is_exact)
 and 
[`min_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.min_is_exact),
 seems we can directly make use of them. I've drafted a pr at 
https://github.com/apache/arrow-rs/pull/7574 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-29 Thread via GitHub


alamb commented on issue #15976:
URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2920485084

   > But I didn't find a method to access the ..exact flags in 
StatisticsConverter, so my plan is to first add functions similar to 
row_group_mins to the converter to extract the flags, which requires a change 
to arrow-rs first, and then collect and pass the extracted boolean array of 
flags to get_col_stats to decide which one to use, Precision::Exact and 
Precision::InExact.
   
   I think the first thing that is needed in arrow-rs is to expose the 
[`is_max_value_exact`](https://docs.rs/parquet/latest/parquet/format/struct.Statistics.html#structfield.is_max_value_exact)
 and `is_min_value_exact` fields into the corresponding Rust structs 
(`ValueStatistics`):
   
   https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html
   
https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-29 Thread via GitHub


CookiePieWw commented on issue #15976:
URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2920132245

   Hi :) I've spent some time on this and found the problem in `get_col_stats`
   
https://github.com/apache/datafusion/blob/2c2f225926958b6abf06b01fcfb594017531043c/datafusion/datasource-parquet/src/file_format.rs#L1101-L1107
   Here we always use `Precision::Exact` to wrap the stats, but actually we 
need to respect the `is_max_value_exact` and `is_min_value_exact` flags in the 
column metadata.
   
   The max and min values are extracted at 
   
https://github.com/apache/datafusion/blob/2c2f225926958b6abf06b01fcfb594017531043c/datafusion/datasource-parquet/src/file_format.rs#L1112-L1139
   But I didn't find a method to access the `..exact` flags in 
`StatisticsConverter`, so my plan is to first add a function similar to 
`row_group_mins` to the converter to extract the flags, which requires a change 
to `arrow-rs` first, and then collect and pass the extracted boolean array of 
flags to `get_col_stats` to decide which one to use, `Precision::Exact` and 
`Precision::InExact`.
   
   Please let me know if this direction makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-22 Thread via GitHub


CookiePieWw commented on issue #15976:
URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2901868240

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[I] Treat truncated parquet stats as inexact [datafusion]

2025-05-07 Thread via GitHub


robert3005 opened a new issue, #15976:
URL: https://github.com/apache/datafusion/issues/15976

   ### Describe the bug
   
   When reading parquet files with truncated stats datafusion will report the 
min/max as exact even though metadata in the file indicates that min/max has 
been truncated
   
   ### To Reproduce
   
   Create a parquet file with truncated statistics, read the file, the 
statistics on the table are Exact 
   
   ### Expected behavior
   
   The statistic should be Absent or Inexact
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]