[ https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace reassigned ARROW-12264: ----------------------------------- Assignee: Weston Pace > [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down > ------------------------------------------------------------------- > > Key: ARROW-12264 > URL: https://issues.apache.org/jira/browse/ARROW-12264 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet > Reporter: Antoine Pitrou > Assignee: Weston Pace > Priority: Major > > The Parquet spec (in parquet.thrift) says the following about handling of > floating-point statistics: > {code} > * (*) Because the sorting order is not specified properly for floating > * point values (relations vs. total ordering) the following > * compatibility rules should be applied when reading statistics: > * - If the min is a NaN, it should be ignored. > * - If the max is a NaN, it should be ignored. > * - If the min is +0, the row group may contain -0 values as well. > * - If the max is -0, the row group may contain +0 values as well. > * - When looking for NaN values, min and max should be ignored. > {code} > It appears that the dataset code uses the following filter expression when > doing Parquet predicate push-down (in {{file_parquet.cc}}): > {code:c++} > return and_(greater_equal(field_expr, literal(min)), > less_equal(field_expr, literal(max))); > {code} > A NaN value will fail that filter and yet may be found in the given Parquet > column chunk. > We may instead need a "greater_equal_or_nan" comparison that returns true if > either value is NaN. -- This message was sent by Atlassian Jira (v8.20.10#820010)