[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738143#comment-17738143 ]
ASF GitHub Bot commented on PARQUET-2249: ----------------------------------------- JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770 ########## src/main/thrift/parquet.thrift: ########## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list<bool> nan_pages Review Comment: @gszadovszky Any writer reader/writer pair who writes NaNs into column indexes (and other places like page headers) and expects them to be there (and otherwise yields wrong results while reading) *is and never was* spec conforming. In older releases of the spec where NaN wasn't mentioned yet, such a writer was at least not violating the spec directly but even then NaN handling was basically "undefined behavior", as the spec never mentioned how to treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already back then a non-portable assumption. Even today, a reader relying on one specific NaN semantics would already yield erroneous results when reading spec conforming Parquet files. E.g., if they search for NaNs and expect them to be in min/max, then they might filter Pages containing NaNs that don't have NaNs in their min/max. Consequently, such a reader is already broken; yes, writing [-Inf,Inf] into the column index would break such a reader more, but all bets are off here anyway already. It currently is just not possible to handle NaNs correctly in a portable way (that's what this PR is all about in the first place). So TBH backward compatibility to such a broken (or at least non-portable) reader/writer pair seems like an absolute non-goal to me. @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't need to distinguish here. All they need to know is whether they should skip the page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless of whether the bounds are there due to NaNs or real infinities, a legacy reader would not skip the page and therefore yield correct results. A new reader that implements this PR can do the distinction via the nan_pages or value_counts computation. Note that actually *any* bounds are, mathematically speaking, correct for a page containing only NaNs (and will yield correct results on spec-conforming readers!). Note that min/max values in the column index don't need to be tight, according to the spec. So the only condition that must hold is that there is no value outside of the bounds (NaNs excluded). As an only-NaN page has no values, any bounds satisfy the condition, as there are no values that need to lie inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. Both would yield correct results on spec conforming readers. Actually the tighter the bounds, the more queries can skip the page. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > ----------------------------------------------------------------------- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Jan Finis > Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > * When writing statistics the following rules should be followed: > * - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > /** > * A list of Boolean values to determine the validity of the corresponding > * min and max values. If true, a page contains only null values, and > writers > * have to set the corresponding entries in min_values and max_values to > * byte[0], so that all lists have the same length. If false, the > * corresponding entries in min_values and max_values must be valid. > */ > 1: required list<bool> null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > -- This message was sent by Atlassian Jira (v8.20.10#820010)