[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788926#comment-17788926 ] ASF GitHub Bot commented on PARQUET-2249: - etseidl commented on code in PR #221: URL: https://github.com/apache/parquet-format/pull/221#discussion_r1402657927 ## src/main/thrift/parquet.thrift: ## @@ -288,7 +288,7 @@ struct MapType {} // see LogicalTypes.md struct ListType {}// see LogicalTypes.md struct EnumType {}// allowed for BINARY, must be encoded with UTF-8 struct DateType {}// allowed for INT32 -struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes +struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes (see LogicalTypes.md) Review Comment: 'must encode' or 'must be encoded as'? ## src/main/thrift/parquet.thrift: ## @@ -962,15 +967,19 @@ union ColumnOrder { * BYTE_ARRAY - unsigned byte-wise comparison * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because the precise sorting order is ambiguous for floating + * point types due to underspecified handling of NaN and -0/+0, + * it is recommended that writers use IEEE_754_TOTAL_ORDER + * for these types. + * + * If TYPE_ORDER is used for floating point types, then the following Review Comment: This line threw me (at least while using my phone 😉...on my computer I can see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used for floating..." or "If this type-defined ordering..." ## src/main/thrift/parquet.thrift: ## @@ -962,15 +967,19 @@ union ColumnOrder { * BYTE_ARRAY - unsigned byte-wise comparison * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because the precise sorting order is ambiguous for floating + * point types due to underspecified handling of NaN and -0/+0, + * it is recommended that writers use IEEE_754_TOTAL_ORDER + * for these types. + * + * If TYPE_ORDER is used for floating point types, then the following Review Comment: This line threw me (at least while using my phone 😉...on my computer I can see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used for floating..." or "If this type-defined ordering..." > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} >   *   When writing statistics the following rules should be followed: >   *   - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { >  /** >   * A list of Boolean values to determine the validity of the corresponding >   * min and max values. If true, a page contains only null values, and > writers >   * have to set the corresponding entries in min_values and max_values to >   * byte[0], so that all lists have the same length. If false, the >   * corresponding entries in min_values and max_values must be valid. >   */ >  1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions:
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788868#comment-17788868 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1823348321 Okay, finally done. As the new solution (total order) does not share a single line with the current solution and this PR gets quite long and contrived, I created a new PR: https://github.com/apache/parquet-format/pull/221 I hope this is fine. If you rather want me to continue in this PR, let me know, then I'll close the other one and instead update this one. Otherwise, let's continue the discussion about total order in the new PR :). @tustvold @mapleFU @wgtmac @crepererum @etseidl @gszadovszky @pitrou FYI > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788866#comment-17788866 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis opened a new pull request, #221: URL: https://github.com/apache/parquet-format/pull/221 This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in https://github.com/apache/parquet-format/pull/196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description === Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present -- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex -- There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) value
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784917#comment-17784917 ] ASF GitHub Bot commented on PARQUET-2249: - tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805653152 Congratulations! Take all the time you need, there is no urgency on this from my end, just wanted to avoid things stalling out > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784693#comment-17784693 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805207632 I hate to not stick to my word, but I won't be able to create the PR today, as my wife is going into labor and I'll have to drive her to the clinic soon 😅. I pushed the status I have so far to my fork. You can already have a look if you want: https://github.com/jfinis/parquet-format/tree/totalorder The commit is basically done, I just wanted to proof read everything and write a descriptive message for the commit and the PR. I'll find some time once we're back from the hospital, i.e., in a few days. But for now, I first need to deliver something else 👶 . > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} >   *   When writing statistics the following rules should be followed: >   *   - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { >  /** >   * A list of Boolean values to determine the validity of the corresponding >   * min and max values. If true, a page contains only null values, and > writers >   * have to set the corresponding entries in min_values and max_values to >   * byte[0], so that all lists have the same length. If false, the >   * corresponding entries in min_values and max_values must be valid. >   */ >  1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". >  -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784578#comment-17784578 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1804343753 @tustvold I actually already have the change ready in my local repo. I was just distracted by other work and it seemed there was little interest so far in advancing this quickly, so I didn't update it on github, yet. I can update the PR tomorrow :). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784386#comment-17784386 ] ASF GitHub Bot commented on PARQUET-2249: - tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1803642293 Just coming back to this as it has come up a bit downstream, the approach described in https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697 makes a lot of sense to me. Would it help move this forward if I were to raise a separate PR proposing it? > parquet-mr can efficiently implement this sort order Provided Java provides some mechanism to interpret a float as an integer, it is just a case of some bit operations - https://doc.rust-lang.org/src/core/num/f64.rs.html#1336 > Total ordering is nice if the goal is to order the data > If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless Why would filter predicates not also need a well-defined order? FWIW arrow-rs uses total order for **all** floating point comparison. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741481#comment-17741481 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1628384514 To support old readers with the statistics we can choose to write `TypeDefinedOrder` for FP values in case there are no `NaN` values in the data. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741310#comment-17741310 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1627373091 The new `IEEE754TotalOrder` looks elegant to me, though a single NaN value may still ruin the page index. Another challenge is how parquet-mr can efficiently implement this sort order. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741127#comment-17741127 ] ASF GitHub Bot commented on PARQUET-2249: - westonpace commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625662823 > CON: NaNs will be used in min/max bounds, even for not only-NaN pages. This makes them less effective for filtering (as they are the widest possible bounds) but @crepererum made a good point that this "special case for NaN" is quite arbitrary and we could also special case INT_MAX for integer columns, e.g.. I see the argument that keeping the architecture simple might be preferrable. Also NaNs are not widely used, so this will not be determinental to many data sets. I agree this is a con. Total ordering is nice if the goal is to order the data. If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless. However, I also agree with @crepererum that this is a slippery slope and I agree with @JFinis that NaNs are not widely used and simpler is better. I don't entirely agree the solution can always be to replace NaN/Infinity with NULL but the cases where it can't are probably very rare. Besides, the penalty here is only a performance loss and not incorrect results so it's more manageable. So, on the balance, I'd say I'm neutral. If there are other advantages to this approach then the disadvantages to dataset filtering are probably not enough outweigh them. We might want to add a small sentence to some kind of pyarrow or parquet documentation somewhere so that users can be aware of this. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741077#comment-17741077 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625546137 @JFinis Thanks a lot! I agree that makes sense. The main problem IMHO is that old readers wouldn't support page filtering on such files. That said, we have to move forward somehow. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741072#comment-17741072 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697 > > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal? > > It's difficult to say without understanding the implications. Say a data page contains some NaNs, what happens? @pitrou On the write path: * The writing library would set the `ColumnOrder` for this column to the new option, let's call it `IEEE754TotalOrder`. * The writing library would use IEEE754 total order for all order / sorting related tasks. I.e., it would compute the min and max values of the page using that total order. That order has a defined place for NaN. The writer would *not* have special logic for NaN. It would just order everything using total order. E.g., in case of a page containing a positive NaN, this would be chosen as the max value, as Nan > everything else in the total order. On the read path: * A reading engine encountering the new `IEEE754TotalOrder` column order would either a) (legacy reader) not understand it and in this case not use any statistics, as it doesn't understand the semantics of the order relation. b) (new reader) understand it and assume that all order is in IEEE 754 total order, which again has a defined place for NaNs. Depending on the NaN semantics of the reading engine, it would need to make sure to align the values it sees in min/max with its own semantics. How this alignment would look like would depend on the semantics of the engine. (I can give more detailed examples for different engine semantics if necessary) Ramifications: * PRO: Due to the new column order, legacy readers are guarded. They don't need to understand the new order. Even if they ignore the column order, if they see NaNs in min and max they will just ignore them and assume the statistics aren't usable. So we have two layers of protection to make sure legacy readers don't misunderstand the ordering. * PRO: No special fields for NaNs. No `nan_counts`, no `nan_pages`. Instead, NaNs are just treated as defined in the total ordering. * PRO: Simple standardized handling of floatsinstead of special handling of NaNs. I guess this was the main point of @tustvold and @crepererum. * PRO: Engines only need to understand total ordering and don't need any special NaN handling code anymore (unless their semantics is different, in which case they need to map their semantics from / to total ordering). * CON: NaNs *will* be used in min/max bounds, even for not only-NaN pages. This makes them less effective for filtering (as they are the widest possible bounds) but @crepererum made a good point that this "special case for NaN" is quite arbitrary and we could also special case INT_MAX for integer columns, e.g.. I see the argument that keeping the architecture simple might be preferrable. Also NaNs are not widely used, so this will not be determinental to many data sets. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741049#comment-17741049 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625464720 > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal? It's difficult to say without understanding the implications. Say a data page contains some NaNs, what happens? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741022#comment-17741022 ] ASF GitHub Bot commented on PARQUET-2249: - tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625392665 > I guess this can also be implemented in each language by "bit casting" the float bits to integer bits and doing an integer comparison, correct Its a bit more than a simple bit cast, but broadly speaking yes. https://doc.rust-lang.org/src/core/num/f64.rs.html#1336 > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741009#comment-17741009 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625354736 @tustvold @crepererum Do I interpret your answer correctly in that your suggestion would be to * Create a new `ColumnOrder` for floats that simply is defined as IEEE 754 total order, if we need such new order for backward compatibility (which we probably need, as apparently parquet-mr will otherwise perform filtering incorrectly) * When that order is used, don't handle NaNs explicitly. Instead, just use the total order relation for ordering and min/max computation (which will result in NaNs being written as max and -NaNs being written as min if they exist). Did I get this right? I guess this can also be implemented in each language by "bit casting" the float bits to integer bits and doing an integer comparison, correct? So even if the underlying language doesn't have native support for total ordering, it should still be possible to implement this. I do see a certain beauty in this approach in it being "simple". As always, I'm happy to adapt my PR to this approach, if we can get consensus that we want this. @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741001#comment-17741001 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625347116 Okay, `[-NaN, +NaN]` as min-max would be ignored in C++ Statistics. I'm ok for these solutions. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740984#comment-17740984 ] ASF GitHub Bot commented on PARQUET-2249: - tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625304128 > I think we already have type-defined order Indeed, and what I am suggesting is rather than layering on more complexity to workaround the problems of such an approach, how about we just remove this complexity? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740976#comment-17740976 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625277042 I think we already have type-defined order, and already exclude +inf and -inf. And not when if a page is all `NaN`, the page would be excluded > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740966#comment-17740966 ] ASF GitHub Bot commented on PARQUET-2249: - crepererum commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625220272 I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote: IMHO this is leaking application details into the storage format. If you start to differentiate NaN from "all normal values" and NULL you may do the same for +/-Inf, because it also acts as a poison value in most computations. But you may also do that for "nearly Inf" because someone divided by "nearly zero" and these super big values are equally nonsensical. This whole discussion isn't even specific to floats. Why do boolean stats not count true/false separately? What about empty strings and byte arrays? Or empty lists in general? My point is: this is opening a can of worms and the complexity isn't worth the gain. The better alternative is: let the user cast invalid values to NULL if they wanna exclude them from their data, because this is exactly what missing values were invented for. If they still want to store broken data and want to have some niche understanding of statistics, provide a way to attach application-defined stats to parquet (this extends to a number of histogram types or counts of other "special" values). Keep the storage format baseline simple. IEEE total ordering is well defined and universally agreed upon. I think the world doesn't need yet another special floating point treatment. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engin
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739052#comment-17739052 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614549513 @mapleFU, as I've written before that's why we initiated [ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863) to make the format open to specify orderings. I don't know how the other implementations use this already. In the current parquet-mr (since we introduced `ColumnOrder`) there is a logic that drops any statistics if the defined column order is not known. So we can safely initiate a new one. We can say that if the min/max value would contain a NaN, then we would write the new `IEEE_754` column order otherwise `TYPE_ORDER`. In this case we can simple skip the additional lists for marking all-NaN pages and write the NaN values into the statistics instead. The question is how older readers of the other implementations would handle an unknown `ColumnOrder`. It is an implementation detail that the NaN handling is java is different from what IEEE 754 says. Java has only one NaN bitmap. So handling this ordering will require additional work. I hope it can be implemented in a performant way. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739036#comment-17739036 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614520051 > Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this. So arrow-rs has a nice handing on float/double comparings, I guess we only need to consider that the new data will not broken by stale parquet-mr reader? @gszadovszky > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739025#comment-17739025 ] ASF GitHub Bot commented on PARQUET-2249: - tustvold commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748 > I wonder for PageIndex pruning in Rust implementions Currently the arrow-rs implementation uses the totalOrder predicate as defined by the IEEE 754 (2008 revision) floating point standard to order floats, this can be very efficiently implemented using some bit-twiddling and at least appears to define the standardised way to handle this. I believe DataFusion is using these same comparison kernels for evaluating pruning predicates, and so I would expect it to have similar behaviour with regards to NaNs. From the [Rust docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp): > The values are ordered in the following sequence: > > negative quiet NaN > negative signaling NaN > negative infinity > negative numbers > negative subnormal numbers > negative zero > positive zero > positive subnormal numbers > positive numbers > positive infinity > positive signaling NaN > positive quiet NaN. > would it matter for adding [-inf, +inf] as min-max for all nan and null pages I haven't read the full backscroll, but the original PR's suggestion of just writing a NaN for a page only containing NaN seems perfectly logical to me, unlikely to cause compatibility issues, and significantly less surprising than writing a value that doesn't actually appear in the data... > Let's cc some of the maintainers of [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): I don't really know enough about the history of floating point comparison to weigh in on what the best solution is with any degree of authority, however, my 2 cents is that the totalOrder predicate is the standardised way to handle this. Whilst I do agree that the behaviour of aggregate statistics containing NaNs might be unfortunate for some workloads, I'm not sure that special casing them is beneficial. Aside from the non-trivial additional complexity associated with special-casing them, if you don't include NaNs in statistics it is unclear to me how you can push down a comparison predicate as you have no way to know if the page contains NaNs? Perhaps that is what this PR seeks to address, but I do wonder if the simple solution might be worth considering... Also tagging @crepererum who may have further thoughts > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739008#comment-17739008 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247688333 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Trying to catching up the discussion. I like the idea to write either [-inf, +inf] or [-0.0, +0.0] for NaN-only pages. As NaN value does not have a well-defined order across systems, simply leveraging page min/max values to filter NaN does not make any sense. Therefore I think this design can break such misuses. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738996#comment-17738996 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614407651 https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 @alamb @tustvold Hi, for PageIndex pruning in Rust implementions, would it matter for adding `[-inf, +inf]` as min-max for all nan and null pages? Would it harm the column pruning for `IS_NAN` or other operations? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738993#comment-17738993 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247671677 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: I think `[-inf, +inf]` it's ok. Now I guess only Rust impl and parquet-mr has the potential problem. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > b
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738941#comment-17738941 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247575939 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @mapleFU, I did not think about any specific implementation. (TBH, I only have experince with parquet-mr.) This is mentioned in the PR description. Maybe, we do not have any implementations as such. @JFinis, I agree we should not care about the potential systems already writing NaN values into column indexes. Also agree that writing NaN values to min/max is risky for existing systems. So we need to write non-NaN valid values to min/max for all-NaN pages. (And of course mark them with either `nan_pages` or `value_counts`.) The more we narrow the range the higher the chance the page will be dropped during filtering which is good because we should not search for NaN values based on the spec anyway. What do you think about `[-Inf, -Inf]`? The worst case is we will read the page of all NaN values instead of dropping. In this very case we would not drop it for `< x` like cases. (This turned out to be the rephrasing and summary of your previous comments. :smile: ) > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column in
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738171#comment-17738171 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245383453 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > there was an argument that some writers already write NaN values into column indexes. Hence, they try to filter on NaN values. Now, we start writing `[-Inf,+Inf]` for NaN only pages. NaN is probably out of `[-Inf,+Inf]` interval so that reader would drop the only NaN page while searching for a NaN. Hi gabor, which implemention has do like that? I check C++ implemention but it doesn't do this. Maybe we can do a check here? Since I guess `[-inf, +inf]` sounds ok > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738169#comment-17738169 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1611623088 Let's cc some of the maintainers of [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): @adamgs @tustvold @alamb > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738163#comment-17738163 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @pitrou writing anything but NaN into min/max was one of my suggestions to circumvent the problem that [parquet-mr doesn't seem to check for NaN values in min/max while reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931) and therefore would probably yield wrong results once we start writing NaNs into these values. This would only work if we go back to maintaining either `nan_pages` or `value_counts` though, as otherwise, as you correctly pointed out, we wouldn't have a way to draw the distinction between only-NaN and real infinities. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738162#comment-17738162 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @pitrou writing anything but NaN into min/max was my suggestion to circumvent the problem that [parquet-mr doesn't seem to check for NaN values in min/max while reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931) and therefore would probably yield wrong results once we start writing NaNs into these values. This would only work if we go back to maintaining either `nan_pages` or `value_counts` though, as otherwise, as you correctly pointed out, we wouldn't have a way to draw the distinction between only-NaN and real infinities. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738146#comment-17738146 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245293430 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > A new reader that implements this PR can do the distinction via the nan_pages or value_counts computation. Wait... I thought the `[-Inf, +Inf]` convention was meant to avoid a new `nan_pages` or `value_counts` field? If not, then what's the point? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > T
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738141#comment-17738141 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @gszadovszky Any writer reader/writer pair who writes NaNs into column indexes (and other places like page headers) and expects them to be there (and otherwise yields wrong results while reading) *is and never was* spec conforming. In older releases of the spec where NaN wasn't mentioned yet, such a writer was at least not violating the spec directly but even then NaN handling was basically "undefined behavior", as the spec never mentioned how to treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already back then a non-portable assumption. Even today, a reader relying on one specific NaN semantics would already yield erroneous results when reading spec conforming Parquet files. E.g., if they search for NaNs and expect them to be in min/max, then they might filter Pages containing NaNs that don't have NaNs in their min/max. Consequently, such a reader is already broken; yes, writing [-Inf,Inf] into the column index would break such a reader more, but all bets are off here anyway already. It currently is just not possible to handle NaNs correctly in a portable way (that's what this PR is all about in the first place). So TBH backward compatibility to such a broken (or at least non-portable) reader/writer pair seems like an absolute non-goal to me. @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't need to distinguish here. All they need to know is whether they should skip the page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless of whether the bounds are there due to NaNs or real infinities, a legacy reader would yield correct results. A new reader that implements this PR can do the distinction via the nan_pages or value_counts computation. Note that actually *any* bounds are, mathematically speaking, correct for a page containing only NaNs (and will yield correct results on spec-conforming readers!). Note that min/max values in the column index don't need to be tight, according to the spec. So the only condition that must hold is that there is no value outside of the bounds (NaNs excluded). As an only-NaN page has no values, any bounds satisfy the condition, as there are no values that need to lie inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. Both would yield correct results on spec conforming readers. Actually the tighter the bounds, the more queries can skip the page. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738143#comment-17738143 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @gszadovszky Any writer reader/writer pair who writes NaNs into column indexes (and other places like page headers) and expects them to be there (and otherwise yields wrong results while reading) *is and never was* spec conforming. In older releases of the spec where NaN wasn't mentioned yet, such a writer was at least not violating the spec directly but even then NaN handling was basically "undefined behavior", as the spec never mentioned how to treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already back then a non-portable assumption. Even today, a reader relying on one specific NaN semantics would already yield erroneous results when reading spec conforming Parquet files. E.g., if they search for NaNs and expect them to be in min/max, then they might filter Pages containing NaNs that don't have NaNs in their min/max. Consequently, such a reader is already broken; yes, writing [-Inf,Inf] into the column index would break such a reader more, but all bets are off here anyway already. It currently is just not possible to handle NaNs correctly in a portable way (that's what this PR is all about in the first place). So TBH backward compatibility to such a broken (or at least non-portable) reader/writer pair seems like an absolute non-goal to me. @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't need to distinguish here. All they need to know is whether they should skip the page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless of whether the bounds are there due to NaNs or real infinities, a legacy reader would not skip the page and therefore yield correct results. A new reader that implements this PR can do the distinction via the nan_pages or value_counts computation. Note that actually *any* bounds are, mathematically speaking, correct for a page containing only NaNs (and will yield correct results on spec-conforming readers!). Note that min/max values in the column index don't need to be tight, according to the spec. So the only condition that must hold is that there is no value outside of the bounds (NaNs excluded). As an only-NaN page has no values, any bounds satisfy the condition, as there are no values that need to lie inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. Both would yield correct results on spec conforming readers. Actually the tighter the bounds, the more queries can skip the page. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex {
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737850#comment-17737850 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244244531 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > Now, we start writing `[-Inf,+Inf]` for NaN only pages. Also, how does the reader distinguish with pages that contain actual infinity values? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments sayin
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737848#comment-17737848 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244236273 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @JFinis, there was an argument that some writers already write NaN values into column indexes. Hence, they try to filter on NaN values. Now, we start writing `[-Inf,+Inf]` for NaN only pages. NaN is probably out of `[-Inf,+Inf]` interval so that reader would drop the only NaN page while searching for a NaN. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parque
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737794#comment-17737794 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244075748 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: I think `[-Inf, +Inf]` is ok. > Add in comments that in the page index, all nan pages can be checked by having nan_count > 0 && min is NaN && max is NaN Previous design uses `[Nan, Nan]`, I guess it's bad. But i guess `[-Inf,+inf]` should be well handled and not including any ambiguity. I'm +1 with this > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737712#comment-17737712 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243910046 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > your idea sounds good but it is not that easy, unfortunately. Since no total ordering is specified NaN values can get before negative infinity or after positive infinity. An implementation that currently writes NaN values to column indexes will break in this scenario. @gszadovszky I don't fully understand your argument here. We just want to make sure that a legacy reader who doesn't know the new semantics yet will definitly *never filter* an only-NaN page. By using min=-Infinity and max=Infinity, we basically write bounds that are as maximal as they can get, so no legacy implementation should ever filter this page, which is the goal for correctness. Could you elaborate how you think an implementation would break? Maybe with an example? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737704#comment-17737704 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243891389 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @pitrou, sorry `BoundaryOrder` was a mistype. I was talking about [ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863). The only order we have currently is `TypeDefinedOrder` that is specified. We were thinking about adding a `ColumnOrder` for FLOAT/DOUBLE with the definition of a total ordering that includes NaN values, -0.0, and +0.0 values. Maybe you're right that the in case of the default string ordering is not enough to a system it shall write its own indices. But there was an idea behind ColumnOrder to maybe implement collations to support those systems. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts a
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737691#comment-17737691 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243836277 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > I've brought up boundary order because that was our original answer to the problems of these ordering issues. Hmm, how is it an answer? It only seems to be a redundant piece of information about `min_values` and `max_values`. > E.g. how should we order internationalized UTF-8 strings? Byte-wise (i.e. codeunit-wise) lexicograph ordering and character-wise (i.e. codepoint-wise) lexicographic ordering should give identical results AFAIR. They are also technically "natural". If a query system needs a more sophisticated ordering, then it should certainly synthesize its own index. I also don't uderstand what that has to do with the presence or absence of `boundary_order`? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortco
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737684#comment-17737684 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243814135 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @JFinis, your idea sounds good but it is not that easy, unfortunately. Since no total ordering is specified NaN values can get before negative infinity or after positive infinity. An implementation that currently writes NaN values to column indexes will break in this scenario. @pitrou, I've brought up boundary order because that was our original answer to the problems of these ordering issues. NaN values are not the only potential issues around ordering. E.g. how should we order internationalized UTF-8 strings? I agree that the current parquet-mr implementation of handling NaN values in column indexes is not correct. But it also means we cannot do this change without breaking older parquet-mr readers. Boundary order would solve this from parquet-mr point of view but if it is not used by other implementations it is not a good choice on its own either. If there are parquet files with column indexes containing NaN values and we consider them valid then we need to fix this issue in parquet-mr and it is unrelated to this format change. However, it is not an easy question if they are really valid. Are both min and max are NaN? If not what is the total ordering in that system which writes these files? Can this format change be compatible with that system? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737658#comment-17737658 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243710422 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @pitrou That shouldn't be a problem. That's why this approach would require alternative (2) or (3). In these alternatives nan_pages / value_counts would be used to find only-NaN pages. If these indicate that the page is only NaN, the min/max can be ignored and a reader can assume that the only values in the page are NaNs. Old readers who don't understand these new fields yet would treat the page simply as "maximum value range; cannot filter". > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737595#comment-17737595 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243576356 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > d) Stick with nan_pages (or value_counts) (i.e., alternatives (2) or (3)) and write min=-Infinity and max=+Infinity into the bounds in the column index for only-NaN pages. What if a page contains actual infinity values? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737568#comment-17737568 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243496427 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Oh, actually there is yet another option d) Stick with nan_pages (or value_counts) (i.e., alternatives (2) or (3)) and write min=-Infinity and max=+Infinity into the bounds in the column index for only-NaN pages. This way, new readers could use nan_pages (or value_counts) to detect an only-NaN pages. Legacy readers would simply never filter this page due to the maximally wide bounds. My heart is bleeding a bit while writing this, as this is obviously a patch solution that feels wrong (the bounds are just not correct) and is just to reverse-patch old implementations by bending the spec, but it would fulfill the requirements and allow backward compatibility while enabling support for filtering only-NaN pages. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737566#comment-17737566 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243489728 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: What is a good path forward then? I see the following options: a) Ship this change but exclude the handling of only-Nan pages in the column index and only handle the other cases. Then we could still at least specify how to handle NaNs in the column index in cases where no "only NaN" page exists and these cases would then at least be well defined (only NaN pages are probably an edge case, so this would already allow us to filter in 99% of all cases and therefore get us almost to the goal). b) Add ColumnOrder to this proposal. (again happy to do that) It would be a good case to start using the ColumnOrder enum. This would also give us the opportunity to define `boundary_order` explicitly for this column order, so we could even assume an ordering. c) Drop this altogether and live with the fact that float / double columns are basically unfilterable in many cases. @gszadovszky Side note: I think that the current read behavior in parquet-mr as you state it is not adhering to the spec and is dangerous at best. I have seen Parquet files which have NaN in these bounds in the wild (I don't know who wrote them) and since the mandate to not write NaNs to these bounds is in the spec only for a while ([introduced here](https://github.com/apache/parquet-format/commit/92ae9a3187d7673c9a40f81f40886faa20807722)), older writers would have been perfectly spec-conforming when writing NaN into these bounds, so files having NaNs here are adhering to (an older version of the) spec and therefore the parquet-mr read code should be robust to handle these cases. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/ma
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737561#comment-17737561 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243478036 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: > Since no total ordering is defined `boundary_order` shall not be either `ASCENDING` or `DESCENDING` if there is any NaN page. Hmm. I am not theoretically against this (as is: the underlying concern is reasonable), but I'm worried that some corners of the Parquet format are more and more becoming a smattering of special cases that implementations must be extra careful to implement correctly. That said, it should also be easy for an implementation to entirely ignore `boundary_order`, and instead detect any existing ordering from the `min_values` and `max_values` (this should be fast given that there is one value per page). It might even be useful to deprecate `boundary_order` and encourage implementations to derive the information themselves? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=Na
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737550#comment-17737550 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243416912 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: It would also require to rewrite a couple of parts in the spec to support a new `ColumnOrder`. Because we did not introduce any `ColumnOrder` since the idea was introduced it might require a vote as well. And we need to investigate the other implementations whether they already reads this value and handles a potentially "unknown" value there. parquet-mr does. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737545#comment-17737545 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243411903 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: `1: TypeDefinedOrder TYPE_ORDER;` Emmm maybe my word is confusing, it means that, a new order should defined here to hint that? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs sho
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737543#comment-17737543 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243409667 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: No, `ColumnOrder` is specified per column in the footer and it is universally valid for any min/max statistics. See [here](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1057) for details. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737525#comment-17737525 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243337924 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: So, maybe a new `ColumnOrder` enum would be added here? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a pa
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737487#comment-17737487 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243234931 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @mapleFU, it seems to me that NaN is only checked for column indexes at the write path in parquet-mr. (In this case the column index will be invalid and won't be written to the file.) For the read path, though, there is no such check. It means that legacy readers can come to incorrect results using FLOAT/DOUBLE column indexes after we start writing NaN values. (Sorry for the late conclusion, I've thought this check was implemented for both directions.) The only way I can think of for backward compatible NaN handling is to define a [ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863) for FP values that includes NaNs as well. In case of we would also add support to row-group level statistics with NaNs. parquet-mr currently skip all kinds of min/max statistics for columns with not supported column orders. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max boun
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737415#comment-17737415 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243057594 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: If that I'm ok with (1), thanks! > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs o
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737289#comment-17737289 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242530201 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: @mapleFU From just reading the spec, I don't think we should have a backward compatibility problem, as legacy readers are already compelled to ignore NaNs if they find them anywhere. Thus, a legacy reader would ignore the NaN it finds in the column index and just not filter that page. Also note that regardless of whether we do (1), (2), or (3) [we basically **have to** write NaN into min and max](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773). We have to write a valid value and every value except NaN would simply be wrong, if a page contains only NaNs. The approaches would just differ in what we write **in addition**, so to a legacy reader that wouldn't read anything new fields, the three approaches would be equal. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737283#comment-17737283 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242521314 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: I'm ok with (1), and I guess Java and Rust implementors should check that if they've prune page index without checking nan. @gszadovszky @pitrou do we need to: 1. check the backward capability for nan and pruning? 2. or just first check the parquet version is ok? 3. or regard the reader doesn't handling min-max nan as a bug? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page ca
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737281#comment-17737281 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Assume that a legacy reader has page index, and `min == max == NaN`, do we need make sure that it will not prune it now? If not, (1) is ok for me, because it doesn't introduce any redudent data. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use s
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737276#comment-17737276 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Assume that a legacy reader has page index, and `min == max == NaN`, do need make sure that it will not prune it now? If not, (1) is ok for me, because it doesn't introduce any redudent data. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solu
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737277#comment-17737277 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Assume that a legacy reader has page index, and `min == max == NaN`, do we need make sure that it will not prune it now? If not, (1) is ok for me, because it doesn't introduce any redudent data. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use s
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737261#comment-17737261 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242476403 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: I'm okay with both (1) and (2), even though (2) sounds more generally useful. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737256#comment-17737256 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Thank you all for your sentiments. It looks like we have two votes for (1) and one for (3). Given that (1) would mean even less fields (and therefore faster decoding/encoding) I guess it would also solve the possible problem of a performance degradation due to this. Given that the majority is for (1), I would draft an update how this would look like. Basically: * Remove mentions of nan_pages * Add in comments that in the page index, all nan pages can be checked by having nan_count > 0 && min is NaN && max is NaN * Add comments about boundary order, as mentioned by @gszadovszky I'll provide an update in the next days. @mapleFU would this be okay with you? You mentioned you would also be okay with the others. @pitrou Would (1) be okay for you as well? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine o
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737258#comment-17737258 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Thank you all for your sentiments. It looks like we have two votes for (1) and one for (3). Given that (1) would mean even less fields (and therefore faster decoding/encoding) I guess it would also solve the possible problem of a performance degradation due to more fields to decode/encode. Given that the majority is for (1), I would draft an update how this would look like. Basically: * Remove mentions of nan_pages * Add in comments that in the page index, all nan pages can be checked by having nan_count > 0 && min is NaN && max is NaN * Add comments about boundary order, as mentioned by @gszadovszky I'll provide an update in the next days. @mapleFU would this be okay with you? You mentioned you would also be okay with the others. @pitrou Would (1) be okay for you as well? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737255#comment-17737255 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Yes, number of rows in the offset index isn't enough due to repeated values. Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in [my earlier post where I depicted our possible options for the column index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762). @pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered with such a value_counts field. @wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you [earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450) actually were against writing NaNs and rather preferred the nan_pages approach: > Personally speaking, apart from adding a nan_count to the statistics, I would go with the option 3: adding a nan_pages bool list to the column index. I am not in favor of writing any NaN to min/max bounds. Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN. As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check". I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index. So we're back at the three options I proposed: 1. Drop nan_pages and use my initial approach of "min=max=NaN && nan_counts > 0 <=> all values are NaN" in the column index 2. Drop nan_pages and instead add value_counts so we can use value_counts-null_counts==nan_counts to determine whether all values are null. (My personal favorite) 3. Retain the current state and use `nan_pages` @wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > ---
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737254#comment-17737254 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Thank you all for your sentiment. It looks like we have two votes for (1) and one for (3). Given that (1) would mean even less fields (and therefore faster decoding/encoding) I guess it would also solve the possible problem of a performance degradation due to this. Given that the majority is for (1), I would draft an update how this would look like. Basically: * Remove mentions of nan_pages * Add in comments that in the page index, all nan pages can be checked by having nan_count > 0 && min is NaN && max is NaN * Add comments about boundary order, as mentioned by @gszadovszky I'll provide an update in the next days. @mapleFU would this be okay with you? You mentioned you would also be okay with the others. @pitrou Would (1) be okay for you as well? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine on
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736186#comment-17736186 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1238741604 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: +1 for (1) as I have explained in this comment: https://github.com/apache/parquet-format/pull/196#discussion_r1231982021 > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be inclu
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735891#comment-17735891 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237651846 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: I would vote on (1) because it would not store redundant data. I think `nan_pages` is not necessary. Meanwhile, we have to take care of the [boundary_order](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L976) as well. Since no total ordering is defined `boundary_order` shall not be either `ASCENDING` or `DESCENDING` if there is any NaN page. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situa
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735855#comment-17735855 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Yes, number of rows in the offset index isn't enough due to repeated values. Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in [my earlier post where I depicted our possible options for the column index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762). @pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered with such a value_counts field. @wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you [earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450) actually were against writing NaNs and rather preferred the nan_pages approach: > Personally speaking, apart from adding a nan_count to the statistics, I would go with the option 3: adding a nan_pages bool list to the column index. I am not in favor of writing any NaN to min/max bounds. Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN. As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check". I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index. So we're back at the three options I proposed: 1. Drop nan_pages and use my initial approach of "min=max=NaN <=> all values are NaN" in the column index 2. Drop nan_pages and instead add value_counts so we can use value_counts-null_counts==nan_counts to determine whether all values are null. (My personal favorite) 3. Retain the current state and use `nan_pages` @wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > -
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735847#comment-17735847 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237392250 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: Maybe I misunderstood the word "except", seems that it means "min-max" should be take into accound. I've no question for that now > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735846#comment-17735846 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237388278 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Personally I like (3), because I think parquet-format changes so slowly, adding a `value_count` or others in it will not be used for a long time. But others seems ok to me, maybe I can write a benchmark that will these bytes make PageIndex larger and decoding it slower. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of th
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735841#comment-17735841 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237384626 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: I don't fully understand your question. We have to write nan_pages and nan_counts *and* we also have to write NaN values to the actual min and max in the column index, as we have to write a valid double value to the bounds and NaN is the only correct double value in case all values are NaN, as pointed out by @gszadovszky [here](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page co
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735843#comment-17735843 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237384626 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: I don't fully understand your question. We have to write nan_pages and nan_counts ***and*** we also have to write NaN values to the actual min and max in the column index, as we have to write a valid double value to the bounds and NaN is the only correct double value in case all values are NaN, as pointed out by @gszadovszky [here](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A pag
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735840#comment-17735840 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237383257 ## README.md: ## @@ -161,21 +161,7 @@ following rules: * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the - `ColumnOrder` union. They are summarized here but the Thrift definition Review Comment: Indeed, as was [requested in this issue](https://github.com/apache/parquet-format/pull/196#discussion_r1151335207). I do agree that not duplicating it is better. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735837#comment-17735837 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381809 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: I'll update this with my next revision once we have [decided on this issue](https://github.com/apache/parquet-format/pull/196#discussion_r1237381221). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this,
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735838#comment-17735838 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237382160 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except Review Comment: I'll update this with my next revision once we have [decided on this issue](https://github.com/apache/parquet-format/pull/196#discussion_r1237381221). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a gener
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735836#comment-17735836 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Yes, number of rows in the offset index isn't enough due to repeated values. Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in [my earlier post where I depicted our possible options for the column index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762). @pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered with such a value_counts field. @wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you [earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450) actually were against writing NaNs and rather preferred the nan_pages approach: > Personally speaking, apart from adding a nan_count to the statistics, I would go with the option 3: adding a nan_pages bool list to the column index. I am not in favor of writing any NaN to min/max bounds. Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN. As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check". I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index. So we're back at the three options I proposed: 1. Drop nan_pages and use my initial approach of "min=max=NaN <=> all values are NaN" in the column index 2. Drop nan_pages and instead add value_counts so we can use value_counts-null_counts==nan_counts to determine whether all values are null. (My personal favorite) 3. Retain the current state and use `nan_pages` @wgtmac @mapleFU @gszadovszky could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733409#comment-17733409 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1231982021 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: As `nan_counts` will be set only after this proposal, could we simply deduce a NaN page by checking `null_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN`? If that is true, we can safely remove definition of `nan_pages` list. ## README.md: ## @@ -161,21 +161,7 @@ following rules: * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the - `ColumnOrder` union. They are summarized here but the Thrift definition Review Comment: Yes, this looks reasonable. ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except Review Comment: I would expect to explicitly state that `NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page`. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states th
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732617#comment-17732617 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229883979 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Personally I think `optional list value_counts` is more common, but I think null already has `null_counts`, and `value_counts` might consume more bytes for every leaf column. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e.,
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732614#comment-17732614 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229881617 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: ```suggestion * in the column index, where a value has to be written in case of ``` > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732613#comment-17732613 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229880276 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: That said, if we do need an additional list (because of repeated columns?), it might be more worthwhile to add an `optional list value_counts` instead, as it would then benefit all column types. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use so
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732610#comment-17732610 ] ASF GitHub Bot commented on PARQUET-2249: - pitrou commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229878920 ## src/main/thrift/parquet.thrift: ## @@ -966,6 +985,23 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list null_counts + + /** + * A list of Boolean values to determine pages that contain only NaNs. Only + * present for columns of type FLOAT and DOUBLE. If true, all non-null + * values in a page are NaN. Writers are suggested to set the corresponding + * entries in min_values and max_values to NaN, so that all lists have the same + * length and contain valid values. If false, then either all values in the + * page are null or there is at least one non-null non-NaN value in the page. + * As readers are supposed to ignore all NaN values in bounds, legacy readers + * who do not consider nan_pages yet are still able to use the column index + * but are not able to skip only-NaN pages. + */ + 6: optional list nan_pages Review Comment: Is this necessary? We already know: * the NaN count for each page (in `nan_counts`) * the null count for each page (in `null_counts`) * the number of rows for each page (from the OffsetIndex) It seems this might be enough to infer whether a page is all-NaN (except perhaps if there are repetition levels?). > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731626#comment-17731626 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1226712152 ## src/main/thrift/parquet.thrift: ## @@ -886,16 +891,25 @@ union ColumnOrder { * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following - * compatibility rules should be applied when reading statistics: + * point values (relations vs. total ordering), the following compatibility + * rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-NULL + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. - * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count fields for FLOAT and + DOUBLE columns. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where a value has to be written incase of Review Comment: ``` NaNs should not be written to min or max statistics fields except in the column index, where a value has to be written incase of ``` Does this means `nan_pages` and `nan_count` in this patch? ## README.md: ## @@ -161,21 +161,7 @@ following rules: * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and signed zeros. The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the - `ColumnOrder` union. They are summarized here but the Thrift definition Review Comment: So this part is removed and unified into the `parquet.thrift`? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs an
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731560#comment-17731560 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1587083232 I finally have time to continue on this. Sorry for the long wait. As @gszadovszky has highlighted, we have to store a valid double/float value into the min/max bounds of the column index to be compatible with legacy readers. So the initial proposal to write NaN into min/max in this case would actually work. But so far not everyone was happy with using these NaNs in readers to see whether we have an only-nan page. Therefore, the suggestion was to also add `nan_pages` to the column options (favored by @wgtmac and @mapleFU). I have updated the PR to this suggestion: We still would write NaNs into min/max in the column index if a page has only NaNs but advise the reader to not use these values (as readers are already advised today) and instead only use `nan_pages` to check for only-nan pages. This way, we don't need to worry about the semantics of NaN comparisions and readers can continue to ignore all NaN values they find in bounds. I have not updated the PR description yet to reflect this new design; only the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please review and let me know if you agree with this design. Then I will update the PR description accordingly. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pic
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731558#comment-17731558 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1226466358 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: Review Comment: I have removed the dulpicate explanation. ## src/main/thrift/parquet.thrift: ## @@ -223,6 +223,8 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** count of NaN values in the column; only present if type is FLOAT or DOUBLE */ Review Comment: Done. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solu
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721586#comment-17721586 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1543226234 @JFinis Do you have a plan to revive this? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707253#comment-17707253 ] ASF GitHub Bot commented on PARQUET-2249: - gszadovszky commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773 Thank you, @JFinis, for working on this. This is not an easy topic. I am afraid we cannot avoid encoding NaN values into column index min/max lists for the sake of backward compatibility: There is no such thing as "missing value" in the list. We encode actual primitive values. We need to store there something for each page. That's why we have `null_pages` to highlight that the values encoded for the corresponding page are valid or not. The only way I can think of being backward compatible is to store NaN values in min/max otherwise we mix up older readers. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706189#comment-17706189 ] ASF GitHub Bot commented on PARQUET-2249: - wgtmac commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1151335207 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: Review Comment: Not relevant to this PR: it is weird that we have duplicated the explanation here. It would be better to consolidate this by referring to the thrift only. ## src/main/thrift/parquet.thrift: ## @@ -223,6 +223,8 @@ struct Statistics { */ 5: optional binary max_value; 6: optional binary min_value; + /** count of NaN values in the column; only present if type is FLOAT or DOUBLE */ Review Comment: ```suggestion /** count of NaN values in the column; only present if physical type is FLOAT or DOUBLE */ ``` > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > tr
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705930#comment-17705930 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150378596 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: Yes, maybe you are right. My point is that, if we write nan_count or even record count, the program would works well. However, non-float point page would have some size-overhead. Personally, I'd like to use `list`, because it's easy to implement, and also lightweight. And we can hear others idea. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help f
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705907#comment-17705907 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150299691 ## src/main/thrift/parquet.thrift: ## @@ -952,6 +961,9 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * For columns of type FLOAT and DOUBLE, NaN values are not to be included Review Comment: I would say let's discuss this once we settle on that we do want to have NaN values. If we go with one of the other [alternatives outlined here](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762), we don't need to discuss it. (We can mandate a specific bit pattern or allow any NaN. I guess both would be okay (note that we also don't mandate a specific bit pattern for values in a column). But I'd say let's postpone the discussion) > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atl
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705900#comment-17705900 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150282767 ## src/main/thrift/parquet.thrift: ## @@ -952,6 +961,9 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * For columns of type FLOAT and DOUBLE, NaN values are not to be included Review Comment: By the way, in your design, for a `NaN` writer, which number should be written here? Should it be a specific NaN? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705868#comment-17705868 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762 The gist of all opened issues is the question how to encode pages/column chunks that contain only NaNs. This is actually only an issue for the `ColumnIndex`. For statistics in the `ColumnMetaData` or the page, we can find only-Nan pages/columnChunks by computing `num_values - null_count - nan_count == 0`. The `ColumnIndex` doesn't have `num_values`, so we can't perform this computation. I see three alternatives to handle the problem in the `ColumnIndex`: * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN. * Adding `num_values` to the ColumnIndex, to make it symmetric with Statistics in pages & `ColumnMetaData` and to enable the computation `num_values - null_count - nan_count == 0` * Adding a `nan_pages` bool list to the column index, which indicates whether a page contains only NaNs **I'm fine with either of these, so I would like us to reach a consensus for one of the alternatives here; then I can update my PR to reflect the decision. As this is my first contribution to parquet, I don't know the decision processes here. Do we vote? Is there a single or group of decision makers? Please let me know how to come to a conclusion here.** As a help for the decision: Here are again the PROs and CONs of the three alternativs: * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN. * **PRO:** Fully backward compatible * **PRO:** Needs no further lists in the ColumnIndex * **CON:** people are uneasy with storing NaNs in bounds, due to many existing bit patterns and therefore a bit fuzzy semantics. * Adding `num_values` to the ColumnIndex, to make it symmetric with Statistics in pages & `ColumnMetaData` and to enable the computation `num_values - null_count - nan_count == 0` * **PRO:** No NaNs in bounds, no encoding/bit-pattern fuzzyness * **PRO:** Makes the `ColumnIndex` symmetric to other statistics (and to Apache Iceberg) * **PRO:** The `num_values` would also be viable for other purposes. It feels weirdly asymmetric to not have this field in the column index. For example, this would help to gauge the number of nested values in a nested column. * **CON:** The extra `num_values` list would be in each column index, even for non FLOAT/DOUBLE columns, thereby adding space consumption and encoding/decoding overhead. * **CON:** Would make `null_pages` redundant, as `null_pages[i] == (num_values[i] - null_count[i] == 0)` * **CON:** In theory not 100% backward compatible, but probably not relevant in practice* * Adding a `nan_pages` bool list to the column index, which indicates whether a page contains only NaNs * **PRO:** No NaN encoding fuzzyness, no encoding/bit-pattern fuzzyness * **PRO:** Less space consumption than `num_values`. The list would only be present for FLOAT/DOUBLE columns * **PRO:** Along the lines of `null_pages` so following an existing pattern in the column index * **CON:** In theory not 100% backward compatible, but probably not relevant in practice* \* Explanation of "in theory not 100% backward compatible": Today, min and max in a column index have to have a valid value unless `null_pages` of the respective page is true. This would no longer hold if we decide to encode only-NaN pages through empty min/max + `nan_pages` or empty min/max + `num_values`. Thus, a legacy reader, who doesn't know the new lists, could come to the conclusion that the missing bounds constitute an invalid ColumnIndex and therefore might deem the whole Parquet file as invalid. I doubt that this is a problem in practice, as readers are written leniently. I.e., if a missing bound in a column index is encountered, the index might not be used (what would already happen today in case of an only-NaN page, so not a regression) or just that page might be treated as "has to be scanned". I don't know a reader that would reject the whole Parquet file in this case. Therefore, this is likely not relevant in practice. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to cr
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704402#comment-17704402 ] ASF GitHub Bot commented on PARQUET-2249: - zhongyujiang commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1482138488 @JFinis Thanks for your reply, just realized that the page value count is stored in the page header, not in the column index. I overlooked your comments above before asked the question, sorry for that. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a conforming parquet file and will randomly > pick any of the solutions. > Thus, my suggestion would be to update parquet.thrift to use solution 3. > I.e., rewrite the comments saying that NaNs shouldn't be included in min/max > bounds by adding a clause stating that "if a page contains only NaNs or a > mixture of NaNs and NULLs, then NaN should be written as min & max". > Â -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704198#comment-17704198 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481370812 @zhongyujiang (as I can't answer your comment directly). Here is the problem with your suggestion of checking `nanCount == valueCount` for checking for only NaNs: > @mapleFU To your general comment (I can't answer there) > > > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by `null_count + nan_count == num_values`? > > The problem is that the ColumnIndex does not have the `num_values` field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a `num_values` list. Then this would indeed work everywhere but at the cost of an additional list. > > So I see we have the following options: > > * Do what I did here, i.e., use min/max to determine whether there are only NaNs > * Add a `num_values` list to the ColumnIndex > * Accept the fact that the column index cannot detect only-NaN pages (might lead to fishy semantics) > * Tell readers to use the `min==max==NaN` reasoning only in the column index, and use the `null_count + nan_count == num_values` for the statistics. > > Which one would you suggest here? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing su
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704194#comment-17704194 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146342719 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: @mapleFU Yes, we could also add a `nan_pages` bool list in the column index. That would work as well. My gut feeling is that one day having a `value_counts` count would be more useful than boolean lists. We already have `null_pages` and `null_counts` and we would then also have `nan_pages` and `nan_counts`, both `null_pages` and `nan_pages` would be obsolete if there were `value_counts`. Yes, storing one integer (`value_counts`) is likely more space than storing two booleans (`null_pages` & `nan_pages`), but knowing the number of values in a page could also be helpful for other pruposes. But yes, we could drop the testing of `min=max=NaN` if we had a `nan_pages` list in the column index. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the be
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704193#comment-17704193 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146342719 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: @mapleFU Yes, we could also add a `nan_pages` bool list in the column index. That would work as well. My gut feeling is that one day having a `value_counts` count would be more useful than boolean lists. We already have `null_pages` and `null_counts` and we would then also have `nan_pages` and `nan_counts`, both `null_pages` and `nan_pages` would be obsolete if there were `value_counts`. Yes, storing one integer is likely more space than storing two booleans, but knowing the number of values in a page could also be helpful for other pruposes. But yes, we could drop the testing of `min=max=NaN` if we had a `nan_pages` list in the column index. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds,
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704158#comment-17704158 ] ASF GitHub Bot commented on PARQUET-2249: - zhongyujiang commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481237476 > Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior: > > Once a writer writes the nan_count/nan_counts fields, they have to: > never write NaN into min/max if there are non-NaN non-Null values and > always write min=max=NaN if the only non-null values in a page are NaN > A reader observing that nan_count/nan_counts field was written can then rely on that if min or max are NaN, then both have to be NaN and this means that the only non-NULL values are NaN. Instead of writing min and max as NaN when there are only NaN values and then let the reader to check whether min max NaN are credible by evaluating whether naNCounts is empty, wouldn't it be much simpler if we just left the evaluation of isNaN and notNaN to the reader? A reader can always conclude a page / column is all NaN when value count of the field == NaN count of the filed (when valueCounts and naNCounts both exists), this's Iceberg's current way of [evaluating isNaN](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java#L486). Is there anything wrong with doing this in Parquet? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such a page cannot write a confo
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704120#comment-17704120 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146114852 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: I got it, I think using both min-max is backward-capatible and can represent "all-data-is-nan". https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944 can we import a status like that? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the so
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704117#comment-17704117 ] ASF GitHub Bot commented on PARQUET-2249: - mapleFU commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146105914 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: Personally I think https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L752 can together decide the status here. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a general note: I would say that it is a shortcoming that Parquet doesn't > track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't > have this inconsistency. In a future version, NaN counts could be introduced, > but that doesn't help for backward compatibility, so we do need a solution > for now. > Any of the solutions is better than the current situation where engines > writing such
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704114#comment-17704114 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146095493 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: TBH: I would actually love to have a `num_values` list in the column index. We have the same in the statistics, Iceberg does the same, and not needing min=max=NaN for only-NaN checking would actually be much more elegant IMHO. I just didn't want to suggest adding another list to each column index for the added space cost. However, given that these indexes are negligibly small in comparison to the data, I think actually no one would mind that extra space. If the consensus is that this is preferrable, I'm happy to adapt the commit to that. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to determine only-Nan pages (min=max=NaN). > As a ge
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704111#comment-17704111 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145999358 ## src/main/thrift/parquet.thrift: ## @@ -952,6 +961,9 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * For columns of type FLOAT and DOUBLE, NaN values are not to be included Review Comment: Don’t we then have the same problem already for the NaN values stored in the actual columns? We do already serialize NaN to binary values in the columns themselves. There we also do not mandate a specific bit pattern. The spec does define float double to be IEEE compliant: ``` * FLOAT - 4 bytes per value. IEEE. Stored as little-endian. * DOUBLE - 8 bytes per value. IEEE. Stored as little-endian. ``` So if I see it correctly, any conforming reader implementation has to be able to handle all NaN bit patterns that IEEE allows. Otherwise they could not read the actual data in the columns. As you mention Java: Java has a defined way of reading IEEE bits into Java floats: `Float.intBitsToFloat` (and the respective method for double). Here it is guaranteed that all valid NaN bit patterns produce a Java Nan. From [the documentation](https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html): > If the argument is any value in the range 0x7f81 through 0x7fff or in the range 0xff81 through 0x, the result is a NaN. This method is used by parquet-mr, so we should be fine here. So, to generalize, as I see it, the following holds: * Parquet defines FLOAT/DOUBLE to be IEEE without further mandating any bit patterns. * If a reader cannot handle all NaN bit patterns, they are not conforming to the spec. * Also, such reader would already today malfunction, as there can be NaNs with any bit patterns in columns already. * All prominent programming languages (C++, Java, Python, Go, ...) have IEEE compliant binary to float conversions, so this also sounds like a rather theoretical problem. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} >   *   When writing statistics the following rules should be followed: >   *   - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { >  /** >   * A list of Boolean values to determine the validity of the corresponding >   * min and max values. If true, a page contains only null values, and > writers >   * have to set the corresponding entries in min_values and max_values to >   * byte[0], so that all lists have the same length. If false, the >   * corresponding entries in min_values and max_values must be valid. >   */ >  1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704110#comment-17704110 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: @mapleFU To your general comment (I can't answer there) > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by `null_count + nan_count == num_values`? The problem is that the ColumnIndex does not have the `num_values` field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a `num_values` list. Then this would indeed work everywhere but at the cost of an additional list. So I see we have the following options: * Do what I did here, i.e., use min/max to determine whether there are only NaNs * Add a `num_values` list to the ColumnIndex * Accept the fact that the column index cannot detect only-NaN pages (might lead to fishy semantics) * Tell readers to use the `min==max==NaN` reasoning only in the column index, and use the `null_count + nan_count == num_values` for the statistics. Which one would you suggest here? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > nu
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704109#comment-17704109 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146080282 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: To this suggestion: > Seems it's a little strict here? Just ingore min-max seems ok? Note that the line you mentioned here just tells a reader that they *can* rely on this information, and therfore could, e.g., skip this page if a predicate like `x = 12.34` was used. They can of course also opt to ignore this information and not skip but rather scan the page. If we removed this, a reader couldn't do the skip here. I guess this is related to your general suggestion: How do we detect only-NaN pages? Depending on what we do for that, this line will be adapted accordingly. > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has > {{byte[0]}} as min/max, even though the null_pages entry is set to > {*}false{*}. > 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have > NaN as min & max in the column index. > None of the solutions is perfect. But I guess solution 3. is the best of > them. It gives us valid min/max bounds, makes null_pages compatible with > this, and gives us a way to de
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704107#comment-17704107 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: @mapleFU To your general comment (I can't answer there) > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by `null_count + nan_count == num_values`? The problem is that the ColumnIndex does not have the `num_values` field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a `num_values` list. Then this would indeed work everywhere but at the cost of an additional list. So I see we have the following options: * Do what I did here, i.e., use min/max to determine whether there are only NaNs * Add a `num_values` list to the ColumnIndex * Accept the fact that the column index cannot detect only-NaN columns * Tell readers to use the `min==max==NaN` reasoning only in the column index, and use the `null_count + nan_count == num_values` for the statistics. Which one would you suggest here? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*
[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
[ https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704106#comment-17704106 ] ASF GitHub Bot commented on PARQUET-2249: - JFinis commented on code in PR #196: URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533 ## README.md: ## @@ -163,18 +163,25 @@ following rules: [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. They are summarized here but the Thrift definition is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), -`+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), -`-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. + * The following compatibility rules should be applied when reading statistics: +* If the nan_count field is set to > 0 and both min and max are Review Comment: > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by `null_count + nan_count == num_values`? @mapleFU The problem is that the ColumnIndex does not have the `num_values` field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a `num_values` list. Then this would indeed work everywhere but at the cost of an additional list. So I see we have the following options: * Do what I did here, i.e., use min/max to determine whether there are only NaNs * Add a `num_values` list to the ColumnIndex * Accept the fact that the column index cannot detect only-NaN columns * Tell readers to use the `min==max==NaN` reasoning only in the column index, and use the `null_count + nan_count == num_values` for the statistics. Which one would you suggest here? > Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs > --- > > Key: PARQUET-2249 > URL: https://issues.apache.org/jira/browse/PARQUET-2249 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Jan Finis >Priority: Major > > Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is > inconsistent, leading to cases where it is impossible to create a parquet > file that is conforming to the spec. > The problem is with double/float columns if a page contains only NaN values. > The spec mentions that NaN values should not be included in min/max bounds, > so a page consisting of only NaN values has no defined min/max bound. To > quote the spec: > {noformat} > Â Â * Â Â When writing statistics the following rules should be followed: > Â Â * Â Â - NaNs should not be written to min or max statistics > fields.{noformat} > However, the comments in the ColumnIndex on the null_pages member states the > following: > {noformat} > struct ColumnIndex { > Â /** > Â Â * A list of Boolean values to determine the validity of the corresponding > Â Â * min and max values. If true, a page contains only null values, and > writers > Â Â * have to set the corresponding entries in min_values and max_values to > Â Â * byte[0], so that all lists have the same length. If false, the > Â Â * corresponding entries in min_values and max_values must be valid. > Â Â */ > Â 1: required list null_pages{noformat} > For a page with only NaNs, we now have a problem. The page definitly does > *not* only contain null values, so {{null_pages}} should be {{false}} for > this page. However, in this case the spec requires valid min/max values in > {{min_values}} and {{max_values}} for this page. As the only value in the > page is NaN, the only valid min/max value we could enter here is NaN, but as > mentioned before, NaNs should never be written to min/max values. > Thus, no writer can currently create a parquet file that conforms to this > specification as soon as there is a only-NaN column and column indexes are to > be written. > I see three possible solutions: > 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's > null_pages entry set to {*}true{*}. > 2. A page consisting of only NaNs (or a mixture o