subject:"\[jira\] \[Commented\] \(PARQUET\-2249\) Parquet spec \(parquet.thrift\) is inconsistent w.r.t. ColumnIndex \+ NaNs"

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788926#comment-17788926
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

etseidl commented on code in PR #221:
URL: https://github.com/apache/parquet-format/pull/221#discussion_r1402657927


##
src/main/thrift/parquet.thrift:
##
@@ -288,7 +288,7 @@ struct MapType {} // see LogicalTypes.md
 struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BINARY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes 
(see LogicalTypes.md)

Review Comment:
   'must encode' or 'must be encoded as'?



##
src/main/thrift/parquet.thrift:
##
@@ -962,15 +967,19 @@ union ColumnOrder {
*   BYTE_ARRAY - unsigned byte-wise comparison
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
-   * (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   * point types due to underspecified handling of NaN and -0/+0,
+   * it is recommended that writers use IEEE_754_TOTAL_ORDER
+   * for these types.
+   *
+   * If TYPE_ORDER is used for floating point types, then the following

Review Comment:
   This line threw me (at least while using my phone 😉...on my computer I can 
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used 
for floating..." or "If this type-defined ordering..."



##
src/main/thrift/parquet.thrift:
##
@@ -962,15 +967,19 @@ union ColumnOrder {
*   BYTE_ARRAY - unsigned byte-wise comparison
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
-   * (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   * point types due to underspecified handling of NaN and -0/+0,
+   * it is recommended that writers use IEEE_754_TOTAL_ORDER
+   * for these types.
+   *
+   * If TYPE_ORDER is used for floating point types, then the following

Review Comment:
   This line threw me (at least while using my phone 😉...on my computer I can 
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used 
for floating..." or "If this type-defined ordering..."





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788868#comment-17788868
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1823348321

   Okay, finally done. As the new solution (total order) does not share a 
single line with the current solution and this PR gets quite long and 
contrived, I created a new PR: https://github.com/apache/parquet-format/pull/221
   
   I hope this is fine. If you rather want me to continue in this PR, let me 
know, then I'll close the other one and instead update this one. Otherwise, 
let's continue the discussion about total order in the new PR :).
   
   @tustvold @mapleFU @wgtmac @crepererum @etseidl @gszadovszky @pitrou FYI




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)

[
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788866#comment-17788866
]

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis opened a new pull request, #221:
URL: https://github.com/apache/parquet-format/pull/221

This commit adds a new column order `IEEE754TotalOrder`, which can be used
for floating point types (FLOAT, DOUBLE, FLOAT16).

The advantage of the new order is a well-defined ordering between -0,+0 and
the various possible bit patterns of NaNs. Thus, every single possible bit
pattern of a floating point value has a well-defined order now, so there are no
possibilities where two implementations might apply different orders when the
new column order is used.

With the default column order, there were many problems w.r.t. NaN values
which lead to reading engines not being able to use statistics of floating
point columns for scan pruning even in the case where no NaNs were in the data
set. The problems are discussed in detail in the next section.

This solution to the problem is the result of the extended discussion in
https://github.com/apache/parquet-format/pull/196, which ended with the
consensus that IEEE 754 total ordering is the best approach to solve the
problem in a simple manner without introducing special fields for floating
point columns (such as `nan_counts`, which was proposed in that PR). Please
refer to the discussion in that PR for all the details why this solution was
chosen over various design alternatives.

Note that this solution is fully backward compatible and should not break
neither old readers nor writers, as a new column order is added. Legacy writers
can continue not writing this new order and instead writing the default type
defined order. Legacy readers should avoid using any statistics on columns that
have a column order they do not understand and therefore should just not use
the statistics for columns ordered using the new order.

The remainder of this message explains in detail what the problems are and
how the proposed solution fixes them.

Problem Description
===

Currently, the way NaN values are to be handled in statistics inhibits most
scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely
the following problems exist:

Statistics don't tell whether NaNs are present
--

As NaN values are not to be incorporated in min/max bounds, a reader cannot
know whether NaN values are present. This might seem to be not too problematic,
as most queries will not filter for NaNs. However, NaN is ordered in most
database systems. For example, Postgres, DB2, and Oracle treat NaN as greater
than any other value, while MSSQL and MySQL treat it as less than any other
value. An overview over what different systems are doing can be found here. The
gist of it is that different systems with different semantics exist w.r.t. NaNs
and most of the systems do order NaNs; either less than or greater than all
other values.

For example, if the semantics of the reading query engine mandate that NaN
is to be treated greater than all other values, the predicate x > 1.0 should
include NaN values. If a page has max = 0.0 now, the engine would not be able
to skip the page, as the page might contain NaNs which would need to be
included in the query result.

Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be
less than all other values by the reading engine. Again, a page with min = 2.0
couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use other
predictes that need to filter or retain NaNs in the semantics of the reading
engine, so the fact that we currently can't know whether a page or row group
contains NaN is a bigger problem than it might seem on first sight.

Currently, any predicate that needs to retain NaNs cannot use min and max
bounds in Parquet and therefore cannot be used for scan pruning at all. And as
state, that can be many seemingly innocuous greater than or less than
predicates in most databases systems. Conversely, it would be nice if Parquet
would enable scan pruning in these cases, regardless of whether the reader and
writer agree upon whether NaN is smaller, greater, or incomparable to all other
values.

Note that the problem exists especially if the Parquet file doesn't include
any NaNs, so this is not only a problem in the edge case where NaNs are
present; it is a problem in the way more common case of NaNs not being present.

Handling NaNs in a ColumnIndex
--

There is currently no well-defined way to write a spec-conforming
ColumnIndex once a page has only NaN (and possibly null) value

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784917#comment-17784917
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805653152

   Congratulations! Take all the time you need, there is no urgency on this 
from my end, just wanted to avoid things stalling out




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784693#comment-17784693
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805207632

   I hate to not stick to my word, but I won't be able to create the PR today, 
as my wife is going into labor and I'll have to drive her to the clinic soon 😅.
   
   I pushed the status I have so far to my fork. You can already have a look if 
you want: https://github.com/jfinis/parquet-format/tree/totalorder
   
   The commit is basically done, I just wanted to proof read everything and 
write a descriptive message for the commit and the PR. I'll find some time once 
we're back from the hospital, i.e., in a few days. But for now, I first need to 
deliver something else 👶 .




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784578#comment-17784578
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1804343753

   @tustvold I actually already have the change ready in my local repo. I was 
just distracted by other work and it seemed there was little interest so far in 
advancing this quickly, so I didn't update it on github, yet. I can update the 
PR tomorrow :).




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784386#comment-17784386
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1803642293

   Just coming back to this as it has come up a bit downstream, the approach 
described in 
https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697 makes 
a lot of sense to me. Would it help move this forward if I were to raise a 
separate PR proposing it?
   
   > parquet-mr can efficiently implement this sort order
   
   Provided Java provides some mechanism to interpret a float as an integer, it 
is just a case of some bit operations - 
https://doc.rust-lang.org/src/core/num/f64.rs.html#1336
   
   > Total ordering is nice if the goal is to order the data
   > If the goal is to filter the data then I think any consideration of 
NaN/null/infinity is meaningless
   
   Why would filter predicates not also need a well-defined order? FWIW 
arrow-rs uses total order for **all** floating point comparison.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741481#comment-17741481
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1628384514

   To support old readers with the statistics we can choose to write 
`TypeDefinedOrder` for FP values in case there are no `NaN` values in the data.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741310#comment-17741310
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1627373091

   The new `IEEE754TotalOrder` looks elegant to me, though a single NaN value 
may still ruin the page index. Another challenge is how parquet-mr can 
efficiently implement this sort order.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741127#comment-17741127
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

westonpace commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625662823

   > CON: NaNs will be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.
   
   I agree this is a con.  Total ordering is nice if the goal is to order the 
data.  If the goal is to filter the data then I think any consideration of 
NaN/null/infinity is meaningless.
   
   However, I also agree with @crepererum that this is a slippery slope and I 
agree with @JFinis that NaNs are not widely used and simpler is better.  I 
don't entirely agree the solution can always be to replace NaN/Infinity with 
NULL but the cases where it can't are probably very rare.  Besides, the penalty 
here is only a performance loss and not incorrect results so it's more 
manageable.
   
   So, on the balance, I'd say I'm neutral.  If there are other advantages to 
this approach then the disadvantages to dataset filtering are probably not 
enough outweigh them.  We might want to add a small sentence to some kind of 
pyarrow or parquet documentation somewhere so that users can be aware of this.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741077#comment-17741077
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625546137

   @JFinis Thanks a lot! I agree that makes sense. The main problem IMHO is 
that old readers wouldn't support page filtering on such files. That said, we 
have to move forward somehow.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741072#comment-17741072
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697

   > > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this 
proposal?
   > 
   > It's difficult to say without understanding the implications. Say a data 
page contains some NaNs, what happens?
   
   @pitrou On the write path:
   
   * The writing library would set the `ColumnOrder` for this column to the new 
option, let's call it `IEEE754TotalOrder`.
   * The writing library would use IEEE754 total order for all order / sorting 
related tasks. I.e., it would compute the min and max values of the page using 
that total order. That order has a defined place for NaN. The writer would 
*not* have special logic for NaN. It would just order everything using total 
order. E.g., in case of a page containing a positive NaN, this would be chosen 
as the max value, as Nan > everything else in the total order.
   
   On the read path:
   * A reading engine encountering the new `IEEE754TotalOrder` column order 
would either
 a) (legacy reader) not understand it and in this case not use any 
statistics, as it doesn't understand the semantics of the order relation.
 b) (new reader) understand it and assume that all order is in IEEE 754 
total order, which again has a defined place for NaNs. Depending on the NaN 
semantics of the reading engine, it would need to make sure to align the values 
it sees in min/max with its own semantics. How this alignment would look like 
would depend on the semantics of the engine. (I can give more detailed examples 
for different engine semantics if necessary)
 
   Ramifications:
   * PRO: Due to the new column order, legacy readers are guarded. They don't 
need to understand the new order. Even if they ignore the column order, if they 
see NaNs in min and max they will just ignore them and assume the statistics 
aren't usable. So we have two layers of protection to make sure legacy readers 
don't misunderstand the ordering.
   * PRO: No special fields for NaNs. No `nan_counts`, no `nan_pages`. Instead, 
NaNs are just treated as defined in the total ordering.
   * PRO: Simple standardized handling of floatsinstead of special handling of 
NaNs. I guess this was the main point of @tustvold and @crepererum.
   * PRO: Engines only need to understand total ordering and don't need any 
special NaN handling code anymore (unless their semantics is different, in 
which case they need to map their semantics from / to total ordering).
   * CON: NaNs *will* be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.

 
 




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741049#comment-17741049
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625464720

   > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this 
proposal?
   
   It's difficult to say without understanding the implications. Say a data 
page contains some NaNs, what happens?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741022#comment-17741022
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625392665

   > I guess this can also be implemented in each language by "bit casting" the 
float bits to integer bits and doing an integer comparison, correct
   
   Its a bit more than a simple bit cast, but broadly speaking yes.
   
   https://doc.rust-lang.org/src/core/num/f64.rs.html#1336
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741009#comment-17741009
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625354736

   @tustvold @crepererum Do I interpret your answer correctly in that your 
suggestion would be to
   
   * Create a new `ColumnOrder` for floats that simply is defined as IEEE 754 
total order, if we need such new order for backward compatibility (which we 
probably need, as apparently parquet-mr will otherwise perform filtering 
incorrectly)
   * When that order is used, don't handle NaNs explicitly. Instead, just use 
the total order relation for ordering and min/max computation (which will 
result in NaNs being written as max and -NaNs being written as min if they 
exist).
   
   Did I get this right?
   
   I guess this can also be implemented in each language by "bit casting" the 
float bits to integer bits and doing an integer comparison, correct? So even if 
the underlying language doesn't have native support for total ordering, it 
should still be possible to implement this.
   
   I do see a certain beauty in this approach in it being "simple". As always, 
I'm happy to adapt my PR to this approach, if we can get consensus that we want 
this.
   
   @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal?
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741001#comment-17741001
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625347116

   Okay, `[-NaN, +NaN]` as min-max would be ignored in C++ Statistics. I'm ok 
for these solutions.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740984#comment-17740984
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625304128

   > I think we already have type-defined order
   
   Indeed, and what I am suggesting is rather than layering on more complexity 
to workaround the problems of such an approach, how about we just remove this 
complexity?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740976#comment-17740976
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625277042

   I think we already have type-defined order, and already exclude +inf and 
-inf. And not when if a page is all `NaN`, the page would be excluded




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740966#comment-17740966
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

crepererum commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625220272

   I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote:
   
   IMHO this is leaking application details into the storage format. If you 
start to differentiate NaN from "all normal values" and NULL you may do the 
same for +/-Inf, because it also acts as a poison value in most computations. 
But you may also do that for "nearly Inf" because someone divided by "nearly 
zero" and these super big values are equally nonsensical. This whole discussion 
isn't even specific to floats. Why do boolean stats not count true/false 
separately? What about empty strings and byte arrays? Or empty lists in 
general? My point is: this is opening a can of worms and the complexity isn't 
worth the gain.
   
   The better alternative is: let the user cast invalid values to NULL if they 
wanna exclude them from their data, because this is exactly what missing values 
were invented for. If they still want to store broken data and want to have 
some niche understanding of statistics, provide a way to attach 
application-defined stats to parquet (this extends to a number of histogram 
types or counts of other "special" values). Keep the storage format baseline 
simple. IEEE total ordering is well defined and universally agreed upon. I 
think the world doesn't need yet another special floating point treatment.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engin

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739052#comment-17739052
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614549513

   @mapleFU, as I've written before that's why we initiated 
[ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863)
 to make the format open to specify orderings. I don't know how the other 
implementations use this already. In the current parquet-mr (since we 
introduced `ColumnOrder`) there is a logic that drops any statistics if the 
defined column order is not known. So we can safely initiate a new one. We can 
say that if the min/max value would contain a NaN, then we would write the new 
`IEEE_754` column order otherwise `TYPE_ORDER`. In this case we can simple skip 
the additional lists for marking all-NaN pages and write the NaN values into 
the statistics instead. The question is how older readers of the other 
implementations would handle an unknown `ColumnOrder`.
   It is an implementation detail that the NaN handling is java is different 
from what IEEE 754 says. Java has only one NaN bitmap. So handling this 
ordering will require additional work. I hope it can be implemented in a 
performant way.
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739036#comment-17739036
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614520051

   > Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. 
   
   So arrow-rs has a nice handing on float/double comparings, I guess we only 
need to consider that the new data will not broken by stale parquet-mr reader? 
@gszadovszky 




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739025#comment-17739025
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748

   > I wonder for PageIndex pruning in Rust implementions
   
   Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. I believe 
DataFusion is using these same comparison kernels for evaluating pruning 
predicates, and so I would expect it to have similar behaviour with regards to 
NaNs.
   
   From the [Rust 
docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp):
   
   > The values are ordered in the following sequence:
   > 
   > negative quiet NaN
   > negative signaling NaN
   > negative infinity
   > negative numbers
   > negative subnormal numbers
   > negative zero
   > positive zero
   > positive subnormal numbers
   > positive numbers
   > positive infinity
   > positive signaling NaN
   > positive quiet NaN.
   
   > would it matter for adding [-inf, +inf] as min-max for all nan and null 
pages
   
   I haven't read the full backscroll, but the original PR's suggestion of just 
writing a NaN for a page only containing NaN seems perfectly logical to me, 
unlikely to cause compatibility issues, and significantly less surprising than 
writing a value that doesn't actually appear in the data...
   
   > Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet):
   
   I don't really know enough about the history of floating point comparison to 
weigh in on what the best solution is with any degree of authority, however, my 
2 cents is that the totalOrder predicate is the standardised way to handle this.
   
   Whilst I do agree that the behaviour of aggregate statistics containing NaNs 
might be unfortunate for some workloads, I'm not sure that special casing them 
is beneficial. Aside from the non-trivial additional complexity associated with 
special-casing them, if you don't include NaNs in statistics it is unclear to 
me how you can push down a comparison predicate as you have no way to know if 
the page contains NaNs? Perhaps that is what this PR seeks to address, but I do 
wonder if the simple solution might be worth considering...
   
   Also tagging @crepererum who may have further thoughts




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739008#comment-17739008
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247688333


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Trying to catching up the discussion. I like the idea to write either [-inf, 
+inf] or [-0.0, +0.0] for NaN-only pages.
   
   As NaN value does not have a well-defined order across systems, simply 
leveraging page min/max values to filter NaN does not make any sense. Therefore 
I think this design can break such misuses.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738996#comment-17738996
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614407651

   https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 
   @alamb @tustvold Hi, for PageIndex pruning in Rust implementions, would it 
matter for adding `[-inf, +inf]` as min-max for all nan and null pages? Would 
it harm the column pruning for `IS_NAN` or other operations?
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738993#comment-17738993
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247671677


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I think `[-inf, +inf]` it's ok. Now I guess only Rust impl and parquet-mr 
has the potential problem.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> b

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738941#comment-17738941
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247575939


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @mapleFU, I did not think about any specific implementation. (TBH, I only 
have experince with parquet-mr.) This is mentioned in the PR description. 
Maybe, we do not have any implementations as such.
   
   @JFinis, I agree we should not care about the potential systems already 
writing NaN values into column indexes. Also agree that writing NaN values to 
min/max is risky for existing systems. So we need to write non-NaN valid values 
to min/max for all-NaN pages. (And of course mark them with either `nan_pages` 
or `value_counts`.)
   
   The more we narrow the range the higher the chance the page will be dropped 
during filtering which is good because we should not search for NaN values 
based on the spec anyway. What do you think about `[-Inf, -Inf]`? The worst 
case is we will read the page of all NaN values instead of dropping. In this 
very case we would not drop it for `< x` like cases. (This turned out to be the 
rephrasing and summary of your previous comments. :smile: )





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column in

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738171#comment-17738171
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245383453


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > there was an argument that some writers already write NaN values into 
column indexes. Hence, they try to filter on NaN values. Now, we start writing 
`[-Inf,+Inf]` for NaN only pages. NaN is probably out of `[-Inf,+Inf]` interval 
so that reader would drop the only NaN page while searching for a NaN.
   
   Hi gabor, which implemention has do like that? I check C++ implemention but 
it doesn't do this. Maybe we can do a check here? Since I guess `[-inf, +inf]` 
sounds ok





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738169#comment-17738169
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1611623088

   Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): @adamgs 
@tustvold @alamb




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738163#comment-17738163
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou writing anything but NaN into min/max was one of my suggestions to 
circumvent the problem that [parquet-mr doesn't seem to check for NaN values in 
min/max while 
reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931)
 and therefore would probably yield wrong results once we start writing NaNs 
into these values.
   
   This would only work if we go back to maintaining either `nan_pages` or 
`value_counts` though, as otherwise, as you correctly pointed out, we wouldn't 
have a way to draw the distinction between only-NaN and real infinities.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738162#comment-17738162
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou writing anything but NaN into min/max was my suggestion to 
circumvent the problem that [parquet-mr doesn't seem to check for NaN values in 
min/max while 
reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931)
 and therefore would probably yield wrong results once we start writing NaNs 
into these values.
   
   This would only work if we go back to maintaining either `nan_pages` or 
`value_counts` though, as otherwise, as you correctly pointed out, we wouldn't 
have a way to draw the distinction between only-NaN and real infinities.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738146#comment-17738146
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245293430


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > A new reader that implements this PR can do the distinction via the 
nan_pages or value_counts computation.
   
   Wait... I thought the `[-Inf, +Inf]` convention was meant to avoid a new 
`nan_pages` or `value_counts` field? If not, then what's the point?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> T

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738141#comment-17738141
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @gszadovszky Any writer reader/writer pair who writes NaNs into column 
indexes (and other places like page headers) and expects them to be there (and 
otherwise yields wrong results while reading) *is and never was* spec 
conforming. In older releases of the spec where NaN wasn't mentioned yet, such 
a writer was at least not violating the spec directly but even then NaN 
handling was basically "undefined behavior", as the spec never mentioned how to 
treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already 
back then a non-portable assumption.
   
   Even today, a reader relying on one specific NaN semantics would already 
yield erroneous results when reading spec conforming Parquet files. E.g., if 
they search for NaNs and expect them to be in min/max, then they might filter 
Pages containing NaNs that don't have NaNs in their min/max. Consequently, such 
a reader is already broken; yes, writing [-Inf,Inf] into the column index would 
break such a reader more, but all bets are off here anyway already. It 
currently is just not possible to handle NaNs correctly in a portable way 
(that's what this PR is all about in the first place).
   
   So TBH backward compatibility to such a broken (or at least non-portable) 
reader/writer pair seems like an absolute non-goal to me.
   
   @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't 
need to distinguish here. All they need to know is whether they should skip the 
page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless 
of whether the bounds are there due to NaNs or real infinities, a legacy reader 
would yield correct results. A new reader that implements this PR can do the 
distinction via the nan_pages or value_counts computation.
   
   Note that actually *any* bounds are, mathematically speaking, correct for a 
page containing only NaNs (and will yield correct results on spec-conforming 
readers!). Note that min/max values in the column index don't need to be tight, 
according to the spec. So the only condition that must hold is that there is no 
value outside of the bounds (NaNs excluded). As an only-NaN page has no values, 
any bounds satisfy the condition, as there are no values that need to lie 
inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. 
Both would yield correct results on spec conforming readers. Actually the 
tighter the bounds, the more queries can skip the page.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738143#comment-17738143
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @gszadovszky Any writer reader/writer pair who writes NaNs into column 
indexes (and other places like page headers) and expects them to be there (and 
otherwise yields wrong results while reading) *is and never was* spec 
conforming. In older releases of the spec where NaN wasn't mentioned yet, such 
a writer was at least not violating the spec directly but even then NaN 
handling was basically "undefined behavior", as the spec never mentioned how to 
treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already 
back then a non-portable assumption.
   
   Even today, a reader relying on one specific NaN semantics would already 
yield erroneous results when reading spec conforming Parquet files. E.g., if 
they search for NaNs and expect them to be in min/max, then they might filter 
Pages containing NaNs that don't have NaNs in their min/max. Consequently, such 
a reader is already broken; yes, writing [-Inf,Inf] into the column index would 
break such a reader more, but all bets are off here anyway already. It 
currently is just not possible to handle NaNs correctly in a portable way 
(that's what this PR is all about in the first place).
   
   So TBH backward compatibility to such a broken (or at least non-portable) 
reader/writer pair seems like an absolute non-goal to me.
   
   @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't 
need to distinguish here. All they need to know is whether they should skip the 
page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless 
of whether the bounds are there due to NaNs or real infinities, a legacy reader 
would not skip the page and therefore yield correct results. A new reader that 
implements this PR can do the distinction via the nan_pages or value_counts 
computation.
   
   Note that actually *any* bounds are, mathematically speaking, correct for a 
page containing only NaNs (and will yield correct results on spec-conforming 
readers!). Note that min/max values in the column index don't need to be tight, 
according to the spec. So the only condition that must hold is that there is no 
value outside of the bounds (NaNs excluded). As an only-NaN page has no values, 
any bounds satisfy the condition, as there are no values that need to lie 
inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. 
Both would yield correct results on spec conforming readers. Actually the 
tighter the bounds, the more queries can skip the page.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {

1 2 >

1 - 100 of 121 matches

Mail list logo