[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788926#comment-17788926
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

etseidl commented on code in PR #221:
URL: https://github.com/apache/parquet-format/pull/221#discussion_r1402657927


##
src/main/thrift/parquet.thrift:
##
@@ -288,7 +288,7 @@ struct MapType {} // see LogicalTypes.md
 struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BINARY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes 
(see LogicalTypes.md)

Review Comment:
   'must encode' or 'must be encoded as'?



##
src/main/thrift/parquet.thrift:
##
@@ -962,15 +967,19 @@ union ColumnOrder {
*   BYTE_ARRAY - unsigned byte-wise comparison
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
-   * (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   * point types due to underspecified handling of NaN and -0/+0,
+   * it is recommended that writers use IEEE_754_TOTAL_ORDER
+   * for these types.
+   *
+   * If TYPE_ORDER is used for floating point types, then the following

Review Comment:
   This line threw me (at least while using my phone 😉...on my computer I can 
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used 
for floating..." or "If this type-defined ordering..."



##
src/main/thrift/parquet.thrift:
##
@@ -962,15 +967,19 @@ union ColumnOrder {
*   BYTE_ARRAY - unsigned byte-wise comparison
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
-   * (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   * point types due to underspecified handling of NaN and -0/+0,
+   * it is recommended that writers use IEEE_754_TOTAL_ORDER
+   * for these types.
+   *
+   * If TYPE_ORDER is used for floating point types, then the following

Review Comment:
   This line threw me (at least while using my phone 😉...on my computer I can 
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used 
for floating..." or "If this type-defined ordering..."





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788868#comment-17788868
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1823348321

   Okay, finally done. As the new solution (total order) does not share a 
single line with the current solution and this PR gets quite long and 
contrived, I created a new PR: https://github.com/apache/parquet-format/pull/221
   
   I hope this is fine. If you rather want me to continue in this PR, let me 
know, then I'll close the other one and instead update this one. Otherwise, 
let's continue the discussion about total order in the new PR :).
   
   @tustvold @mapleFU @wgtmac @crepererum @etseidl @gszadovszky @pitrou FYI




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788866#comment-17788866
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis opened a new pull request, #221:
URL: https://github.com/apache/parquet-format/pull/221

   This commit adds a new column order `IEEE754TotalOrder`, which can be used 
for floating point types (FLOAT, DOUBLE, FLOAT16).
   
   The advantage of the new order is a well-defined ordering between -0,+0 and 
the various possible bit patterns of NaNs. Thus, every single possible bit 
pattern of a floating point value has a well-defined order now, so there are no 
possibilities where two implementations might apply different orders when the 
new column order is used.
   
   With the default column order, there were many problems w.r.t. NaN values 
which lead to reading engines not being able to use statistics of floating 
point columns for scan pruning even in the case where no NaNs were in the data 
set. The problems are discussed in detail in the next section.
   
   This solution to the problem is the result of the extended discussion in 
https://github.com/apache/parquet-format/pull/196, which ended with the 
consensus that IEEE 754 total ordering is the best approach to solve the 
problem in a simple manner without introducing special fields for floating 
point columns (such as `nan_counts`, which was proposed in that PR). Please 
refer to the discussion in that PR for all the details why this solution was 
chosen over various design alternatives.
   
   Note that this solution is fully backward compatible and should not break 
neither old readers nor writers, as a new column order is added. Legacy writers 
can continue not writing this new order and instead writing the default type 
defined order. Legacy readers should avoid using any statistics on columns that 
have a column order they do not understand and therefore should just not use 
the statistics for columns ordered using the new order.
   
   The remainder of this message explains in detail what the problems are and 
how the proposed solution fixes them.
   
   Problem Description
   ===
   
   Currently, the way NaN values are to be handled in statistics inhibits most 
scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely 
the following problems exist:
   
   Statistics don't tell whether NaNs are present
   --
   
   As NaN values are not to be incorporated in min/max bounds, a reader cannot 
know whether NaN values are present. This might seem to be not too problematic, 
as most queries will not filter for NaNs. However, NaN is ordered in most 
database systems. For example, Postgres, DB2, and Oracle treat NaN as greater 
than any other value, while MSSQL and MySQL treat it as less than any other 
value. An overview over what different systems are doing can be found here. The 
gist of it is that different systems with different semantics exist w.r.t. NaNs 
and most of the systems do order NaNs; either less than or greater than all 
other values.
   
   For example, if the semantics of the reading query engine mandate that NaN 
is to be treated greater than all other values, the predicate x > 1.0 should 
include NaN values. If a page has max = 0.0 now, the engine would not be able 
to skip the page, as the page might contain NaNs which would need to be 
included in the query result.
   
   Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be 
less than all other values by the reading engine. Again, a page with min = 2.0 
couldn't be skipped in this case by the reader.
   
   Thus, even if a user doesn't query for NaN explicitly, they might use other 
predictes that need to filter or retain NaNs in the semantics of the reading 
engine, so the fact that we currently can't know whether a page or row group 
contains NaN is a bigger problem than it might seem on first sight.
   
   Currently, any predicate that needs to retain NaNs cannot use min and max 
bounds in Parquet and therefore cannot be used for scan pruning at all. And as 
state, that can be many seemingly innocuous greater than or less than 
predicates in most databases systems. Conversely, it would be nice if Parquet 
would enable scan pruning in these cases, regardless of whether the reader and 
writer agree upon whether NaN is smaller, greater, or incomparable to all other 
values.
   
   Note that the problem exists especially if the Parquet file doesn't include 
any NaNs, so this is not only a problem in the edge case where NaNs are 
present; it is a problem in the way more common case of NaNs not being present.
   
   Handling NaNs in a ColumnIndex
   --
   
   There is currently no well-defined way to write a spec-conforming 
ColumnIndex once a page has only NaN (and possibly null) value

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784917#comment-17784917
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805653152

   Congratulations! Take all the time you need, there is no urgency on this 
from my end, just wanted to avoid things stalling out




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784693#comment-17784693
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1805207632

   I hate to not stick to my word, but I won't be able to create the PR today, 
as my wife is going into labor and I'll have to drive her to the clinic soon 😅.
   
   I pushed the status I have so far to my fork. You can already have a look if 
you want: https://github.com/jfinis/parquet-format/tree/totalorder
   
   The commit is basically done, I just wanted to proof read everything and 
write a descriptive message for the commit and the PR. I'll find some time once 
we're back from the hospital, i.e., in a few days. But for now, I first need to 
deliver something else 👶 .




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784578#comment-17784578
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1804343753

   @tustvold I actually already have the change ready in my local repo. I was 
just distracted by other work and it seemed there was little interest so far in 
advancing this quickly, so I didn't update it on github, yet. I can update the 
PR tomorrow :).




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-11-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784386#comment-17784386
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1803642293

   Just coming back to this as it has come up a bit downstream, the approach 
described in 
https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697 makes 
a lot of sense to me. Would it help move this forward if I were to raise a 
separate PR proposing it?
   
   > parquet-mr can efficiently implement this sort order
   
   Provided Java provides some mechanism to interpret a float as an integer, it 
is just a case of some bit operations - 
https://doc.rust-lang.org/src/core/num/f64.rs.html#1336
   
   > Total ordering is nice if the goal is to order the data
   > If the goal is to filter the data then I think any consideration of 
NaN/null/infinity is meaningless
   
   Why would filter predicates not also need a well-defined order? FWIW 
arrow-rs uses total order for **all** floating point comparison.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741481#comment-17741481
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1628384514

   To support old readers with the statistics we can choose to write 
`TypeDefinedOrder` for FP values in case there are no `NaN` values in the data.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741310#comment-17741310
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1627373091

   The new `IEEE754TotalOrder` looks elegant to me, though a single NaN value 
may still ruin the page index. Another challenge is how parquet-mr can 
efficiently implement this sort order.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741127#comment-17741127
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

westonpace commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625662823

   > CON: NaNs will be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.
   
   I agree this is a con.  Total ordering is nice if the goal is to order the 
data.  If the goal is to filter the data then I think any consideration of 
NaN/null/infinity is meaningless.
   
   However, I also agree with @crepererum that this is a slippery slope and I 
agree with @JFinis that NaNs are not widely used and simpler is better.  I 
don't entirely agree the solution can always be to replace NaN/Infinity with 
NULL but the cases where it can't are probably very rare.  Besides, the penalty 
here is only a performance loss and not incorrect results so it's more 
manageable.
   
   So, on the balance, I'd say I'm neutral.  If there are other advantages to 
this approach then the disadvantages to dataset filtering are probably not 
enough outweigh them.  We might want to add a small sentence to some kind of 
pyarrow or parquet documentation somewhere so that users can be aware of this.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741077#comment-17741077
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625546137

   @JFinis Thanks a lot! I agree that makes sense. The main problem IMHO is 
that old readers wouldn't support page filtering on such files. That said, we 
have to move forward somehow.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741072#comment-17741072
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625537697

   > > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this 
proposal?
   > 
   > It's difficult to say without understanding the implications. Say a data 
page contains some NaNs, what happens?
   
   @pitrou On the write path:
   
   * The writing library would set the `ColumnOrder` for this column to the new 
option, let's call it `IEEE754TotalOrder`.
   * The writing library would use IEEE754 total order for all order / sorting 
related tasks. I.e., it would compute the min and max values of the page using 
that total order. That order has a defined place for NaN. The writer would 
*not* have special logic for NaN. It would just order everything using total 
order. E.g., in case of a page containing a positive NaN, this would be chosen 
as the max value, as Nan > everything else in the total order.
   
   On the read path:
   * A reading engine encountering the new `IEEE754TotalOrder` column order 
would either
 a) (legacy reader) not understand it and in this case not use any 
statistics, as it doesn't understand the semantics of the order relation.
 b) (new reader) understand it and assume that all order is in IEEE 754 
total order, which again has a defined place for NaNs. Depending on the NaN 
semantics of the reading engine, it would need to make sure to align the values 
it sees in min/max with its own semantics. How this alignment would look like 
would depend on the semantics of the engine. (I can give more detailed examples 
for different engine semantics if necessary)
 
   Ramifications:
   * PRO: Due to the new column order, legacy readers are guarded. They don't 
need to understand the new order. Even if they ignore the column order, if they 
see NaNs in min and max they will just ignore them and assume the statistics 
aren't usable. So we have two layers of protection to make sure legacy readers 
don't misunderstand the ordering.
   * PRO: No special fields for NaNs. No `nan_counts`, no `nan_pages`. Instead, 
NaNs are just treated as defined in the total ordering.
   * PRO: Simple standardized handling of floatsinstead of special handling of 
NaNs. I guess this was the main point of @tustvold and @crepererum.
   * PRO: Engines only need to understand total ordering and don't need any 
special NaN handling code anymore (unless their semantics is different, in 
which case they need to map their semantics from / to total ordering).
   * CON: NaNs *will* be used in min/max bounds, even for not only-NaN pages. 
This makes them less effective for filtering (as they are the widest possible 
bounds) but @crepererum made a good point that this "special case for NaN" is 
quite arbitrary and we could also special case INT_MAX for integer columns, 
e.g.. I see the argument that keeping the architecture simple might be 
preferrable. Also NaNs are not widely used, so this will not be determinental 
to many data sets.

 
 




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741049#comment-17741049
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625464720

   > @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this 
proposal?
   
   It's difficult to say without understanding the implications. Say a data 
page contains some NaNs, what happens?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741022#comment-17741022
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625392665

   > I guess this can also be implemented in each language by "bit casting" the 
float bits to integer bits and doing an integer comparison, correct
   
   Its a bit more than a simple bit cast, but broadly speaking yes.
   
   https://doc.rust-lang.org/src/core/num/f64.rs.html#1336
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741009#comment-17741009
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625354736

   @tustvold @crepererum Do I interpret your answer correctly in that your 
suggestion would be to
   
   * Create a new `ColumnOrder` for floats that simply is defined as IEEE 754 
total order, if we need such new order for backward compatibility (which we 
probably need, as apparently parquet-mr will otherwise perform filtering 
incorrectly)
   * When that order is used, don't handle NaNs explicitly. Instead, just use 
the total order relation for ordering and min/max computation (which will 
result in NaNs being written as max and -NaNs being written as min if they 
exist).
   
   Did I get this right?
   
   I guess this can also be implemented in each language by "bit casting" the 
float bits to integer bits and doing an integer comparison, correct? So even if 
the underlying language doesn't have native support for total ordering, it 
should still be possible to implement this.
   
   I do see a certain beauty in this approach in it being "simple". As always, 
I'm happy to adapt my PR to this approach, if we can get consensus that we want 
this.
   
   @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal?
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741001#comment-17741001
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625347116

   Okay, `[-NaN, +NaN]` as min-max would be ignored in C++ Statistics. I'm ok 
for these solutions.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740984#comment-17740984
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625304128

   > I think we already have type-defined order
   
   Indeed, and what I am suggesting is rather than layering on more complexity 
to workaround the problems of such an approach, how about we just remove this 
complexity?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740976#comment-17740976
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625277042

   I think we already have type-defined order, and already exclude +inf and 
-inf. And not when if a page is all `NaN`, the page would be excluded




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-07-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740966#comment-17740966
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

crepererum commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1625220272

   I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote:
   
   IMHO this is leaking application details into the storage format. If you 
start to differentiate NaN from "all normal values" and NULL you may do the 
same for +/-Inf, because it also acts as a poison value in most computations. 
But you may also do that for "nearly Inf" because someone divided by "nearly 
zero" and these super big values are equally nonsensical. This whole discussion 
isn't even specific to floats. Why do boolean stats not count true/false 
separately? What about empty strings and byte arrays? Or empty lists in 
general? My point is: this is opening a can of worms and the complexity isn't 
worth the gain.
   
   The better alternative is: let the user cast invalid values to NULL if they 
wanna exclude them from their data, because this is exactly what missing values 
were invented for. If they still want to store broken data and want to have 
some niche understanding of statistics, provide a way to attach 
application-defined stats to parquet (this extends to a number of histogram 
types or counts of other "special" values). Keep the storage format baseline 
simple. IEEE total ordering is well defined and universally agreed upon. I 
think the world doesn't need yet another special floating point treatment.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engin

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739052#comment-17739052
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614549513

   @mapleFU, as I've written before that's why we initiated 
[ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863)
 to make the format open to specify orderings. I don't know how the other 
implementations use this already. In the current parquet-mr (since we 
introduced `ColumnOrder`) there is a logic that drops any statistics if the 
defined column order is not known. So we can safely initiate a new one. We can 
say that if the min/max value would contain a NaN, then we would write the new 
`IEEE_754` column order otherwise `TYPE_ORDER`. In this case we can simple skip 
the additional lists for marking all-NaN pages and write the NaN values into 
the statistics instead. The question is how older readers of the other 
implementations would handle an unknown `ColumnOrder`.
   It is an implementation detail that the NaN handling is java is different 
from what IEEE 754 says. Java has only one NaN bitmap. So handling this 
ordering will require additional work. I hope it can be implemented in a 
performant way.
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739036#comment-17739036
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614520051

   > Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. 
   
   So arrow-rs has a nice handing on float/double comparings, I guess we only 
need to consider that the new data will not broken by stale parquet-mr reader? 
@gszadovszky 




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739025#comment-17739025
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

tustvold commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614476748

   > I wonder for PageIndex pruning in Rust implementions
   
   Currently the arrow-rs implementation uses the totalOrder predicate as 
defined by the IEEE 754 (2008 revision) floating point standard to order 
floats, this can be very efficiently implemented using some bit-twiddling and 
at least appears to define the standardised way to handle this. I believe 
DataFusion is using these same comparison kernels for evaluating pruning 
predicates, and so I would expect it to have similar behaviour with regards to 
NaNs.
   
   From the [Rust 
docs](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp):
   
   > The values are ordered in the following sequence:
   > 
   > negative quiet NaN
   > negative signaling NaN
   > negative infinity
   > negative numbers
   > negative subnormal numbers
   > negative zero
   > positive zero
   > positive subnormal numbers
   > positive numbers
   > positive infinity
   > positive signaling NaN
   > positive quiet NaN.
   
   > would it matter for adding [-inf, +inf] as min-max for all nan and null 
pages
   
   I haven't read the full backscroll, but the original PR's suggestion of just 
writing a NaN for a page only containing NaN seems perfectly logical to me, 
unlikely to cause compatibility issues, and significantly less surprising than 
writing a value that doesn't actually appear in the data...
   
   > Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet):
   
   I don't really know enough about the history of floating point comparison to 
weigh in on what the best solution is with any degree of authority, however, my 
2 cents is that the totalOrder predicate is the standardised way to handle this.
   
   Whilst I do agree that the behaviour of aggregate statistics containing NaNs 
might be unfortunate for some workloads, I'm not sure that special casing them 
is beneficial. Aside from the non-trivial additional complexity associated with 
special-casing them, if you don't include NaNs in statistics it is unclear to 
me how you can push down a comparison predicate as you have no way to know if 
the page contains NaNs? Perhaps that is what this PR seeks to address, but I do 
wonder if the simple solution might be worth considering...
   
   Also tagging @crepererum who may have further thoughts




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739008#comment-17739008
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247688333


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Trying to catching up the discussion. I like the idea to write either [-inf, 
+inf] or [-0.0, +0.0] for NaN-only pages.
   
   As NaN value does not have a well-defined order across systems, simply 
leveraging page min/max values to filter NaN does not make any sense. Therefore 
I think this design can break such misuses.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738996#comment-17738996
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614407651

   https://github.com/apache/parquet-format/pull/196#discussion_r1237381221 
   @alamb @tustvold Hi, for PageIndex pruning in Rust implementions, would it 
matter for adding `[-inf, +inf]` as min-max for all nan and null pages? Would 
it harm the column pruning for `IS_NAN` or other operations?
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738993#comment-17738993
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247671677


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I think `[-inf, +inf]` it's ok. Now I guess only Rust impl and parquet-mr 
has the potential problem.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> b

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738941#comment-17738941
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247575939


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @mapleFU, I did not think about any specific implementation. (TBH, I only 
have experince with parquet-mr.) This is mentioned in the PR description. 
Maybe, we do not have any implementations as such.
   
   @JFinis, I agree we should not care about the potential systems already 
writing NaN values into column indexes. Also agree that writing NaN values to 
min/max is risky for existing systems. So we need to write non-NaN valid values 
to min/max for all-NaN pages. (And of course mark them with either `nan_pages` 
or `value_counts`.)
   
   The more we narrow the range the higher the chance the page will be dropped 
during filtering which is good because we should not search for NaN values 
based on the spec anyway. What do you think about `[-Inf, -Inf]`? The worst 
case is we will read the page of all NaN values instead of dropping. In this 
very case we would not drop it for `< x` like cases. (This turned out to be the 
rephrasing and summary of your previous comments. :smile: )





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column in

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738171#comment-17738171
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245383453


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > there was an argument that some writers already write NaN values into 
column indexes. Hence, they try to filter on NaN values. Now, we start writing 
`[-Inf,+Inf]` for NaN only pages. NaN is probably out of `[-Inf,+Inf]` interval 
so that reader would drop the only NaN page while searching for a NaN.
   
   Hi gabor, which implemention has do like that? I check C++ implemention but 
it doesn't do this. Maybe we can do a check here? Since I guess `[-inf, +inf]` 
sounds ok





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738169#comment-17738169
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1611623088

   Let's cc some of the maintainers of 
[parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet): @adamgs 
@tustvold @alamb




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738163#comment-17738163
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou writing anything but NaN into min/max was one of my suggestions to 
circumvent the problem that [parquet-mr doesn't seem to check for NaN values in 
min/max while 
reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931)
 and therefore would probably yield wrong results once we start writing NaNs 
into these values.
   
   This would only work if we go back to maintaining either `nan_pages` or 
`value_counts` though, as otherwise, as you correctly pointed out, we wouldn't 
have a way to draw the distinction between only-NaN and real infinities.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738162#comment-17738162
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245364253


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou writing anything but NaN into min/max was my suggestion to 
circumvent the problem that [parquet-mr doesn't seem to check for NaN values in 
min/max while 
reading](https://github.com/apache/parquet-format/pull/196#discussion_r1243234931)
 and therefore would probably yield wrong results once we start writing NaNs 
into these values.
   
   This would only work if we go back to maintaining either `nan_pages` or 
`value_counts` though, as otherwise, as you correctly pointed out, we wouldn't 
have a way to draw the distinction between only-NaN and real infinities.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738146#comment-17738146
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245293430


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > A new reader that implements this PR can do the distinction via the 
nan_pages or value_counts computation.
   
   Wait... I thought the `[-Inf, +Inf]` convention was meant to avoid a new 
`nan_pages` or `value_counts` field? If not, then what's the point?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> T

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738141#comment-17738141
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @gszadovszky Any writer reader/writer pair who writes NaNs into column 
indexes (and other places like page headers) and expects them to be there (and 
otherwise yields wrong results while reading) *is and never was* spec 
conforming. In older releases of the spec where NaN wasn't mentioned yet, such 
a writer was at least not violating the spec directly but even then NaN 
handling was basically "undefined behavior", as the spec never mentioned how to 
treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already 
back then a non-portable assumption.
   
   Even today, a reader relying on one specific NaN semantics would already 
yield erroneous results when reading spec conforming Parquet files. E.g., if 
they search for NaNs and expect them to be in min/max, then they might filter 
Pages containing NaNs that don't have NaNs in their min/max. Consequently, such 
a reader is already broken; yes, writing [-Inf,Inf] into the column index would 
break such a reader more, but all bets are off here anyway already. It 
currently is just not possible to handle NaNs correctly in a portable way 
(that's what this PR is all about in the first place).
   
   So TBH backward compatibility to such a broken (or at least non-portable) 
reader/writer pair seems like an absolute non-goal to me.
   
   @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't 
need to distinguish here. All they need to know is whether they should skip the 
page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless 
of whether the bounds are there due to NaNs or real infinities, a legacy reader 
would yield correct results. A new reader that implements this PR can do the 
distinction via the nan_pages or value_counts computation.
   
   Note that actually *any* bounds are, mathematically speaking, correct for a 
page containing only NaNs (and will yield correct results on spec-conforming 
readers!). Note that min/max values in the column index don't need to be tight, 
according to the spec. So the only condition that must hold is that there is no 
value outside of the bounds (NaNs excluded). As an only-NaN page has no values, 
any bounds satisfy the condition, as there are no values that need to lie 
inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. 
Both would yield correct results on spec conforming readers. Actually the 
tighter the bounds, the more queries can skip the page.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738143#comment-17738143
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @gszadovszky Any writer reader/writer pair who writes NaNs into column 
indexes (and other places like page headers) and expects them to be there (and 
otherwise yields wrong results while reading) *is and never was* spec 
conforming. In older releases of the spec where NaN wasn't mentioned yet, such 
a writer was at least not violating the spec directly but even then NaN 
handling was basically "undefined behavior", as the spec never mentioned how to 
treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already 
back then a non-portable assumption.
   
   Even today, a reader relying on one specific NaN semantics would already 
yield erroneous results when reading spec conforming Parquet files. E.g., if 
they search for NaNs and expect them to be in min/max, then they might filter 
Pages containing NaNs that don't have NaNs in their min/max. Consequently, such 
a reader is already broken; yes, writing [-Inf,Inf] into the column index would 
break such a reader more, but all bets are off here anyway already. It 
currently is just not possible to handle NaNs correctly in a portable way 
(that's what this PR is all about in the first place).
   
   So TBH backward compatibility to such a broken (or at least non-portable) 
reader/writer pair seems like an absolute non-goal to me.
   
   @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't 
need to distinguish here. All they need to know is whether they should skip the 
page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless 
of whether the bounds are there due to NaNs or real infinities, a legacy reader 
would not skip the page and therefore yield correct results. A new reader that 
implements this PR can do the distinction via the nan_pages or value_counts 
computation.
   
   Note that actually *any* bounds are, mathematically speaking, correct for a 
page containing only NaNs (and will yield correct results on spec-conforming 
readers!). Note that min/max values in the column index don't need to be tight, 
according to the spec. So the only condition that must hold is that there is no 
value outside of the bounds (NaNs excluded). As an only-NaN page has no values, 
any bounds satisfy the condition, as there are no values that need to lie 
inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. 
Both would yield correct results on spec conforming readers. Actually the 
tighter the bounds, the more queries can skip the page.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737850#comment-17737850
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244244531


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > Now, we start writing `[-Inf,+Inf]` for NaN only pages.
   
   Also, how does the reader distinguish with pages that contain actual 
infinity values?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments sayin

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737848#comment-17737848
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244236273


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @JFinis, there was an argument that some writers already write NaN values 
into column indexes. Hence, they try to filter on NaN values. Now, we start 
writing `[-Inf,+Inf]` for NaN only pages. NaN is probably out of `[-Inf,+Inf]` 
interval so that reader would drop the only NaN page while searching for a NaN.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parque

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737794#comment-17737794
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1244075748


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I think `[-Inf, +Inf]` is ok.
   
   > Add in comments that in the page index, all nan pages can be checked by 
having nan_count > 0 && min is NaN && max is NaN
   
   Previous design uses `[Nan, Nan]`, I guess it's bad. But i guess 
`[-Inf,+inf]` should be well handled and not including any ambiguity. I'm +1 
with this





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737712#comment-17737712
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243910046


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > your idea sounds good but it is not that easy, unfortunately. Since no 
total ordering is specified NaN values can get before negative infinity or 
after positive infinity. An implementation that currently writes NaN values to 
column indexes will break in this scenario.
   
   @gszadovszky I don't fully understand your argument here. We just want to 
make sure that a legacy reader who doesn't know the new semantics yet will 
definitly *never filter* an only-NaN page. By using min=-Infinity and 
max=Infinity, we basically write bounds that are as maximal as they can get, so 
no legacy implementation should ever filter this page, which is the goal for 
correctness.
   
   Could you elaborate how you think an implementation would break? Maybe with 
an example?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737704#comment-17737704
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243891389


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou, sorry `BoundaryOrder` was a mistype. I was talking about 
[ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863).
 The only order we have currently is `TypeDefinedOrder` that is specified. We 
were thinking about adding a `ColumnOrder` for FLOAT/DOUBLE with the definition 
of a total ordering that includes NaN values, -0.0, 
   and +0.0 values.
   Maybe you're right that the in case of the default string ordering is not 
enough to a system it shall write its own indices. But there was an idea behind 
ColumnOrder to maybe implement collations to support those systems. 





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts a

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737691#comment-17737691
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243836277


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > I've brought up boundary order because that was our original answer to the 
problems of these ordering issues.
   
   Hmm, how is it an answer? It only seems to be a redundant piece of 
information about `min_values` and `max_values`.
   
   > E.g. how should we order internationalized UTF-8 strings?
   
   Byte-wise (i.e. codeunit-wise) lexicograph ordering and character-wise (i.e. 
codepoint-wise) lexicographic ordering should give identical results AFAIR. 
They are also technically "natural".
   
   If a query system needs a more sophisticated ordering, then it should 
certainly synthesize its own index.
   
   I also don't uderstand what that has to do with the presence or absence of 
`boundary_order`?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortco

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737684#comment-17737684
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243814135


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @JFinis, your idea sounds good but it is not that easy, unfortunately. Since 
no total ordering is specified NaN values can get before negative infinity or 
after positive infinity. An implementation that currently writes NaN values to 
column indexes will break in this scenario.
   @pitrou, I've brought up boundary order because that was our original answer 
to the problems of these ordering issues. NaN values are not the only potential 
issues around ordering. E.g. how should we order internationalized UTF-8 
strings?
   I agree that the current parquet-mr implementation of handling NaN values in 
column indexes is not correct. But it also means we cannot do this change 
without breaking older parquet-mr readers. Boundary order would solve this from 
parquet-mr point of view but if it is not used by other implementations it is 
not a good choice on its own either.
   
   If there are parquet files with column indexes containing NaN values and we 
consider them valid then we need to fix this issue in parquet-mr and it is 
unrelated to this format change. However, it is not an easy question if they 
are really valid. Are both min and max are NaN? If not what is the total 
ordering in that system which writes these files? Can this format change be 
compatible with that system?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737658#comment-17737658
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243710422


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @pitrou That shouldn't be a problem. That's why this approach would require 
alternative (2) or (3). In these alternatives nan_pages / value_counts would be 
used to find only-NaN pages. If these indicate that the page is only NaN, the 
min/max can be ignored and a reader can assume that the only values in the page 
are NaNs. Old readers who don't understand these new fields yet would treat the 
page simply as "maximum value range; cannot filter".





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737595#comment-17737595
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243576356


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > d) Stick with nan_pages (or value_counts) (i.e., alternatives (2) or (3)) 
and write min=-Infinity and max=+Infinity into the bounds in the column index 
for only-NaN pages.
   
   What if a page contains actual infinity values?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737568#comment-17737568
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243496427


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Oh, actually there is yet another option
   
   d) Stick with nan_pages (or value_counts) (i.e., alternatives (2) or (3)) 
and write min=-Infinity and max=+Infinity into the bounds in the column index 
for only-NaN pages. This way, new readers could use nan_pages (or value_counts) 
to detect an only-NaN pages. Legacy readers would simply never filter this page 
due to the maximally wide bounds. My heart is bleeding a bit while writing 
this, as this is obviously a patch solution that feels wrong (the bounds are 
just not correct) and is just to reverse-patch old implementations by bending 
the spec, but it would fulfill the requirements and allow backward 
compatibility while enabling support for filtering only-NaN pages.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737566#comment-17737566
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243489728


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   
   What is a good path forward then? I see the following options:
   
   a) Ship this change but exclude the handling of only-Nan pages in the column 
index and only handle the other cases. Then we could still at least specify how 
to handle NaNs in the column index in cases where no "only NaN" page exists and 
these cases would then at least be well defined (only NaN pages are probably an 
edge case, so this would already allow us to filter in 99% of all cases and 
therefore get us almost to the goal).
   b) Add ColumnOrder to this proposal. (again happy to do that)
   It would be a good case to start using the ColumnOrder enum. This would also 
give us the opportunity to define `boundary_order` explicitly for this column 
order, so we could even assume an ordering.
   c) Drop this altogether and live with the fact that float / double columns 
are basically unfilterable in many cases.
   
   @gszadovszky 
   Side note: I think that the current read behavior in parquet-mr as you state 
it is not adhering to the spec and is dangerous at best. I have seen Parquet 
files which have NaN in these bounds in the wild (I don't know who wrote them) 
and since the mandate to not write NaNs to these bounds is in the spec only for 
a while ([introduced 
here](https://github.com/apache/parquet-format/commit/92ae9a3187d7673c9a40f81f40886faa20807722)),
 older writers would have been perfectly spec-conforming when writing NaN into 
these bounds, so files having NaNs here are adhering to (an older version of 
the) spec and therefore the parquet-mr read code should be robust to handle 
these cases.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/ma

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737561#comment-17737561
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243478036


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   > Since no total ordering is defined `boundary_order` shall not be either 
`ASCENDING` or `DESCENDING` if there is any NaN page.
   
   Hmm. I am not theoretically against this (as is: the underlying concern is 
reasonable), but I'm worried that some corners of the Parquet format are more 
and more becoming a smattering of special cases that implementations must be 
extra careful to implement correctly.
   
   That said, it should also be easy for an implementation to entirely ignore 
`boundary_order`, and instead detect any existing ordering from the 
`min_values` and `max_values` (this should be fast given that there is one 
value per page). It might even be useful to deprecate `boundary_order` and 
encourage implementations to derive the information themselves?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=Na

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737550#comment-17737550
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243416912


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   It would also require to rewrite a couple of parts in the spec to support a 
new `ColumnOrder`. Because we did not introduce any `ColumnOrder` since the 
idea was introduced it might require a vote as well. And we need to investigate 
the other implementations whether they already reads this value and handles a 
potentially "unknown" value there. parquet-mr does.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737545#comment-17737545
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243411903


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   `1: TypeDefinedOrder TYPE_ORDER;`
   
   Emmm maybe my word is confusing, it means that, a new order should defined 
here to hint that?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs sho

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737543#comment-17737543
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243409667


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   No, `ColumnOrder` is specified per column in the footer and it is 
universally valid for any min/max statistics. See 
[here](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1057)
 for details.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737525#comment-17737525
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243337924


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   So, maybe a new `ColumnOrder` enum would be added here?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a pa

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737487#comment-17737487
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243234931


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @mapleFU, it seems to me that NaN is only checked for column indexes at the 
write path in parquet-mr. (In this case the column index will be invalid and 
won't be written to the file.) For the read path, though, there is no such 
check. It means that legacy readers can come to incorrect results using 
FLOAT/DOUBLE column indexes after we start writing NaN values. (Sorry for the 
late conclusion, I've thought this check was implemented for both directions.)
   The only way I can think of for backward compatible NaN handling is to 
define a 
[ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863)
 for FP values that includes NaNs as well. In case of we would also add support 
to row-group level statistics with NaNs. parquet-mr currently skip all kinds of 
min/max statistics for columns with not supported column orders.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max boun

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737415#comment-17737415
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243057594


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   If that I'm ok with (1), thanks!





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs o

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737289#comment-17737289
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242530201


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   @mapleFU From just reading the spec, I don't think we should have a backward 
compatibility problem, as legacy readers are already compelled to ignore NaNs 
if they find them anywhere. Thus, a legacy reader would ignore the NaN it finds 
in the column index and just not filter that page.
   
   Also note that regardless of whether we do (1), (2), or (3) [we basically 
**have to** write NaN into min and 
max](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773).
 We have to write a valid value and every value except NaN would simply be 
wrong, if a page contains only NaNs. The approaches would just differ in what 
we write **in addition**, so to a legacy reader that wouldn't read anything new 
fields, the three approaches would be equal.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737283#comment-17737283
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242521314


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I'm ok with (1), and I guess Java and Rust implementors should check that if 
they've prune page index without checking nan. @gszadovszky @pitrou do we need 
to:
   
   1. check the backward capability for nan and pruning?
   2. or just first check the parquet version is ok?
   3. or regard the reader doesn't handling min-max nan as a bug?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page ca

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737281#comment-17737281
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Assume that a legacy reader has page index, and `min == max == NaN`, do we 
need make sure that it will not prune it now? If not, (1) is ok for me, because 
it doesn't introduce any redudent data.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use s

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737276#comment-17737276
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Assume that a legacy reader has page index, and `min == max == NaN`, do need 
make sure that it will not prune it now? If not, (1) is ok for me, because it 
doesn't introduce any redudent data.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solu

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737277#comment-17737277
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242503072


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Assume that a legacy reader has page index, and `min == max == NaN`, do we 
need make sure that it will not prune it now? If not, (1) is ok for me, because 
it doesn't introduce any redudent data.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use s

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737261#comment-17737261
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242476403


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I'm okay with both (1) and (2), even though (2) sounds more generally useful.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737256#comment-17737256
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Thank you all for your sentiments. It looks like we have two votes for (1) 
and one for (3). Given that (1) would mean even less fields (and therefore 
faster decoding/encoding) I guess it would also solve the possible problem of a 
performance degradation due to this.
   
   Given that the majority is for (1), I would draft an update how this would 
look like. Basically:
   * Remove mentions of nan_pages
   * Add in comments that in the page index, all nan pages can be checked by 
having nan_count > 0 && min is NaN && max is NaN
   * Add comments about boundary order, as mentioned by @gszadovszky 
   
   I'll provide an update in the next days.
   
   @mapleFU would this be okay with you? You mentioned you would also be okay 
with the others.
   @pitrou Would (1) be okay for you as well?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine o

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737258#comment-17737258
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Thank you all for your sentiments. It looks like we have two votes for (1) 
and one for (3). Given that (1) would mean even less fields (and therefore 
faster decoding/encoding) I guess it would also solve the possible problem of a 
performance degradation due to more fields to decode/encode.
   
   Given that the majority is for (1), I would draft an update how this would 
look like. Basically:
   * Remove mentions of nan_pages
   * Add in comments that in the page index, all nan pages can be checked by 
having nan_count > 0 && min is NaN && max is NaN
   * Add comments about boundary order, as mentioned by @gszadovszky 
   
   I'll provide an update in the next days.
   
   @mapleFU would this be okay with you? You mentioned you would also be okay 
with the others.
   @pitrou Would (1) be okay for you as well?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737255#comment-17737255
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Yes, number of rows in the offset index isn't enough due to repeated values.
   
   Apart from this, the suggestions seem to turn a bit in circles now. Note 
that all suggestions in this thread were already mentioned in [my earlier post 
where I depicted our possible options for the column 
index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762).
   
   @pitrou what you mentioned was my Option 2. I personally would prefer this 
as it feels like a useful thing to have anyway. Having said that, others 
pointed rightfully out that it would cost a few bytes even for non float 
columns. The value might be valuable for other tasks as well. For example, it 
could be used to quickly check how many nested values are in a page. By having 
these values one could sum up the nested values per column chunk by adding up 
all the value counts. This is currently a value that cannot be optained at all 
through statistics; instead one has to decode pages and count. For example, the 
SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered 
with such a value_counts field.
   
   @wgtmac your proposal was my Option 1 and actually my initial proposal (see 
previous commit). Note that you 
[earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450)
 actually were against writing NaNs and rather preferred the nan_pages approach:
   
   > Personally speaking, apart from adding a nan_count to the statistics, I 
would go with the option 3: adding a nan_pages bool list to the column index. I 
am not in favor of writing any NaN to min/max bounds.
   
   Is your argument that if we now need to write the NaNs anyway, that we 
should in this case just use them instead of adding nan_pages? I do agree that 
this would save the extra field and I personally see nothing wrong in doing 
this. Readers need to be able to detect NaN values anyway (to ignore them), so 
readers should be able to use the same logic to determin min=max=NaN <=> all 
values are NaN.
   
   As mentioned in my previous post where I compared the three approaches, I am 
happy to implement any of them and I think all of them will fulfill the 
requirements. In my personal opinion, I like the current approach with 
nan_pages actually the least, as it seems redundant if we have to write NaN 
values anyway and I see no problem in using NaN values for the "all values NaN 
check".
   
   I also like the option of adding a value_counts field to the column index of 
all columns. It feels like a useful and missing field (that is not subsumed by 
offset index row counts for nested columns) and I would love to add it as well 
and I feel the few extra bytes will be so negligible in contrast to the actual 
data that no-one will ever care. Also it would enable us to do the check for 
all values NaN the same way in page statistics and in the column index.
   
   So we're back at the three options I proposed:
   
   1. Drop nan_pages and use my initial approach of "min=max=NaN && nan_counts 
> 0 <=> all values are NaN" in the column index
   2. Drop nan_pages and instead add value_counts so we can use 
value_counts-null_counts==nan_counts to determine whether all values are null. 
(My personal favorite)
   3. Retain the current state and use `nan_pages`
   
   @wgtmac @mapleFU @gszadovszky @pitrou  could we arrive at a consensus here? 
I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't 
mentioned your favorite, yet (you just pointed out that we have to write some 
valid value).
   
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737254#comment-17737254
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Thank you all for your sentiment. It looks like we have two votes for (1) 
and one for (3). Given that (1) would mean even less fields (and therefore 
faster decoding/encoding) I guess it would also solve the possible problem of a 
performance degradation due to this.
   
   Given that the majority is for (1), I would draft an update how this would 
look like. Basically:
   * Remove mentions of nan_pages
   * Add in comments that in the page index, all nan pages can be checked by 
having nan_count > 0 && min is NaN && max is NaN
   * Add comments about boundary order, as mentioned by @gszadovszky 
   
   I'll provide an update in the next days.
   
   @mapleFU would this be okay with you? You mentioned you would also be okay 
with the others.
   @pitrou Would (1) be okay for you as well?
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine on

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736186#comment-17736186
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1238741604


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   +1 for (1) as I have explained in this comment: 
https://github.com/apache/parquet-format/pull/196#discussion_r1231982021





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be inclu

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735891#comment-17735891
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237651846


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   I would vote on (1) because it would not store redundant data. I think 
`nan_pages` is not necessary. Meanwhile, we have to take care of the 
[boundary_order](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L976)
 as well. Since no total ordering is defined `boundary_order` shall not be 
either `ASCENDING` or `DESCENDING` if there is any NaN page.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situa

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735855#comment-17735855
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Yes, number of rows in the offset index isn't enough due to repeated values.
   
   Apart from this, the suggestions seem to turn a bit in circles now. Note 
that all suggestions in this thread were already mentioned in [my earlier post 
where I depicted our possible options for the column 
index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762).
   
   @pitrou what you mentioned was my Option 2. I personally would prefer this 
as it feels like a useful thing to have anyway. Having said that, others 
pointed rightfully out that it would cost a few bytes even for non float 
columns. The value might be valuable for other tasks as well. For example, it 
could be used to quickly check how many nested values are in a page. By having 
these values one could sum up the nested values per column chunk by adding up 
all the value counts. This is currently a value that cannot be optained at all 
through statistics; instead one has to decode pages and count. For example, the 
SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered 
with such a value_counts field.
   
   @wgtmac your proposal was my Option 1 and actually my initial proposal (see 
previous commit). Note that you 
[earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450)
 actually were against writing NaNs and rather preferred the nan_pages approach:
   
   > Personally speaking, apart from adding a nan_count to the statistics, I 
would go with the option 3: adding a nan_pages bool list to the column index. I 
am not in favor of writing any NaN to min/max bounds.
   
   Is your argument that if we now need to write the NaNs anyway, that we 
should in this case just use them instead of adding nan_pages? I do agree that 
this would save the extra field and I personally see nothing wrong in doing 
this. Readers need to be able to detect NaN values anyway (to ignore them), so 
readers should be able to use the same logic to determin min=max=NaN <=> all 
values are NaN.
   
   As mentioned in my previous post where I compared the three approaches, I am 
happy to implement any of them and I think all of them will fulfill the 
requirements. In my personal opinion, I like the current approach with 
nan_pages actually the least, as it seems redundant if we have to write NaN 
values anyway and I see no problem in using NaN values for the "all values NaN 
check".
   
   I also like the option of adding a value_counts field to the column index of 
all columns. It feels like a useful and missing field (that is not subsumed by 
offset index row counts for nested columns) and I would love to add it as well 
and I feel the few extra bytes will be so negligible in contrast to the actual 
data that no-one will ever care. Also it would enable us to do the check for 
all values NaN the same way in page statistics and in the column index.
   
   So we're back at the three options I proposed:
   
   1. Drop nan_pages and use my initial approach of "min=max=NaN <=> all values 
are NaN" in the column index
   2. Drop nan_pages and instead add value_counts so we can use 
value_counts-null_counts==nan_counts to determine whether all values are null. 
(My personal favorite)
   3. Retain the current state and use `nan_pages`
   
   @wgtmac @mapleFU @gszadovszky @pitrou  could we arrive at a consensus here? 
I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't 
mentioned your favorite, yet (you just pointed out that we have to write some 
valid value).
   
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735847#comment-17735847
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237392250


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   Maybe I misunderstood the word "except", seems that it means "min-max" 
should be take into accound. I've no question for that now





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735846#comment-17735846
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237388278


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Personally I like (3), because I think parquet-format changes so slowly, 
adding a `value_count` or others in it will not be used for a long time. But 
others seems ok to me, maybe I can write a benchmark that will these bytes make 
PageIndex larger and decoding it slower.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of th

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735841#comment-17735841
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237384626


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   I don't fully understand your question.
   
   We have to write nan_pages and nan_counts *and* we also have to write NaN 
values to the actual min and max in the column index, as we have to write a 
valid double value to the bounds and NaN is the only correct double value in 
case all values are NaN, as pointed out by @gszadovszky 
[here](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773).





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page co

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735843#comment-17735843
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237384626


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   I don't fully understand your question.
   
   We have to write nan_pages and nan_counts ***and*** we also have to write 
NaN values to the actual min and max in the column index, as we have to write a 
valid double value to the bounds and NaN is the only correct double value in 
case all values are NaN, as pointed out by @gszadovszky 
[here](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773).





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A pag

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735840#comment-17735840
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237383257


##
README.md:
##
@@ -161,21 +161,7 @@ following rules:
 * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
   signed zeros.   The details are documented in the
   [Thrift definition](src/main/thrift/parquet.thrift) in the
-  `ColumnOrder` union. They are summarized here but the Thrift definition

Review Comment:
   Indeed, as was [requested in this 
issue](https://github.com/apache/parquet-format/pull/196#discussion_r1151335207).
 I do agree that not duplicating it is better.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735837#comment-17735837
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381809


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   I'll update this with my next revision once we have [decided on this 
issue](https://github.com/apache/parquet-format/pull/196#discussion_r1237381221).





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this,

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735838#comment-17735838
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237382160


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except

Review Comment:
   I'll update this with my next revision once we have [decided on this 
issue](https://github.com/apache/parquet-format/pull/196#discussion_r1237381221).





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a gener

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735836#comment-17735836
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237381221


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Yes, number of rows in the offset index isn't enough due to repeated values.
   
   Apart from this, the suggestions seem to turn a bit in circles now. Note 
that all suggestions in this thread were already mentioned in [my earlier post 
where I depicted our possible options for the column 
index](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762).
   
   @pitrou what you mentioned was my Option 2. I personally would prefer this 
as it feels like a useful thing to have anyway. Having said that, others 
pointed rightfully out that it would cost a few bytes even for non float 
columns. The value might be valuable for other tasks as well. For example, it 
could be used to quickly check how many nested values are in a page. By having 
these values one could sum up the nested values per column chunk by adding up 
all the value counts. This is currently a value that cannot be optained at all 
through statistics; instead one has to decode pages and count. For example, the 
SQL query `SELECT count(*) FROM some_nested_column;` could be fully answered 
with such a value_counts field.
   
   @wgtmac your proposal was my Option 1 and actually my initial proposal (see 
previous commit). Note that you 
[earlier](https://github.com/apache/parquet-format/pull/196#pullrequestreview-1362171450)
 actually were against writing NaNs and rather preferred the nan_pages approach:
   
   > Personally speaking, apart from adding a nan_count to the statistics, I 
would go with the option 3: adding a nan_pages bool list to the column index. I 
am not in favor of writing any NaN to min/max bounds.
   
   Is your argument that if we now need to write the NaNs anyway, that we 
should in this case just use them instead of adding nan_pages? I do agree that 
this would save the extra field and I personally see nothing wrong in doing 
this. Readers need to be able to detect NaN values anyway (to ignore them), so 
readers should be able to use the same logic to determin min=max=NaN <=> all 
values are NaN.
   
   As mentioned in my previous post where I compared the three approaches, I am 
happy to implement any of them and I think all of them will fulfill the 
requirements. In my personal opinion, I like the current approach with 
nan_pages actually the least, as it seems redundant if we have to write NaN 
values anyway and I see no problem in using NaN values for the "all values NaN 
check".
   
   I also like the option of adding a value_counts field to the column index of 
all columns. It feels like a useful and missing field (that is not subsumed by 
offset index row counts for nested columns) and I would love to add it as well 
and I feel the few extra bytes will be so negligible in contrast to the actual 
data that no-one will ever care. Also it would enable us to do the check for 
all values NaN the same way in page statistics and in the column index.
   
   So we're back at the three options I proposed:
   
   1. Drop nan_pages and use my initial approach of "min=max=NaN <=> all values 
are NaN" in the column index
   2. Drop nan_pages and instead add value_counts so we can use 
value_counts-null_counts==nan_counts to determine whether all values are null. 
(My personal favorite)
   3. Retain the current state and use `nan_pages`
   
   @wgtmac @mapleFU @gszadovszky could we arrive at a consensus here? I'm happy 
to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned 
your favorite, yet (you just pointed out that we have to write some valid 
value).
   
   





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> --

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733409#comment-17733409
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1231982021


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   As `nan_counts` will be set only after this proposal, could we simply deduce 
a NaN page by checking `null_pages[i] == false && nan_counts[i] > 0 && 
min_values[i] == NaN && max_values[i] == NaN`? If that is true, we can safely 
remove definition of `nan_pages` list.



##
README.md:
##
@@ -161,21 +161,7 @@ following rules:
 * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
   signed zeros.   The details are documented in the
   [Thrift definition](src/main/thrift/parquet.thrift) in the
-  `ColumnOrder` union. They are summarized here but the Thrift definition

Review Comment:
   Yes, this looks reasonable.



##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except

Review Comment:
   I would expect to explicitly state that `NaN value should not be written to 
min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and 
ColumnMetaData. But it is suggested to write NaN to min_values and max_values 
fields in the ColumnIndex where a value has to be written in case of a only-NaN 
page`.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states th

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732617#comment-17732617
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229883979


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Personally I think `optional list value_counts` is more common, but I 
think null already has `null_counts`, and `value_counts` might consume more 
bytes for every leaf column.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732614#comment-17732614
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229881617


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   ```suggestion
  *   in the column index, where a value has to be written in case of
   ```





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732613#comment-17732613
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229880276


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   That said, if we do need an additional list (because of repeated columns?), 
it might be more worthwhile to add an `optional list value_counts` 
instead, as it would then benefit all column types.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use so

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732610#comment-17732610
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

pitrou commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1229878920


##
src/main/thrift/parquet.thrift:
##
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list nan_pages

Review Comment:
   Is this necessary? We already know:
   * the NaN count for each page (in `nan_counts`)
   * the null count for each page (in `null_counts`)
   * the number of rows for each page (from the OffsetIndex)
   
   It seems this might be enough to infer whether a page is all-NaN (except 
perhaps if there are repetition levels?).





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731626#comment-17731626
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1226712152


##
src/main/thrift/parquet.thrift:
##
@@ -886,16 +891,25 @@ union ColumnOrder {
*   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
-   * point values (relations vs. total ordering) the following
-   * compatibility rules should be applied when reading statistics:
+   * point values (relations vs. total ordering), the following 
compatibility
+   * rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+   * - If the nan_count field is set, a reader can compute
+   *   nan_count + null_count == num_values to deduce whether all non-NULL
+   *   values are NaN.
+   * - When looking for NaN values, min and max should be ignored.
+   *   If the nan_count field is set, it can be used to check whether
+   *   NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
-   * - When looking for NaN values, min and max should be ignored.
* 
* When writing statistics the following rules should be followed:
-   * - NaNs should not be written to min or max statistics fields.
+   * - It is suggested to always set the nan_count fields for FLOAT and
+   DOUBLE columns.
+   * - NaNs should not be written to min or max statistics fields except
+   *   in the column index, where a value has to be written incase of

Review Comment:
   ```
   NaNs should not be written to min or max statistics fields except
   in the column index, where a value has to be written incase of
   ```
   
   Does this means `nan_pages` and `nan_count` in this patch?



##
README.md:
##
@@ -161,21 +161,7 @@ following rules:
 * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
   signed zeros.   The details are documented in the
   [Thrift definition](src/main/thrift/parquet.thrift) in the
-  `ColumnOrder` union. They are summarized here but the Thrift definition

Review Comment:
   So this part is removed and unified into the `parquet.thrift`?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs an

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731560#comment-17731560
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1587083232

   I finally have time to continue on this. Sorry for the long wait.
   
   As @gszadovszky has highlighted, we have to store a valid double/float value 
into the min/max bounds of the column index to be compatible with legacy 
readers. So the initial proposal to write NaN into min/max in this case would 
actually work.
   
   But so far not everyone was happy with using these NaNs in readers to see 
whether we have an only-nan page. Therefore, the suggestion was to also add 
`nan_pages` to the column options (favored by @wgtmac and @mapleFU). I have 
updated the PR to this suggestion: We still would write NaNs into min/max in 
the column index if a page has only NaNs but advise the reader to not use these 
values (as readers are already advised today) and instead only use `nan_pages` 
to check for only-nan pages. This way, we don't need to worry about the 
semantics of NaN comparisions and readers can continue to ignore all NaN values 
they find in bounds.
   
   I have not updated the PR description yet to reflect this new design; only 
the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please 
review and let me know if you agree with this design. Then I will update the PR 
description accordingly.
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pic

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-06-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731558#comment-17731558
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1226466358


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:

Review Comment:
   I have removed the dulpicate explanation.



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,8 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** count of NaN values in the column; only present if type is FLOAT or 
DOUBLE */

Review Comment:
   Done.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solu

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-05-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721586#comment-17721586
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1543226234

   @JFinis Do you have a plan to revive this?




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707253#comment-17707253
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

gszadovszky commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773

   Thank you, @JFinis, for working on this. This is not an easy topic.
   I am afraid we cannot avoid encoding NaN values into column index min/max 
lists for the sake of backward compatibility: There is no such thing as 
"missing value" in the list. We encode actual primitive values. We need to 
store there something for each page. That's why we have `null_pages` to 
highlight that the values encoded for the corresponding page are valid or not. 
The only way I can think of being backward compatible is to store NaN values in 
min/max otherwise we mix up older readers. 




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706189#comment-17706189
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1151335207


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:

Review Comment:
   Not relevant to this PR: it is weird that we have duplicated the explanation 
here. It would be better to consolidate this by referring to the thrift only. 



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,8 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** count of NaN values in the column; only present if type is FLOAT or 
DOUBLE */

Review Comment:
   ```suggestion
  /** count of NaN values in the column; only present if physical type is 
FLOAT or DOUBLE */
   ```





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> tr

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705930#comment-17705930
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150378596


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   Yes, maybe you are right. My point is that, if we write nan_count or even 
record count, the program would works well. However, non-float point page would 
have some size-overhead. Personally, I'd like to use `list`, because it's 
easy to implement, and also lightweight. And we can hear others idea.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help f

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705907#comment-17705907
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150299691


##
src/main/thrift/parquet.thrift:
##
@@ -952,6 +961,9 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated 
before
* using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   I would say let's discuss this once we settle on that we do want to have NaN 
values. If we go with one of the other [alternatives outlined 
here](https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762),
 we don't need to discuss it. 
   
   (We can mandate a specific bit pattern or allow any NaN. I guess both would 
be okay (note that we also don't mandate a specific bit pattern for values in a 
column). But I'd say let's postpone the discussion)





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atl

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705900#comment-17705900
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1150282767


##
src/main/thrift/parquet.thrift:
##
@@ -952,6 +961,9 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated 
before
* using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   By the way, in your design, for a `NaN` writer, which number should be 
written here? Should it be a specific NaN?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705868#comment-17705868
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762

   The gist of all opened issues is the question how to encode pages/column 
chunks that contain only NaNs. 
   
   This is actually only an issue for the `ColumnIndex`. For statistics in the 
`ColumnMetaData` or the page, we can find only-Nan pages/columnChunks by 
computing `num_values - null_count - nan_count == 0`. The `ColumnIndex` doesn't 
have `num_values`, so we can't perform this computation.
   
   I see three alternatives to handle the problem in the `ColumnIndex`:
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
   
   **I'm fine with either of these, so I would like us to reach a consensus for 
one of the alternatives here; then I can update my PR to reflect the decision. 
As this is my first contribution to parquet, I don't know the decision 
processes here. Do we vote? Is there a single or group of decision makers? 
Please let me know how to come to a conclusion here.**
   
   As a help for the decision: Here are again the PROs and CONs of the three 
alternativs:
   
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
 * **PRO:** Fully backward compatible
 * **PRO:** Needs no further lists in the ColumnIndex
 * **CON:** people are uneasy with storing NaNs in bounds, due to many 
existing bit patterns and therefore a bit fuzzy semantics.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
 * **PRO:** No NaNs in bounds, no encoding/bit-pattern fuzzyness
 * **PRO:** Makes the `ColumnIndex` symmetric to other statistics (and to 
Apache Iceberg)
 * **PRO:** The `num_values` would also be viable for other purposes. It 
feels weirdly asymmetric to not have this field in the column index. For 
example, this would help to gauge the number of nested values in a nested 
column.
 * **CON:** The extra `num_values` list would be in each column index, even 
for non FLOAT/DOUBLE columns, thereby adding space consumption and 
encoding/decoding overhead.
 * **CON:** Would make `null_pages` redundant, as `null_pages[i] == 
(num_values[i] - null_count[i] == 0)`
 * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
 * **PRO:** No NaN encoding fuzzyness, no encoding/bit-pattern fuzzyness
 * **PRO:** Less space consumption than `num_values`. The list would only 
be present for FLOAT/DOUBLE columns
 * **PRO:** Along the lines of `null_pages` so following an existing 
pattern in the column index
 * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
 
   \* Explanation of "in theory not 100% backward compatible": Today, min and 
max in a column index have to have a valid value unless `null_pages` of the 
respective page is true. This would no longer hold if we decide to encode 
only-NaN pages through empty min/max + `nan_pages` or empty min/max + 
`num_values`. Thus, a legacy reader, who doesn't know the new lists, could come 
to the conclusion that the missing bounds constitute an invalid ColumnIndex and 
therefore might deem the whole Parquet file as invalid. I doubt that this is a 
problem in practice, as readers are written leniently. I.e., if a missing bound 
in a column index is encountered, the index might not be used (what would 
already happen today in case of an only-NaN page, so not a regression) or just 
that page might be treated as "has to be scanned". I don't know a reader that 
would reject the whole Parquet file in this case. Therefore, this is likely not 
relevant in practice.
   
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to cr

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704402#comment-17704402
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

zhongyujiang commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1482138488

   @JFinis Thanks for your reply, just realized that the page value count is 
stored in the page header, not in the column index. I overlooked your comments 
above before asked the question, sorry for that.




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704198#comment-17704198
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481370812

   @zhongyujiang (as I can't answer your comment directly). Here is the problem 
with your suggestion of checking `nanCount == valueCount` for checking for only 
NaNs:
   
   > @mapleFU To your general comment (I can't answer there)
   > 
   > > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   > 
   > The problem is that the ColumnIndex does not have the `num_values` field, 
so using this computation to derive whether there are only NaNs would only be 
applicable to Statistics, not to the column index. Of course, we could do what 
I suggested in alternatives and give the column index a `num_values` list. Then 
this would indeed work everywhere but at the cost of an additional list.
   > 
   > So I see we have the following options:
   > 
   > * Do what I did here, i.e., use min/max to determine whether there are 
only NaNs
   > * Add a `num_values` list to the ColumnIndex
   > * Accept the fact that the column index cannot detect only-NaN pages 
(might lead to fishy semantics)
   > * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   > 
   > Which one would you suggest here?
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing su

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704194#comment-17704194
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146342719


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   @mapleFU Yes, we could also add a `nan_pages` bool list in the column index. 
That would work as well.
   
   My gut feeling is that one day having a `value_counts` count would be more 
useful than boolean lists. We already have `null_pages` and `null_counts` and 
we would then also have `nan_pages` and `nan_counts`, both `null_pages` and 
`nan_pages` would be obsolete if there were `value_counts`. Yes, storing one 
integer  (`value_counts`) is likely more space than storing two booleans 
(`null_pages` & `nan_pages`), but knowing the number of values in a page could 
also be helpful for other pruposes.
   
   But yes, we could drop the testing of `min=max=NaN` if we had a `nan_pages` 
list in the column index.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the be

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704193#comment-17704193
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146342719


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   @mapleFU Yes, we could also add a `nan_pages` bool list in the column index. 
That would work as well.
   
   My gut feeling is that one day having a `value_counts` count would be more 
useful than boolean lists. We already have `null_pages` and `null_counts` and 
we would then also have `nan_pages` and `nan_counts`, both `null_pages` and 
`nan_pages` would be obsolete if there were `value_counts`. Yes, storing one 
integer is likely more space than storing two booleans, but knowing the number 
of values in a page could also be helpful for other pruposes.
   
   But yes, we could drop the testing of `min=max=NaN` if we had a `nan_pages` 
list in the column index.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds,

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704158#comment-17704158
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

zhongyujiang commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481237476

   > Thus, to solve the problem of only-NaN pages, the comments in the spec are 
extended to mandate the following behavior:
   > 
   > Once a writer writes the nan_count/nan_counts fields, they have to:
   > never write NaN into min/max if there are non-NaN non-Null values and
   > always write min=max=NaN if the only non-null values in a page are NaN
   > A reader observing that nan_count/nan_counts field was written can then 
rely on that if min or max are NaN, then both have to be NaN and this means 
that the only non-NULL values are NaN.
   
   Instead of writing min and max as NaN when there are only NaN values and 
then let the reader to check whether min max  NaN are credible by evaluating 
whether naNCounts is empty, wouldn't it be much simpler if we just left the 
evaluation of isNaN and notNaN to the reader?
   A reader can always conclude a page / column is all NaN when value count of 
the field == NaN count of the filed (when valueCounts and naNCounts both 
exists), this's Iceberg's current way of [evaluating 
isNaN](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java#L486).
  Is there anything wrong with doing this in Parquet?
   
   
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a confo

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704120#comment-17704120
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146114852


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   I got it, I think using both min-max is backward-capatible and can represent 
"all-data-is-nan". 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944
 can we import a status like that?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the so

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704117#comment-17704117
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

mapleFU commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146105914


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   Personally I think 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L752
 can together decide the status here.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704114#comment-17704114
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146095493


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   TBH: I would actually love to have a `num_values` list in the column index. 
We have the same in the statistics, Iceberg does the same, and not needing 
min=max=NaN for only-NaN checking would actually be much more elegant IMHO. 
   
   I just didn't want to suggest adding another list to each column index for 
the added space cost. However, given that these indexes are negligibly small in 
comparison to the data, I think actually no one would mind that extra space. If 
the consensus is that this is preferrable, I'm happy to adapt the commit to 
that.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a ge

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704111#comment-17704111
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145999358


##
src/main/thrift/parquet.thrift:
##
@@ -952,6 +961,9 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated 
before
* using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   Don’t we then have the same problem already for the NaN values stored in the 
actual columns? We do already serialize NaN to binary values in the columns 
themselves. There we also do not mandate a specific bit pattern. The spec does 
define float double to be IEEE compliant:
   ```
  * FLOAT - 4 bytes per value.  IEEE. Stored as little-endian.
  * DOUBLE - 8 bytes per value.  IEEE. Stored as little-endian.
   ```
   So if I see it correctly, any conforming reader implementation has to be 
able to handle all NaN bit patterns that IEEE allows. Otherwise they could not 
read the actual data in the columns.
   
   As you mention Java: Java has a defined way of reading IEEE bits into Java 
floats: `Float.intBitsToFloat` (and the respective method for double). Here it 
is guaranteed that all valid NaN bit patterns produce a Java Nan. From [the 
documentation](https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html):
   
   > If the argument is any value in the range 0x7f81 through 0x7fff or 
in the range 0xff81 through 0x, the result is a NaN.
   
   This method is used by parquet-mr, so we should be fine here.
   
   So, to generalize, as I see it, the following holds:
   * Parquet defines FLOAT/DOUBLE to be IEEE without further mandating any bit 
patterns.
 * If a reader cannot handle all NaN bit patterns, they are not conforming 
to the spec.
 * Also, such reader would already today malfunction, as there can be NaNs 
with any bit patterns in columns already.
   * All prominent programming languages (C++, Java, Python, Go, ...) have IEEE 
compliant binary to float conversions, so this also sounds like a rather 
theoretical problem.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704110#comment-17704110
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   @mapleFU To your general comment (I can't answer there)
   
   > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   
The problem is that the ColumnIndex does not have the `num_values` field, 
so using this computation to derive whether there are only NaNs would only be 
applicable to Statistics, not to the column index. Of course, we could do what 
I suggested in alternatives and give the column index a `num_values` list. Then 
this would indeed work everywhere but at the cost of an additional list.
   
   So I see we have the following options:
   * Do what I did here, i.e., use min/max to determine whether there are only 
NaNs
   * Add a `num_values` list to the ColumnIndex
   * Accept the fact that the column index cannot detect only-NaN pages (might 
lead to fishy semantics)
   * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   
   Which one would you suggest here?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> nu

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704109#comment-17704109
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146080282


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   To this suggestion: 
   
   > Seems it's a little strict here? Just ingore min-max seems ok?
   
   Note that the line you mentioned here just tells a reader that they *can* 
rely on this information, and therfore could, e.g., skip this page if a 
predicate like `x = 12.34` was used. They can of course also opt to ignore this 
information and not skip but rather scan the page. If we removed this, a reader 
couldn't do the skip here. 
   
   I guess this is related to your general suggestion: How do we detect 
only-NaN pages? Depending on what we do for that, this line will be adapted 
accordingly.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to de

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704107#comment-17704107
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   @mapleFU To your general comment (I can't answer there)
   
   > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   
The problem is that the ColumnIndex does not have the `num_values` field, 
so using this computation to derive whether there are only NaNs would only be 
applicable to Statistics, not to the column index. Of course, we could do what 
I suggested in alternatives and give the column index a `num_values` list. Then 
this would indeed work everywhere but at the cost of an additional list.
   
   So I see we have the following options:
   * Do what I did here, i.e., use min/max to determine whether there are only 
NaNs
   * Add a `num_values` list to the ColumnIndex
   * Accept the fact that the column index cannot detect only-NaN columns
   * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   
   Which one would you suggest here?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704106#comment-17704106
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-

JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1146076533


##
README.md:
##
@@ -163,18 +163,25 @@ following rules:
   [Thrift definition](src/main/thrift/parquet.thrift) in the
   `ColumnOrder` union. They are summarized here but the Thrift definition
   is considered authoritative:
-  * NaNs should not be written to min or max statistics fields.
-  * If the computed max value is zero (whether negative or positive),
-`+0.0` should be written into the max statistics field.
-  * If the computed min value is zero (whether negative or positive),
-`-0.0` should be written into the min statistics field.
-
-  For backwards compatibility when reading files:
-  * If the min is a NaN, it should be ignored.
-  * If the max is a NaN, it should be ignored.
-  * If the min is +0, the row group may contain -0 values as well.
-  * If the max is -0, the row group may contain +0 values as well.
-  * When looking for NaN values, min and max should be ignored.
+  * The following compatibility rules should be applied when reading 
statistics:
+* If the nan_count field is set to > 0 and both min and max are

Review Comment:
   > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   
   @mapleFU The problem is that the ColumnIndex does not have the `num_values` 
field, so using this computation to derive whether there are only NaNs would 
only be applicable to Statistics, not to the column index. Of course, we could 
do what I suggested in alternatives and give the column index a `num_values` 
list. Then this would indeed work everywhere but at the cost of an additional 
list.
   
   So I see we have the following options:
   * Do what I did here, i.e., use min/max to determine whether there are only 
NaNs
   * Add a `num_values` list to the ColumnIndex
   * Accept the fact that the column index cannot detect only-NaN columns
   * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   
   Which one would you suggest here?





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture o

  1   2   >