[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

ASF GitHub Bot (Jira) Thu, 23 Mar 2023 02:46:47 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704022#comment-17704022
 ]


ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

wgtmac commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145920913


##########
src/main/thrift/parquet.thrift:
##########
@@ -952,6 +961,9 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   There is an issue that IEEE754 has defined many NaNs but implementation 
varies among different programming languages. For example, Java only has one 
NaN. It would be difficult to serialize NaN to binary values and inter-operate 
between languages. 



##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +888,23 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
-   *     - If the min is a NaN, it should be ignored.
-   *     - If the max is a NaN, it should be ignored.
+   *     point values (relations vs. total ordering), the following 
compatibility
+   *     rules should be applied when reading statistics:
+   *     - If the nan_count field is set to > 0 and both min and max are
+   *       NaN, a reader can rely on that all non-NULL values are NaN
+   *     - Otherwise, if the min or the max is a NaN, it should be ignored.
+   *     - When looking for NaN values, min and max should be ignored;
+   *       if the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - The nan_count fields should always be set for FLOAT and DOUBLE 
columns.

Review Comment:
   Why should we enforce it?



##########
src/main/thrift/parquet.thrift:
##########
@@ -952,6 +961,9 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   Same rationale for not writing NaN to `statistics` above.





> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

Reply via email to