[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705868#comment-17705868
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762

   The gist of all opened issues is the question how to encode pages/column 
chunks that contain only NaNs. 
   
   This is actually only an issue for the `ColumnIndex`. For statistics in the 
`ColumnMetaData` or the page, we can find only-Nan pages/columnChunks by 
computing `num_values - null_count - nan_count == 0`. The `ColumnIndex` doesn't 
have `num_values`, so we can't perform this computation.
   
   I see three alternatives to handle the problem in the `ColumnIndex`:
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
   
   **I'm fine with either of these, so I would like us to reach a consensus for 
one of the alternatives here; then I can update my PR to reflect the decision. 
As this is my first contribution to parquet, I don't know the decision 
processes here. Do we vote? Is there a single or group of decision makers? 
Please let me know how to come to a conclusion here.**
   
   As a help for the decision: Here are again the PROs and CONs of the three 
alternativs:
   
   * My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
     * **PRO:** Fully backward compatible
     * **PRO:** Needs no further lists in the ColumnIndex
     * **CON:** people are uneasy with storing NaNs in bounds, due to many 
existing bit patterns and therefore a bit fuzzy semantics.
   * Adding `num_values` to the ColumnIndex, to make it symmetric with 
Statistics in pages & `ColumnMetaData` and to enable the computation 
`num_values - null_count - nan_count == 0`
     * **PRO:** No NaNs in bounds, no encoding/bit-pattern fuzzyness
     * **PRO:** Makes the `ColumnIndex` symmetric to other statistics (and to 
Apache Iceberg)
     * **PRO:** The `num_values` would also be viable for other purposes. It 
feels weirdly asymmetric to not have this field in the column index. For 
example, this would help to gauge the number of nested values in a nested 
column.
     * **CON:** The extra `num_values` list would be in each column index, even 
for non FLOAT/DOUBLE columns, thereby adding space consumption and 
encoding/decoding overhead.
     * **CON:** Would make `null_pages` redundant, as `null_pages[i] == 
(num_values[i] - null_count[i] == 0)`
     * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
   * Adding a `nan_pages` bool list to the column index, which indicates 
whether a page contains only NaNs
     * **PRO:** No NaN encoding fuzzyness, no encoding/bit-pattern fuzzyness
     * **PRO:** Less space consumption than `num_values`. The list would only 
be present for FLOAT/DOUBLE columns
     * **PRO:** Along the lines of `null_pages` so following an existing 
pattern in the column index
     * **CON:** In theory not 100% backward compatible, but probably not 
relevant in practice*
     
   \* Explanation of "in theory not 100% backward compatible": Today, min and 
max in a column index have to have a valid value unless `null_pages` of the 
respective page is true. This would no longer hold if we decide to encode 
only-NaN pages through empty min/max + `nan_pages` or empty min/max + 
`num_values`. Thus, a legacy reader, who doesn't know the new lists, could come 
to the conclusion that the missing bounds constitute an invalid ColumnIndex and 
therefore might deem the whole Parquet file as invalid. I doubt that this is a 
problem in practice, as readers are written leniently. I.e., if a missing bound 
in a column index is encountered, the index might not be used (what would 
already happen today in case of an only-NaN page, so not a regression) or just 
that page might be treated as "has to be scanned". I don't know a reader that 
would reject the whole Parquet file in this case. Therefore, this is likely not 
relevant in practice.
   
   
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to