[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

ASF GitHub Bot (Jira) Wed, 22 Nov 2023 11:07:06 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788866#comment-17788866
 ]


ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis opened a new pull request, #221:
URL: https://github.com/apache/parquet-format/pull/221

   This commit adds a new column order `IEEE754TotalOrder`, which can be used 
for floating point types (FLOAT, DOUBLE, FLOAT16).
   
   The advantage of the new order is a well-defined ordering between -0,+0 and 
the various possible bit patterns of NaNs. Thus, every single possible bit 
pattern of a floating point value has a well-defined order now, so there are no 
possibilities where two implementations might apply different orders when the 
new column order is used.
   
   With the default column order, there were many problems w.r.t. NaN values 
which lead to reading engines not being able to use statistics of floating 
point columns for scan pruning even in the case where no NaNs were in the data 
set. The problems are discussed in detail in the next section.
   
   This solution to the problem is the result of the extended discussion in 
https://github.com/apache/parquet-format/pull/196, which ended with the 
consensus that IEEE 754 total ordering is the best approach to solve the 
problem in a simple manner without introducing special fields for floating 
point columns (such as `nan_counts`, which was proposed in that PR). Please 
refer to the discussion in that PR for all the details why this solution was 
chosen over various design alternatives.
   
   Note that this solution is fully backward compatible and should not break 
neither old readers nor writers, as a new column order is added. Legacy writers 
can continue not writing this new order and instead writing the default type 
defined order. Legacy readers should avoid using any statistics on columns that 
have a column order they do not understand and therefore should just not use 
the statistics for columns ordered using the new order.
   
   The remainder of this message explains in detail what the problems are and 
how the proposed solution fixes them.
   
   Problem Description
   ===================
   
   Currently, the way NaN values are to be handled in statistics inhibits most 
scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely 
the following problems exist:
   
   Statistics don't tell whether NaNs are present
   ----------------------------------------------
   
   As NaN values are not to be incorporated in min/max bounds, a reader cannot 
know whether NaN values are present. This might seem to be not too problematic, 
as most queries will not filter for NaNs. However, NaN is ordered in most 
database systems. For example, Postgres, DB2, and Oracle treat NaN as greater 
than any other value, while MSSQL and MySQL treat it as less than any other 
value. An overview over what different systems are doing can be found here. The 
gist of it is that different systems with different semantics exist w.r.t. NaNs 
and most of the systems do order NaNs; either less than or greater than all 
other values.
   
   For example, if the semantics of the reading query engine mandate that NaN 
is to be treated greater than all other values, the predicate x > 1.0 should 
include NaN values. If a page has max = 0.0 now, the engine would not be able 
to skip the page, as the page might contain NaNs which would need to be 
included in the query result.
   
   Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be 
less than all other values by the reading engine. Again, a page with min = 2.0 
couldn't be skipped in this case by the reader.
   
   Thus, even if a user doesn't query for NaN explicitly, they might use other 
predictes that need to filter or retain NaNs in the semantics of the reading 
engine, so the fact that we currently can't know whether a page or row group 
contains NaN is a bigger problem than it might seem on first sight.
   
   Currently, any predicate that needs to retain NaNs cannot use min and max 
bounds in Parquet and therefore cannot be used for scan pruning at all. And as 
state, that can be many seemingly innocuous greater than or less than 
predicates in most databases systems. Conversely, it would be nice if Parquet 
would enable scan pruning in these cases, regardless of whether the reader and 
writer agree upon whether NaN is smaller, greater, or incomparable to all other 
values.
   
   Note that the problem exists especially if the Parquet file doesn't include 
any NaNs, so this is not only a problem in the edge case where NaNs are 
present; it is a problem in the way more common case of NaNs not being present.
   
   Handling NaNs in a ColumnIndex
   ------------------------------
   
   There is currently no well-defined way to write a spec-conforming 
ColumnIndex once a page has only NaN (and possibly null) values. NaN values 
should not be included in min/max bounds, but if a page contains only NaN 
values, then there is no other value to put into the min/max bounds. However, 
bounds in a ColumnIndex are non-optional, so we have to put something in here. 
The spec does not describe what engines should do in this case. Parquet-mr 
takes the safe route and does not write a column index once NaNs are present. 
But this is a huge pessimization, as a single page containing NaNs will prevent 
writing a column index for the column chunk containing that page, so even pages 
in that chunk that don't contain NaNs will not be indexed.
   
   It would be nice if there was a defined way of writing the ColumnIndex when 
NaNs (and especially only-NaN pages) are present.
   
   Handling only-NaN pages & column chunks
   ---------------------------------------
   
   Note: Hereinafter, whenever the term only-NaN is used, it refers to a page 
or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is 
allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN 
non-null values.
   
   The Statistics objects stored in page headers and in the file footer have a 
similar, albeit smaller problem: min_value and max_value are optional here, so 
it is easier to not include NaNs in the min/max in case of an only-NaN page or 
column chunk: Simply omit these optional fields. However, this brings a 
semantic ambiguity with it, as it is now unclear whether the min/max value 
wasn't written because there were only NaNs, or simply because the writing 
engine did decide to omit them for whatever other reason, which is allowed by 
the spec as the field is optional.
   
   Consequently, a reader cannot know whether missing min_value and max_value 
means "only NaNs, you can skip this page if you are looking for only non-NaN 
values" or "no stats written, you have to read this page as it is undefined 
what values it contains".
   
   It would be nice if we could handle NaNs in a way that would allow scan 
pruning for these only-NaN pages.
   
   Solution
   ========
   
   IEEE 754 total order solves all the mentioned problems. As NaNs now have a 
defined place in the ordering, they can be incorporated into min and max 
bounds. In fact, in contrast to the default ordering, they do not need any 
special casing anymore, so all the remarks how readers and writers should 
special-handle NaNs and -0/+0 no longer apply to the new ordering.
   
   As NaNs are incorporated into min and max, a reader can now see whether NaNs 
are contained through the statistics. Thus, a reading engine just has to map 
its NaN semantics to the NaN semantics of total ordering. For example, if the 
semantics of the reading engine treat all NaNs (also -NaNs) as greater than all 
other values, a reading engine having a predicate `x > 5.0` (which should 
include NaNs) may not filter any pages / row groups if either min or max are 
(+/-)NaN.
   
   Only-NaN pages can now also be included in the column index, as they are no 
longer a special case.
   
   In conclusion, all mentioned problems are solved by using IEEE 754 total 
ordering.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
     - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

Reply via email to