[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703622#comment-17703622
 ] 

ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------

JFinis opened a new pull request, #196:
URL: https://github.com/apache/parquet-format/pull/196

   This commit proposes an improvement for handling of NaN values in FLOAT and 
DOUBLE type columns. The goal is to allow reading engines, regardless of how 
they order NaN w.r.t. other values, to efficiently use statistics for scan 
pruning while NaN values are present, which currently is impossible in most 
cases. This is to be accomplished in a fully backward compatible manner, so 
that existing readers and writers do not need to be updated immediatly but can 
migrate over time to make use of the improved semantics.
   
   There was already [work on improving the handling of float and double 
columns](https://github.com/apache/parquet-format/pull/185) which laid good 
ground work for backward compatible improvements, but it wasn't sufficient to 
fix all the problems with NaN values, which are laid out hereinafter.
   
   Problem Description
   ===================
   
   Currently, the way NaN values are to be handled in statistics inhibits most 
scan pruning once NaN values are present in DOUBLE or FLOAT columns.  
Concretely the following problems exist:
   
   Statistics don't tell whether NaNs are present
   ----------------------------------------------
   
   As NaN values are not to be incorporated in min/max bounds, a reader cannot 
know whether NaN values are present. This might seem to be not too problematic, 
as most queries will not filter for NaNs. However, NaN is ordered in most 
database systems. For example, Postgres, DB2, and Oracle treat NaN as greater 
than any other value, while MSSQL and MySQL treat it as less than any other 
value. An overview over what different systems are doing can be found
   
[here](https://github.com/apache/arrow-rs/issues/264#issuecomment-833629444). 
The gist of it is that different systems with different semantics exist w.r.t.  
NaNs and most of the systems do order NaNs; either less than or greater than 
all other values.
   
   For example, if the semantics of the reading query engine mandate that NaN 
is to be treated greater than all other values, the predicate `x > 1.0` 
*should* include NaN values. If a page has `max = 0.0` now, the engine would 
*not* be able to skip the page, as the page might contain NaNs which would need 
to be included in the query result.
   
   Likewise, the predicate `x < 1.0` should include NaN if NaN is treated to be 
less than all other values by the reading engine. Again, a page with `min = 
2.0` couldn't be skipped in this case by the reader.
   
   Thus, even if a user doesn't query for NaN explicitly, they might use other 
predictes that need to filter or retain NaNs in the semantics of the reading 
engine, so the fact that we currently can't know whether a page or row group 
contains NaN is a bigger problem than it might seem on first sight.
   
   Currently, any predicate that needs to retain NaNs cannot use min and max 
bounds in Parquet and therefore cannot be used for scan pruning at all.  And as 
state, that can be many seemingly innocuous greater than or less than 
predicates in most databases systems. Conversely, it would be nice if Parquet 
would enable scan pruning in these cases, regardless of whether the reader and 
writer agree upon whether NaN is smaller, greater, or incomparable to all other 
values.
   
   Note that the problem exists especially if the Parquet file *doesn't* 
include any NaNs, so this is not only a problem in the edge case where NaNs are 
present; it is a problem in the way more common case of NaNs *not* being 
present.
   
   Handling NaNs in a ColumnIndex
   ------------------------------
   
   There is currently no well-defined way to write a spec-conforming 
ColumnIndex once a page has only NaN (and possibly null) values.  NaN values 
should not be included in min/max bounds, but if a page contains only NaN 
values, then there is no other value to put into the min/max bounds. However, 
bounds in a ColumnIndex are non-optional, so we *have to* put something in 
here.  The spec does not describe what engines should do in this case.  
Parquet-mr takes the safe route and does not write a column index once NaNs are 
present. But this is a huge pessimization, as a single page containing NaNs 
will prevent writing a column index for the column chunk containing that page, 
so even pages in that chunk that don't contain NaNs will not be indexed.
   
   It would be nice if there was a defined way of writing the `ColumnIndex` 
when NaNs (and especially only-NaN pages) are present.
   
   Handling only-NaN pages & column chunks
   ---------------------------------------
   
   Note: Hereinafter, whenever the term *only-NaN* is used, it refers to a page 
or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is 
allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN 
non-null values.
   
   The `Statistics` objects stored in page headers and in the file footer have 
a similar, albeit smaller problem: `min_value` and `max_value` are optional 
here, so it is easier to not include NaNs in the min/max in case of an only-NaN 
page or column chunk: Simply omit these optional fields. However, this brings a 
semantic ambiguity with it, as it is now unclear whether the min/max value 
wasn't written because there were only NaNs, or simply because the writing 
engine did decide to omit them for whatever other reason, which is allowed by 
the spec as the field is optional.
   
   Consequently, a reader cannot know whether missing `min_value` and 
`max_value` means "only NaNs, you can skip this page if you are looking for 
only non-NaN values" or "no stats written, you have to read this page as it is 
undefined what values it contains".
   
   It would be nice if we could handle NaNs in a way that would allow scan 
pruning for these only-NaN pages.
   
   Proposed solution
   =================
   
   Adding NaN counts
   -----------------
   
   The proposed solution for handling NaNs in statistics is akin to what 
[Apache Iceberg](https://iceberg.apache.org/spec/) does: add an *optional* 
`nan_count` field to `Statistics` and an *optional* `nan_counts` list to 
`ColumnIndex`.  This way, all places where statistics are being retained can 
specify whether NaN values are present. This already solves the first problem, 
as now queries wanting to retain NaNs can check whether the count is > 0 to see 
whether a page or column chunk contains NaNs.
   
   Handling NaN-only pages & column chunks
   ---------------------------------------
   
   Adding `nan_count`/`nan_counts` fields does not solve the problem of 
only-NaN pages, yet. But since we have a new optional field in both the 
`Statistics` object and the `ColumnIndex` object, we can tie a stricter 
semantics to the existence of this field. I.e., we can mandate that writers who 
write this optional field have to treat NaNs in a specific way.
   
   We basically have two options for treating only-NaN pages or column chunks:
   * to write NaN as min/max in this case.
   * to write nothing, i.e., 
     * omit the `min_value` / `max_value` in `Statistics` 
     * write `byte[0]` in the `min_values`/`max_values` list entry of the 
`ColumnIndex`
   
   I propose to go with the first option of writing NaN as min/max in case of 
only-Nan pages & column chunks. A section depicting the decision process for 
this follows below.
   
   Thus, to solve the problem of only-NaN pages, the comments in the spec are 
extended to mandate the following behavior:
   
   * Once a writer writes the `nan_count`/`nan_counts` fields, they have to: 
     * never write NaN into min/max if there are non-NaN non-Null values **and**
     * always write min=max=NaN if the only non-null values in a page are NaN
   * A reader observing that `nan_count`/`nan_counts` field was written can 
then rely on that if min or max are NaN, then both have to be NaN and this 
means that the only non-NULL values are NaN.
   
   Should we write NaN or nothing for only-NaN pages & column chunks?
   ------------------------------------------------------------------
   
   Here are the cons of each approach and how to mitigate them:
   
   CONs for writing NaN in this case:
   * Writing NaN breaks with the "don't write NaN into min and max bounds" 
rule. 
     * However, one could argue that breaking the rule in this edge case is 
okay, as since if NaN is the only value in a page, then it doesn't matter where 
to sort NaN w.r.t. other values, as there are no other values. It is the only 
value in the page, so it is the min and max of the page 
     * Breaking this rule has no consequences on existing readers, as they 
should ignore NaN anyway, i.e., treat it as if it wasn't written, so legacy 
readers should treat both cases the same anyway.
   * There might be existing writers that have written NaN for min & max for 
pages that do not only contain NaN but also other values, so a reader couldn't 
rely on min=max=NaN to mean that the only non-null value is NaN 
     * However, as specified, we enforce a stricter semantics once the 
`nan_count` field is written: We define that once a writing engine writes this 
field, it has to never write NaN into min/max if there are non-NaN non-Null 
values and always write min=max=NaN if the only non-null values in a page are 
NaN. Consequently, readers can rely on the semantics once they observe that the 
`nan_count` field is written and this becomes a non-issue.
   * NaNs take more space than not writing the field or writing byte[0] in the 
column index. This space overhead should usually be negligible.
   
   In conclusion, there is no big con for writing NaN. It can be implemented in 
a fully backward compatible way that still allows future writers and readers to 
apply a more strict semantics.
   
   CONs for writing nothing in this case:
   * Writing byte[0] to a ColumnIndex might break older readers who expect the 
`min_values`/`max_values` field to be a value with correct length unless 
`null_pages` is true for the entry.
     * Although readers should be lenient enough and handle wrongly sized 
min/max values gracefully by ignoring them we cannot be sure this is the case 
for any reader. Thus, certain legacy spec-conforming readers will reject the 
new Parquet file, which is bad.
   * Omitting the `min_value` / `max_value` fields in `Statistics` is 
suboptimal, as it is indistinguishable from the writing engine not writing 
these fields for whatever other reason. Would we do this, then the page could 
never be pruned by a reader, as the reader couldn't rely on this meaning "there 
are only NaNs" vs "the writer hasn't written any min/max for this page". 
     * We could define that a writer may not omit min/max if they write the 
`nan_count` field and must only omit them if a page has only NaNs, but this 
seems to be quite fishy, semancially.
   
   In conclusion, the cons for the NaN approach have mitigations and can be 
handled in a backward compatible way, while the cons for writing nothing might 
be non-backward-compatible. Therefore, I propose to write NaN as min/max for 
only-nan pages & column chunks.
   
   Considerations
   ==============
   
   Backward compatibility
   ----------------------
   
   The suggested change is fully backward compatible both in the read and write 
direction:
   
   Older readers not supporting `nan_count`/`nan_counts` yet can stay as is.  
As the fields are optional, readers not supporting them will simply ignore them.
   
   The spec already today mandates that if a reader sees `NaN` in min or max 
fields they should ignore it. They can continue doing so.
   
   No older reader will have regressed performance; any page that an older 
reader would have skipped before can still be skipped.
   
   Older writers can continue writing files without
   `nan_count`/`nan_counts`. Even if an old reader sets min=max=NaN for a page 
that does contain non-NaN values, readers supporting this new semantics will 
not misinterpret these bounds, as the writer will not write 
`nan_count`/`nan_counts`, so the new more strict semantics does not apply when 
reading.
   
   As `nan_count`/`nan_counts` are local to the scopes where they apply (column 
index, page, column chunk), even stitching together row groups from a writer 
that didn't write them and a writer that does write them works. This would 
result in a file where some pages / column indexes / column chunks would have 
them written while others wouldn't.
   
   Versioning
   ----------
   
   This proposal definitly does not require a *major* version bump, as it is 
fully backward compatible.
   
   I do not fully understand the versioning policy of parquet, so I don't know 
whether this change would require a minor version bump. One could argue that it 
is not necessary as the mere existence of the `nan_count`/`nan_counts` field 
would be the "feature flag" that would indicate whether a writer supported this 
change or not. There wouldn't be a version check necessary in a reader; they 
would just need to check for the existence of the `nan_count`/`nan_counts` 
fields.
   
   No Unnecessary Overhead
   -----------------------
   
   As thrift encodes missing optional fields with zero bytes in the compact 
protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the new 
optional fields.
   
   Design alternatives and why they were not chosen
   ================================================
   
   Ordering NaN before or after all other values
   ---------------------------------------------
   
   We could simply define NaN to be smaller or greater than all other values 
and then allow NaN in the respective bound.
   
   This however has many drawbacks:
   * NaN is the widest bound possible, so adding NaNs to min and max isn't very 
useful, as it makes pruning for non-NaN values almost impossible in the 
respective direction.
   * As mentioned, not all systems agree on whether NaN is larger or smaller 
than all other values. If we decided for one, systems that choose the other 
semantics couldn't filter effectively.
   * This contradicts the current spec of not writing NaN to min/max, so it 
would make older readers no longer skip pages they could skip before.
   
   Adding a new ColumnOrder
   ------------------------
   
   We could add a new ColumnOrder that specifies NaN to be smaller or greater 
than all other values, or even supports -NaN and +NaN ordering them as smaller 
and larger than all values, respectively. For example, Iceberg mandates the 
following sort order:
   
   -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN
   
   Once we define such an order, we could again allow NaN (and potentially 
-NaN) in bounds again.
   
   This however has the following drawbacks:
   * As with the previous alternative: NaN is the widest bound possible, so 
adding NaNs to min and max isn't very useful, as it makes pruning for non-NaN 
values almost impossible in the respective direction. If we even allow -NaN and 
+NaN, a page containing both would have no meaningful min and max and wouldn't 
allow any pruning at all.
   * Most systems don't support -NaN, as mathematically speaking, it is 
nonsense.  The only reason why it exists is that the physical reprsentation of 
floats has a sign bit that also exists for NaN representations.
   * The fact that NaNs being so unuseful for min/max bounds is displayed by 
the fact that even though Iceberg has such a well defined sort order,  they 
still do what is proposed in this proposal and *do not* include -NaN/NaN into 
min/max bounds and rather track them through nan_counts.
   
   All in all, any alternative putting NaN into min/max bounds (except for 
only-NaN-pages) suffers from the problem that NaN bounds are too wide and 
therefore not useful for pruning.
   
   Writing a second `value_counts` list in the ColumnIndex
   -------------------------------------------------------
   
   The column index does allow `byte[0]` values already, in case a page 
contains only nulls. We could enable the same for only-NaN pages by not storing 
only the `nan_counts`, but also the `value_counts` of a page. Then, one could 
check whether a page in the column index contains only NaNs by checking  
`nan_count + nulls_count = value_count`.  Hoewever, this would mean yet another 
list in the column index, making the column index bigger and more expensive to 
deserialize.  And while the `nan_counts` list only exists for FLOAT/DOUBLE 
columns, the `value_counts` list would exist for *all* columns and therefore 
take up considerably more space.  Also, this would not be backward compatible, 
as an older reader wouldn't know of the new lists, so it would see a `byte[0]` 
and would need to treat it as invalid.
   
   All in all, the extra list doesn't seem to add enough value for its cost 
(especially also cost on non-float columns) and reduced backward compatibility.
   
   Do nothing
   ----------
   
   As long as we don't do anything, columns containing NaNs will almost be 
useless for scan pruning. The problems outlined will persist, making double 
columns almost unprunable for some predicates. That is not satisfactory.
   
   Why wasn't sort order tackled?
   ==============================
   
   Even with this improvement which fixes the semantics of NaN values in 
statistics, NaN values are still a problem in some other places as there is 
still not a defined order for them, so the `boundary_order` in a column index 
and the `SortingColumn` would still have undefined placement for `NaN`.
   
   This mainly wasn't tackled for two reasons:
   * It is an orthogonal issue. This improvement is about enabling NaNs in 
statistics, so after this change all statistics can handle NaN in a 
well-defined way.  Sort odering of columns to leverage `boundary_order` or 
`SortingColumn` needs to be solved in a different way anyway, as the mere 
information about whether (only or some) NaNs are present isn't enough here but 
it needs to be defined whether they come before or after all values.
     * Even though we could fix both statistics and sort order by just defining 
NaN to be smaller or greater than other values, doing so for statistics is 
*not* a good idea, as having NaN in bounds makes too wide bounds that aren't 
helpful for filtering.
     * If sort ordering will be fixed by a different commit one day, the design 
of this commit shouldn't have a (negative) influence on that future design, as 
NaN counts and not including NaNs into bounds is a good thing to do anyway.
   * While fixing statistics with NaN counts is pretty uncontested w.r.t. 
design alternatives (see the respective section for a discussion why), the 
design to be chosen for sort orders isn't that clear:
     * We could define a new `ColumnOrder` with well defined NaN ordering. This 
would be the cleanest fix, but also a "very big gun", as this would be the 
first non-default column order in existence. Also, we would have to decide 
whether we want to have different sort orders for nans first, nans last, and 
+/-nan allowed.
     * We could define a `nans_first` field which tells whether NaNs are to be 
sorted before or after other values, akin to the already existing field 
`nulls_first`.  This would be a more micro-invasive change, but it would be 
less clean IMHO, as there is a tool for defining column ordering--the 
`ColumnOrder`--and not using that tool but working around it feels hacky.
   
   Thus, sort ordering of NaNs wasn't tackled in this commit. They can be 
tackled by a follow-up change if necessary.
   
   Epilogue
   ========
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
     - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
>                 Key: PARQUET-2249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2249
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Jan Finis
>            Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to