Hi all,

As this discussion has been open for more than two years, I’d like to bump
up
this thread again to update the progress and collect feedback.

*Background*
• Today Parquet’s min/max stats and page index omit NaNs entirely.
• Engines can’t safely prune floating values because they know nothing on
NaNs.
• Column index is disabled if any page contains only NaNs.

There are two active proposals as below:

*Proposal A - IEEE754TotalOrder* (from the PR [1])
• Define a new ColumnOrder to include +0, –0 and all NaN bit‐patterns.
• Stats and column index store NaNs if they appear.
• Three PoC impls are ready: arrow-rs [2], duckdb [3] and parquet-java [4].
• For more context of this approach, please refer to discussion in [5].

*Proposal B - add nan_count* (from a comment [6] to [1])
• Add `nan_count` to stats and a `nan_counts` list to column index.
• For all‐NaNs cases, write NaN to min/max and use nan_count to distinguish.

Both solutions have pros and cons but are way better than the status quo
today.
Please share your thoughts on the two proposals above, or maybe come up with
better alternatives. We need consensus on one proposal and move forward.

[1] https://github.com/apache/parquet-format/pull/221
[2] https://github.com/apache/arrow-rs/pull/7408
[3]
https://github.com/duckdb/duckdb/compare/main...Mytherin:duckdb:ieeeorder
[4] https://github.com/apache/parquet-java/pull/3191
[5] https://github.com/apache/parquet-format/pull/196
[6]
https://github.com/apache/parquet-format/pull/221#issuecomment-2931376077

Best,
Gang

On Tue, Mar 28, 2023 at 4:22 PM Jan Finis <[email protected]> wrote:

> Dear contributors,
>
> My PR has now gathered comments for a week and the gist of all open issues
> is the question of how to encode pages/column chunks that contain only
> NaNs. There are different suggestions and I don't see one common favorite
> yet.
>
> I have outlined three alternatives of how we can handle these and I want us
> to reach a conclusion here, so I can update my PR accordingly and move on
> with it. As this is my first contribution to parquet, I don't know the
> decision processes here. Do we vote? Is there a single or group of decision
> makers? *Please let me know how to come to a conclusion here; what are the
> next steps?*
>
> For reference, here are the three alternatives I pointed out. You can find
> detailed description of their PROs and CONs in my comment:
> https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762
>
> 1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
> 2. Adding `num_values` to the ColumnIndex, to make it symmetric with
> Statistics in pages & `ColumnMetaData` and to enable the computation
> `num_values - null_count - nan_count == 0`
> 3. Adding a `nan_pages` bool list to the column index, which indicates
> whether a page contains only NaNs
>
>
> Cheers
> Jan Finis
>

Reply via email to