Dear contributors,

My PR has now gathered comments for a week and the gist of all open issues
is the question of how to encode pages/column chunks that contain only
NaNs. There are different suggestions and I don't see one common favorite
yet.

I have outlined three alternatives of how we can handle these and I want us
to reach a conclusion here, so I can update my PR accordingly and move on
with it. As this is my first contribution to parquet, I don't know the
decision processes here. Do we vote? Is there a single or group of decision
makers? *Please let me know how to come to a conclusion here; what are the
next steps?*

For reference, here are the three alternatives I pointed out. You can find
detailed description of their PROs and CONs in my comment:
https://github.com/apache/parquet-format/pull/196#issuecomment-1486416762

1. My initial proposal, i.e., encoding only-NaN pages by min=max=NaN.
2. Adding `num_values` to the ColumnIndex, to make it symmetric with
Statistics in pages & `ColumnMetaData` and to enable the computation
`num_values - null_count - nan_count == 0`
3. Adding a `nan_pages` bool list to the column index, which indicates
whether a page contains only NaNs


Cheers
Jan Finis

Reply via email to