[
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788926#comment-17788926
]
ASF GitHub Bot commented on PARQUET-2249:
-----------------------------------------
etseidl commented on code in PR #221:
URL: https://github.com/apache/parquet-format/pull/221#discussion_r1402657927
##########
src/main/thrift/parquet.thrift:
##########
@@ -288,7 +288,7 @@ struct MapType {} // see LogicalTypes.md
struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
(see LogicalTypes.md)
Review Comment:
'must encode' or 'must be encoded as'?
##########
src/main/thrift/parquet.thrift:
##########
@@ -962,15 +967,19 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
- * (*) Because the sorting order is not specified properly for floating
- * point values (relations vs. total ordering) the following
+ * (*) Because the precise sorting order is ambiguous for floating
+ * point types due to underspecified handling of NaN and -0/+0,
+ * it is recommended that writers use IEEE_754_TOTAL_ORDER
+ * for these types.
+ *
+ * If TYPE_ORDER is used for floating point types, then the following
Review Comment:
This line threw me (at least while using my phone 😉...on my computer I can
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used
for floating..." or "If this type-defined ordering..."
##########
src/main/thrift/parquet.thrift:
##########
@@ -962,15 +967,19 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
- * (*) Because the sorting order is not specified properly for floating
- * point values (relations vs. total ordering) the following
+ * (*) Because the precise sorting order is ambiguous for floating
+ * point types due to underspecified handling of NaN and -0/+0,
+ * it is recommended that writers use IEEE_754_TOTAL_ORDER
+ * for these types.
+ *
+ * If TYPE_ORDER is used for floating point types, then the following
Review Comment:
This line threw me (at least while using my phone 😉...on my computer I can
see `TYPE_ORDER` below). Maybe this could instead say "If this ordering is used
for floating..." or "If this type-defined ordering..."
> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> -----------------------------------------------------------------------
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Jan Finis
> Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is
> inconsistent, leading to cases where it is impossible to create a parquet
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values.
> The spec mentions that NaN values should not be included in min/max bounds,
> so a page consisting of only NaN values has no defined min/max bound. To
> quote the spec:
> {noformat}
> Â Â * Â Â When writing statistics the following rules should be followed:
> Â Â * Â Â - NaNs should not be written to min or max statistics
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the
> following:
> {noformat}
> struct ColumnIndex {
> Â /**
> Â Â * A list of Boolean values to determine the validity of the corresponding
> Â Â * min and max values. If true, a page contains only null values, and
> writers
> Â Â * have to set the corresponding entries in min_values and max_values to
> Â Â * byte[0], so that all lists have the same length. If false, the
> Â Â * corresponding entries in min_values and max_values must be valid.
> Â Â */
> Â 1: required list<bool> null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does
> *not* only contain null values, so {{null_pages}} should be {{false}} for
> this page. However, in this case the spec requires valid min/max values in
> {{min_values}} and {{max_values}} for this page. As the only value in the
> page is NaN, the only valid min/max value we could enter here is NaN, but as
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this
> specification as soon as there is a only-NaN column and column indexes are to
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has
> {{byte[0]}} as min/max, even though the null_pages entry is set to
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of
> them. It gives us valid min/max bounds, makes null_pages compatible with
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't
> have this inconsistency. In a future version, NaN counts could be introduced,
> but that doesn't help for backward compatibility, so we do need a solution
> for now.
> Any of the solutions is better than the current situation where engines
> writing such a page cannot write a conforming parquet file and will randomly
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3.
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max
> bounds by adding a clause stating that "if a page contains only NaNs or a
> mixture of NaNs and NULLs, then NaN should be written as min & max".
> Â
--
This message was sent by Atlassian Jira
(v8.20.10#820010)