[jira] [Updated] (PARQUET-2345) The Parquet Spec doesn't specify whether multiple columns are allowed to have the same name.

2023-09-08 Thread Jan Finis (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Finis updated PARQUET-2345:
---
Description: The parquet format specification doesn't say whether a Parquet 
file having columns with the same name (in the same group node, so really 
exactly the same name) is valid. I.e., say I have a Parquet file with two 
columns. Both are called x. Is this file a valid Parquet file?  (was: The 
parquet format specification doesn't say whether a Parquet file having columns 
with the same name (in the same struct, so really exactly the same name) is 
valid. I.e., say I have a Parquet file with two columns. Both are called x. Is 
this file a valid Parquet file?)

> The Parquet Spec doesn't specify whether multiple columns are allowed to have 
> the same name.
> 
>
> Key: PARQUET-2345
> URL: https://issues.apache.org/jira/browse/PARQUET-2345
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: 1.13.1
>Reporter: Jan Finis
>Priority: Minor
>
> The parquet format specification doesn't say whether a Parquet file having 
> columns with the same name (in the same group node, so really exactly the 
> same name) is valid. I.e., say I have a Parquet file with two columns. Both 
> are called x. Is this file a valid Parquet file?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2345) The Parquet Spec doesn't specify whether multiple columns are allowed to have the same name.

2023-09-08 Thread Jan Finis (Jira)
Jan Finis created PARQUET-2345:
--

 Summary: The Parquet Spec doesn't specify whether multiple columns 
are allowed to have the same name.
 Key: PARQUET-2345
 URL: https://issues.apache.org/jira/browse/PARQUET-2345
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Affects Versions: 1.13.1
Reporter: Jan Finis


The parquet format specification doesn't say whether a Parquet file having 
columns with the same name (in the same struct, so really exactly the same 
name) is valid. I.e., say I have a Parquet file with two columns. Both are 
called x. Is this file a valid Parquet file?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-03-22 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703624#comment-17703624
 ] 

Jan Finis commented on PARQUET-2249:


[~wgtmac] [~mwish] I have created the pull request. Have a look at it :).

Should I advertise the pull request somewhere else?

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-21 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691525#comment-17691525
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/21/23 11:19 AM:
--

[~wgtmac] my proposal would be backward compatible. It would only add optional 
fields, so that legacy readers not implementing the logic could still read the 
file and legacy writers could also still write files not using this without any 
code changes.

[~mwish] I agree. Treating NaN as larger or smaller than any other value 
doesn't fit the semantics of all engines. Therefore, my fix for this would 
enable both types of semantics (and a third semantics, where NaN should be 
treated as neither larger nor smaller) to work.

I guess I'll just create a pull request. We can then start the discussion based 
on this. Sounds good?


was (Author: jfinis):
[~wgtmac] my proposal would be backward compatible. It would only add optional 
fields, so that legacy readers not implementing the logic could still read the 
file and legacy writers could also still write files not using this without any 
code changes.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-21 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691525#comment-17691525
 ] 

Jan Finis commented on PARQUET-2249:


[~wgtmac] my proposal would be backward compatible. It would only add optional 
fields, so that legacy readers not implementing the logic could still read the 
file and legacy writers could also still write files not using this without any 
code changes.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:50 PM:
-

I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Sidepoint: Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

However, treating NaN as something extra by maintaining NaN counts will allow 
incorporating any NaN semantics the query engine wishes to use.


was (Author: jfinis):
I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> 

[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:48 PM:
-

I would be willing to propose a fixing commit for this, but I'm not part of ASF 
and the whole process, yet, so I don't know exactly how to get that going. I 
could start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
 * The semantic that SQL has for NaN
 * What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.


was (Author: jfinis):
I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
* The semantic that SQL has for NaN
* What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of 

[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 1:48 PM:
-

I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-format github repo. Is that the right point to 
suggest changes to the spec/parquet.thrift?

Note that NaN being larger than all other values is also:
* The semantic that SQL has for NaN
* What parquet-mr seems to be doing right now. At least, I have found parquet 
files that have NaN written as max_value in row group statistics.


was (Author: jfinis):
I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-mr github repo. Is that the right point to suggest 
changes to the spec/parquet.thrift?

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691184#comment-17691184
 ] 

Jan Finis commented on PARQUET-2249:


I would be willing to suggest a fix for this, but I'm not part of ASF and the 
whole process, yet, so I don't know exactly how to get that going. I could 
start a PR on the parquet-mr github repo. Is that the right point to suggest 
changes to the spec/parquet.thrift?

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691108#comment-17691108
 ] 

Jan Finis edited comment on PARQUET-2249 at 2/20/23 9:55 AM:
-

[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

The whole topic of NaN handling in Parquet currently seems to be lacking and 
somewhat inconsistent, making columns with NaNs mostly unusable for scan 
pruning. Maybe there should be a redefinition of the semantics in a new 
version, so that columns with NaNs can be used for indexing as other columns. 
As mentioned, Iceberg has solved this problem by providing NaN counts.


was (Author: jfinis):
[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Jan Finis (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691108#comment-17691108
 ] 

Jan Finis commented on PARQUET-2249:


[~wgtmac] True, not writing a column index in this case is also a solution. 
Note though that this is a pessimization for pages not containing NaN in the 
same column chunk. It would be a shame if a single NaN makes a whole column 
chunk non-indexable. It might be a good interim solution, but it's not too 
satisfying.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-19 Thread Jan Finis (Jira)
Jan Finis created PARQUET-2249:
--

 Summary: Parquet spec (parquet.thrift) is inconsistent w.r.t. 
ColumnIndex + NaNs
 Key: PARQUET-2249
 URL: https://issues.apache.org/jira/browse/PARQUET-2249
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Jan Finis


Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
inconsistent, leading to cases where it is impossible to create a parquet file 
that is conforming to the spec.

The problem is with double/float columns if a page contains only NaN values. 
The spec mentions that NaN values should not be included in min/max bounds, so 
a page consisting of only NaN values has no defined min/max bound. To quote the 
spec:


{noformat}
   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.{noformat}

However, the comments in the ColumnIndex on the null_pages member states the 
following:


{noformat}
struct ColumnIndex {
  /**
   * A list of Boolean values to determine the validity of the corresponding
   * min and max values. If true, a page contains only null values, and writers
   * have to set the corresponding entries in min_values and max_values to
   * byte[0], so that all lists have the same length. If false, the
   * corresponding entries in min_values and max_values must be valid.
   */
  1: required list null_pages{noformat}

For a page with only NaNs, we now have a problem. The page definitly does *not* 
only contain null values, so {{null_pages}} should be {{false}} for this page. 
However, in this case the spec requires valid min/max values in {{min_values}} 
and {{max_values}} for this page. As the only value in the page is NaN, the 
only valid min/max value we could enter here is NaN, but as mentioned before, 
NaNs should never be written to min/max values.

Thus, no writer can currently create a parquet file that conforms to this 
specification as soon as there is a only-NaN column and column indexes are to 
be written.

I see three possible solutions:
1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
null_pages entry set to {*}true{*}.
2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
{{byte[0]}} as min/max, even though the null_pages entry is set to {*}false{*}.
3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
NaN as min & max in the column index.

None of the solutions is perfect. But I guess solution 3. is the best of them. 
It gives us valid min/max bounds, makes null_pages compatible with this, and 
gives us a way to determine only-Nan pages (min=max=NaN).

As a general note: I would say that it is a shortcoming that Parquet doesn't 
track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
have this inconsistency. In a future version, NaN counts could be introduced, 
but that doesn't help for backward compatibility, so we do need a solution for 
now.

Any of the solutions is better than the current situation where engines writing 
such a page cannot write a conforming parquet file and will randomly pick any 
of the solutions.

Thus, my suggestion would be to update parquet.thrift to use solution 3. I.e., 
rewrite the comments saying that NaNs shouldn't be included in min/max bounds 
by adding a clause stating that "if a page contains only NaNs or a mixture of 
NaNs and NULLs, then NaN should be written as min & max".

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2238) Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding

2023-02-04 Thread Jan Finis (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Finis updated PARQUET-2238:
---
Description: 
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to 
no longer write this encoding for FIXED_LEN_BYTE_ARRAY.

I guess changing the spec is more prudent, given that 
a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
and
b) there might already be countless files written with this encoding / type 
combination.

  was:
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to 
no longer write this encoding for FIXED_LEN_BYTE_ARRAY.


> Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
> -
>
> Key: PARQUET-2238
> URL: https://issues.apache.org/jira/browse/PARQUET-2238
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Jan Finis
>Priority: Minor
>
> The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
> for the physical type 
> BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
>  Yet, [parquet-mr also uses it to encode 
> FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].
> So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
> supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed 
> to no longer write this encoding for FIXED_LEN_BYTE_ARRAY.
> I guess changing the spec is more prudent, given that 
> a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
> and
> b) there might already be countless files written with this encoding / type 
> combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2238) Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding

2023-02-04 Thread Jan Finis (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Finis updated PARQUET-2238:
---
Description: 
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed to 
no longer write this encoding for FIXED_LEN_BYTE_ARRAY.

  was:
The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LENGTH_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated or the code should be changed to no 
longer write this encoding for FIXED_LENGTH_BYTE_ARRAY.


> Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
> -
>
> Key: PARQUET-2238
> URL: https://issues.apache.org/jira/browse/PARQUET-2238
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Jan Finis
>Priority: Minor
>
> The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
> for the physical type 
> BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
>  Yet, [parquet-mr also uses it to encode 
> FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].
> So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
> supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed 
> to no longer write this encoding for FIXED_LEN_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2238) Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding

2023-02-04 Thread Jan Finis (Jira)
Jan Finis created PARQUET-2238:
--

 Summary: Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
 Key: PARQUET-2238
 URL: https://issues.apache.org/jira/browse/PARQUET-2238
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format, parquet-mr
Reporter: Jan Finis


The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
for the physical type 
BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
 Yet, [parquet-mr also uses it to encode 
FIXED_LENGTH_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].

So, I guess the spec should be updated or the code should be changed to no 
longer write this encoding for FIXED_LENGTH_BYTE_ARRAY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)