[jira] [Created] (PARQUET-2356) Fix typo in DeltaBinaryPackingValuesWriter

2023-09-29 Thread Xuwei Fu (Jira)
Xuwei Fu created PARQUET-2356:
-

 Summary: Fix typo in DeltaBinaryPackingValuesWriter
 Key: PARQUET-2356
 URL: https://issues.apache.org/jira/browse/PARQUET-2356
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.13.1
Reporter: Xuwei Fu
Assignee: Xuwei Fu
 Fix For: 1.13.1


DeltaBinaryPackingValuesWriter has typo `bitwith`, it should use `bitwidth`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-14 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732652#comment-17732652
 ] 

Xuwei Fu commented on PARQUET-:
---

Yes this answers my question. I think arrow parquet c++ is same on this. In cpp 
user can write the rle but I just introduce it few month later, so maybe user 
haven't and should not use it in v1 format.

As for RL/DL, I guess RLE is used

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730992#comment-17730992
 ] 

Xuwei Fu commented on PARQUET-:
---

[~gszadovszky] Hi Gabor maybe I'm misleading.

1. RLE for Boolean is just implemented a few month ago by me. I guess Legacy 
writer cannot write use RLE.
2. For RL/DL, in parquet-cpp, only RLE can be used.

In arrow-rs, I found the code below
{code}
/// RLE/Bit-Packing hybrid encoding for values.
/// Currently is used only for data pages v2 and supports boolean types.
pub struct RleValueEncoder {
// Buffer with raw values that we collect,
// when flushing buffer they are encoded using RLE encoder
encoder: Option,
_phantom: PhantomData,
}
{code}
I guess it only supports Page V2 and format V2.  [~tustvold]  am I right?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730983#comment-17730983
 ] 

Xuwei Fu commented on PARQUET-:
---

I think in cpp, in 12.0.0, even if it's Format V1, writer can only write RLE 
for RL/DL.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701577#comment-17701577
 ] 

Xuwei Fu commented on PARQUET-2256:
---

[~gszadovszky] Yes, I'd like to. I think having compression in standard doesn't 
means we need always compression. We can do it only when original BloomFilter 
occupy a lot of space and compression can save lots of time

> Adding Compression for BloomFilter
> --
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.9.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-13 Thread Xuwei Fu (Jira)
Xuwei Fu created PARQUET-2256:
-

 Summary: Adding Compression for BloomFilter
 Key: PARQUET-2256
 URL: https://issues.apache.org/jira/browse/PARQUET-2256
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Affects Versions: format-2.9.0
Reporter: Xuwei Fu


In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
0.01, the BloomFilter size may grows to 2M for each column, which is really 
huge. Should we support compression for BloomFilter, like:

 

```

 /**
 * The compression used in the Bloom filter.
 **/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2255) BloomFilter and float point is ambiguous

2023-03-13 Thread Xuwei Fu (Jira)
Xuwei Fu created PARQUET-2255:
-

 Summary: BloomFilter and float point is ambiguous
 Key: PARQUET-2255
 URL: https://issues.apache.org/jira/browse/PARQUET-2255
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Xuwei Fu
 Fix For: format-2.9.0


Currently, our Parquet can use BloomFilter for any physical types. However, 
when BloomFilter apply on float:
 # What does +0 -0 means? Are they equal?
 # Should qNaN sNaN written in BloomFilter? Are they equal?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-27 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693874#comment-17693874
 ] 

Xuwei Fu commented on PARQUET-:
---

ok, I got it. Previously I found `RLE` format requires a length here:

```

{{rle-bit-packed-hybrid:  }}

{{length := length of the  in bytes stored as 4 bytes little 
endian (unsigned int32)}}

{{encoded-data := *}}

{{```}}

So I thought it already has `length` in encoded data, but actually, in data 
page v2's rep-levels and def-levels, length is not included?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-26 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693798#comment-17693798
 ] 

Xuwei Fu commented on PARQUET-:
---

I don't understand. Isn't length the part of encoding in spec?

And seems that DataPageV2 in parquet-mr is not in-prod?

 

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691257#comment-17691257
 ] 

Xuwei Fu commented on PARQUET-2249:
---

I guess maybe we can take a look at:
 # [https://github.com/apache/arrow-rs/issues/264]
 # https://issues.apache.org/jira/browse/PARQUET-1222

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691256#comment-17691256
 ] 

Xuwei Fu commented on PARQUET-2249:
---

I guess NaN is not always larger than all values.
 # Postgres, DB2 and Oracle put NULL higher than any other values
 # MSSQL and MySQL going the other way.

SQL may have `null_first` for this case. Still I think your idea is ok.

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691179#comment-17691179
 ] 

Xuwei Fu edited comment on PARQUET-2249 at 2/20/23 1:30 PM:


The problem seem to be that, float point is so widely used, but they are 
"partial order". Seems that iceberg provides NaN counts. And min-max is 
un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 ([https://ieeexplore.ieee.org/document/8766229]) and C++ standard 
support some "totalOrder", but I think regard it as totalOrder is strange, so 
min-max as `byte[0]` is wierd for me. I think iceberg style looks great

If I'm wrong, please correct me


was (Author: JIRAUSER294084):
Seems that iceberg provides NaN counts. And min-max is un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support 
some "totalOrder", but I think regard it as totalOrder is strange, so min-max 
as `byte[0]` is wierd for me. I think iceberg style looks great

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the 

[jira] [Commented] (PARQUET-2249) Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs

2023-02-20 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691179#comment-17691179
 ] 

Xuwei Fu commented on PARQUET-2249:
---

Seems that iceberg provides NaN counts. And min-max is un-related to NaN.

In Sorting, iceberg forces that:

> Sorting floating-point numbers should produce the following behavior: 
> {{-NaN}} < {{-Infinity}} < {{-value}} < {{-0}} < {{0}} < {{value}} < 
> {{Infinity}} < {{{}NaN{}}}. This aligns with the implementation of Java 
> floating-point types comparisons.

I think (1) is bad, because NaN is never equal to NULL.

IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support 
some "totalOrder", but I think regard it as totalOrder is strange, so min-max 
as `byte[0]` is wierd for me. I think iceberg style looks great

> Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs
> ---
>
> Key: PARQUET-2249
> URL: https://issues.apache.org/jira/browse/PARQUET-2249
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Jan Finis
>Priority: Major
>
> Currently, the specification of {{ColumnIndex}} in {{parquet.thrift}} is 
> inconsistent, leading to cases where it is impossible to create a parquet 
> file that is conforming to the spec.
> The problem is with double/float columns if a page contains only NaN values. 
> The spec mentions that NaN values should not be included in min/max bounds, 
> so a page consisting of only NaN values has no defined min/max bound. To 
> quote the spec:
> {noformat}
>    *     When writing statistics the following rules should be followed:
>    *     - NaNs should not be written to min or max statistics 
> fields.{noformat}
> However, the comments in the ColumnIndex on the null_pages member states the 
> following:
> {noformat}
> struct ColumnIndex {
>   /**
>    * A list of Boolean values to determine the validity of the corresponding
>    * min and max values. If true, a page contains only null values, and 
> writers
>    * have to set the corresponding entries in min_values and max_values to
>    * byte[0], so that all lists have the same length. If false, the
>    * corresponding entries in min_values and max_values must be valid.
>    */
>   1: required list null_pages{noformat}
> For a page with only NaNs, we now have a problem. The page definitly does 
> *not* only contain null values, so {{null_pages}} should be {{false}} for 
> this page. However, in this case the spec requires valid min/max values in 
> {{min_values}} and {{max_values}} for this page. As the only value in the 
> page is NaN, the only valid min/max value we could enter here is NaN, but as 
> mentioned before, NaNs should never be written to min/max values.
> Thus, no writer can currently create a parquet file that conforms to this 
> specification as soon as there is a only-NaN column and column indexes are to 
> be written.
> I see three possible solutions:
> 1. A page consisting only of NaNs (or a mixture of NaNs and nulls) has it's 
> null_pages entry set to {*}true{*}.
> 2. A page consisting of only NaNs (or a mixture of NaNs and nulls) has 
> {{byte[0]}} as min/max, even though the null_pages entry is set to 
> {*}false{*}.
> 3. A page consisting of only NaNs (or a mixture of NaNs and nulls) does have 
> NaN as min & max in the column index.
> None of the solutions is perfect. But I guess solution 3. is the best of 
> them. It gives us valid min/max bounds, makes null_pages compatible with 
> this, and gives us a way to determine only-Nan pages (min=max=NaN).
> As a general note: I would say that it is a shortcoming that Parquet doesn't 
> track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't 
> have this inconsistency. In a future version, NaN counts could be introduced, 
> but that doesn't help for backward compatibility, so we do need a solution 
> for now.
> Any of the solutions is better than the current situation where engines 
> writing such a page cannot write a conforming parquet file and will randomly 
> pick any of the solutions.
> Thus, my suggestion would be to update parquet.thrift to use solution 3. 
> I.e., rewrite the comments saying that NaNs shouldn't be included in min/max 
> bounds by adding a clause stating that "if a page contains only NaNs or a 
> mixture of NaNs and NULLs, then NaN should be written as min & max".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-08 Thread Xuwei Fu (Jira)
Xuwei Fu created PARQUET-2241:
-

 Summary: ByteStreamSplitDecoder broken in presence of nulls
 Key: PARQUET-2241
 URL: https://issues.apache.org/jira/browse/PARQUET-2241
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Affects Versions: format-2.8.0
Reporter: Xuwei Fu
 Fix For: format-2.10.0


 
This problem is shown in this issue: 
[https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
Let me talk about it briefly:
* Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, but 
using "num_values" as stride in BYTE_STREAM_SPLIT
* When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
the page, however, in DATA_PAGE_V1, without statistics, we should read 
def-levels and rep-levels to get the real num-of-values. And without the 
num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
 
The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678055#comment-17678055
 ] 

Xuwei Fu commented on PARQUET-1622:
---

[~gszadovszky] [~martinradev] 

Hi all, I meet a problem here: [https://github.com/apache/arrow/issues/15173]

Would you mind take a look? Seems we don't have "non-null value count" here.

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)