Jorge Leitão created PARQUET-2029:
-------------------------------------

             Summary: Delta-encoding seems incorrect
                 Key: PARQUET-2029
                 URL: https://issues.apache.org/jira/browse/PARQUET-2029
             Project: Parquet
          Issue Type: Bug
            Reporter: Jorge Leitão


This assumes that spark==3.1.1 uses the java reference implementation. Consider

{code:java}
import pyspark.sql
from pyspark.sql.types import *

spark = (
    pyspark.sql.SparkSession.builder.config(
        "spark.sql.parquet.compression.codec",
        "uncompressed",
    )
    .config("spark.hadoop.parquet.writer.version", "v2")
    .getOrCreate()
)

schema = StructType([StructField("label", IntegerType(), False)])

df = spark.createDataFrame([[1], [2], [3], [4], [5]], schema).coalesce(1)

df.write.parquet("bla.parquet", mode="overwrite")
{code}

This has no def or rep levels, is encoded with DELTA_BINARY_PACKED = 5, and 
results in the following page buffer:

{code}
buffer: [
            128,
            1,
            4,
            5,
            2,
            2,
            0,
            0,
            0,
            0,
        ],
{code}

Let's use notation
{code}
(encoded <=u> decoded via uleb128)
(encoded <=z> decoded via uleb128 zig-zag)
{code}

Let's decode the above buffer manually according [to the 
specification|https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5].

{code}
byte 0: 128 <=u> 128, the <block size in values>
byte 1: 1 <=u> 1, the <number of miniblocks in a block>
byte 2: 4 <=u> 4, the <total value count>
byte 3: 5 <=z> -3, the <first value>
byte 4: 2 <=z> 1, the <min delta>
byte 5: 2 <=u> 2, the bit width of the (only) miniblock
{code}

I think that this is incorrect: the first value should be 1, not -3; the total 
count should be 5, not 4, the bit width of the miniblock should be 0, not 2.

Note that if byte 2 was removed, everything would be correct:

{code}
byte 0: 128 <=u> 128, the <block size in values>
byte 1: 1 <=u> 1, the <number of miniblocks in a block>
byte 2: 5 <=u> 5, the <total value count>
byte 3: 2 <=z> 1, the <first value>
byte 4: 2 <=z> 1, the <min delta>
byte 5: 0 <=u> 0, the bit width of the (only) miniblock
{code}

which corresponds to the delta-bitpack encoding of [1,2,3,4,5]: 

* 5 elements
* first value is 1
* the min delta is 1
* the relative differences are zero (bit_width=0) and thus do not require a 
buffer (writing 0 is also fine)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to