Jorge Leitão created PARQUET-2029:
-------------------------------------
Summary: Delta-encoding seems incorrect
Key: PARQUET-2029
URL: https://issues.apache.org/jira/browse/PARQUET-2029
Project: Parquet
Issue Type: Bug
Reporter: Jorge Leitão
This assumes that spark==3.1.1 uses the java reference implementation. Consider
{code:java}
import pyspark.sql
from pyspark.sql.types import *
spark = (
pyspark.sql.SparkSession.builder.config(
"spark.sql.parquet.compression.codec",
"uncompressed",
)
.config("spark.hadoop.parquet.writer.version", "v2")
.getOrCreate()
)
schema = StructType([StructField("label", IntegerType(), False)])
df = spark.createDataFrame([[1], [2], [3], [4], [5]], schema).coalesce(1)
df.write.parquet("bla.parquet", mode="overwrite")
{code}
This has no def or rep levels, is encoded with DELTA_BINARY_PACKED = 5, and
results in the following page buffer:
{code}
buffer: [
128,
1,
4,
5,
2,
2,
0,
0,
0,
0,
],
{code}
Let's use notation
{code}
(encoded <=u> decoded via uleb128)
(encoded <=z> decoded via uleb128 zig-zag)
{code}
Let's decode the above buffer manually according [to the
specification|https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5].
{code}
byte 0: 128 <=u> 128, the <block size in values>
byte 1: 1 <=u> 1, the <number of miniblocks in a block>
byte 2: 4 <=u> 4, the <total value count>
byte 3: 5 <=z> -3, the <first value>
byte 4: 2 <=z> 1, the <min delta>
byte 5: 2 <=u> 2, the bit width of the (only) miniblock
{code}
I think that this is incorrect: the first value should be 1, not -3; the total
count should be 5, not 4, the bit width of the miniblock should be 0, not 2.
Note that if byte 2 was removed, everything would be correct:
{code}
byte 0: 128 <=u> 128, the <block size in values>
byte 1: 1 <=u> 1, the <number of miniblocks in a block>
byte 2: 5 <=u> 5, the <total value count>
byte 3: 2 <=z> 1, the <first value>
byte 4: 2 <=z> 1, the <min delta>
byte 5: 0 <=u> 0, the bit width of the (only) miniblock
{code}
which corresponds to the delta-bitpack encoding of [1,2,3,4,5]:
* 5 elements
* first value is 1
* the min delta is 1
* the relative differences are zero (bit_width=0) and thus do not require a
buffer (writing 0 is also fine)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)