Re: Write a parquet file with delta encoding enable

Wes McKinney Mon, 23 Mar 2020 17:17:45 -0700

These encodings are not available for use in the Parquet C++ library
yet -- partially implemented but not thoroughly tested or exposed in
the public API -- so it's not possible to generate them from Python. I
don't know about Java, you may want to ask on the Parquet mailing list


On Mon, Mar 23, 2020 at 2:30 AM Omega Gamage <om...@bigstream.co> wrote:
>
> I was trying to write a parquet file with delta encoding. This page
> <https://github.com/apache/parquet-format/blob/master/Encodings.md>, states
> that parquet supports three types of delta encodings:
>
>     (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).
>
> Since spark, pyspark or pyarrow does not allow us to specify the encoding
> method. I was curious how one can write a file with delta encoding enabled?
>
> However, I found on the internet that, if I have columns with TimeStamp
> type parquet will use delta encoding. So I used the following code in
> *Scala* to create a parquet file. But encoding is not a delta.
>
>
>     val df = Seq(("2018-05-01"),
>                 ("2018-05-02"),
>                 ("2018-05-03"),
>                 ("2018-05-04"),
>                 ("2018-05-05"),
>                 ("2018-05-06"),
>                 ("2018-05-07"),
>                 ("2018-05-08"),
>                 ("2018-05-09"),
>                 ("2018-05-10")
>             ).toDF("Id")
>     val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
>     val df3 = df2.withColumn("Date", (col("Id").cast("date")))
>
>     df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")
>
> parquet-tools shows the following information regarding the written parquet
> file.
>
> file schema: spark_schema
> --------------------------------------------------------------------------------Id:
>          OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
> R:0 D:1Date:        OPTIONAL INT32 L:DATE R:0 D:1
>
> row group 1: RC:31 TS:1100 OFFSET:4
> --------------------------------------------------------------------------------Id:
>           BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
> ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]Timestamp:    INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
> VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
> not defined]Date:         INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
> VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]
>
> As you can see, no column has used delta encoding.
>
> My question is,
>
> 1) How can I write a parquet file with delta encoding? (If you can provide
> an example code in scala or python that would be great.) 2) How to decide
> which "delta encoding": (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY,
> DELTA_BYTE_ARRAY) to use?

Re: Write a parquet file with delta encoding enable

Reply via email to