[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

Raphael Taylor-Davies (Jira) Fri, 09 Jun 2023 08:16:03 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731001#comment-17731001
 ]


Raphael Taylor-Davies commented on PARQUET-2222:
------------------------------------------------

Within arrow-rs levels data is always written using RLE, only prepending the 
length for v1 data pages. We also support reading bit backed level data but we 
never write it.

We additionally support reading/writing boolean data using RLE, this is the 
default for writer version 2, but theoretically a v1 writer could opt-in to 
this. This will always prepend the length when flushing a data page.

Finally for dictionary indices we do not prepend the length

Not sure if that answers your question

> [Format] RLE encoding spec incorrect for v2 data pages
> ------------------------------------------------------
>
>                 Key: PARQUET-2222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Antoine Pitrou
>            Assignee: Gang Wu
>            Priority: Critical
>             Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid: <length> <encoded-data>
> length := length of the <encoded-data> in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

Reply via email to