[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

Micah Kornfield (Jira) Tue, 21 Nov 2023 11:25:07 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788519#comment-17788519
 ]


Micah Kornfield commented on PARQUET-2221:
------------------------------------------

I agree with [~wgtmac] here.  I think we should probably have language like 
we've done in previous cases like "for maximum compatibility" but then say any 
mix of page encodings is valid as long as the ordering is valid.

 

In terms of mixing dictionary encodings with others, it does make things a 
little bit harder but I don't think we should make it not-allowed (but point 
out the potential benefits of unified encoding).

> [Format] Encoding spec incorrect for dictionary fallback
> --------------------------------------------------------
>
>                 Key: PARQUET-2221
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2221
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Antoine Pitrou
>            Priority: Critical
>
> The spec for DICTIONARY_ENCODING states that:
> bq. If the dictionary grows too big, whether in size or number of distinct 
> values, the encoding will fall back to the plain encoding. 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> However, the parquet-mr implementation was deliberately changed to a 
> different fallback mechanism in 
> https://issues.apache.org/jira/browse/PARQUET-52
> I'm assuming the parquet-mr implementation is authoritative here. But then 
> the spec is incorrect and should be fixed to reflect expected behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

Reply via email to