Re: Fallback Encoding for Very Sparse or Sorted Datasets

Gang Wu Wed, 01 Mar 2023 07:37:25 -0800

>  What are the reasons for forcing the dictionary to be the first page?


This is by design. I guess it benefits sequential scan where the dictionary
page is read first and then followed by its encoded indices in the data
pages. Otherwise we need to seek anyway.

> can this be changed to allow for better encoding decisions in the
scenario I described?

I think that is possible. But this requires code change to buffer input
data and postpone encoding decisions until it gets sufficient knowledge of
the data.

Best,
Gang




On Wed, Mar 1, 2023 at 6:33 PM Patrick Hansert <hans...@informatik.uni-kl.de>
wrote:

> Hi Gang,
>
> thanks for your reply.
>
> On 01.03.23 03:09, Gang Wu wrote:
> > If at least one record in the beginning 20000 rows is not null, then the
> encoded size will be much better.
> That is the workaround I have been using for the past weeks, although my
> tests show that at least two values are required.
>
> > 3. If dictionary encoding is in effect, the first page must be a
> dictionary page followed by a set of data pages that are only indices of
> the dictionary.
> > [...]
> > 5. By default, the parquet-mr implementation has to decide the encoding
> of a page when it reaches 20000 records.
>
> I agree that this is at the core of the problem; the question is, can
> this be changed to allow for better encoding decisions in the scenario I
> described? An all-null page contains just definition and (possibly)
> repetition levels, no value entries, so there is no need to choose their
> encoding yet. What are the reasons for forcing the dictionary to be the
> first page?
>
> Kind Regards
>
> Patrick
>
>

Re: Fallback Encoding for Very Sparse or Sorted Datasets

Reply via email to