Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Brian Hulette Sun, 09 Feb 2020 14:55:52 -0800

> It seems we should potentially disallow dictionaries to contain null
values?
+1 - I've always thought it was odd you could encode null values in two
different places for dictionary encoded columns.
You could argue it's more efficient to encode the nulls in the dictionary,
but I think if we're going to allow that we should go further - we know
there should only be _one_ index with the NULL value in a dictionary, why
encode an entire validity buffer? Maybe this is one place where a sentinel
value makes sense.



The mailing list thread where I brought up the idea of nested dictionaries
[1] is useful context for item 2. I still think this is a good idea, but
I've changed jobs since then and the use-case I described is no longer
motivating me to actually implement it.

> It seems simpler to keep dictionary encoding at the leafs of the schema.
Do we need to go that far? I think we could still allow dictionary encoding
at any level of a hierarchy, and just disallow nested dictionaries.

Brian

[1]
https://lists.apache.org/thread.html/37c0480c4c7a48dd298e8459938444afb901bf01dcebd5f8c5f1dee6%40%3Cdev.arrow.apache.org%3E

On Sat, Feb 8, 2020 at 10:53 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I'd like to understand if any one is making use of the following features
> and if we should revisit them before 1.0.
>
> 1. Dictionaries can encode null values.
> - This become error prone for things like parquet.  We seem to be
> calculating the definition level solely based on the null bitmap.
>
> I might have missed something but it appears that we only check if a
> dictionary contains nulls on the optimized path [1] but not when converting
> the dictionary array back to dense, so I think the values written could get
> out of sync with the rep/def levels?
>
> It seems we should potentially disallow dictionaries to contain null
> values?
>
> 2.  Dictionaries can nested columns which are in turn dictionary encoded
> columns.
>
> - Again we aren't handling this in Parquet today, and I'm wondering if it
> worth the effort.
> There was a PR merged a while ago [2] to add a "skipped" integration test
> but it doesn't look like anyone has done follow-up work to make enable
> this/make it pass.
>
> It seems simpler to keep dictionary encoding at the leafs of the schema.
>
> Of the two I'm a little more worried that Option #1 will break people if we
> decide to disallow it.
>
> Thoughts?
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> [2] https://github.com/apache/arrow/pull/1848
>

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to