Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Wes McKinney Tue, 11 Feb 2020 14:37:26 -0800

hi Micah,

It seems like the null and nested issues really come up when trying to
translate from one dictionary encoding scheme to another. That we are
able to directly write dictionary-encoded data to Parquet format is
beneficial, but it doesn't seem like we should let the constraints of
Parquet's dictionary compression push back into the Arrow
specification. I agree that it does add complexity to implementing the
specification.


Some other thoughts about the null issue

* evaluating an expression like SUM(ISNULL($field)) is more
semantically ambiguous (you have to check more things) when $field is
a dictionary-encoded type and the values of the dictionary could be
null
* there may be situations where you want to consider null to be a value
* some other systems with dictionary encoding allow null as a
dictionary value [1]. Sometimes this is used to determine whether to
include the "null group" in tabulations / group aggregations

Others might have more opinions

[1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes and Brian,
>
> Thanks for the feedback.  My intent in raising these issues is that they make 
> the spec harder to work with/implement (i.e. we have existing bugs, etc).  
> I'm wondering if we should take the opportunity to simplify before things are 
> set in stone. If we think things are already set, then I'm OK with that as 
> well.
>
> Thanks,
> Micah
>
> On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>
>>
>> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > I'd like to understand if any one is making use of the following features
>> > and if we should revisit them before 1.0.
>> >
>> > 1. Dictionaries can encode null values.
>> > - This become error prone for things like parquet.  We seem to be
>> > calculating the definition level solely based on the null bitmap.
>> >
>> > I might have missed something but it appears that we only check if a
>> > dictionary contains nulls on the optimized path [1] but not when converting
>> > the dictionary array back to dense, so I think the values written could get
>> > out of sync with the rep/def levels?
>> >
>> > It seems we should potentially disallow dictionaries to contain null
>> > values?
>>
>> Are you talking about the direct DictionaryArray encoding path?
>>
>> Since both nulls and duplicates are allowed in Arrow dictionaries, I think 
>> that Parquet will need to check for the null-in-dictionary case in the 
>> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and 
>> erroring out.
>>
>> [1]: 
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>>
>> >
>> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
>> > columns.
>> >
>> > - Again we aren't handling this in Parquet today, and I'm wondering if it
>> > worth the effort.
>> > There was a PR merged a while ago [2] to add a "skipped" integration test
>> > but it doesn't look like anyone has done follow-up work to make enable
>> > this/make it pass.
>>
>> I started looking at this the other day (on the integration tests), I'd like 
>> to get the C++ side fully working with integration tests.
>>
>> As far as Parquet it doesn't seem worth the effort to try to implement a 
>> faithful roundtrip. Since we are serializing the Arrow types into the file 
>> metadata we could at least cast back but perhaps lose the ordering.
>>
>> >
>> > It seems simpler to keep dictionary encoding at the leafs of the schema.
>> >
>> > Of the two I'm a little more worried that Option #1 will break people if we
>> > decide to disallow it.
>>
>> I think the ship has sailed on disallowing this at the format level.
>>
>> >
>> > Thoughts?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1]
>> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
>> > [2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Reply via email to