hi Micah,

It seems like the null and nested issues really come up when trying to
translate from one dictionary encoding scheme to another. That we are
able to directly write dictionary-encoded data to Parquet format is
beneficial, but it doesn't seem like we should let the constraints of
Parquet's dictionary compression push back into the Arrow
specification. I agree that it does add complexity to implementing the
specification.

Some other thoughts about the null issue

* evaluating an expression like SUM(ISNULL($field)) is more
semantically ambiguous (you have to check more things) when $field is
a dictionary-encoded type and the values of the dictionary could be
null
* there may be situations where you want to consider null to be a value
* some other systems with dictionary encoding allow null as a
dictionary value [1]. Sometimes this is used to determine whether to
include the "null group" in tabulations / group aggregations

Others might have more opinions

[1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes and Brian,
>
> Thanks for the feedback.  My intent in raising these issues is that they make 
> the spec harder to work with/implement (i.e. we have existing bugs, etc).  
> I'm wondering if we should take the opportunity to simplify before things are 
> set in stone. If we think things are already set, then I'm OK with that as 
> well.
>
> Thanks,
> Micah
>
> On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>
>>
>> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > I'd like to understand if any one is making use of the following features
>> > and if we should revisit them before 1.0.
>> >
>> > 1. Dictionaries can encode null values.
>> > - This become error prone for things like parquet.  We seem to be
>> > calculating the definition level solely based on the null bitmap.
>> >
>> > I might have missed something but it appears that we only check if a
>> > dictionary contains nulls on the optimized path [1] but not when converting
>> > the dictionary array back to dense, so I think the values written could get
>> > out of sync with the rep/def levels?
>> >
>> > It seems we should potentially disallow dictionaries to contain null
>> > values?
>>
>> Are you talking about the direct DictionaryArray encoding path?
>>
>> Since both nulls and duplicates are allowed in Arrow dictionaries, I think 
>> that Parquet will need to check for the null-in-dictionary case in the 
>> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and 
>> erroring out.
>>
>> [1]: 
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>>
>> >
>> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
>> > columns.
>> >
>> > - Again we aren't handling this in Parquet today, and I'm wondering if it
>> > worth the effort.
>> > There was a PR merged a while ago [2] to add a "skipped" integration test
>> > but it doesn't look like anyone has done follow-up work to make enable
>> > this/make it pass.
>>
>> I started looking at this the other day (on the integration tests), I'd like 
>> to get the C++ side fully working with integration tests.
>>
>> As far as Parquet it doesn't seem worth the effort to try to implement a 
>> faithful roundtrip. Since we are serializing the Arrow types into the file 
>> metadata we could at least cast back but perhaps lose the ordering.
>>
>> >
>> > It seems simpler to keep dictionary encoding at the leafs of the schema.
>> >
>> > Of the two I'm a little more worried that Option #1 will break people if we
>> > decide to disallow it.
>>
>> I think the ship has sailed on disallowing this at the format level.
>>
>> >
>> > Thoughts?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1]
>> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
>> > [2] https://github.com/apache/arrow/pull/1848

Reply via email to