hi Micah, It seems like the null and nested issues really come up when trying to translate from one dictionary encoding scheme to another. That we are able to directly write dictionary-encoded data to Parquet format is beneficial, but it doesn't seem like we should let the constraints of Parquet's dictionary compression push back into the Arrow specification. I agree that it does add complexity to implementing the specification.
Some other thoughts about the null issue * evaluating an expression like SUM(ISNULL($field)) is more semantically ambiguous (you have to check more things) when $field is a dictionary-encoded type and the values of the dictionary could be null * there may be situations where you want to consider null to be a value * some other systems with dictionary encoding allow null as a dictionary value [1]. Sometimes this is used to determine whether to include the "null group" in tabulations / group aggregations Others might have more opinions [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Wes and Brian, > > Thanks for the feedback. My intent in raising these issues is that they make > the spec harder to work with/implement (i.e. we have existing bugs, etc). > I'm wondering if we should take the opportunity to simplify before things are > set in stone. If we think things are already set, then I'm OK with that as > well. > > Thanks, > Micah > > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > I'd like to understand if any one is making use of the following features >> > and if we should revisit them before 1.0. >> > >> > 1. Dictionaries can encode null values. >> > - This become error prone for things like parquet. We seem to be >> > calculating the definition level solely based on the null bitmap. >> > >> > I might have missed something but it appears that we only check if a >> > dictionary contains nulls on the optimized path [1] but not when converting >> > the dictionary array back to dense, so I think the values written could get >> > out of sync with the rep/def levels? >> > >> > It seems we should potentially disallow dictionaries to contain null >> > values? >> >> Are you talking about the direct DictionaryArray encoding path? >> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I think >> that Parquet will need to check for the null-in-dictionary case in the >> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and >> erroring out. >> >> [1]: >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 >> >> > >> > 2. Dictionaries can nested columns which are in turn dictionary encoded >> > columns. >> > >> > - Again we aren't handling this in Parquet today, and I'm wondering if it >> > worth the effort. >> > There was a PR merged a while ago [2] to add a "skipped" integration test >> > but it doesn't look like anyone has done follow-up work to make enable >> > this/make it pass. >> >> I started looking at this the other day (on the integration tests), I'd like >> to get the C++ side fully working with integration tests. >> >> As far as Parquet it doesn't seem worth the effort to try to implement a >> faithful roundtrip. Since we are serializing the Arrow types into the file >> metadata we could at least cast back but perhaps lose the ordering. >> >> > >> > It seems simpler to keep dictionary encoding at the leafs of the schema. >> > >> > Of the two I'm a little more worried that Option #1 will break people if we >> > decide to disallow it. >> >> I think the ship has sailed on disallowing this at the format level. >> >> > >> > Thoughts? >> > >> > Thanks, >> > Micah >> > >> > >> > [1] >> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 >> > [2] https://github.com/apache/arrow/pull/1848