Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
On Tue, Feb 18, 2020 at 2:01 AM Micah Kornfield wrote: > > >> * evaluating an expression like SUM(ISNULL($field)) is more >> semantically ambiguous (you have to check more things) when $field is >> a dictionary-encoded type and the values of the dictionary could be >> null > > It is this type of thing that I'm worried about (parquet just happens to be > where I'm working in now). Having corner cases that even experts in Arrow > have a hard time getting right seems not great, but I don't have solutions > either. > > Another example of this are the assumptions that null lists take zero length > (at least in the Arrow to Parquet code, I'm not seeing where we handle the > condition where the assumption is false). > >> * some other systems with dictionary encoding allow null as a >> dictionary value [1]. Sometimes this is used to determine whether to >> include the "null group" in tabulations / group aggregations > > This seems like a good reason to keep it. Does R support the same double > nullness that Arrow has in its spec? I wonder if an update to the spec > saying only one mechanism should be used for nullness would be helpful or if > even that is too big of a change? It seems that when R sees NA in the factor levels, it is treated as a separate value and not considered null/NA (FWIW, NA and NULL in R are not the same thing, but that's a different topic that we can avoid for our purposes) > a <- factor(c("a", "a", "a", "b", "b")) > a <- factor(c("a", "a", "a", "b", "b", NA)) > is.na(a) [1] FALSE FALSE FALSE FALSE FALSE TRUE > table(a) a a b 3 2 > a <- addNA(a) > is.na(a) [1] FALSE FALSE FALSE FALSE FALSE FALSE > a [1] aaabb Levels: a b > table(a) a ab 321 If you look at the implementation of functions like "table" you can see a good amount of code having to do with handling this case (the factor levels / dictionary values contain a null and so the NA result needs to be appended to the result). In practice it seems that the semantics may be application-determined -- a null in the dictionary values versus a null in the dictionary indices could have different semantic meanings (or not, it's hard to say). > Thanks, > Micah > > > On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney wrote: >> >> hi Micah, >> >> It seems like the null and nested issues really come up when trying to >> translate from one dictionary encoding scheme to another. That we are >> able to directly write dictionary-encoded data to Parquet format is >> beneficial, but it doesn't seem like we should let the constraints of >> Parquet's dictionary compression push back into the Arrow >> specification. I agree that it does add complexity to implementing the >> specification. >> >> Some other thoughts about the null issue >> >> * evaluating an expression like SUM(ISNULL($field)) is more >> semantically ambiguous (you have to check more things) when $field is >> a dictionary-encoded type and the values of the dictionary could be >> null >> * there may be situations where you want to consider null to be a value >> * some other systems with dictionary encoding allow null as a >> dictionary value [1]. Sometimes this is used to determine whether to >> include the "null group" in tabulations / group aggregations >> >> Others might have more opinions >> >> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html >> >> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield >> wrote: >> > >> > Hi Wes and Brian, >> > >> > Thanks for the feedback. My intent in raising these issues is that they >> > make the spec harder to work with/implement (i.e. we have existing bugs, >> > etc). I'm wondering if we should take the opportunity to simplify before >> > things are set in stone. If we think things are already set, then I'm OK >> > with that as well. >> > >> > Thanks, >> > Micah >> > >> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney wrote: >> >> >> >> >> >> >> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield >> >> wrote: >> >> > >> >> > I'd like to understand if any one is making use of the following >> >> > features >> >> > and if we should revisit them before 1.0. >> >> > >> >> > 1. Dictionaries can encode null values. >> >> > - This become error prone for things like parquet. We seem to be >> >> > calculating the definition level solely based on the null bitmap. >> >> > >> >> > I might have missed something but it appears that we only check if a >> >> > dictionary contains nulls on the optimized path [1] but not when >> >> > converting >> >> > the dictionary array back to dense, so I think the values written could >> >> > get >> >> > out of sync with the rep/def levels? >> >> > >> >> > It seems we should potentially disallow dictionaries to contain null >> >> > values? >> >> >> >> Are you talking about the direct DictionaryArray encoding path? >> >> >> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I >> >> think that Parquet will need to check for the null
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
> * evaluating an expression like SUM(ISNULL($field)) is more > semantically ambiguous (you have to check more things) when $field is > a dictionary-encoded type and the values of the dictionary could be > null It is this type of thing that I'm worried about (parquet just happens to be where I'm working in now). Having corner cases that even experts in Arrow have a hard time getting right seems not great, but I don't have solutions either. Another example of this are the assumptions that null lists take zero length (at least in the Arrow to Parquet code, I'm not seeing where we handle the condition where the assumption is false). * some other systems with dictionary encoding allow null as a > dictionary value [1]. Sometimes this is used to determine whether to > include the "null group" in tabulations / group aggregations This seems like a good reason to keep it. Does R support the same double nullness that Arrow has in its spec? I wonder if an update to the spec saying only one mechanism should be used for nullness would be helpful or if even that is too big of a change? Thanks, Micah On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney wrote: > hi Micah, > > It seems like the null and nested issues really come up when trying to > translate from one dictionary encoding scheme to another. That we are > able to directly write dictionary-encoded data to Parquet format is > beneficial, but it doesn't seem like we should let the constraints of > Parquet's dictionary compression push back into the Arrow > specification. I agree that it does add complexity to implementing the > specification. > > Some other thoughts about the null issue > > * evaluating an expression like SUM(ISNULL($field)) is more > semantically ambiguous (you have to check more things) when $field is > a dictionary-encoded type and the values of the dictionary could be > null > * there may be situations where you want to consider null to be a value > * some other systems with dictionary encoding allow null as a > dictionary value [1]. Sometimes this is used to determine whether to > include the "null group" in tabulations / group aggregations > > Others might have more opinions > > [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html > > On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield > wrote: > > > > Hi Wes and Brian, > > > > Thanks for the feedback. My intent in raising these issues is that they > make the spec harder to work with/implement (i.e. we have existing bugs, > etc). I'm wondering if we should take the opportunity to simplify before > things are set in stone. If we think things are already set, then I'm OK > with that as well. > > > > Thanks, > > Micah > > > > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney > wrote: > >> > >> > >> > >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield > wrote: > >> > > >> > I'd like to understand if any one is making use of the following > features > >> > and if we should revisit them before 1.0. > >> > > >> > 1. Dictionaries can encode null values. > >> > - This become error prone for things like parquet. We seem to be > >> > calculating the definition level solely based on the null bitmap. > >> > > >> > I might have missed something but it appears that we only check if a > >> > dictionary contains nulls on the optimized path [1] but not when > converting > >> > the dictionary array back to dense, so I think the values written > could get > >> > out of sync with the rep/def levels? > >> > > >> > It seems we should potentially disallow dictionaries to contain null > >> > values? > >> > >> Are you talking about the direct DictionaryArray encoding path? > >> > >> Since both nulls and duplicates are allowed in Arrow dictionaries, I > think that Parquet will need to check for the null-in-dictionary case in > the equivalent of ArrowColumnWriter::Write [1]. You could start by checking > and erroring out. > >> > >> [1]: > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 > >> > >> > > >> > 2. Dictionaries can nested columns which are in turn dictionary > encoded > >> > columns. > >> > > >> > - Again we aren't handling this in Parquet today, and I'm wondering > if it > >> > worth the effort. > >> > There was a PR merged a while ago [2] to add a "skipped" integration > test > >> > but it doesn't look like anyone has done follow-up work to make enable > >> > this/make it pass. > >> > >> I started looking at this the other day (on the integration tests), I'd > like to get the C++ side fully working with integration tests. > >> > >> As far as Parquet it doesn't seem worth the effort to try to implement > a faithful roundtrip. Since we are serializing the Arrow types into the > file metadata we could at least cast back but perhaps lose the ordering. > >> > >> > > >> > It seems simpler to keep dictionary encoding at the leafs of the > schema. > >> > > >> > Of the two I'm a little more worried that Option #1 will break people > if we > >> > decide to disa
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
hi Micah, It seems like the null and nested issues really come up when trying to translate from one dictionary encoding scheme to another. That we are able to directly write dictionary-encoded data to Parquet format is beneficial, but it doesn't seem like we should let the constraints of Parquet's dictionary compression push back into the Arrow specification. I agree that it does add complexity to implementing the specification. Some other thoughts about the null issue * evaluating an expression like SUM(ISNULL($field)) is more semantically ambiguous (you have to check more things) when $field is a dictionary-encoded type and the values of the dictionary could be null * there may be situations where you want to consider null to be a value * some other systems with dictionary encoding allow null as a dictionary value [1]. Sometimes this is used to determine whether to include the "null group" in tabulations / group aggregations Others might have more opinions [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield wrote: > > Hi Wes and Brian, > > Thanks for the feedback. My intent in raising these issues is that they make > the spec harder to work with/implement (i.e. we have existing bugs, etc). > I'm wondering if we should take the opportunity to simplify before things are > set in stone. If we think things are already set, then I'm OK with that as > well. > > Thanks, > Micah > > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney wrote: >> >> >> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield >> wrote: >> > >> > I'd like to understand if any one is making use of the following features >> > and if we should revisit them before 1.0. >> > >> > 1. Dictionaries can encode null values. >> > - This become error prone for things like parquet. We seem to be >> > calculating the definition level solely based on the null bitmap. >> > >> > I might have missed something but it appears that we only check if a >> > dictionary contains nulls on the optimized path [1] but not when converting >> > the dictionary array back to dense, so I think the values written could get >> > out of sync with the rep/def levels? >> > >> > It seems we should potentially disallow dictionaries to contain null >> > values? >> >> Are you talking about the direct DictionaryArray encoding path? >> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I think >> that Parquet will need to check for the null-in-dictionary case in the >> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and >> erroring out. >> >> [1]: >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 >> >> > >> > 2. Dictionaries can nested columns which are in turn dictionary encoded >> > columns. >> > >> > - Again we aren't handling this in Parquet today, and I'm wondering if it >> > worth the effort. >> > There was a PR merged a while ago [2] to add a "skipped" integration test >> > but it doesn't look like anyone has done follow-up work to make enable >> > this/make it pass. >> >> I started looking at this the other day (on the integration tests), I'd like >> to get the C++ side fully working with integration tests. >> >> As far as Parquet it doesn't seem worth the effort to try to implement a >> faithful roundtrip. Since we are serializing the Arrow types into the file >> metadata we could at least cast back but perhaps lose the ordering. >> >> > >> > It seems simpler to keep dictionary encoding at the leafs of the schema. >> > >> > Of the two I'm a little more worried that Option #1 will break people if we >> > decide to disallow it. >> >> I think the ship has sailed on disallowing this at the format level. >> >> > >> > Thoughts? >> > >> > Thanks, >> > Micah >> > >> > >> > [1] >> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 >> > [2] https://github.com/apache/arrow/pull/1848
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
Hi Wes and Brian, Thanks for the feedback. My intent in raising these issues is that they make the spec harder to work with/implement (i.e. we have existing bugs, etc). I'm wondering if we should take the opportunity to simplify before things are set in stone. If we think things are already set, then I'm OK with that as well. Thanks, Micah On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney wrote: > > > On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield > wrote: > > > > I'd like to understand if any one is making use of the following features > > and if we should revisit them before 1.0. > > > > 1. Dictionaries can encode null values. > > - This become error prone for things like parquet. We seem to be > > calculating the definition level solely based on the null bitmap. > > > > I might have missed something but it appears that we only check if a > > dictionary contains nulls on the optimized path [1] but not when > converting > > the dictionary array back to dense, so I think the values written could > get > > out of sync with the rep/def levels? > > > > It seems we should potentially disallow dictionaries to contain null > > values? > > Are you talking about the direct DictionaryArray encoding path? > > Since both nulls and duplicates are allowed in Arrow dictionaries, I think > that Parquet will need to check for the null-in-dictionary case in the > equivalent of ArrowColumnWriter::Write [1]. You could start by checking and > erroring out. > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 > > > > > 2. Dictionaries can nested columns which are in turn dictionary encoded > > columns. > > > > - Again we aren't handling this in Parquet today, and I'm wondering if it > > worth the effort. > > There was a PR merged a while ago [2] to add a "skipped" integration test > > but it doesn't look like anyone has done follow-up work to make enable > > this/make it pass. > > I started looking at this the other day (on the integration tests), I'd > like to get the C++ side fully working with integration tests. > > As far as Parquet it doesn't seem worth the effort to try to implement a > faithful roundtrip. Since we are serializing the Arrow types into the file > metadata we could at least cast back but perhaps lose the ordering. > > > > > It seems simpler to keep dictionary encoding at the leafs of the schema. > > > > Of the two I'm a little more worried that Option #1 will break people if > we > > decide to disallow it. > > I think the ship has sailed on disallowing this at the format level. > > > > > Thoughts? > > > > Thanks, > > Micah > > > > > > [1] > > > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 > > [2] https://github.com/apache/arrow/pull/1848 >
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield wrote: > > I'd like to understand if any one is making use of the following features > and if we should revisit them before 1.0. > > 1. Dictionaries can encode null values. > - This become error prone for things like parquet. We seem to be > calculating the definition level solely based on the null bitmap. > > I might have missed something but it appears that we only check if a > dictionary contains nulls on the optimized path [1] but not when converting > the dictionary array back to dense, so I think the values written could get > out of sync with the rep/def levels? > > It seems we should potentially disallow dictionaries to contain null > values? Are you talking about the direct DictionaryArray encoding path? Since both nulls and duplicates are allowed in Arrow dictionaries, I think that Parquet will need to check for the null-in-dictionary case in the equivalent of ArrowColumnWriter::Write [1]. You could start by checking and erroring out. [1]: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324 > > 2. Dictionaries can nested columns which are in turn dictionary encoded > columns. > > - Again we aren't handling this in Parquet today, and I'm wondering if it > worth the effort. > There was a PR merged a while ago [2] to add a "skipped" integration test > but it doesn't look like anyone has done follow-up work to make enable > this/make it pass. I started looking at this the other day (on the integration tests), I'd like to get the C++ side fully working with integration tests. As far as Parquet it doesn't seem worth the effort to try to implement a faithful roundtrip. Since we are serializing the Arrow types into the file metadata we could at least cast back but perhaps lose the ordering. > > It seems simpler to keep dictionary encoding at the leafs of the schema. > > Of the two I'm a little more worried that Option #1 will break people if we > decide to disallow it. I think the ship has sailed on disallowing this at the format level. > > Thoughts? > > Thanks, > Micah > > > [1] > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 > [2] https://github.com/apache/arrow/pull/1848
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
> It seems we should potentially disallow dictionaries to contain null values? +1 - I've always thought it was odd you could encode null values in two different places for dictionary encoded columns. You could argue it's more efficient to encode the nulls in the dictionary, but I think if we're going to allow that we should go further - we know there should only be _one_ index with the NULL value in a dictionary, why encode an entire validity buffer? Maybe this is one place where a sentinel value makes sense. The mailing list thread where I brought up the idea of nested dictionaries [1] is useful context for item 2. I still think this is a good idea, but I've changed jobs since then and the use-case I described is no longer motivating me to actually implement it. > It seems simpler to keep dictionary encoding at the leafs of the schema. Do we need to go that far? I think we could still allow dictionary encoding at any level of a hierarchy, and just disallow nested dictionaries. Brian [1] https://lists.apache.org/thread.html/37c0480c4c7a48dd298e8459938444afb901bf01dcebd5f8c5f1dee6%40%3Cdev.arrow.apache.org%3E On Sat, Feb 8, 2020 at 10:53 PM Micah Kornfield wrote: > I'd like to understand if any one is making use of the following features > and if we should revisit them before 1.0. > > 1. Dictionaries can encode null values. > - This become error prone for things like parquet. We seem to be > calculating the definition level solely based on the null bitmap. > > I might have missed something but it appears that we only check if a > dictionary contains nulls on the optimized path [1] but not when converting > the dictionary array back to dense, so I think the values written could get > out of sync with the rep/def levels? > > It seems we should potentially disallow dictionaries to contain null > values? > > 2. Dictionaries can nested columns which are in turn dictionary encoded > columns. > > - Again we aren't handling this in Parquet today, and I'm wondering if it > worth the effort. > There was a PR merged a while ago [2] to add a "skipped" integration test > but it doesn't look like anyone has done follow-up work to make enable > this/make it pass. > > It seems simpler to keep dictionary encoding at the leafs of the schema. > > Of the two I'm a little more worried that Option #1 will break people if we > decide to disallow it. > > Thoughts? > > Thanks, > Micah > > > [1] > > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 > [2] https://github.com/apache/arrow/pull/1848 >
[Format] Dictionary edge cases (encoding nulls and nested dictionaries)
I'd like to understand if any one is making use of the following features and if we should revisit them before 1.0. 1. Dictionaries can encode null values. - This become error prone for things like parquet. We seem to be calculating the definition level solely based on the null bitmap. I might have missed something but it appears that we only check if a dictionary contains nulls on the optimized path [1] but not when converting the dictionary array back to dense, so I think the values written could get out of sync with the rep/def levels? It seems we should potentially disallow dictionaries to contain null values? 2. Dictionaries can nested columns which are in turn dictionary encoded columns. - Again we aren't handling this in Parquet today, and I'm wondering if it worth the effort. There was a PR merged a while ago [2] to add a "skipped" integration test but it doesn't look like anyone has done follow-up work to make enable this/make it pass. It seems simpler to keep dictionary encoding at the leafs of the schema. Of the two I'm a little more worried that Option #1 will break people if we decide to disallow it. Thoughts? Thanks, Micah [1] https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 [2] https://github.com/apache/arrow/pull/1848