Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-18 Thread Wes McKinney
On Tue, Feb 18, 2020 at 2:01 AM Micah Kornfield wrote: > > >> * evaluating an expression like SUM(ISNULL($field)) is more >> semantically ambiguous (you have to check more things) when $field is >> a dictionary-encoded type and the values of the dictionary could be >> null > > It is this type of t

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-18 Thread Micah Kornfield
> * evaluating an expression like SUM(ISNULL($field)) is more > semantically ambiguous (you have to check more things) when $field is > a dictionary-encoded type and the values of the dictionary could be > null It is this type of thing that I'm worried about (parquet just happens to be where I'm w

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-11 Thread Wes McKinney
hi Micah, It seems like the null and nested issues really come up when trying to translate from one dictionary encoding scheme to another. That we are able to directly write dictionary-encoded data to Parquet format is beneficial, but it doesn't seem like we should let the constraints of Parquet's

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-10 Thread Micah Kornfield
Hi Wes and Brian, Thanks for the feedback. My intent in raising these issues is that they make the spec harder to work with/implement (i.e. we have existing bugs, etc). I'm wondering if we should take the opportunity to simplify before things are set in stone. If we think things are already set,

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-10 Thread Wes McKinney
On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield wrote: > > I'd like to understand if any one is making use of the following features > and if we should revisit them before 1.0. > > 1. Dictionaries can encode null values. > - This become error prone for things like parquet. We seem to be > calcula

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-09 Thread Brian Hulette
> It seems we should potentially disallow dictionaries to contain null values? +1 - I've always thought it was odd you could encode null values in two different places for dictionary encoded columns. You could argue it's more efficient to encode the nulls in the dictionary, but I think if we're goi

[Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-08 Thread Micah Kornfield
I'd like to understand if any one is making use of the following features and if we should revisit them before 1.0. 1. Dictionaries can encode null values. - This become error prone for things like parquet. We seem to be calculating the definition level solely based on the null bitmap. I might h