For those following along, I've proposed a workaround that loosens the restriction on non-nullable children here <https://github.com/apache/arrow-rs/pull/3244>. In particular non-nullable children are now allowed to contain nulls, so long as they don't introduce any new nulls not already found on their parent.

I think this preserves the semantic notions of nullability, whilst preserving the ability to correctly interpret, validate and construct child arrays in the absence of their parents.

Please do let me know if you foresee any issue with this approach, or have any insights to share on how other implementations are handling this.

Kind Regards,

Raphael Taylor-Davies

On 29/11/2022 20:47, Jacob Quinn wrote:
I was just looking into a related issue last night where it seems pandas
complains if there are _any_ nulls in the dictionary and we were
considering not allowing nulls in the dictionary values at all. But it's a
little tangled up at the moment because we've already allowed it. Ref:
https://github.com/apache/arrow-julia/issues/360

-Jacob

On Tue, Nov 29, 2022 at 8:06 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid>  wrote:

Hi All,

I am not sure if it is intentional, but a common property of all arrow
layouts is that the value at a given index is defined, even if for a
null it may contain an arbitrary value. This is true everywhere except
for the dictionary layout, where the key in the null slot may contain an
arbitrary value, and consequently the value of the index is undefined.

This has been a repeated nuisance in the Rust implementation, but so far
I've managed to find workarounds for most issues, however, I'm unsure
how to handle StructArrays containing non-nullable, dictionary-encoded
children. As the children are non-nullable, they cannot contain a null
mask, but without a null mask the child dictionary array is ill-formed.
I'm not really sure how best to handle this?

One option might be to require that all dictionary keys, even those for
null slots, are a valid index into the child values array. As the child
values array can itself contain nulls, this is always possible.

My questions are therefore:

* How are other implementations handling this case?

* Is requiring all dictionary keys to be a valid index into the child
values acceptable? We already do something similar for offsets

* What is the motivation for dictionaries having two levels of
nullability, both in the keys and values. UnionArray by contrast only
encodes nullability in its children

Any help would be much appreciated

Kind Regards,

Raphael Taylor-Davies

Reply via email to