In principle I'm in favor of #2 -- the only question is what kinds of
problems it might pose for forward compatibility.

Note

* This is completely backward compatible (any data conforming to the
spec to the letter will continue to be conforming)
* It is also forward compatible at a protocol level, but code that
makes assumptions about the monotonicity of the offsets will break

Since the offset acts effectively as a dictionary index, this doesn't
strike me as being so harmful, but I'm interested in the opinions of
others

On Tue, Nov 17, 2020 at 5:28 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hello,
>
> The format spec and the C++ implementation disagree on one point:
>
> * The spec says that dense union offsets should be increasing:
> """The respective offsets for each child value array must be in order /
> increasing."""
>
> (from https://arrow.apache.org/docs/format/Columnar.html#dense-union)
>
> * The C++ implementation has long had some tests that used deliberatly
> non-increasing (even descending) dense union offsets.
>
> (see https://issues.apache.org/jira/browse/ARROW-10580)
>
> I don't know what other implementations, especially Java, expect.
>
> There are obviously two possible solutions:
>
> 1) Fix the C++ implementation and its tests to conform to the format
> spec (which may break compatibility for code producing / consuming dense
> unions with non-increasing offsets)
>
> 2) Relax the format spec to allow arbitrary offsets (which could make
> dense union more like a polymorphic dictionary).
>
> If the first solution is chosen, then another question arises: must the
> offsets be strictly increasing?  Or can a given offset appear several
> times in a row?
> (the latter is currently exploited by the C++ implementation: when
> appending several nulls to a DenseUnionBuilder, only one child null slot
> is added and the same offset is appended multiple times)
>
> Regards
>
> Antoine.

Reply via email to