In principle I'm in favor of #2 -- the only question is what kinds of problems it might pose for forward compatibility.
Note * This is completely backward compatible (any data conforming to the spec to the letter will continue to be conforming) * It is also forward compatible at a protocol level, but code that makes assumptions about the monotonicity of the offsets will break Since the offset acts effectively as a dictionary index, this doesn't strike me as being so harmful, but I'm interested in the opinions of others On Tue, Nov 17, 2020 at 5:28 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hello, > > The format spec and the C++ implementation disagree on one point: > > * The spec says that dense union offsets should be increasing: > """The respective offsets for each child value array must be in order / > increasing.""" > > (from https://arrow.apache.org/docs/format/Columnar.html#dense-union) > > * The C++ implementation has long had some tests that used deliberatly > non-increasing (even descending) dense union offsets. > > (see https://issues.apache.org/jira/browse/ARROW-10580) > > I don't know what other implementations, especially Java, expect. > > There are obviously two possible solutions: > > 1) Fix the C++ implementation and its tests to conform to the format > spec (which may break compatibility for code producing / consuming dense > unions with non-increasing offsets) > > 2) Relax the format spec to allow arbitrary offsets (which could make > dense union more like a polymorphic dictionary). > > If the first solution is chosen, then another question arises: must the > offsets be strictly increasing? Or can a given offset appear several > times in a row? > (the latter is currently exploited by the C++ implementation: when > appending several nulls to a DenseUnionBuilder, only one child null slot > is added and the same offset is appended multiple times) > > Regards > > Antoine.