On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > hi folks,
> >
> > As discussed on the recent GitHub PR [1], as a means of reconciling
> > the long-standing cross-implementation incompatibilities with Union
> > types, it's been proposed to remove the top-level validity bitmap from
> > the Union data layout and let validity be determined exclusively by
> > the child arrays of the union. So the only additional data needed to
> > form a union are the type ids (and for the dense union, the offsets).
> >
> > I do not think this change meaningfully alters the semantics of Union
> > types and I think it also simplifies their construction, so I would be
> > in favor of making it for 1.0.0.
>
> So it sounds like this may break compatibility with existing only uses
> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> Arrow); not only on the API side, but on the data side.

Right. However, I don't think these changes will be very disruptive,
and we always knew that this disruption was possible because of the
hitherto unreconciled issues with Unions. The applications that I'm
aware of that use Union serialization (e.g. Ray) use it only for
ephemeral serialization.

In general, I think that we should be bumping the metadata version [1]
for 1.0.0 to create a forcing function for upgrade to the
format-stable line of libraries. The C++/Python libraries could have a
"compatibility mode" (like the "write_legacy_ipc_format" options) that
writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
(like unions -- which are not needed for Spark for example) disabled.

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22

> Regards
>
> Antoine.

Reply via email to