Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Jacques Nadeau Tue, 20 Mar 2018 10:05:51 -0700

>
> I may have missed something, but I'm not remembering either the points
> re: JavaScript or decimals. My understanding is that we have been
> discussing how to handle a union-of-complex-types -- the Union
> implementation in Java does not support this. Could you clarify or
> refer to prior mailing list threads?
>


Sorry, let me clarify.

The original thinking was that there is a non-collapsing intermediate node
behavior and an intermediate node collapsing behavior (a.k.a
single-primitive behavior) for unions. For example, if we have the
following records and types (imagine two different sensors generations):

sensor_gen1: {
  ts: <timestamp(nanos)>,
  info: {
    metric: <utf8>,
    value: <double>,
    variance: <double>
  }
}

(a.k.a. struct<
  ts:timestamp(nanos),
  info: struct<
    metric: utf8,
    value: double,
    variance: double
  >
>)

sensor_gen2: {
  ts: <timestamp(nanos)>
  info: {
    metric: <utf8>
    value: <int64>
    tolerance: <double>
  }
}

(a.k.a struct<
  ts:timestamp(nanos),
  info: struct<
    metric: utf8,
    value: int64,
    tolerance: double
  >
>)


We have two possible unions that could be created:

the non-node-collapsing behavior:
struct<
  ts:timestamp(nanos),
  info: union<
  struct<
      metric: utf8,
  value: double,
      variance: double
  >,
    struct<
      metric: utf8,
      value: int64,
      tolerance: double
    >
  >
>

Or the collapsing behavior

struct<
  ts:timestamp(nanos),
  info: union<
  info: struct<
    metric: utf8,
    value: union<double, int64>,
    tolerance: double
    variance: double
  >
>

For generalized data processing (e.g. a sql system), I consider the latter
to be optimal as it allows analysts to deal with sameness without having to
dereference to a particular union branch. To that end, we talked about
introducing a "single-primitive" (a.k.a. "javascript") union behavior that
would operate this way. It would be defined by only allowing one of each
variety of type at any intermediate node of hierarchy. In other words, a
struct could never contain two structs or two lists. (It also couldn't
contain two int64 or int32). This is how the Java library behaves. The
format simplification that is then possible would be that these names would
be directly mapped to known positions (e.g. struct is always in position 1
and list is always in position 2, etc.). The java library doesn't try to do
the latter at the moment (it used to but the definition wasn't clear).

The single-primitive behavior in general works very well. It also doesn't
limit a user from having a set of multiple unions that they want to
dereference but does require that each of those branches are named via a
struct rather than using positions in unions. In other words, it doesn't
allow for positional union dereferencing. The one place where it becomes
challenging is when a leaf node is not simple. For example decimal(30,2)
combined with decimal(30,4). In this case, what should the behavior be?
Following a simple-primitive model would suggest that this is only possible
if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
decimal(30,4)> but that seems arbitrary since I can also create
union<int32,int64> (which feels very much the same). The problem compounds
as we have added more information at other leaf types (e.g.
timestamp(millis) and timestamp(nanos)).

So, my suggestion that started the thread was that this single-primitive
behavior not be part of the format but be a choice of the implementation.
In terms of the way to expose the union of structs scenario in Java, I
propose that we implement that as named structs for now and enhance the
behavior if people have use cases that need alternative apis (and are
willing to invest in an arbitrary approach without disrupting the existing
apis).


> >   - Interval Day to Seconds: 8 bytes representing number of
> >     milliseconds.
> >     - Interval Year to Months: 4 bytes representing number of months.
>
> Yes, I'm supportive of this. The one addition is that we need to add a
> "unit" field to the metadata to support finer granularity than
> milliseconds -- the idea is that we should support the same units as
> TImestamp so that a difference of timestamps produces an interval (aka
> timedelta). We have this data arising already in Python, for example,
> but we cannot represent it in Arrow at the moment, so this has been a
> rough edge for users.
>
>
Agree on units.

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Reply via email to