Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Paul Taylor Tue, 20 Mar 2018 12:20:55 -0700

Jumping in b/c I did the JS Union implementations. I inferred the behavior from 
what I understood the C++ and Java to be doing, so I may have misunderstood how 
they should work.


> To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way. 


Just to clarify, Jacques: are you referencing how the ArrowJS Unions work 
today, or using JavaScript as an adjective to describe the behavior you'd like 
to see?

If the former, I may have misunderstood the distinction between Dense and 
Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a 
valueOffsets buffer to also map idx -> child_idx). I'm happy to review the 
implementations if this behavior is incorrect.

> It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves.


One way we use the JS Union implementation at Graphistry is representing a 
heterogenous Struct of IPv4/6 address + port number combinations:

> interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
> interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
> 
> type IPAddresses = DenseUnion<IPv4 | IPv6>
> type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector */]>

In this case, we benefit from the ability to compact the IP addresses into a 
dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an 
implicit Dictionary encoding -- useful when representing 200k events on an 
internal network of say, ~200 IPs.

Would the "single-primitive" proposal restrict the IPAddresses type from 
containing two child Binary Vectors?


> On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <[email protected]> wrote:
> 
>> 
>> I may have missed something, but I'm not remembering either the points
>> re: JavaScript or decimals. My understanding is that we have been
>> discussing how to handle a union-of-complex-types -- the Union
>> implementation in Java does not support this. Could you clarify or
>> refer to prior mailing list threads?
>> 
> 
> Sorry, let me clarify.
> 
> The original thinking was that there is a non-collapsing intermediate node
> behavior and an intermediate node collapsing behavior (a.k.a
> single-primitive behavior) for unions. For example, if we have the
> following records and types (imagine two different sensors generations):
> 
> sensor_gen1: {
>  ts: <timestamp(nanos)>,
>  info: {
>    metric: <utf8>,
>    value: <double>,
>    variance: <double>
>  }
> }
> 
> (a.k.a. struct<
>  ts:timestamp(nanos),
>  info: struct<
>    metric: utf8,
>    value: double,
>    variance: double
>> 
>> )
> 
> sensor_gen2: {
>  ts: <timestamp(nanos)>
>  info: {
>    metric: <utf8>
>    value: <int64>
>    tolerance: <double>
>  }
> }
> 
> (a.k.a struct<
>  ts:timestamp(nanos),
>  info: struct<
>    metric: utf8,
>    value: int64,
>    tolerance: double
>> 
>> )
> 
> 
> We have two possible unions that could be created:
> 
> the non-node-collapsing behavior:
> struct<
>  ts:timestamp(nanos),
>  info: union<
>  struct<
>      metric: utf8,
>  value: double,
>      variance: double
>> ,
>    struct<
>      metric: utf8,
>      value: int64,
>      tolerance: double
>> 
>> 
>> 
> 
> Or the collapsing behavior
> 
> struct<
>  ts:timestamp(nanos),
>  info: union<
>  info: struct<
>    metric: utf8,
>    value: union<double, int64>,
>    tolerance: double
>    variance: double
>> 
>> 
> 
> For generalized data processing (e.g. a sql system), I consider the latter
> to be optimal as it allows analysts to deal with sameness without having to
> dereference to a particular union branch. To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way. It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves. The
> format simplification that is then possible would be that these names would
> be directly mapped to known positions (e.g. struct is always in position 1
> and list is always in position 2, etc.). The java library doesn't try to do
> the latter at the moment (it used to but the definition wasn't clear).
> 
> The single-primitive behavior in general works very well. It also doesn't
> limit a user from having a set of multiple unions that they want to
> dereference but does require that each of those branches are named via a
> struct rather than using positions in unions. In other words, it doesn't
> allow for positional union dereferencing. The one place where it becomes
> challenging is when a leaf node is not simple. For example decimal(30,2)
> combined with decimal(30,4). In this case, what should the behavior be?
> Following a simple-primitive model would suggest that this is only possible
> if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
> decimal(30,4)> but that seems arbitrary since I can also create
> union<int32,int64> (which feels very much the same). The problem compounds
> as we have added more information at other leaf types (e.g.
> timestamp(millis) and timestamp(nanos)).
> 
> So, my suggestion that started the thread was that this single-primitive
> behavior not be part of the format but be a choice of the implementation.
> In terms of the way to expose the union of structs scenario in Java, I
> propose that we implement that as named structs for now and enhance the
> behavior if people have use cases that need alternative apis (and are
> willing to invest in an arbitrary approach without disrupting the existing
> apis).
> 
> 
>>>  - Interval Day to Seconds: 8 bytes representing number of
>>>    milliseconds.
>>>    - Interval Year to Months: 4 bytes representing number of months.
>> 
>> Yes, I'm supportive of this. The one addition is that we need to add a
>> "unit" field to the metadata to support finer granularity than
>> milliseconds -- the idea is that we should support the same units as
>> TImestamp so that a difference of timestamps produces an interval (aka
>> timedelta). We have this data arising already in Python, for example,
>> but we cannot represent it in Arrow at the moment, so this has been a
>> rough edge for users.
>> 
>> 
> Agree on units.

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Reply via email to