Jumping in b/c I did the JS Union implementations. I inferred the behavior from
what I understood the C++ and Java to be doing, so I may have misunderstood how
they should work.
> To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way.
Just to clarify, Jacques: are you referencing how the ArrowJS Unions work
today, or using JavaScript as an adjective to describe the behavior you'd like
to see?
If the former, I may have misunderstood the distinction between Dense and
Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a
valueOffsets buffer to also map idx -> child_idx). I'm happy to review the
implementations if this behavior is incorrect.
> It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves.
One way we use the JS Union implementation at Graphistry is representing a
heterogenous Struct of IPv4/6 address + port number combinations:
> interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
> interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
>
> type IPAddresses = DenseUnion<IPv4 | IPv6>
> type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector */]>
In this case, we benefit from the ability to compact the IP addresses into a
dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an
implicit Dictionary encoding -- useful when representing 200k events on an
internal network of say, ~200 IPs.
Would the "single-primitive" proposal restrict the IPAddresses type from
containing two child Binary Vectors?
> On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <[email protected]> wrote:
>
>>
>> I may have missed something, but I'm not remembering either the points
>> re: JavaScript or decimals. My understanding is that we have been
>> discussing how to handle a union-of-complex-types -- the Union
>> implementation in Java does not support this. Could you clarify or
>> refer to prior mailing list threads?
>>
>
> Sorry, let me clarify.
>
> The original thinking was that there is a non-collapsing intermediate node
> behavior and an intermediate node collapsing behavior (a.k.a
> single-primitive behavior) for unions. For example, if we have the
> following records and types (imagine two different sensors generations):
>
> sensor_gen1: {
> ts: <timestamp(nanos)>,
> info: {
> metric: <utf8>,
> value: <double>,
> variance: <double>
> }
> }
>
> (a.k.a. struct<
> ts:timestamp(nanos),
> info: struct<
> metric: utf8,
> value: double,
> variance: double
>>
>> )
>
> sensor_gen2: {
> ts: <timestamp(nanos)>
> info: {
> metric: <utf8>
> value: <int64>
> tolerance: <double>
> }
> }
>
> (a.k.a struct<
> ts:timestamp(nanos),
> info: struct<
> metric: utf8,
> value: int64,
> tolerance: double
>>
>> )
>
>
> We have two possible unions that could be created:
>
> the non-node-collapsing behavior:
> struct<
> ts:timestamp(nanos),
> info: union<
> struct<
> metric: utf8,
> value: double,
> variance: double
>> ,
> struct<
> metric: utf8,
> value: int64,
> tolerance: double
>>
>>
>>
>
> Or the collapsing behavior
>
> struct<
> ts:timestamp(nanos),
> info: union<
> info: struct<
> metric: utf8,
> value: union<double, int64>,
> tolerance: double
> variance: double
>>
>>
>
> For generalized data processing (e.g. a sql system), I consider the latter
> to be optimal as it allows analysts to deal with sameness without having to
> dereference to a particular union branch. To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way. It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves. The
> format simplification that is then possible would be that these names would
> be directly mapped to known positions (e.g. struct is always in position 1
> and list is always in position 2, etc.). The java library doesn't try to do
> the latter at the moment (it used to but the definition wasn't clear).
>
> The single-primitive behavior in general works very well. It also doesn't
> limit a user from having a set of multiple unions that they want to
> dereference but does require that each of those branches are named via a
> struct rather than using positions in unions. In other words, it doesn't
> allow for positional union dereferencing. The one place where it becomes
> challenging is when a leaf node is not simple. For example decimal(30,2)
> combined with decimal(30,4). In this case, what should the behavior be?
> Following a simple-primitive model would suggest that this is only possible
> if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
> decimal(30,4)> but that seems arbitrary since I can also create
> union<int32,int64> (which feels very much the same). The problem compounds
> as we have added more information at other leaf types (e.g.
> timestamp(millis) and timestamp(nanos)).
>
> So, my suggestion that started the thread was that this single-primitive
> behavior not be part of the format but be a choice of the implementation.
> In terms of the way to expose the union of structs scenario in Java, I
> propose that we implement that as named structs for now and enhance the
> behavior if people have use cases that need alternative apis (and are
> willing to invest in an arbitrary approach without disrupting the existing
> apis).
>
>
>>> - Interval Day to Seconds: 8 bytes representing number of
>>> milliseconds.
>>> - Interval Year to Months: 4 bytes representing number of months.
>>
>> Yes, I'm supportive of this. The one addition is that we need to add a
>> "unit" field to the metadata to support finer granularity than
>> milliseconds -- the idea is that we should support the same units as
>> TImestamp so that a difference of timestamps produces an interval (aka
>> timedelta). We have this data arising already in Python, for example,
>> but we cannot represent it in Arrow at the moment, so this has been a
>> rough edge for users.
>>
>>
> Agree on units.