Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

2018-03-31 Thread Wes McKinney
Sorry for increasing the confusion with my e-mail. When you said
"JavaScript" I understand you mean now "JSON".

It sounds like in Java you will want to have a specialized union that
cannot have nested types as its children. Perhaps this could implement
a more generic union API, but I will leave that up to the Java users.

Limiting one's self to the union-of-primitives only really makes sense
when you have tight control over the data ingest path and the schema
creation. For example, Support we have two streams of JSON records:

[{arg: 0}, {arg: 1}, {arg: 2}]
[{arg: "foo"}, {arg: "bar"}, {arg: "baz"}]

If you have a JSON record reader that is turning these into Arrow
columnar, then you can make the schema be:

Struct<
  arg: Union
>

If these records originated independently, and were later spliced
together to create a merged data structure, you might have:

Union<
  f0: Struct,
  f1: Struct
>

It may not be feasible, in general, to write a schema as the first
case. For a particular JSON reader, it may be that you exclusively
want unions to contain only primitive types. It may be helpful to
implement conversion functions to "rewrite" data structures to push
down unions into the leaves -- it might be that a particular data
processing engine isn't able to handle unions of nested types.

I'm not sure how you currently handle cases like:

[{arg: 1}, {arg: [1, 2, 3]}]

Here the union is Union>

So it sounds like the work we have before us is:

* Reconcile Java and C++ union implementations, at least. Implement
Union integration tests
* Change interval implementation, implement in C++ and Java, add
integration tests
* Add integration tests for FixedSizeList in C++

I also don't think we are testing schema and field-level metadata in
the integration tests yet. It would be nice to see that happen.

I hope to see this work proceeding -- it's been a lot of slaving over
small details to bring the project to the current level of
completeness and compatibility. We're close to getting over the finish
line to be able to declare a period of binary stability with a 1.0.0
release. I would suggest the following versioning schema

1 FORMAT
0 MAJOR
0 MINOR

If in the future we need to do true patch releases, then we could add
a 4th version number

- Wes

On Wed, Mar 21, 2018 at 11:01 AM, Jacques Nadeau  wrote:
> I'm using javascript as an adjective, sorry about the confusion Paul. And
> maybe JSON would be a better adjective (but neither is good).
>
> With your example of two Binary vectors that have different metadata, yes
> the single-primitive model would argue that they should either be a single
> binary vector or a struct that indicates the type. However the counter I
> outlined above is that you could say single-primitive is really
> single-container and unique leaf--e.g. int32,int64 is allowed but
> int32,int32 is not and thus binary(md1), binary(md2) would be allowed. In
> this context we could treat metadata as part of the leaf node type
> signature.
>
> Random aside, given your description I'm wondering if you're using
> DenseUnion because you really need a DenseStruct.
>
>
>
> On Tue, Mar 20, 2018 at 12:18 PM, Paul Taylor  wrote:
>
>> Jumping in b/c I did the JS Union implementations. I inferred the behavior
>> from what I understood the C++ and Java to be doing, so I may have
>> misunderstood how they should work.
>>
>> > To that end, we talked about
>> > introducing a "single-primitive" (a.k.a. "javascript") union behavior
>> that
>> > would operate this way.
>>
>>
>> Just to clarify, Jacques: are you referencing how the ArrowJS Unions work
>> today, or using JavaScript as an adjective to describe the behavior you'd
>> like to see?
>>
>> If the former, I may have misunderstood the distinction between Dense and
>> Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a
>> valueOffsets buffer to also map idx -> child_idx). I'm happy to review the
>> implementations if this behavior is incorrect.
>>
>> > It would be defined by only allowing one of each
>> > variety of type at any intermediate node of hierarchy. In other words, a
>> > struct could never contain two structs or two lists. (It also couldn't
>> > contain two int64 or int32). This is how the Java library behaves.
>>
>>
>> One way we use the JS Union implementation at Graphistry is representing a
>> heterogenous Struct of IPv4/6 address + port number combinations:
>>
>> > interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
>> > interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
>> >
>> > type IPAddresses = DenseUnion
>> > type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector
>> */]>
>>
>> In this case, we benefit from the ability to compact the IP addresses into
>> a dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an
>> implicit Dictionary encoding -- useful when representing 200k events on an
>> internal network of say, ~200 IPs.

Re: arrow-dev gmane mirror?

2018-03-31 Thread Wes McKinney
+1

On Sat, Mar 31, 2018 at 2:39 PM, Antoine Pitrou  wrote:
>
> Hello,
>
> I'd like to propose we create a gmane mirror of this mailing-list
> (arrow-dev).  gmane.org is a service providing a NNTP gateway for
> mailing-lists.  There are already gmane mirrors for many Apache
> mailing-lists such as Maven and httpd, living under the
> "gmane.comp.apache" hierarchy.  The newsgroup name for this mailing-list
> would therefore probably be "gmane.comp.apache.arrow.devel".
>
> If nobody opposes, I'll submit the creation of a mirror in a few days.
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-2374) [Rust] Add support for array of List

2018-03-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2374:
-

 Summary: [Rust] Add support for array of List
 Key: ARROW-2374
 URL: https://issues.apache.org/jira/browse/ARROW-2374
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andy Grove


Add support for List in Array types. Look at Utf8 which wraps List to 
see how this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


arrow-dev gmane mirror?

2018-03-31 Thread Antoine Pitrou

Hello,

I'd like to propose we create a gmane mirror of this mailing-list
(arrow-dev).  gmane.org is a service providing a NNTP gateway for
mailing-lists.  There are already gmane mirrors for many Apache
mailing-lists such as Maven and httpd, living under the
"gmane.comp.apache" hierarchy.  The newsgroup name for this mailing-list
would therefore probably be "gmane.comp.apache.arrow.devel".

If nobody opposes, I'll submit the creation of a mirror in a few days.

Regards

Antoine.


[jira] [Created] (ARROW-2373) [Rust] Investigate alternate ways to acheive byte-aligned memory

2018-03-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2373:
-

 Summary: [Rust] Investigate alternate ways to acheive byte-aligned 
memory
 Key: ARROW-2373
 URL: https://issues.apache.org/jira/browse/ARROW-2373
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove


The current Rust implementation of Buffer relies on libc and unsafe code to 
allocate byte-aligned buffers.

It would be nice to find a safer way to do this.

Rust 1.25 added support for byte-aligned structs, but that doesn't help.

I have started a discussion in the Rust develop forum about adding support for 
byte-aligned Vec ([https://internals.rust-lang.org/t/byte-aligned-vec/7200).]

RawVec has been suggested too, but that required rust nightly, which doesn't 
seem great to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: How nested can List types be?

2018-03-31 Thread Wes McKinney
hi Andy,

In List, the type T can be any other type, including other lists.
In the C++ library we limit nesting to 64 levels by default, but this
is a failsafe parameter to permit accidental creation of deeply nested
schemas (see arrow/ipc/message.h)

HTH
Wes

On Sat, Mar 31, 2018 at 11:40 AM, Andy Grove  wrote:
> The layout document gives examples of List and List but it
> wasn't clear to me from this document if the intent is to support nesting
> at deeper levels such as List>.
>
> Could someone clear that up for me?
>
> Thanks,
>
> Andy.


How nested can List types be?

2018-03-31 Thread Andy Grove
The layout document gives examples of List and List but it
wasn't clear to me from this document if the intent is to support nesting
at deeper levels such as List>.

Could someone clear that up for me?

Thanks,

Andy.