Re: Add a UUID type to the Arrow format

2018-02-15 Thread Wes McKinney
hi Uwe,

This seems like a good idea to me given the widespread use of UUIDs,
and would make use more natural for application developers.

- Wes

On Tue, Feb 13, 2018 at 10:03 AM, Uwe L. Korn  wrote:
> Hello,
>
> I just opened https://issues.apache.org/jira/browse/ARROW-2152 to start the 
> discussion about adding a UUID type to the Arrow format specification. In its 
> essence a UUID is simply a 128bit array but there are often special classes 
> used for it, e.g. java.util.UUID in Java and uuid.UUID in Python. These 
> provide special functions for them as well as sometimes the knowledge that a 
> column is a UUID could be beneficial during computations. Other data systems 
> like Postgres or Parquet also have a special UUID type.
>
> While there is only a small difference to a 128bit fixed sized binary array, 
> I think providing the respective object model accessor is already a good 
> benefit.
>
> Uwe


Re: indicating an error during a streaming response

2018-02-15 Thread Wes McKinney
hi Ben,

The stream writer and reader classes that exist now are fairly
primitive. We've been discussing a richer RPC protocol using http2 or
GRPC that would permit more nuanced control messages, errors, and
other. This would be a great area for new members of the dev community
to make some valuable contributions. Since Dremio uses Arrow heavily,
there may be some tech that can be repurposed or at least designs
reused to streamline the process.

Bottom line is that this is a use case that we want to be
well-supported out of the box in the Arrow libraries.

Would be great to hear from some others. Copying Nong Li who was
looking into GRPC integration sometime last year.

best,
Wes

On Mon, Feb 12, 2018 at 7:41 PM, Ben Duffield  wrote:
> Hello there,
>
>
>
> I have a java server (jetty, jersey) which produces a streaming arrow
> response (via a javax.ws.rs.core.StreamingOutput).
>
>
>
> The stream of arrow is produced ‘on the fly’ and occasionally something goes
> wrong once we’ve already begun streaming. In this case we throw an exception
> which propagates up to jersey, which then closes the response normally
> (there seems to be no good way to get jersey to hand the exception back up
> to jetty to cause an abort on the response stream). We’re sending whole
> arrow batches, so the client just sees a regular arrow stream (i.e. the
> stream up to and including the last entire arrow row batch we sent) and so
> has no way to detect that an error occurred.
>
>
>
> Maybe there is actually a way to get jetty to hang up the connection without
> proper termination (which the client can then detect) – although this does
> mean we have no ability to indicate why the error happened – but we have so
> far been unable to figure out how to do this.
>
>
>
> Another thought was having some way in a streaming arrow response to
> indicate this kind of error inline in the stream/be able to ‘poison’ an
> arrow stream – to indicate to the client that something exceptional happened
> part way through the streaming response. Maybe some way of sending an EOS
> with an error flag set.
>
>
>
> There are probably other, arguably more elegant ways to handle this kind of
> thing e.g. if we can assume http2 everywhere something like
> https://http2.github.io/http2-spec/#StreamErrorHandler but it seems
> currently the tooling isn’t really there/at any rate it could be nice to
> have a transport-independent way of indicating an error.
>
>
>
> I was wondering if anyone has dealt with anything similar to this/has any
> pointers/thoughts on whether adding to the streaming format makes sense
> here?
>
>
>
> Best,
>
>
>
> Ben
>
>


Re: Decimal NaNs

2018-02-15 Thread Wes McKinney
hey Phillip,

Replying so we have a record on the mailing list about this. For the
user's use case, it seems that having reasonable null support for
decimals incoming from Python would be sufficient.

I agree it's probably not worth supporting NaN decimals in the Arrow
format for now given the sporadic support across the ecosystem.

- Wes

On Tue, Feb 13, 2018 at 8:50 PM, Phillip Cloud  wrote:
> Recently someone opened ARROW-2145
>  asking
> for support for non-finite values, such as NaN and infinity.
> It may seem like a “no-brainer” to implement this, but there’s no real
> consistency on how to implement it or *even to implement it at all*:
>
>- Java BigDecimal: raises an exception for nan or inf as per the docs
>
> 
>- boost multiprecision supports it but not for fixed precision decimal
>numbers (cpp_bin_float/cpp_dec_float, which are arbitrary precision
>floating point not fixed point)
>- python supports it using flags and special string exponents (and it
>supports both signaling and quiet nans)
>- impala doesn’t support it (returns null when you try to perform
> CAST(CAST('NaN'
>AS DOUBLE) AS DECIMAL)
>- postgres supports it with its numeric
> type
>by using the sign member of the C struct backing numeric values
>
> 
>- MySQL: doesn’t even support nan/inf!
>
> The lack of support for these values across languages likely stems from the
> fact that fixed precision arithmetic by definition must happen on finite
> values, and nan/inf are not finite values therefore they are not supported.
>
> We could go down this rabbit hole in the name of providing support for
> Python decimal.Decimal() but I’m not sure how useful it
> is.
>
> No other system except in-memory C++ arrow arrays would be able to operate
> on these values (I suppose we could add a wrapper around BigDecimal that
> has the desired behavior).
>
> For example, writing arrow arrays containing Decimal128 values (with nans
> or infs) to a parquet file seems untenable.
>
> Additionally, if we decided to implement it, we’d likely have to take
> something like the flag approach which would require a change to the
> metadata (not necessary a bad thing) that would add two bitmaps to arrow
> Decimal arrays: one for indicating nan-ness and one for indicating inf-ness
> (that’s a ton of overhead IMO when I think it’s likely that most values are
> always finite).
>
> I’m skeptical about whether we should support this.
>
> Thoughts?
>


Re: Making the next JavaScript Arrow release

2018-02-15 Thread Brian Hulette

Wes,

We're still working on generated API documentation (ARROW-951), but that 
doesn't need to hold up the release. I also just opened one more small 
PR, but once we merge the two open JS PRs I think we're ready for a 
release vote.


Brian


On 02/15/2018 04:53 PM, Wes McKinney wrote:

hi folks,

With all the work that's happened in JavaScript, my understanding from
Slack and GitHub conversations is that we'd like to do a JS-0.3.0 and
get updates published thereafter to NPM. I see there's an open PR at
the moment -- anything else that needs to get done before I kick off a
release vote?

Thanks,
Wes




Making the next JavaScript Arrow release

2018-02-15 Thread Wes McKinney
hi folks,

With all the work that's happened in JavaScript, my understanding from
Slack and GitHub conversations is that we'd like to do a JS-0.3.0 and
get updates published thereafter to NPM. I see there's an open PR at
the moment -- anything else that needs to get done before I kick off a
release vote?

Thanks,
Wes


[jira] [Created] (ARROW-2165) enhance AllocatorListener to listen for child allocator addition and removal

2018-02-15 Thread Vanco Buca (JIRA)
Vanco Buca created ARROW-2165:
-

 Summary: enhance AllocatorListener to listen for child allocator 
addition and removal
 Key: ARROW-2165
 URL: https://issues.apache.org/jira/browse/ARROW-2165
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Memory
Reporter: Vanco Buca


AllocatorListener allows a BaseAllocator to track when users allocate memory. 
However, the structure of the users' allocators can't be tracked easily. 

I suggest an enhancement to the AllocatorListener: onNewChild(BufferAllocator) 
and onChildClosed(BufferAllocator)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow for MATLAB?

2018-02-15 Thread Wes McKinney
hi Nick,

I would strongly encourage you to focus your energies on supporting
the general Arrow streaming and file formats as described in
http://arrow.apache.org/docs/ipc.html. It's fine to support Feather,
so long as you restrict yourself to the API defined in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.h

As soon as we have reasonable R bindings for the Arrow C++ libraries,
I would like to replace the internal details of the Feather format
with the Arrow IPC format, deprecating the metadata defined in
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.fbs.

Please let us know how things go and if there's anything we can do to help.

Thanks,
Wes

On Thu, Feb 15, 2018 at 9:48 AM, Nick Haddad  wrote:
> Uwe and Joris,
>
> I work for MathWorks. We plan to contribute Feather read/write capability in 
> the near future. Interesting idea about MATLAB and Arrow. We’ll take a look 
> as well.
>
> Thanks
> -Nick
>
>
> On 2/14/18, 3:39 AM, "Joris Peeters"  wrote:
>
> Yeah, pinging Mathworks might be worth it. It feels like this is something
> that could be of great value to them. Aside from HDF5 & mat files (both on
> disk), it can be really tedious to efficiently share data from anywhere
> with MATLAB, so it's becoming increasingly isolated.
> I'm going to play around for a few days with Arrow & the Matlab C++ API,
> getting a bit more familiar with it and maybe hacking a small prototype
> together. Currently it's just out-of-hours, though, so don't expect major
> magic. :)
>
> Thanks,
> -J
>
> On Tue, Feb 13, 2018 at 2:33 PM, Phillip Cloud  wrote:
>
> > The MathWorks is in the process of starting to contribute. I spoke with
> > them a couple weeks ago about this and they were excited about it. I can
> > ping them to see if they are still interested.
> >
> > On Tue, Feb 13, 2018, 09:24 Uwe L. Korn  wrote:
> >
> > > Hello Joris,
> > >
> > > this is only due to lack of someone doing it and probably due to lack 
> of
> > > people that have the experience to do that. I had a short look at
> > Matlab's
> > > C++ API and the interfaces seem to be promising enough
> > > https://de.mathworks.com/help/matlab/matlab-data-array.html that once
> > > someone attempts it, it should not be hard to build.
> > >
> > > If you want to try to take a shot, we are happy to help if there are
> > > problems with the Arrow side of things.
> > >
> > > Uwe
> > >
> > > On Tue, Feb 13, 2018, at 2:41 PM, Joris Peeters wrote:
> > > > Hello,
> > > >
> > > > Is anyone aware of plans (or concrete projects) to add MATLAB 
> bindings
> > > for
> > > > Arrow? I'm interested in exchanging data between Java, Python, ..., 
> and
> > > > MATLAB - and Arrow sounds like a great solution.
> > > >
> > > > I couldn't find any pre-existing effort, though, so curious if that 
> is
> > > due
> > > > to a lack of interest or because there might be underlying reasons 
> that
> > > > would make this very hard to achieve.
> > > >
> > > > Best,
> > > > -Joris.
> > >
> >
>
>


[jira] [Created] (ARROW-2164) [C++] Clean up unnecessary decimal module refs

2018-02-15 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2164:


 Summary: [C++] Clean up unnecessary decimal module refs
 Key: ARROW-2164
 URL: https://issues.apache.org/jira/browse/ARROW-2164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0


See this comment: 
https://github.com/apache/arrow/pull/1610#discussion_r168533239



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2163) Install apt dependencies separate from built-in Travis commands, retry on flakiness

2018-02-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2163:
---

 Summary: Install apt dependencies separate from built-in Travis 
commands, retry on flakiness
 Key: ARROW-2163
 URL: https://issues.apache.org/jira/browse/ARROW-2163
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


This would also allow us to run the detect changes script earlier than 
installing apt dependencies, so unnecessary builds will terminate faster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow for MATLAB?

2018-02-15 Thread Nick Haddad
Uwe and Joris, 

I work for MathWorks. We plan to contribute Feather read/write capability in 
the near future. Interesting idea about MATLAB and Arrow. We’ll take a look as 
well.

Thanks
-Nick


On 2/14/18, 3:39 AM, "Joris Peeters"  wrote:

Yeah, pinging Mathworks might be worth it. It feels like this is something
that could be of great value to them. Aside from HDF5 & mat files (both on
disk), it can be really tedious to efficiently share data from anywhere
with MATLAB, so it's becoming increasingly isolated.
I'm going to play around for a few days with Arrow & the Matlab C++ API,
getting a bit more familiar with it and maybe hacking a small prototype
together. Currently it's just out-of-hours, though, so don't expect major
magic. :)

Thanks,
-J

On Tue, Feb 13, 2018 at 2:33 PM, Phillip Cloud  wrote:

> The MathWorks is in the process of starting to contribute. I spoke with
> them a couple weeks ago about this and they were excited about it. I can
> ping them to see if they are still interested.
>
> On Tue, Feb 13, 2018, 09:24 Uwe L. Korn  wrote:
>
> > Hello Joris,
> >
> > this is only due to lack of someone doing it and probably due to lack of
> > people that have the experience to do that. I had a short look at
> Matlab's
> > C++ API and the interfaces seem to be promising enough
> > https://de.mathworks.com/help/matlab/matlab-data-array.html that once
> > someone attempts it, it should not be hard to build.
> >
> > If you want to try to take a shot, we are happy to help if there are
> > problems with the Arrow side of things.
> >
> > Uwe
> >
> > On Tue, Feb 13, 2018, at 2:41 PM, Joris Peeters wrote:
> > > Hello,
> > >
> > > Is anyone aware of plans (or concrete projects) to add MATLAB bindings
> > for
> > > Arrow? I'm interested in exchanging data between Java, Python, ..., 
and
> > > MATLAB - and Arrow sounds like a great solution.
> > >
> > > I couldn't find any pre-existing effort, though, so curious if that is
> > due
> > > to a lack of interest or because there might be underlying reasons 
that
> > > would make this very hard to achieve.
> > >
> > > Best,
> > > -Joris.
> >
>