Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Yibo Cai
Did a quick test. For random bitmaps and my trivial test code, the branch-less code is 3.5x faster than branch one. https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro On 6/23/21 11:21 PM, Wes McKinney wrote: One project I was interested in getting to but haven't had the time was introducing

Arrow sync call June 23 at 12:00 US/Eastern, 16:00 UTC

2021-06-23 Thread Neal Richardson
I didn't send a reminder beforehand, but we did meet today as usual. Attendees: Nade Bauernfeind Nic Crane Ian Cook Rémi Dettai James Duong Alenka F Ray Lum Rok Mihevc Prudhvi Porandla Neal Richardson Discussion: * 5.0.0 release coming up in 3+ weeks, see previous ML discussion *

Re: [ANNOUNCE] Official media types (MIME types) for Apache Arrow formats

2021-06-23 Thread Wes McKinney
Congratulations to Kou and Weston (and everyone else who contributed) for pushing this initiative through to completion! On Wed, Jun 23, 2021 at 8:48 PM Sutou Kouhei wrote: > > Hi, > > The official media types (MIME types) for Apache Arrow > formats are registered to IANA: > > * >

[ANNOUNCE] Official media types (MIME types) for Apache Arrow formats

2021-06-23 Thread Sutou Kouhei
Hi, The official media types (MIME types) for Apache Arrow formats are registered to IANA: * https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file * https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream We recommend ".arrow" for IPC file

Re: [C++] Maximum type code for union types

2021-06-23 Thread Ying Zhou
Thanks! Issue filed. I will work on it. https://issues.apache.org/jira/browse/ARROW-13154 > On Jun 20, 2021, at 5:26 PM, Wes McKinney wrote: > > UnionType::kMaxTypeCode is 127, so we intend to have codes from 0 to > 127. If there is code preventing things from going up to and including > 127

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
Thanks for chiming in - I've replied in the doc. Scoping it to just schema evolution would be preferable, but I'm not sure if Gosh's usecase requires more flexibility than that or not. Again, though, given that 1) gRPC recycles a connection, so repeated calls aren't necessarily expensive and

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Nate Bauernfeind
Thanks for writing this up! I added a few general comments, but have a question on the approach because it's not quite what I was expecting. I am slightly concerned that the proposal looks more like support for "multiplexing" IPC streams into a single RPC stream rather than support for a changing

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
Ah to be clear, the API is indeed inconsistent - DoExchange was added some time later (and by its nature returning a FlightDataStream would not have been possible, since it's meant to be able to interleave reading/writing). But really, DoGet is indeed the odd one out in the C++ API and it may

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David, Got you. In fact I was looking at this more from the point of view of consistency of the API in terms of "inputs" and thought DoExchange is kind of a DoGet+ so might make sense to have the same classes being utilized in both places. But again, I might be missing something and I get the

Re: [Python] Drop Python 3.6 and Numpy 1.16 support?

2021-06-23 Thread Micah Kornfield
Could we postpone the dropping Python 3.6 support to be inline with what the Python core maintainers deadline? Or at least until the Arrow 6 release? Thanks, Micah On Wed, Jun 23, 2021 at 10:36 AM Wes McKinney wrote: > This seems reasonable to me. > > On Wed, Jun 23, 2021 at 11:39 AM Antoine

Re: [ANNOUNCE] New Arrow PMC member: David M Li

2021-06-23 Thread Kazuaki Ishizaki
Congrats! Bryan Cutler wrote on 2021/06/23 16:00:32: > From: Bryan Cutler > To: dev > Date: 2021/06/23 16:00 > Subject: [EXTERNAL] Re: [ANNOUNCE] New Arrow PMC member: David M Li > > Congrats David! > > On Tue, Jun 22, 2021, 7:24 PM Micah Kornfield wrote: > > > Congrats David! > > > > On

Re: [Python] Drop Python 3.6 and Numpy 1.16 support?

2021-06-23 Thread Wes McKinney
This seems reasonable to me. On Wed, Jun 23, 2021 at 11:39 AM Antoine Pitrou wrote: > > Hello, > > In https://issues.apache.org/jira/browse/ARROW-12706 it was proposed to > drop support for the aforementioned Python and Numpy versions. The > rationale is that they have ceased to be supported

[Python] Drop Python 3.6 and Numpy 1.16 support?

2021-06-23 Thread Antoine Pitrou
Hello, In https://issues.apache.org/jira/browse/ARROW-12706 it was proposed to drop support for the aforementioned Python and Numpy versions. The rationale is that they have ceased to be supported by Numpy, which is a mandatory dependency of PyArrow. Besides, Pandas (an optional

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Niranda Perera
Hi Wes, I am interesting in this. In this PR [1] we are exposing BitmapWordReader/ Writer [2] to the outside, which may help the 'batch-at-a-time' scenario. [1] https://github.com/apache/arrow/pull/10487 [2]

[C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Wes McKinney
One project I was interested in getting to but haven't had the time was introducing branch-free code into vector_selection.cc and reducing the use of if-statements to try to improve performance. One way to do this is to take code that looks like this: if (BitUtil::GetBit(filter_data_,

Re: [GitHub] Pull Request 10305

2021-06-23 Thread Wes McKinney
I just left some comments. Thank you for working on this! On Wed, Jun 23, 2021 at 8:54 AM Sarah Gilmore wrote: > > Hi all, > > David Li suggested I email the mailing list to see if anyone would be > interested in reviewing this pull > request for the

[GitHub] Pull Request 10305

2021-06-23 Thread Sarah Gilmore
Hi all, David Li suggested I email the mailing list to see if anyone would be interested in reviewing this pull request for the MATLAB interface to Arrow. If anyone has time to look at it that would be great, and if there's anything we can do to

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2021 07:37:09 -0500 Wes McKinney wrote: > On Wed, Jun 23, 2021 at 3:03 AM Antoine Pitrou wrote: > > > > On Tue, 22 Jun 2021 19:04:49 -0500 > > Wes McKinney wrote: > > > Some on this list might be interested in a new paper out of CMU/MIT > > > about the use of selection vectors

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
It's mostly a quirk of implementation (and just for clarification, they're all nearly identical on the format/protocol level). DoGet is conceptualized as your application returning a readable stream of batches, instead of your application imperatively writing batches to the client. (This is

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David, Going through the ArrowFlight API: got confused a bit on DoGet and DoPut/DoExachange apis: the former one expects FlightDataStream which talks in already serialized message terms while the latter to accept FlightMessageReader/Writer which expect the user to pass in RecordBatches etc. Is

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Wes McKinney
On Wed, Jun 23, 2021 at 3:03 AM Antoine Pitrou wrote: > > On Tue, 22 Jun 2021 19:04:49 -0500 > Wes McKinney wrote: > > Some on this list might be interested in a new paper out of CMU/MIT > > about the use of selection vectors and bitmaps for handling the > > intermediate results of filters: > >

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Antoine Pitrou
On Tue, 22 Jun 2021 19:04:49 -0500 Wes McKinney wrote: > Some on this list might be interested in a new paper out of CMU/MIT > about the use of selection vectors and bitmaps for handling the > intermediate results of filters: > > https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf > > The

Re: [ANNOUNCE] New Arrow PMC member: David M Li

2021-06-23 Thread Bryan Cutler
Congrats David! On Tue, Jun 22, 2021, 7:24 PM Micah Kornfield wrote: > Congrats David! > > On Tue, Jun 22, 2021 at 7:13 PM Fan Liya wrote: > > > Congratulations David! > > > > Best, > > Liya Fan > > > > > > On Wed, Jun 23, 2021 at 9:44 AM Yibo Cai wrote: > > > > > Congrats David! > > > > > >

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Jorge Cardoso Leitão
Thank you for sharing, Wes, an interesting paper indeed. In Rust we currently use a different strategy. We build an iterator over ranges [a_i, b_i[ to be selected from the filter bitmap, and filter the array based on those ranges. For a single filter, the ranges are iterated as they are being