Re: Apache Arrow Cookbook

2021-07-07 Thread Matt Topol
Personally I'd love to see a cookbook where a recipe is accompanied by examples of how to accomplish it in multiple languages rather than having separate cookbooks for each language. Though that may just be me wanting to see more love for the Golang implementation On Wed, Jul 7, 2021, 8:57

Re: Arrow sync call July 7 at 12:00 US/Eastern, 16:00 UTC

2021-07-07 Thread Ian Cook
Thanks to all who attended. Meeting notes: Attendees: Nate Bauernfeind Ian Cook Nic Crane Alenka Frim Micah Kornfield Jorge Leitao Alessandro Molina Weston Pace Eduardo Ponce Pol Santamaria Discussion: - Arrow 5.0.0 release - Goal: release end of next week or worst case the third week of

Distributing the Arrow C++ library through vcpkg

2021-07-07 Thread Ian Cook
Hi Arrow devs, Since 2017, it has been possible to install the Arrow C++ library using the vcpkg package manager[1], but until recently, the Arrow vcpkg port ("port" is their term for a package) was maintained by community members, not by core Arrow devs. This led to a pattern of irregular

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Weston Pace
Good question. I'll take a stab at answering some of it. C++ has the same passthru / interoperability concerns. Python is significant as it's builtin datetime module distinguishes between "local" and "instant" datetimes (which it calls naive and non-naive). In addition, pandas which has a very

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Evan Chan
Thanks everyone for their input; Interoperability would be the biggest issue; how much does C++ do with the timezone string? -Evan > On Jul 7, 2021, at 1:33 PM, Weston Pace wrote: > > I don't know about removal but you could probably ignore the timezone > string and it's not clear the issues

Re: Apache Arrow Cookbook

2021-07-07 Thread Eduardo Ponce
Here is additional food for thought. The cookbook currently contains examples for C++, R, and Python. Is there a plan (or wish) to eventually extend a single cookbook to include examples from other languages (eg., Rust, Java)? If so, then putting the cookbook into its own (language agnostic) repo

Re: Apache Arrow Cookbook

2021-07-07 Thread Eduardo Ponce
Great work! I would recommend having the cookbook in its own repo so that its updates are not constrained by the timeline used for updating the public Arrow documentation. This will allow users that are not involved in Arrow development to contribute or provide suggestions to the cookbook fairly

Re: Apache Arrow Cookbook

2021-07-07 Thread Rares Vernica
Awesome! We would find C++ versions of these recipes very useful. From our experience the C++ API is much much harder to deal with and error prone than the R/Python one. Cheers, Rares On Wed, Jul 7, 2021 at 9:07 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > Yes, that was mostly

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-07-07 Thread Niranda Perera
I have created a PR for the changes we discussed. https://github.com/apache/arrow/pull/10679 It would be great if you guys could go through it. I'm still benchmarking the results. And I also have some ideas to reduce the branches in the main while loop in the Primitive Types implementation. I

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Weston Pace
I don't know about removal but you could probably ignore the timezone string and it's not clear the issues would be that significant. If Rust never produces a non-null non-UTC timestamp then I don't see that as an issue. If you are consuming data with a timestamp string other than UTC it isn't

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread David Li
To summarize so far, it sounds like schema evolution is neither sufficient nor necessary for either Gosh or Nate's use-cases here? It could be useful for FlightSQL but even there I don't think it's a requirement. For Nate - it almost sounds like what you need is some way to slice up a record

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Nate Bauernfeind
> Flatbuffers does not support modifying structs > in any forwards or backwards compatible way > (only tables support evolution). Bah. I did not realize that. To reiterate the feature that would be ideal: I realize the specific feature I am missing is the ability to encode that a field (i.e. its

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Joris Van den Bossche
On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão wrote: > Hi, > > AFAIK timezone is part of the spec. And for reference, the current spec (Schema flatbuffer file) for timestamp is at https://github.com/apache/arrow/blob/6c8d30ea8fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247.

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Jorge Cardoso Leitão
Hi, AFAIK timezone is part of the spec. In Python, that would be [1] import pyarrow as pa dt1 = pa.timestamp("ms", "+00:10") dt2 = pa.timestamp("ms") arrow-rs is not very consistent with how it handles it. imo that is an artifact of being currently difficult (API wise) to create an array with a

[Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Evan Chan
Hi folks, Some of us are having a discussion about a direction change for Rust Arrow timestamp types, which current support both a resolution field (Ns, Micros, Ms, Seconds) similar to the other language implementations, but also optionally a timezone string field. I believe the timezone

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina
Yes, that was mostly what I meant when I wrote that the next step is opening a PR against the apache/arrow repository itself :D We moved forward in a separate repository initially to be able to cycle more quickly, but we reached a point where we think we can start integrating the cookbook with the

Re: Arrow sync call July 7 at 12:00 US/Eastern, 16:00 UTC

2021-07-07 Thread Nate Bauernfeind
Is this still happening today? On Tue, Jul 6, 2021 at 11:07 AM Ian Cook wrote: > Hi all, > > Our biweekly sync call is tomorrow at > https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes > will be shared with the mailing list afterward. > > Ian > --

Re: Arrow sync call July 7 at 12:00 US/Eastern, 16:00 UTC

2021-07-07 Thread Ian Cook
Update: For the meeting starting now, please us this Google Meet URL: https://meet.google.com/ebp-tczo-xjn Ian On Tue, Jul 6, 2021 at 12:07 PM Ian Cook wrote: > > Hi all, > > Our biweekly sync call is tomorrow at > https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes > will be

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Micah Kornfield
> > Might there be interest in adding a "field_id" to the FieldNode (which is > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible > upgrade (by either keying off of 0, or explicitly set the field default to > -1) which would allow the sender to "skip" fields that have 1)

Re: Apache Arrow Cookbook

2021-07-07 Thread Wes McKinney
What do you think about developing this cookbook in an Apache Arrow repository (it could be something like apache/arrow-cookbook, if not part of the main development repo)? Creating expanded documentation resources for learning how to use Apache Arrow to solve problems seems certainly within the

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina
We finally have a first preview of the cookbook available for R and Python, for anyone interested the two versions are visible at http://ursacomputing.com/arrow-cookbook/py/index.html and http://ursacomputing.com/arrow-cookbook/r/index.html A new version of the cookbook is automatically published

[DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Micah Kornfield
Retitling and forking the discussion to talk about key value pairs. What is the byte cost of an empty list? Another option would be to introduce a new BinaryKeyValue table and add binary metadata. On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind < natebauernfe...@deephaven.io> wrote: >

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Nate Bauernfeind
Deephaven and I are very supportive of "upgrading" the value half of the kv pair to a byte vector. What is the best way to find out if there is sufficient interest? I've been stewing on the ideas here around schema evolution, and I realize the specific feature I am missing is the ability to

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Wes McKinney
On Wed, Jul 7, 2021 at 2:53 PM David Li wrote: > > From the Flatbuffers internals doc[1] it appears they are the same: "Strings > are simply a vector of bytes, and are always null-terminated." I see. I took a look at flatbuffers.h, and it appears that changing this field from string to [byte]

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread David Li
>From the Flatbuffers internals doc[1] it appears they are the same: "Strings >are simply a vector of bytes, and are always null-terminated." [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html -David On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote: > On Tue, Jul 6, 2021 at

Re: Improving PR workload management for Arrow maintainers

2021-07-07 Thread Weston Pace
I investigated the cpython approach and the PR labelling is a part of the existing bedevere bot which does a number of things (not all relevant to Arrow). Yesterday I created a standalone Github action[1] dedicated to this task roughly based on my previous email. It will apply "awaiting-review"

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Wes McKinney
On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield wrote: > > > > > Right, I had wanted to focus the discussion on Flight as I think schema > > evolution or multiplexing streams (more so the latter) is a property of the > > transport and not the stream format itself. If we are leaning towards just >