Re: [Rust] [Discuss] proposal to redesign Arrow crate to resolve safety violations

2021-07-09 Thread Jorge Cardoso Leitão
Thanks a lot Wes,

I am not sure how to proceed from here:

1. how do we generate the html from the xml? I.e.
https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html
2. how do I trigger the the process to start? can I just email the
incubator with the proposal?

Best,
Jorge



On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney  wrote:

> Great, thanks for the update and pushing this forward. Let us know if
> you need help with anything.
>
> On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > Wes and Neils,
> >
> > Thank you for your feedback and offer. I have created the two .xml
> reports:
> >
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml
> >
> http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml
> >
> > I based them on the report for Ballista. I also requested, on the PRs
> > [1,2], clarification wrt to every contributors' contributions to each.
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1
> > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1
> >
> >
> >
> > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney 
> wrote:
> >
> > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks a lot for your feedback. I agree with all the arguments put
> > > forward,
> > > > including Andrew's point about the large change.
> > > >
> > > > I tried a gradual 4 months ago, but it was really difficult and I
> gave
> > > up.
> > > > I estimate that the work involved is half the work of writing
> parquet2
> > > and
> > > > arrow2 in the first place. The internal dependency on ArrayData (the
> main
> > > > culprit of the unsafe) on arrow-rs is so prevalent that all core
> > > components
> > > > need to be re-written from scratch (IPC, FFI, IO, array/transform/*,
> > > > compute, SIMD). I personally do not have the motivation to do it,
> though.
> > > >
> > > > Jed, the public API changes are small for end users. A typical
> migration
> > > is
> > > > [1]. I agree that we can further reduce the change-set by keeping
> legacy
> > > > interfaces available.
> > > >
> > > > Andy, on my machine, the current benchmarks on query 1 yield:
> > > >
> > > > type, master (ms), PR [2] for arrow2+parquet2 (ms)
> > > > memory (-m): 332.9, 239.6
> > > > load (the initial time in -m with --format parquet): 5286.0, 3043.0
> > > > parquet format: 1316.1, 930.7
> > > > tbl format: 5297.3, 5383.1
> > > >
> > > > i.e. I am observing some improvements. Queries with joins are still
> > > slower.
> > > > The pruning of parquet groups and pages based on stats are not yet
> > > there; I
> > > > am working on them.
> > > >
> > > > I agree that this should go through IP clearance. I will start this
> > > > process. My thinking would be to create two empty repos on apache/*,
> and
> > > > create 2 PRs from the main branches of each of my repos to those
> repos,
> > > and
> > > > only merge them once IP is cleared. Would that be a reasonable
> process,
> > > Wes?
> > >
> > > This sounds plenty fine to me — I'm happy to assist with the IP
> > > clearance process having done it several times in the past. I don't
> > > have an opinion about the names, but having experimental- in the name
> > > sounds in line with the previous discussion we had about this.
> > >
> > > > Names: arrow-experimental-rs2 and arrow-experimental-rs-parquet2, or?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > [1]
> > > >
> > >
> https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660
> > > > [2] https://github.com/apache/arrow-datafusion/pull/68
> > > >
> > > >
> > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor  >
> > > wrote:
> > > >
> > > > > I played around with it, for my use case I really like the new way
> of
> > > > > writing CSVs, it's much more obvious. I love the
> `read_stream_metadata`
> > > > > function as well.
> > > > >
> > > > > I'm seeing a very slight speed (~8ms) improvement on my end, but I
> > > read a
> > > > > bunch of files in a directory and spit out a CSV, the bottleneck
> is the
> > > > > parsing of lots of files, but it's pretty quick per file.
> > > > >
> > > > > old:
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_0
> > > 120224
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_1
> > > 123144
> > > > > bytes took 1ms
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_10
> > > > > 17127928 bytes took 159ms
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_11
> > > > > 17127144 bytes took 160ms
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_0_12
> > > > > 17130352 bytes took 158ms
> > > > > /home/josh/staging/019c4715-3200-48fa--4105000cd71e/data_0_

Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Neal Richardson
Congrats Weston!

On Fri, Jul 9, 2021 at 11:53 AM Micah Kornfield 
wrote:

> Congrats!
>
> On Fri, Jul 9, 2021 at 7:56 AM Benjamin Kietzman 
> wrote:
>
> > Congrats!
> >
> > On Fri, Jul 9, 2021, 08:48 Wes McKinney  wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Weston has
> > accepted
> > > an
> > > invitation to become a committer on Apache Arrow. Welcome, and thank
> you
> > > for your contributions!
> > >
> > > Wes
> > >
> >
>


Re: [Rust] Preparing for the 5.0.0 release and the 5.x release line

2021-07-09 Thread Andrew Lamb
I have created a `release-blocker` label we can use to label issues:
https://github.com/apache/arrow-rs/issues?q=is%3Aissue+is%3Aopen+label%3Arelease-blocker

Adam Lippai pointed out https://github.com/apache/arrow-rs/issues/458 as
another one that would be nice to fix

On Fri, Jul 9, 2021 at 7:55 AM Andrew Lamb  wrote:

> I plan to create a 5.0.0 arrow-rs release candidate (from master) early
> next week,. The 5.0 release will include breaking API changes
>
> Here is the list of issues I think are blockers:
> * https://github.com/apache/arrow-rs/issues/529
> * https://github.com/apache/arrow-rs/issues/463 (improved documentation)
>
> Please let me know if there are any additional issues you think should be
> included.
>
> After the 5.0.0 release,  we'll begin releasing 5.x every other week as
> normal
>
> Andrew
>


Re: Apache Arrow Cookbook

2021-07-09 Thread Wes McKinney
Some benefits of separating the cookbook from the documentation would
be to decouple its release / publication from Arrow releases, so you
can roll out new content to the published version as soon as it's
merged into the repository, where in the same fashion we might not
want to publish inter-release changes to the documentation. You could
also have a separate entry point to increase navigability (since the
documentation is intended to be more of a reference book).

Given that the Rust projects have decoupled into multiple
repositories, a "cookbook" repository could also be a place to collect
recipes related to DataFusion.

Either option is plenty reasonable, though, so feel free to choose
what makes the most sense to you.

On Thu, Jul 8, 2021 at 12:09 PM Alessandro Molina
 wrote:
>
> Thinking about it, I think that having the cookbook into its own repository
> (apache/arrow-cookbook) might lower the barrier for contributors. You only
> need to clone the cookbook and running `make` does also take care of
> installing the required dependencies, so in theory you don't even need to
> care too much about setting up your environment. But we can surely improve
> the README in the repo further to ease contributions.
>
> I think we can also preserve the benefit that Nic mentioned of making sure
> that on each Arrow build the recipes are verified by triggering a build of
> the cookbook repository on each new arrow master change. Worst case, have a
> nightly build for the cookbook that clones that latest arrow master branch.
>
> Having a cookbook for C++ is a very good idea, that might be the next step
> once we finish the Python and R versions. If people want to contribute
> cookbook versions for more languages that would be greatly appreciated too.
>
> On the other hand, while we want to keep the cookbooks in the same
> repository and sharing the same infrastructure to keep a low entry barrier
> (make py/r/X will just compile the cookbook for the language you picked), I
> feel that keeping the cookbook separated per language is a good idea. While
> it's cool to be able to compare the solution between languages, in general
> developers look for the solution in their target language and might
> perceive as noise the other implementations.
> For example, we received similar feedback for the Arrow documentation too,
> that as a Python developer it's hard to find what you are looking for
> because it's mixed with the "format" and "C++" documentation and there are
> a few links back and forth between them.
>
>
>
>
>
> On Thu, Jul 8, 2021 at 11:39 AM Nic  wrote:
>
> > One of the possible aims for the cookbook is having interlinked
> > documentation between function docs and the cookbook, and both the R and
> > Python docs include tests that all of the outputs are expected.  Including
> > these tests means that we can immediately see if any code changes render
> > any recipes incorrect.  Therefore the decoupling between cookbook updates
> > and docs updates may not be necessary.
> >
> > That said, there has been mention of having versions of the cookbook tied
> > to released versions of Arrow, which sounds like a great idea.
> >
> > The repo also includes a Makefile which creates all the relevant setup, so
> > hopefully that should simplify things for users.  The R cookbook uses
> > bookdown, which has a feature where a reader can click an 'edit' button and
> > it automatically creates a fork where they can edit the cookbook and submit
> > a PR directly from GitHub.
> >
> > It'd be great to see a lot of recipes in multiple languages, but in the
> > document of possible recipes circulated previously, we identified slightly
> > different needs for recipes for R/Python, and this may be further
> > complicated by writing for slightly different audiences (from what I
> > understand, the pyarrow implementation may be more geared towards people
> > building on top of the low-level bindings, whereas in R, we have both that
> > audience as well as folks who just want to make their dplyr code run faster
> > without needing to know that much about the details of Arrow).
> >
> > I wonder, though, if we could still achieve that by having an additional
> > page that points to the recipes that *are* common between each cookbook.
> >
> > On Thu, 8 Jul 2021 at 10:07, Antoine Pitrou  wrote:
> >
> > >
> > > Hi Rares,
> > >
> > > Documentation bugs and improvement requests are welcome, feel free to
> > > file them on the JIRA!
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 08/07/2021 à 01:45, Rares Vernica a écrit :
> > > > Awesome! We would find C++ versions of these recipes very useful. From
> > > our
> > > > experience the C++ API is much much harder to deal with and error prone
> > > > than the R/Python one.
> > > >
> > > > Cheers,
> > > > Rares
> > > >
> > > > On Wed, Jul 7, 2021 at 9:07 AM Alessandro Molina <
> > > > alessan...@ursacomputing.com> wrote:
> > > >
> > > >> Yes, that was mostly what I meant when I wrote that

Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Micah Kornfield
Congrats!

On Fri, Jul 9, 2021 at 7:56 AM Benjamin Kietzman 
wrote:

> Congrats!
>
> On Fri, Jul 9, 2021, 08:48 Wes McKinney  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Weston has
> accepted
> > an
> > invitation to become a committer on Apache Arrow. Welcome, and thank you
> > for your contributions!
> >
> > Wes
> >
>


Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-09 Thread Wes McKinney
The cost of an empty vector in Flatbuffers appears to be 4 bytes.

On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield  wrote:
>
> Retitling and forking the discussion to talk about key value pairs.
>
> What is the byte cost of an empty list?  Another option would be to
> introduce a new BinaryKeyValue table and add binary metadata.
>
> On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> natebauernfe...@deephaven.io> wrote:
>
> > Deephaven and I are very supportive of "upgrading" the value half of the kv
> > pair to a byte vector. What is the best way to find out if there is
> > sufficient interest?
> >
> >
> > I've been stewing on the ideas here around schema evolution, and I realize
> > the specific feature I am missing is the ability to encode that a field
> > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> > empty/has-no-data in O(0) cost (yes; for free).
> >
> > Might there be interest in adding a "field_id" to the FieldNode (which is
> > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> > upgrade (by either keying off of 0, or explicitly set the field default to
> > -1) which would allow the sender to "skip" fields that have 1) FieldNode
> > length of zero, and 2) all Buffer's associated at that level (and further
> > nested) are also equally empty (i.e. Buffer length is zero).
> >
> > I understand this concept slightly interferes with RecordBatch's `length`
> > field, and that many implementations use that length to resize the
> > root-level FieldNodes. The use-case I have in mind has different logical
> > lengths per field node; current implementations require sending a
> > RecordBatch length of the max length across all root level field nodes. I
> > believe this requires a copy of data whenever a field node is too short; I
> > don't know if there is a decent solution to this slight inefficiency. I am
> > bringing it up because if "skipping a field node when it is empty" is a
> > feature, then we may not want to allocate space for those nodes given that
> > the record batch length will likely be greater than zero.
> >
> > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney  wrote:
> >
> > > On Wed, Jul 7, 2021 at 2:53 PM David Li  wrote:
> > > >
> > > > From the Flatbuffers internals doc[1] it appears they are the same:
> > > "Strings are simply a vector of bytes, and are always null-terminated."
> > >
> > > I see. I took a look at flatbuffers.h, and it appears that changing
> > > this field from string to [byte] would be backward-compatible and
> > > forward-compatible except with code that expects a null terminator.
> > > This is something we could discuss separately if there were enough
> > > interest.
> > >
> > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> > > >
> > > > -David
> > > >
> > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > > schema
> > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > property of the
> > > > > > > transport and not the stream format itself. If we are leaning
> > > towards just
> > > > > > > schema evolution then maybe it makes sense to discuss it for the
> > > IPC stream
> > > > > > > format and leverage that in Flight. I'd be interested in what
> > > others think.
> > > > > >
> > > > > > I tend to agree, I think stream multiplexing is likely a transport
> > > level
> > > > > > issue.  IMO I think schema evolution should be consistent with the
> > > IPC
> > > > > > stream format  and flight.
> > > > > >
> > > > > >
> > > > > > > Nate: it may be worth starting a separate discussion about more
> > > general
> > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > metadata was
> > > > > > > chosen/if opaque bytes were considered in the past.
> > > > > >
> > > > > >
> > > > > > I think  this was an unfortunate design of the key value metadata
> > in
> > > > > > Schema.fbs, but I don't think I was around when this decision was
> > > made.
> > > > >
> > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > > > string for the value in the KeyValue metadata — I think this was more
> > > > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > > > to require binary data to be base64-encoded — this is something that
> > > > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > > > for Parquet, for example). Is the binary representation of [byte]
> > > > > different from string?
> > > > >
> > > > >
> > > > >
> > > > > > Side Question: Why isn't the IPC stream format a series of the
> > flight
> > > > > > > protobufs?
> > > > > >
> > > > > > In addition to what David said, protobufs can't be read directly
> > > from a
> > > > > > memory-mapped file (they need decoding).  This was one 

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-09 Thread Wes McKinney
It sounds like we may want to discuss some potential evolutions of the
Arrow binary protocol (for example: new Message types). Certainly a
can of worms but rather than trying to bolt some new functionality
onto the existing structures, it might be better to support the new
use cases through some new structures which will be more clear cut
from a forward compatibility standpoint.

On Wed, Jul 7, 2021 at 8:31 PM David Li  wrote:
>
> To summarize so far, it sounds like schema evolution is neither sufficient 
> nor necessary for either Gosh or Nate's use-cases here? It could be useful 
> for FlightSQL but even there I don't think it's a requirement.
>
> For Nate - it almost sounds like what you need is some way to slice up a 
> record batch and send columns individually, which isn't really a concept in 
> IPC (and hence Flight). Or rather, record batch is almost the wrong 
> abstraction for your use case (when you're sending per-column deltas), even 
> though you could model it as a record batch with 'empty' columns or as a 
> stream with constantly shifting schema (neither of which are perfect 
> encodings).
>
> -David
>
> On Wed, Jul 7, 2021, at 13:24, Nate Bauernfeind wrote:
> > > Flatbuffers does not support modifying structs
> > > in any forwards or backwards compatible way
> > > (only tables support evolution).
> >
> > Bah. I did not realize that.
> >
> > To reiterate the feature that would be ideal:
> > I realize the specific feature I am missing is the ability to encode that a
> > field (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> > empty/has-no-data in O(0) cost (yes; for free).
> >
> > Since RecordBatch is a table, might there be interest in adding a field
> > that is a bitset (formatted as a byte vector) to indicate the subset of
> > root FieldNodes that are included in the RecordBatch's field-node-list
> > (noting that the buffer list most be appropriately filtered, too)? If the
> > bitset is omitted/empty we can be forwards-compatible by assuming every
> > field is included (seems fair; because why send a record batch without a
> > payload?). This does bring down the cost from 16 bytes (field node) + 32
> > bytes (2x buffer node minimum; validity buffer node + payload buffer node)
> > to 1 bit per field node -- that's a compression ratio of >= 384:1 --
> > although it's not free, it's a lot closer.
> >
> > Are there any other alternatives the Arrow community might consider?
> >
> >
> > > This sounds a lot like what DenseUnion provides though?
> >
> > The use case is as follows: We send multiple record batches that aggregate
> > / accumulate into a single logical update. The first set of record batches
> > contain payload for rows that were added since the last logical update
> > (this is a set of updates to accommodate that not every implementation
> > supports 32-bit lengths). A field node for every column and every added row
> > will be sent. For this half of the logical update the RecordBatch length
> > matches the root FieldNode lengths. The second set of record batches
> > contain payload for only rows, and fields, that were actually modified. If
> > only one field changed, we send only that payload for that row. There is
> > additional metadata that allows the client to understand which existing row
> > is to be replaced by any given row for a given Field/Column. In this
> > context, each root field node has a different mapping to minimize total
> > payload size.
> >
> > We might be able to use DenseUnion but it certainly feels like folding the
> > entire data-source into a dense union makes the design less useful and less
> > accessible. I will spend some time sleeping on your suggestion, but I'm not
> > immediately excited about it. At this moment, I suspect I will continue to
> > lie and state that the RecordBatch's length is the max length across all
> > root field node lengths (and be content that it's not ideal from a
> > copy/allocation perspective).
> > -
> >
> > On Wed, Jul 7, 2021 at 10:57 AM Micah Kornfield 
> > wrote:
> >
> > > >
> > > > Might there be interest in adding a "field_id" to the FieldNode (which 
> > > > is
> > > > encoded on the RecordBatch flatbuffer)? I see a simple 
> > > > forward-compatible
> > > > upgrade (by either keying off of 0, or explicitly set the field default
> > > to
> > > > -1) which would allow the sender to "skip" fields that have 1) FieldNode
> > > > length of zero, and 2) all Buffer's associated at that level (and 
> > > > further
> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > >
> > >
> > > FieldNode is a struct in Message.fbs.   Flatbuffers does not support
> > > modifying structs in any forwards or backwards compatible way (only tables
> > > support evolution).  I think there was originally more metadata in
> > > FieldNode and it was stripped down due to size concerns.
> > >
> > > I understand this concept slightly interferes with RecordBatch's `length`
> > > > field, and that many implementati

Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Benjamin Kietzman
Congrats!

On Fri, Jul 9, 2021, 08:48 Wes McKinney  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted
> an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
>
> Wes
>


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Yibo Cai
Congrats Weston!


From: Wes McKinney 
Sent: Friday, July 9, 2021 8:47 PM
To: dev 
Subject: [ANNOUNCE] New Arrow committer: Weston Pace

On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
invitation to become a committer on Apache Arrow. Welcome, and thank you
for your contributions!

Wes
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Rok Mihevc
Congrats Weston!

On Fri, Jul 9, 2021 at 3:44 PM Nic  wrote:
>
> Congrats Weston! :)
>
> On Fri, 9 Jul 2021 at 14:43, Eduardo Ponce  wrote:
>
> > Congratulations Weston and thanks for your hard work!
> >
> > ~Eduardo
> >
> > ~Eduardo
> >
> > 
> > From: David Li 
> > Sent: Friday, July 9, 2021 9:14:19 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Weston Pace
> >
> > Congrats Weston!
> >
> > On Fri, Jul 9, 2021, at 08:47, Wes McKinney wrote:
> > > On behalf of the Arrow PMC, I'm happy to announce that Weston has
> > accepted an
> > > invitation to become a committer on Apache Arrow. Welcome, and thank you
> > > for your contributions!
> > >
> > > Wes
> > >
> >


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Nic
Congrats Weston! :)

On Fri, 9 Jul 2021 at 14:43, Eduardo Ponce  wrote:

> Congratulations Weston and thanks for your hard work!
>
> ~Eduardo
>
> ~Eduardo
>
> 
> From: David Li 
> Sent: Friday, July 9, 2021 9:14:19 AM
> To: dev@arrow.apache.org 
> Subject: Re: [ANNOUNCE] New Arrow committer: Weston Pace
>
> Congrats Weston!
>
> On Fri, Jul 9, 2021, at 08:47, Wes McKinney wrote:
> > On behalf of the Arrow PMC, I'm happy to announce that Weston has
> accepted an
> > invitation to become a committer on Apache Arrow. Welcome, and thank you
> > for your contributions!
> >
> > Wes
> >
>


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Eduardo Ponce
Congratulations Weston and thanks for your hard work!

~Eduardo

~Eduardo


From: David Li 
Sent: Friday, July 9, 2021 9:14:19 AM
To: dev@arrow.apache.org 
Subject: Re: [ANNOUNCE] New Arrow committer: Weston Pace

Congrats Weston!

On Fri, Jul 9, 2021, at 08:47, Wes McKinney wrote:
> On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
>
> Wes
>


Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread David Li
Congrats Weston!

On Fri, Jul 9, 2021, at 08:47, Wes McKinney wrote:
> On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
> invitation to become a committer on Apache Arrow. Welcome, and thank you
> for your contributions!
> 
> Wes
> 

[ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Wes McKinney
On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
invitation to become a committer on Apache Arrow. Welcome, and thank you
for your contributions!

Wes


Re: Distributing the Arrow C++ library through vcpkg

2021-07-09 Thread Wes McKinney
hi Ian,

Thank you for driving this effort — I agree that having the vcpkg
installation path maintained by the Arrow community will help yield a
more consistent experience for users of the libraries, similar to
Homebrew and other packages we maintain.

As an aside, and speaking of documenting the various install paths, I
wonder if we should move the installation information into the Sphinx
project to make this information more cohesive with the rest of the
documentation.

Thanks,
Wes

On Thu, Jul 8, 2021 at 4:24 AM Ian Cook  wrote:
>
> Hi Arrow devs,
>
> Since 2017, it has been possible to install the Arrow C++ library
> using the vcpkg package manager[1], but until recently, the Arrow
> vcpkg port ("port" is their term for a package) was maintained by
> community members, not by core Arrow devs. This led to a pattern of
> irregular updates that left vcpkg users sometimes stuck with very old
> versions of Arrow.
>
> Following the Arrow 4.0.0 release, I performed the task of updating
> the arrow vcpkg port to 4.0.0, to see what this process would entail
> [2]. Thanks to Tanguy Fautré at G-Research and to the vcpkg
> maintainers at Microsoft for help with this.
>
> The arrow vcpkg port[3] is fairly simple: it consists of a JSON
> manifest with port metadata and dependency information, a
> vcpkg-flavored CMake script, and a patch file to apply some necessary
> fixes. There are no binary assets; the CMake script downloads the
> source release.
>
> The process of updating the vcpkg port consists of opening a PR to
> modify these files, committing fixes as needed to make the CI green,
> then getting the PR merged by a vcpkg maintainer. The PR to update the
> arrow port to 4.0.0[4] was more complicated because the port
> previously used a legacy format to specify metadata and dependencies,
> and I updated it to use the current JSON manifest format. Going
> forward, I anticipate that it should be much more straightforward. I
> expect that the chief difficulties will be updating the patch file as
> needed when the Arrow CMake scripts it patches have changed and
> resolving vcpkg CI failures when they occur in the PR.
>
> I propose that the core Arrow devs take ownership of this task to
> update the vcpkg port following each Arrow release. For the
> foreseeable future, I intend to volunteer to perform this task
> following each release. Please reply with any objections, questions,
> or discussion you might have regarding this proposal.
>
> If there are no objections, I will add vcpkg port update instructions
> to the Arrow release management guide, mataintain copies of the vcpkg
> port files in the apache/arrow repo, and perform the next arrow vcpkg
> port update following the upcoming 5.0.0 release.
>
> Thank you,
> Ian
>
> [1] https://vcpkg.io
> [2] https://issues.apache.org/jira/browse/ARROW-11581
> [3] https://github.com/microsoft/vcpkg/tree/master/ports/arrow
> [4] https://github.com/microsoft/vcpkg/pull/17975


[Rust] Preparing for the 5.0.0 release and the 5.x release line

2021-07-09 Thread Andrew Lamb
I plan to create a 5.0.0 arrow-rs release candidate (from master) early
next week,. The 5.0 release will include breaking API changes

Here is the list of issues I think are blockers:
* https://github.com/apache/arrow-rs/issues/529
* https://github.com/apache/arrow-rs/issues/463 (improved documentation)

Please let me know if there are any additional issues you think should be
included.

After the 5.0.0 release,  we'll begin releasing 5.x every other week as
normal

Andrew