Re: Policy on access to ursacomputing/crossbow?

2024-05-29 Thread Jonathan Keane
Thanks for the quick reply and action Raúl. I'm also very happy to help
craft such a document. I know that the need and use of being on the
ursacomputing org is limited, since the comment bot generally works well,
but it still would be nice to have something like that.

Thank you again!

-Jon


On Tue, May 28, 2024 at 11:22 AM Raúl Cumplido 
wrote:

> Hi Jon,
>
> From my understanding we currently don't have a written policy for
> accessing the crossbow repository but PMCs should be allowed to
> request access for them and/or committers.
>
> I had to ask what happened with your access. It seems it was a mistake
> when someone was doing some cleanup on some user accesses.
>
> In order to avoid those things from happening in the future I will
> work on a proposal to have a written policy about how to request
> access and when access is removed.
>
> Regards,
> Raúl
>
> El sáb, 25 may 2024 a las 2:02, Jonathan Keane ()
> escribió:
> >
> > Over my time with the project I've had access to the github repository
> > ursacomputing/crossbow to be able to manually trigger crossbow jobs. I
> find
> > it incredibly helpful when working on the extended R CI to be able to
> > iterate more quickly than waiting for the comment bot.
> >
> > But also over the time I've used it I've been removed and then had to ask
> > to be readded to the organization at least twice now.
> >
> > I was recently (15 May) removed from the organization once again. One, is
> > it possible to be added back to the repository? And two: what is the
> policy
> > around who has access and when they get removed?
> >
> > -Jon
>


Policy on access to ursacomputing/crossbow?

2024-05-24 Thread Jonathan Keane
Over my time with the project I've had access to the github repository
ursacomputing/crossbow to be able to manually trigger crossbow jobs. I find
it incredibly helpful when working on the extended R CI to be able to
iterate more quickly than waiting for the comment bot.

But also over the time I've used it I've been removed and then had to ask
to be readded to the organization at least twice now.

I was recently (15 May) removed from the organization once again. One, is
it possible to be added back to the repository? And two: what is the policy
around who has access and when they get removed?

-Jon


Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Jonathan Keane
Congrats and welcome, Bryce.

-Jon


On Mon, Mar 18, 2024 at 6:47 AM Antoine Pitrou  wrote:

>
> Congratulations Bryce, and keep up the good work!
>
> Regards
>
> Antoine.
>
> Le 18/03/2024 à 03:21, Nic Crane a écrit :
> > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> and
> > thank you for your contributions!
> >
> > Nic
> >
>


Re: New tag for releases for R-universe

2024-02-10 Thread Jonathan Keane
Thanks for this Nic.

And just to clarify: the latest here is the latest _release_ of Apache
Arrow with this new set up. Prior to this the build available on R-universe
were effectively dev builds (commits to main), but with this new tag,
R-universe will only have (or at least default to having) the latest
release.

-Jon


On Sat, Feb 10, 2024 at 2:18 PM Nic Crane  wrote:

> Hi folks,
>
> The Arrow R package is distributed via a few different methods, one of
> which is R-universe[1].
>
> In order for r-universe to track the latest version of the R package, we
> have started using the tag "r-universe-release" to indicate the commit
> which represents the latest version of the R package (which is also
> submitted to CRAN).
>
> I'm mentioning it here just to be transparent about this - it doesn't make
> any changes to the current release process as it's still the version based
> off the release candidate and this is just an additional step for the R
> package which follows a successful main project release.
>
> Hope this all sounds OK - if not, happy to take feedback for changes etc on
> this.
>
> Thanks,
>
> Nic
>
>
> [1] https://r-universe.dev/
>


Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Jonathan Keane
Congratulations and welcome!

-Jon


Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-23 Thread Jonathan Keane
Thank you all for the kind words. Working with this community is a joy and
I hope to continue for a long time to come.

-Jon


On Mon, Oct 16, 2023 at 11:16 AM Vibhatha Abeykoon 
wrote:

> Congratulations Jon!
>
>
> On Mon, Oct 16, 2023 at 9:28 PM Kevin Gurney  >
> wrote:
>
> > Congratulations, Jonathan!
> >
> > 
> > From: Dane Pitkin 
> > Sent: Monday, October 16, 2023 11:52 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane
> >
> > Congrats Jon!!
> >
> > On Mon, Oct 16, 2023 at 7:04 AM Krisztián Szűcs <
> szucs.kriszt...@gmail.com
> > >
> > wrote:
> >
> > > Congrats Jon!
> > >
> > > On Mon, Oct 16, 2023 at 11:20 AM Alenka Frim
> > >  wrote:
> > > >
> > > > Yay, congratulations Jon!!
> > > >
> > > > On Mon, Oct 16, 2023 at 10:27 AM vin jake 
> > wrote:
> > > >
> > > > > Congrats Jon!
> > > > >
> > > > > On Sun, Oct 15, 2023 at 1:25 AM Andrew Lamb 
> > > wrote:
> > > > >
> > > > > > The Project Management Committee (PMC) for Apache Arrow has
> invited
> > > > > > Jonathan Keane to become a PMC member and we are pleased to
> > announce
> > > > > > that Jonathan Keane has accepted.
> > > > > >
> > > > > > Congratulations and welcome!
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > >
> > >
> >
>


Re: Help regarding setting up the r package in arrow apache

2023-10-20 Thread Jonathan Keane
tting a working R dev setup on
> > Docker.
> >
> > I'd recommend instead looking at the article mentioned by me, Bryce, and
> > Jon [1].  Happy to answer any questions if any issues come up with those
> > instructions, as they could potentially be made more clear, and it's
> always
> > useful to get feedback on docs like these.
> >
> > Nic
> >
> > [1] https://arrow.apache.org/docs/r/articles/developers/docker.html
> >
> > On Fri, 20 Oct 2023 at 08:13, Divyansh Khatri <
> divyanshkhatri...@gmail.com
> > >
> > wrote:
> >
> > > please see this and help me resolve the issue
> > >
> https://gist.github.com/Divyansh200102/3ba4f5e391d8e62307f8b584a5a659d8
> > >
> > > On Wed, 18 Oct 2023 at 19:14, Jonathan Keane  wrote:
> > >
> > > > For development of the R package with docker containers, the link [1]
> > > that
> > > > Nic sent in this same thread is the place to go. In addition to that
> > > > docker-focused one, there are a handful of others that might prove
> > useful
> > > > to you in getting your development environment setup [2].
> > > >
> > > > If you run into any issues, feel free to post here, but it's helpful
> to
> > > do
> > > > so with debugging mode on (i.e. set the env var ARROW_DEV to true)
> and
> > to
> > > > provide the exact commands you sent along with the output you're
> seeing
> > > so
> > > > we can help diagnose what's going wrong.
> > > >
> > > > [1] –
> https://arrow.apache.org/docs/r/articles/developers/docker.html
> > > > [2] –
> > > https://arrow.apache.org/docs/r/articles/index.html#developer-guides
> > > >
> > > > -Jon
> > > >
> > > >
> > > > On Wed, Oct 18, 2023 at 2:48 AM Divyansh Khatri <
> > > > divyanshkhatri...@gmail.com>
> > > > wrote:
> > > >
> > > > > I am trying to contribute to the arrow project.so i am trying to
> > setup
> > > > the
> > > > > project on locally.
> > > > >
> > > > > On Tue, 17 Oct 2023 at 05:14, Bryce Mecum 
> > > wrote:
> > > > >
> > > > > > That error makes it look like you're running `docker compose up`
> > from
> > > > > > the root of the Arrow source tree which is likely not what you
> > want.
> > > > > > Are you trying to use the Arrow R package in a Docker container
> or
> > > are
> > > > > > you trying to contribute to it by developing inside of a Docker
> > > > > > container? Nic's link [1] is a good starting point.
> > > > > >
> > > > > > [1]
> > https://arrow.apache.org/docs/r/articles/developers/docker.html
> > > > > >
> > > > > > On Mon, Oct 16, 2023 at 4:31 AM Divyansh Khatri
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi,so i am basically using the docker cmd 'docker compose up
> -d'
> > in
> > > > the
> > > > > > > docker-compose.yml but i am encountering this error(Error
> > response
> > > > from
> > > > > > > daemon: manifest for amd64/maven:3.5.4-eclipse-temurin-8 not
> > found:
> > > > > > > manifest unknown: manifest unknown)so i am not sure how to
> > proceed
> > > > from
> > > > > > > here?
> > > > > > >
> > > > > > > On Mon, 16 Oct 2023 at 14:17, Nic Crane 
> > > wrote:
> > > > > > >
> > > > > > > > Hi Divyansh,
> > > > > > > >
> > > > > > > > There are instructions for creating a R package dev setup
> here:
> > > > > > > >
> https://arrow.apache.org/docs/r/articles/developers/setup.html
> > > > > > > >
> > > > > > > > If you can explain a bit more about what you've tried so far
> > and
> > > > > > what's not
> > > > > > > > working, we may be able to advise.
> > > > > > > >
> > > > > > > > Best wishes,
> > > > > > > >
> > > > > > > > Nic
> > > > > > > >
> > > > > > > > On Mon, 16 Oct 2023 at 06:02, Divyansh Khatri <
> > > > > > divyanshkhatri...@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am having problems regarding setting up the r package
> using
> > > > > docker
> > > > > > of
> > > > > > > > the
> > > > > > > > > apache arrow.Can you give me the step by step process of
> how
> > > do i
> > > > > > setup
> > > > > > > > the
> > > > > > > > > r package in my vs code system using docker.
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Jonathan Keane
+1

-Jon


On Wed, Oct 18, 2023 at 2:26 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> +1
>
> On Wed, Oct 18, 2023 at 2:49 PM Dewey Dunnington
>  wrote:
>
> > +1!
> >
> > On Wed, Oct 18, 2023 at 2:14 PM Matt Topol 
> wrote:
> > >
> > > +1
> > >
> > > On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou 
> > wrote:
> > >
> > > > +1
> > > >
> > > > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > > > Hello all,
> > > > >
> > > > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > > > BinaryView types in the Arrow C data interface [1].
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 - I'm in favor of these new C data format strings
> > > > > [ ] +0
> > > > > [ ] -1 - I'm against adding these new format strings because
> > > > >
> > > > > Ben Kietzman
> > > > >
> > > > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > > > >
> > > >
> >
>


Re: Help regarding setting up the r package in arrow apache

2023-10-18 Thread Jonathan Keane
For development of the R package with docker containers, the link [1] that
Nic sent in this same thread is the place to go. In addition to that
docker-focused one, there are a handful of others that might prove useful
to you in getting your development environment setup [2].

If you run into any issues, feel free to post here, but it's helpful to do
so with debugging mode on (i.e. set the env var ARROW_DEV to true) and to
provide the exact commands you sent along with the output you're seeing so
we can help diagnose what's going wrong.

[1] – https://arrow.apache.org/docs/r/articles/developers/docker.html
[2] – https://arrow.apache.org/docs/r/articles/index.html#developer-guides

-Jon


On Wed, Oct 18, 2023 at 2:48 AM Divyansh Khatri 
wrote:

> I am trying to contribute to the arrow project.so i am trying to setup the
> project on locally.
>
> On Tue, 17 Oct 2023 at 05:14, Bryce Mecum  wrote:
>
> > That error makes it look like you're running `docker compose up` from
> > the root of the Arrow source tree which is likely not what you want.
> > Are you trying to use the Arrow R package in a Docker container or are
> > you trying to contribute to it by developing inside of a Docker
> > container? Nic's link [1] is a good starting point.
> >
> > [1] https://arrow.apache.org/docs/r/articles/developers/docker.html
> >
> > On Mon, Oct 16, 2023 at 4:31 AM Divyansh Khatri
> >  wrote:
> > >
> > > Hi,so i am basically using the docker cmd 'docker compose up -d' in the
> > > docker-compose.yml but i am encountering this error(Error response from
> > > daemon: manifest for amd64/maven:3.5.4-eclipse-temurin-8 not found:
> > > manifest unknown: manifest unknown)so i am not sure how to proceed from
> > > here?
> > >
> > > On Mon, 16 Oct 2023 at 14:17, Nic Crane  wrote:
> > >
> > > > Hi Divyansh,
> > > >
> > > > There are instructions for creating a R package dev setup here:
> > > > https://arrow.apache.org/docs/r/articles/developers/setup.html
> > > >
> > > > If you can explain a bit more about what you've tried so far and
> > what's not
> > > > working, we may be able to advise.
> > > >
> > > > Best wishes,
> > > >
> > > > Nic
> > > >
> > > > On Mon, 16 Oct 2023 at 06:02, Divyansh Khatri <
> > divyanshkhatri...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > I am having problems regarding setting up the r package using
> docker
> > of
> > > > the
> > > > > apache arrow.Can you give me the step by step process of how do i
> > setup
> > > > the
> > > > > r package in my vs code system using docker.
> > > > >
> > > >
> >
>


Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Jonathan Keane
+1

-Jon


On Sat, Oct 7, 2023 at 3:54 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> +1
>
> On Sat, 7 Oct 2023 at 10:44, Antoine Pitrou  wrote:
> >
> >
> > +1 from me.
> >
> > But I also reiterate my plea that these existing parsers get fixed so as
> > to entirely validate the format string instead of stopping early.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit :
> > > Hello,
> > >
> > > I'm writing to propose "+vl" and "+vL" as format strings for list-view
> and
> > > large list-view arrays passing through the Arrow C data interface [1].
> > >
> > > The previous proposal was considered a bad idea because existing
> parsers of
> > > these format strings might be looking at only the first `l` (or `L`)
> after
> > > the `+` and assuming the classic list format from that alone, so now
> I'm
> > > proposing we start with a `+v` as this prefix is not shared with any
> other
> > > existing type so far.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of this new C Data Format string
> > > [ ] +0
> > > [ ] -1 - I'm against adding this new format string because
> > >
> > > Thanks everyone!
> > >
> > > --
> > > Felipe
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
>


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Jonathan Keane
> I would understand this objection more if DuckDB hasn't been relying on
> being able to pass PyArrow expressions for 18 months now [1]. Unless, do
we
> just think this isn't widely used enough that we don't care?

This isn't a pro or a con of specifically adopting the PyArrow expression
semantics as is / with a warning about changing / not at all, but having
some kind of standardization in this interface would be very nice. This
even came up while collaborating with the DuckDB folks that using some of
the expression bits here (and in the R equivalents) was a little bit odd
and having something like a proper API for that would have made that
more natural (and likely that would have been used had it existed 18 months
ago :))

-Jon


On Wed, Jun 28, 2023 at 1:17 PM David Li  wrote:

> That wouldn't remove the feature from DuckDB, would it? It would just mean
> that we recognize that PyArrow expressions don't have well-defined
> semantics that we are committing to at this time. As long as we have
> `**kwargs` everywhere, we can in the future introduce a
> `substrait_filter_expression` or similar argument, while allowing current
> implementors to handle `filter` if possible. (As a compromise, we could
> reserve `filter` and existing arguments and note that PyArrow Expression
> semantics are subject to change without notice?)
>
> On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > Hi Ian,
> >
> >
> >> I favor option 2 out of concern that option 1 could create a
> >> temptation for users of this protocol to depend on a feature that we
> >> intend to deprecate.
> >>
> >
> > I would understand this objection more if DuckDB hasn't been relying on
> > being able to pass PyArrow expressions for 18 months now [1]. Unless, do
> we
> > just think this isn't widely used enough that we don't care?
> >
> > Best,
> > Will
> >
> > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> >
> > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook  wrote:
> >
> >> > I think there's three routes we can go here:
> >> >
> >> > 1. We keep PyArrow expressions in the API initially, but once we have
> >> > Substrait-based alternatives we deprecate the PyArrow expression
> support.
> >> > This is what I intended with the current design, and I think it
> provides
> >> > the most obvious migration paths for existing producers and consumers.
> >> > 2. We keep the overall dataset API, but don't introduce the filter and
> >> > projection arguments until we have Substrait support. I'm not sure
> what
> >> the
> >> > migration path looks like for producers and consumers, but I think
> this
> >> > just implicitly becomes the same as (1), but with worse documentation.
> >> > 3. We write a protocol completely from scratch, that doesn't try to
> >> > describe the existing dataset API. Producers and consumers would then
> >> > migrate to use the new protocol and deprecate their existing dataset
> >> > integrations. We could introduce a dunder method in that API (sort of
> >> like
> >> > __arrow_array__) that would make the migration seamless from the
> end-user
> >> > perspective.
> >> >
> >> > *Which do you all think is the best path forward?*
> >>
> >> I favor option 2 out of concern that option 1 could create a
> >> temptation for users of this protocol to depend on a feature that we
> >> intend to deprecate. I think option 2 also creates a stronger
> >> motivation to complete the Substrait expression integration work,
> >> which is underway in https://github.com/apache/arrow/pull/34834.
> >>
> >> Ian
> >>
> >>
> >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace 
> wrote:
> >> >
> >> > > The trouble is that Dataset was not designed to serve as a
> >> > > general-purpose unmaterialized dataframe. For example, the PyArrow
> >> > > Dataset constructor [5] exposes options for specifying a list of
> >> > > source files and a partitioning scheme, which are irrelevant for
> many
> >> > > of the applications that Will anticipates. And some work is needed
> to
> >> > > reconcile the methods of the PyArrow Dataset object [6] with the
> >> > > methods of the Table object. Some methods like filter() are exposed
> by
> >> > > both and behave lazily on Datasets and eagerly on Tables, as a user
> >> > > might expect. But many other Table methods are not implemented for
> >> > > Dataset though they potentially could be, and it is unclear where we
> >> > > should draw the line between adding methods to Dataset vs.
> encouraging
> >> > > new scanner implementations to expose options controlling what lazy
> >> > > operations should be performed as they see fit.
> >> >
> >> > In my mind there is a distinction between the "compute domain" (e.g. a
> >> > pandas dataframe or something like ibis or SQL) and the "data domain"
> >> (e.g.
> >> > pyarrow datasets).  I think, in a perfect world, you could push any
> and
> >> all
> >> > compute up and down the chain as far as possible.  However, in
> practice,
> >> I
> >> > think there is a healthy set of tools and libraries that say 

Re: [VOTE] Move issue tracking to GitHub Issues

2022-10-26 Thread Jonathan Keane
+1, I'm very glad to see what will hopefully be a _slightly smoother_
experience for new contributors + issue reporters

-Jon


On Wed, Oct 26, 2022 at 7:05 PM David Li  wrote:

> +1
>
> On Wed, Oct 26, 2022, at 20:01, Andy Grove wrote:
> > +1
> >
> > On Wed, Oct 26, 2022 at 5:50 PM L. C. Hsieh  wrote:
> >
> >> +1
> >>
> >>
> >>
> >> On Wed, Oct 26, 2022 at 4:08 PM Raphael Taylor-Davies
> >>  wrote:
> >> >
> >> > +1
> >> >
> >> > On 27/10/2022 12:02, Neal Richardson wrote:
> >> > > I propose that we move issue tracking from the ASF's Jira to GitHub
> >> Issues.
> >> > > This has been discussed on [1] and [2] and there seems to be
> >> consensus. A
> >> > > number of Arrow subprojects already use GitHub Issues; this moves
> the
> >> issue
> >> > > tracking for `apache/arrow` into GitHub along with the source code.
> >> > >
> >> > > The vote will be open for at least 72 hours.
> >> > >
> >> > > [ ] +1 Leave ASF Jira and move to GitHub Issues
> >> > > [ ] +0
> >> > > [ ] -1 Remain in Jira because...
> >> > >
> >> > > My vote: +1
> >> > >
> >> > > Neal
> >> > >
> >> > >
> >> > > [1]:
> https://lists.apache.org/thread/l545m95xmf3w47oxwqxvg811or7b93tb
> >> > > [2]:
> https://lists.apache.org/thread/0vwj8gdo55jly5zn16wksrotyqqm0zqr
> >> > >
> >>
>


Re: [ANNOUNCE] New Arrow PMC member: Nicola Crane

2022-10-25 Thread Jonathan Keane
Congratulations! Your contributions to the project have been immeasurable.

-Jon


On Tue, Oct 25, 2022 at 8:12 PM Vibhatha Abeykoon 
wrote:

> Congrats Nic!
>
> On Wed, Oct 26, 2022 at 5:30 AM Ashish  wrote:
>
> > Congrats !
> >
> > On Wednesday, October 26, 2022, Anja  wrote:
> >
> > > Congrats!!
> > >
> > > On Tue, 25 Oct 2022 at 15:45, Rok Mihevc  wrote:
> > >
> > > > Congrats Nic!
> > > >
> > > > Rok
> > > >
> > > > On Tue, Oct 25, 2022 at 11:16 PM Will Jones  >
> > > > wrote:
> > > >
> > > > > Congrats Nic!
> > > > >
> > > > > On Tue, Oct 25, 2022 at 2:14 PM David Li 
> > wrote:
> > > > >
> > > > > > Congrats & welcome Nic!
> > > > > >
> > > > > > On Tue, Oct 25, 2022, at 17:07, Matt Topol wrote:
> > > > > > > Congrats!!
> > > > > > >
> > > > > > > On Tue, Oct 25, 2022 at 5:06 PM Sutou Kouhei <
> k...@clear-code.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> The Project Management Committee (PMC) for Apache Arrow has
> > > invited
> > > > > > >> Nicola Crane to become a PMC member and we are pleased to
> > announce
> > > > > > >> that Nicola Crane has accepted.
> > > > > > >>
> > > > > > >> Congratulations and welcome!
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > thanks
> > ashish
> >
> --
> Vibhatha Abeykoon
>


Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Jonathan Keane
+1 (non binding)

-Jon

On Wed, Jun 8, 2022 at 4:52 PM Jorge Cardoso Leitão
 wrote:
>
> Sorry, I got a bit confused on what we were voting on. Thank you for the
> clarification.
>
> +1
>
> Best,
> Jorge
>
>
> On Wed, Jun 8, 2022 at 9:53 PM Antoine Pitrou  wrote:
>
> >
> > Le 08/06/2022 à 20:55, Jorge Cardoso Leitão a écrit :
> > > 0 (binding) - imo there is some unclarity over what is expected to be
> > > passed over the C streaming interface - an Array or a StructArray.
> > >
> > > I think the spec claims the former, but the C++ implementation (which I
> > > assume is the reference here) expects the latter [1].
> >
> > It is definitely be the former, despite any limitation in the C++
> > implementation.
> >
> > > Would it be possible to clarify this on either end so we do not clone the
> > > spec and/or reference implementation with this unclarity?
> >
> > For the record, there is no reference implementation. The spec is the
> > reference.
> >
> > Regards
> >
> > Antoine.
> >


Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-03 Thread Jonathan Keane
cc Hannes Mühleisen from DuckDB Labs

-Jon


On Tue, May 31, 2022 at 5:03 PM Wes McKinney  wrote:

> I'm also supportive of having a small vendorable C/C++ "Arrow
> middleware" that provides:
>
> * Schemas and types
> * Columnar data structures and minimal APIs to build them and iterate over
> them
> * C data interface
> * Minimal validation (at the level of Validate but not ValidateFull)
>
> I don't think it's going to be practical to try to refactor parts of
> the existing Arrow C++ core to be vendorable since there are many
> features / requirements (e.g. an extensible buffer and device API)
> that these C++ classes include that aren't needed in this
> limited-feature middleware library.
>
> This also relates to the "Improving Arrow's database support" project
> that David Li raised some time ago [1]. If we want to encourage
> database driver libraries to add new APIs that emit the Arrow C
> interface, we need to make it easier to generate the C interface
> without requiring a new library dependency.
>
> [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>
> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane  wrote:
> >
> > Thanks for working on this. I've heard people asking about something
> > like this from a number of different fronts on top of the obvious use
> > case in geoarrow | other geospatial libraries. I think a minimal piece
> > of Arrow that other packages could depend on without needing to bring
> > in all of arrow would be super valuable in building the bridges we
> > want across other systems.
> >
> > Do you have any (design) documentation that describes the scope of
> > what you're thinking? I know there have been others floating around
> > [1] [2] that were in a similar spirit.
> >
> > A few more questions I hope will spark more conversation: How do the
> > header files you linked in [3] overlap with these other efforts? Are
> > those headers something we could|should "just" PR into apache/arrow
> > and write up how to use them? If not what is the work to make them so
> > that they could be (the answer of course could be design something
> > else entirely and PR that!)?
> >
> > [1] https://github.com/paleolimbot/narrow
> > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > [3]
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp
> >
> > -Jon
> >
> > -Jon
> >
> >
> > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington 
> wrote:
> > >
> > > I'm writing to gauge interest in a set of helpers in C and/or C++ for
> > > reading/exporting Arrow C Data interface structures. My use-case is
> > > building Arrow geospatial support in R [1], and while the set of
> helpers
> > > I've been using [2] has served the purpose of me writing about the
> > > opportunities for Arrow + geospatial [3], I would like to rewrite the
> > > prototype based on something developed by/with the Arrow community.
> > >
> > > Does a set of C/C++ helpers for Arrow C Data interface structures
> already
> > > exist? *Should* it exist?
> > >
> > > If it doesn't, what should the name/scope of that library be? The names
> > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced in
> my
> > > limited discussion of this so far. For the purpose of starting the
> > > discussion, I'll posit that the library should include helpers to
> > > allocate/destroy C Data interface structures, a schema metadata
> > > encoder/decoder, validation of a schema/array pair, and something like
> the
> > > ArrayBuilder C++ class.
> > >
> > > [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > > [2]
> > >
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp
> > > [3]
> > >
> https://docs.google.com/document/d/1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>


Re: [Discuss][Java] macOS minimum requirements

2022-06-01 Thread Jonathan Keane
This isn't Java related directly, but for the R bindings we have to
support at least 10.13.6 to be on CRAN, so bumping up to 10.13 would
be fine for that too.

-Jon

On Wed, Jun 1, 2022 at 9:24 AM Antoine Pitrou  wrote:
>
>
> Sorry, I put "C++" in the title but this really affects Java via JNI.
>
>
> Le 01/06/2022 à 16:22, Antoine Pitrou a écrit :
> >
> >
> > Hello,
> >
> > The topic came up recently of bumping up our minimal macOS requirements
> > from 10.11 to 10.13 (*).  Do people have any particular concerns about this?
> >
> > (*) https://github.com/apache/arrow/pull/13157#issuecomment-1143670152
> >
> > Regards
> >
> > Antoine.


Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-05-30 Thread Jonathan Keane
Thanks for working on this. I've heard people asking about something
like this from a number of different fronts on top of the obvious use
case in geoarrow | other geospatial libraries. I think a minimal piece
of Arrow that other packages could depend on without needing to bring
in all of arrow would be super valuable in building the bridges we
want across other systems.

Do you have any (design) documentation that describes the scope of
what you're thinking? I know there have been others floating around
[1] [2] that were in a similar spirit.

A few more questions I hope will spark more conversation: How do the
header files you linked in [3] overlap with these other efforts? Are
those headers something we could|should "just" PR into apache/arrow
and write up how to use them? If not what is the work to make them so
that they could be (the answer of course could be design something
else entirely and PR that!)?

[1] https://github.com/paleolimbot/narrow
[2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
[3] 
https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp

-Jon

-Jon


On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington  wrote:
>
> I'm writing to gauge interest in a set of helpers in C and/or C++ for
> reading/exporting Arrow C Data interface structures. My use-case is
> building Arrow geospatial support in R [1], and while the set of helpers
> I've been using [2] has served the purpose of me writing about the
> opportunities for Arrow + geospatial [3], I would like to rewrite the
> prototype based on something developed by/with the Arrow community.
>
> Does a set of C/C++ helpers for Arrow C Data interface structures already
> exist? *Should* it exist?
>
> If it doesn't, what should the name/scope of that library be? The names
> 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced in my
> limited discussion of this so far. For the purpose of starting the
> discussion, I'll posit that the library should include helpers to
> allocate/destroy C Data interface structures, a schema metadata
> encoder/decoder, validation of a schema/array pair, and something like the
> ArrayBuilder C++ class.
>
> [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> [2]
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp
> [3]
> https://docs.google.com/document/d/1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing


Re: DISCUSS: Stabilize Arrow C Stream Interface?

2022-05-26 Thread Jonathan Keane
I too am +1 (nonbinding) to marking it as stable

-Jon


On Thu, May 26, 2022 at 1:05 PM Neal Richardson 
wrote:

> +1 from me too to mark it as stable. De facto it is stable: there have been
> no modifications to
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/abi.h since
> the
> stream was added in 2020.
>
> Neal
>
> On Thu, May 26, 2022 at 12:32 PM Dewey Dunnington 
> wrote:
>
> > I'm fairly new to this but have worked on the DuckDB--R bindings
> > integration and used it in geospatial prototyping for a few things. I
> would
> > love to see the ArrowArrayStream declared as stable to promote its
> adoption
> > (or start the process of finalizing its definition if there is pending
> > feedback that hasn't yet been incorporated).
> >
> > On Wed, May 25, 2022 at 6:59 PM Will Jones 
> > wrote:
> >
> > > The Arrow C Stream Interface is still listed as experimental [1],
> though
> > it
> > > was introduced about 20 months ago [2]. It's being used in the
> > > well-advertised integration between PyArrow/R arrow and DuckDB [3].
> > Support
> > > was added to both Rust implementations [4][5]. It was discussed in
> > today's
> > > sync meeting that additional systems have been experimenting with it
> [6].
> > >
> > > Should we stabilize the API now? Or are there any changes we need to
> > > contemplate based on the experience of our early adopters?
> > >
> > > [1] https://arrow.apache.org/docs/7.0/format/CStreamInterface.html
> > > [2] https://github.com/apache/arrow/pull/8052
> > > [3] https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
> > > [4] https://github.com/apache/arrow-rs/pull/1384
> > > [5] https://github.com/jorgecarleitao/arrow2/pull/857
> > > [6] https://lists.apache.org/thread/x03nck8rpmyd8td6vpz1ctqgno1cbf10
> > >
> >
>


Re: [VOTE] Release Apache Arrow 7.0.0 - RC8

2022-01-27 Thread Jonathan Keane
+0 most things validate, though I haven't been able to run the C++
tests successfully

Thank you for the huge effort Krisztián.

I verified the signature + checksums on [3].

I've run the following (on macOS 12.1):

The binary verification — successful.

I've also run the source verification on:
* C++ — 1 test failure:
BitUtilTests.TestCopyAndReverseBitmapPreAllocated Is this flakey? It
fails each time I try C++ (or any of the packages that depend on C++)
* JS — had to install yarn (should we add this to the release
verification instructions for macos [13]?) but successful
* Go — successful
* csharp — complained about a dotnet version mismatch (I didn't dig
too deeply on this one)


[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc8
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates

-Jon






On Wed, Jan 26, 2022 at 7:24 AM Krisztián Szűcs
 wrote:
>
> Hi,
>
> I would like to propose the following release candidate (RC8) of Apache
> Arrow version 7.0.0. This is a release consisting of 618
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 400b5d989dd3a654bc1061d19a5ae3e95972e5eb [2]
>
> The source release rc8 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 7.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 7.0.0 because...
>
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.0
> [2]: 
> https://github.com/apache/arrow/tree/400b5d989dd3a654bc1061d19a5ae3e95972e5eb
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc8
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc8
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc8
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc8
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]: 
> https://github.com/apache/arrow/blob/400b5d989dd3a654bc1061d19a5ae3e95972e5eb/CHANGELOG.md
> [13]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-17 Thread Jonathan Keane
This doesn't address the large number of row groups ticket that was
raised, but for some visibility: there is some work to change the row
group sizing based on the size of data instead of a static number of
rows [1] as well as exposing a few more knobs to tune [2]

There is a bit of prior art in the R implementation for attempting to
get a reasonable row group size based on the shape of the data
(basically, aims to have row groups that have 250 Million cells in
them). [3]

[1] https://issues.apache.org/jira/browse/ARROW-4542
[2] https://issues.apache.org/jira/browse/ARROW-14426 and
https://issues.apache.org/jira/browse/ARROW-14427
[3] 
https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222

-Jon

On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
 wrote:
>
> In addition, would it be useful to be able to change this max_row_group_length
> from Python?
> Currently that writer property can't be changed from Python, you can only
> specify the row_group_size (chunk_size in C++)
> when writing a table, but that's currently only useful to set it to
> something that is smaller than the max_row_group_length.
>
> Joris


Re: Arrow sync call November 10 at 12:00 US/Eastern, 17:00 UTC

2021-11-10 Thread Jonathan Keane
Meeting notes:

# Participants
 Nic
 Weston
 David
 Eduardo
 Benson
 Rok
 Antoine
 Alenka
 James
 Matt
 Micah

# 6.0.1 patch release
The RC1 for 6.0.1 is on its way and will have a vote shortly

# Flight SQL
David wanted to talk about Flight SQL from Dremio. We are close, would
like someone to review the C++, but we should be able to have a vote
soon.

Micah: we should mark this as experimental, yeah? Not yet having
forwards/backwards compatibility. Maybe we should clarify this in the
vote?

We will clarify this as part of the vote and as part of the pull
request too. We don't _plan_ to make big changes

-Jon

On Wed, Nov 10, 2021 at 9:27 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian


Re: [DISCUSS] Deprecate user@ in favor for github issues/discussions

2021-09-29 Thread Jonathan Keane
I am also +1 for all of the same reasons both Neal and Philip mention.
Lowering that barrier to participation for getting help + having that
information more easily findable will make it easiest for folks to use
and adopt Arrow. I will add personally I didn't realize I already do
this when working with other pieces of software that I use when I bump
into something that is not totally clear, my first course of action is
"lemme search the GH issues / PRs..."

This would also be a good complement to the cookbook work that
contains more general solutions, this GH space would likely have more
discussion about
specifics (and maybe even would provide fodder for continuing to
expand the cookbook if particular issues come up repeatedly).

I don't have a strong opinion on GitHub Issues versus Discussions.
I've not used GitHub discussions at all, but am happy to try that out.
This was mentioned on the call as well, but Apache Airflow uses GitHub
Discussions, so it should be something we can have enabled.

-Jon


On Wed, Sep 29, 2021 at 1:39 PM Neal Richardson
 wrote:
>
> +1 from me too. More and more developers seem to be accustomed to using
> GitHub Issues to ask for help, and redirecting them to a mailing list adds
> a barrier to participation.
>
> Neal
>
> On Wed, Sep 29, 2021 at 2:32 PM Phillip Cloud  wrote:
>
> > I am +1 on steering users towards GitHub issues for support questions. I
> > think there's a lot of value in someone being able to use a search engine
> > to potentially find an answer to their problem.
> >
> > On Wed, Sep 29, 2021 at 2:16 PM Micah Kornfield 
> > wrote:
> >
> > > We discussed briefly on the sync this morning, but I was wondering what
> > > people thought about removing the user@ mailing list in favor of either
> > > Github issues or discussions.  We can try to mirror issues to an
> > > appropriate mailing list if archiving for posterity.
> > >
> > > Off the top of my head here are some pros/cons to approaches:
> > >
> > > Pros:
> > > - Github focuses on SEO which makes answers to one-off questions easier
> > to
> > > find.
> > > -  Github issues seem to have roughly the same traffic as the user@
> > > mailing.  It would likely have more if we didn't steer people to user@.
> > >
> > > Cons:
> > > -  This decentralizes user issues across the Arrow repos.
> > >
> > >
> > > - This is NOT a proposal to use github issues in place of JIRA (for the
> > > languages that are currently using JIRA).
> > > - This is NOT a proposal to make any modification to the dev@ mailing
> > list
> > > (I think centralization here is important).
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > Micah
> > >
> >


Re: [VOTE] Restart the Julia implementation with new repository and process

2021-09-27 Thread Jonathan Keane
+1

-Jon

On Mon, Sep 27, 2021 at 2:26 PM Mauricio Vargas
 wrote:
>
> +1
>
> On Mon, Sep 27, 2021 at 3:18 PM Neal Richardson 
> wrote:
>
> > +1 (binding)
> >
> > Neal
> >
> > On Mon, Sep 27, 2021 at 6:54 AM Andrew Lamb  wrote:
> >
> > > +1 (binding)
> > >
> > > On Mon, Sep 27, 2021 at 12:17 AM Andy Grove 
> > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Sun, Sep 26, 2021 at 9:11 PM Benjamin Kietzman  > >
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Sun, Sep 26, 2021, 23:08 Micah Kornfield 
> > > > wrote:
> > > > >
> > > > > > ,+1 (binding)
> > > > > >
> > > > > > On Sunday, September 26, 2021, Sutou Kouhei 
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This vote is to determine if the Arrow PMC is in favor of
> > > > > > > the Julia community moving the Julia implementation of
> > > > > > > Apache Arrow out of apache/arrow into apache/arrow-julia.
> > > > > > >
> > > > > > > The Julia community uses a process like the Rust community
> > > > > > > uses [1][2].
> > > > > > >
> > > > > > > Here is a summary of the process:
> > > > > > >
> > > > > > >   1. Use GitHub instead of JIRA for issue management platform
> > > > > > >
> > > > > > >  Note: Contributors will be required to write issues for
> > > > > > >  planned features and bug fixes so that we have
> > > > > > >  visibility and opportunities for collaboration before a
> > > > > > >  PR shows up.
> > > > > > >
> > > > > > >  (This is for the Apache way.)
> > > > > > >
> > > > > > >  [1]
> > > > > > >
> > > > > > >   2. Release on demand
> > > > > > >
> > > > > > >  Like DataFusion.
> > > > > > >
> > > > > > >  Release for apache/arrow doesn't include the Julia
> > > > > > >  implementation.
> > > > > > >
> > > > > > >  The Julia implementation uses separated version
> > > > > > >  scheme. (apache/arrow uses 6.0.0 as the next version
> > > > > > >  but the next Julia implementation release doesn't use
> > > > > > >  6.0.0.)
> > > > > > >
> > > > > > >  [2]
> > > > > > >
> > > > > > > We'll create apache/arrow-julia and start IP clearance
> > > > > > > process to import JuliaData/Arrow.jl to apache/arrow after
> > > > > > > the vote is passed. (We don't use julia/arrow/ in
> > > > > > > apache/arrow.)
> > > > > > >
> > > > > > > See also discussions about this: [3][4]
> > > > > > >
> > > > > > >
> > > > > > > Please vote whether to accept the proposal and allow the
> > > > > > > Julia community to proceed with the work.
> > > > > > >
> > > > > > > The vote will be open for at least 72 hours.
> > > > > > >
> > > > > > > [ ] +1 : Accept the proposal
> > > > > > > [ ] 0 : No opinion
> > > > > > > [ ] -1 : Reject proposal because...
> > > > > > >
> > > > > > >
> > > > > > > [1] https://docs.google.com/document/d/1TyrUP8_
> > > > > > > UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit
> > > > > > > [2] https://github.com/apache/arrow-datafusion/blob/master/
> > > > > > > dev/release/README.md
> > > > > > > [3]
> > > > > >
> > > https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801
> > > > > > > a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
> > > > > > > [4]
> > > > > >
> > > https://lists.apache.org/x/thread.html/r0df7f44f7e1ed7f6e4352d34047d5
> > > > > > > 3076208aa78aad308e30b58f83a%40%3Cdev.arrow.apache.org%3E
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > --
> > > > > > > kou
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> —
> *Mauricio 'Pachá' Vargas Sepúlveda*
> Site: pacha.dev
> Blog: pacha.dev/blog


Re: Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2021-08-04 Thread Jonathan Keane
Notes for the meeting, it was relatively short and sparsely attended
this fortnight:

Attendees:
* David Li
* Jonathan Keane
* Nic Crane
* Neal Richardson

Topics discussed
* Compute IR proposal: There's been some discussion, check it out
* CRAN resubmission, we have the fixes we need, will send the
resubmission shortly

Thanks to all who attended, see y'all in a fortnight.

-Jon

On Tue, Aug 3, 2021 at 4:20 PM Jonathan Keane  wrote:
>
> Hello everyone,
>
> Our biweekly sync call is tomorrow (3 August) at 12:00 noon Eastern time.
>
> For today's call, let's please us this Google Meet URL (different from the
> usual one):
> https://meet.google.com/vbq-yufg-zwr?authuser=0
>
> All are welcome to join. Notes will be shared with the mailing list
> afterward.
>
> Thanks,
> -Jon


Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2021-08-03 Thread Jonathan Keane
Hello everyone,

Our biweekly sync call is tomorrow (3 August) at 12:00 noon Eastern time.

For today's call, let's please us this Google Meet URL (different from the
usual one):
https://meet.google.com/vbq-yufg-zwr?authuser=0

All are welcome to join. Notes will be shared with the mailing list
afterward.

Thanks,
-Jon


[Discuss] If and how we should integrate geospatial data (specs) in Arrow

2021-06-25 Thread Jonathan Keane
Hello,

There is an emerging spec[1] for how to store geospatial data in Arrow
+ pass through parquet files in the geopandas world. There is even a
new R package that implements a wrapper to do the same in R[2]. These
both define a serialization[3] for storing geospatial data as an Arrow
table (and thus also when saving to parquet with Arrow).

I could see a number of ways that we might interact with standards
like these, and for any of these that we pursue it would be good to
clarify that in our docs:

1. Point to the standard — we could mention that this standard exists
and that if someone is building a geospatial data aware application,
they _could_ refer to this standard if they want to.
2. Adopt a/this standard — this could range from stating that we've
adopted it as the way that spatial data _ought_ to be stored to asking
the creators if maintaining it within the Arrow project itself would
be better (either by adopting it or creating a fork — of course
communication with the folks working on it now would be critical!)
3. Create extension type(s) for geospatial data — this would require
adopting a standard like the one linked, but on top of that providing
an extension type within Arrow itself that the various clients could
implement as they saw fit.
4. Create new, fully separate type(s) for geospatial data — again,
this would require adopting a standard of some sort, but we would
implement it as a specific type and presumably support it in all of
the clients as we could.

There are of course pros and cons to all of these. This type of data
*is* somewhat specialized and I don't think we want to have a huge
profusion of types for all of the possible specialized data types out
there. But, at a minimum we should acknowledge (or adopt) a standard
if it exists and encourage implementations that use Arrow to follow
that standard (like sfarrow does to be compatible with geopandas) so
that some level of interoperability is there + people aren't needing
to reinvent the wheel each time they store spatial data.

Thoughts? Are there other projects out there that already do something
like this with Arrow that we should consider?

[1] https://github.com/geopandas/geo-arrow-spec/pull/2
[2] https://github.com/wcjochem/sfarrow
[3] for now they create a binary WKB column + attach a bit of metadata
to the schema that that's what happened, though there are other ways
one could encode this and the spec might include other way(s) to store
this data in the future.

-Jon


Re: [VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-25 Thread Jonathan Keane
+1

-Jon

On Fri, Jun 25, 2021 at 5:30 AM Rok Mihevc  wrote:
>
> +1 (non-binding)
>
> On Fri, Jun 25, 2021 at 11:21 AM Eduardo Ponce  wrote:
>
> > +1 (non-binding)
> >
> > On Fri, Jun 25, 2021 at 4:31 AM Joris Peeters 
> > wrote:
> >
> > > +1
> > >
> > > On Fri, Jun 25, 2021 at 9:29 AM Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > > On Thu, 24 Jun 2021 at 21:21, Micah Kornfield 
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Thu, Jun 24, 2021 at 12:17 PM Weston Pace 
> > > > > wrote:
> > > > >
> > > > > > The discussion in [1] led to the following proposal which I would
> > > like
> > > > > > to submit for a vote.
> > > > > >
> > > > > > ---
> > > > > > Arrow allows a timestamp column to omit the time zone property.
> > This
> > > > > > has caused confusion because some people have interpreted a
> > timestamp
> > > > > > without a time zone to be an Instant while others have interpreted
> > it
> > > > > > to be a LocalDateTime.
> > > > > >
> > > > > > This proposal is to clarify the Arrow schema (via comments) and
> > > assert
> > > > > > that a timestamp without a time zone should be interpreted as
> > > > > > LocalDateTime.
> > > > > >
> > > > > > Note: For definitions of Instant and LocalDateTime (and a
> > discussion
> > > > > > on the semantics) please refer to [3]
> > > > > > ---
> > > > > >
> > > > > > For sample arguments for/against see [2].  For a summary of some of
> > > > > > the discussion in [1] and a detailed discussion about the different
> > > > > > temporal concepts see [3].  A related straw poll (and eventual
> > vote)
> > > > > > will be sent regarding treatment of instants as potential Arrow
> > > types.
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Update comments in schema.fbs to assert the above
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not make any change
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/r8216e5de3efd2935e3907ad9bd20ce07e430952f84de69b36337e5eb%40%3Cdev.arrow.apache.org%3E
> > > > > > [2]:
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing
> > > > > > [3]:
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > > > >
> > > > >
> > > >
> > >
> >


Re: [C++][Discuss] Switch to C++17

2021-06-11 Thread Jonathan Keane
e on the same xcode version they use
> (or wait for them to eventually upgrade their machines).
> 3. What users can install on their systems. In the enterprise context,
> users don't always get to upgrade R freely, nor can they always install
> newer compilers. I acknowledge that raising this is FUD, but we just don't
> know how significant this is.
> 4. What other R packages require. Because of #3, maintainers of major R
> packages in the ecosystem generally try to support the last 4-5 releases so
> that users who are stuck unable to upgrade R are not left behind. This
> means 3 versions of R (and, given yearly releases, a 3 year lag) beyond
> what CRAN requires. This is not to say that we have to do the same, just
> that if we don't, then that limits the chances that one of those
> maintainers would view arrow as something they can depend on. (That said, I
> don't think there's high likelihood that these packages would take a hard
> dependency on arrow; optional dependency ("Suggests", in R-speak) is more
> likely, regardless of C++ standard, due to other reasons (size, FUD, etc.).)
>
> Neal
>
> [1]:
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b-code
>
> On Wed, Jun 9, 2021 at 10:26 AM Eduardo Ponce  wrote:
>
> > After the discussion in today's Arrow sync call, I do think it would be
> > beneficial to come up with a formal process for deciding when is a "right
> > time" for upgrading Arrow to a newer C++ standard. I suggest we could
> > consider a set of general metrics/criteria that try to summarize the
> > benefits and drawbacks of such change. Some metrics will be measurable but
> > others will be qualitative. For the latter, we can use a consensus-based
> > scale rating (1-5 with a meaning attached to each value). I am curious what
> > approach other major C++ projects have used to resolve decisions on
> > selecting a C++ standard (aside from crI foreseeitically required
> > features)?
> >
> > The criteria used to evaluate newer C++ standards need to fairly consider
> > people with different roles with regards to the Arrow project, such as
> > developers, contributors, C++ users, other language users (R, Python), and
> > maintainers.
> > Here is a possible (and likely incomplete) set of metrics:
> >
> > Measurable metrics:
> > * code size (source and binary) - measured in bytes
> > * compilation time (consider each major Arrow component)
> > * runtime - what are the performance changes? (consider each major Arrow
> > component)
> > * systems/OS/tools supported and deprecated
> > * ...
> >
> > Qualitative metrics:
> > * code structure/maintainability - how would it improve development?
> > * code readability - ease of understanding details for new/current
> > contributors?
> > * ...
> >
> > I do think this approach will give us a better standpoint for deciding on
> > when to upgrade to a newer C++ standard.
> > Nevertheless, there are complexities for implementing such an approach:
> > * selecting the "correct" metrics
> > * designing the scale rating
> > * How do we get the community to provide their opinion for the qualitative
> > metrics? What is a "good enough" coverage?
> > * How do we summarize the results into a binary decision: upgrade vs not
> > upgrade?
> > * ...
> >
> > In the end, it might not be worthwhile to go through all this work, I am
> > simply expressing an idea.
> >
> > ~Eduardo
> >
> >
> > On Wed, Jun 9, 2021 at 9:40 AM Antoine Pitrou  wrote:
> >
> > > On Tue, 8 Jun 2021 17:37:30 -0500
> > > Jonathan Keane  wrote:
> > > > I've been digging a bit to try and put numbers on those users the Neal
> > > > mentions. Specifically, we know that requiring C++17 will mean that R
> > > > users on windows using versions of R before 4.0.0 will not be able to
> > > > compile/install arrow. Although R version 3.6 is no longer supported
> > > > by CRAN [1], many people hang on to older versions for an extended
> > > > period of time.
> > > >
> > > > We are still working on getting more solid numbers about how many
> > > > people might still be on these old versions, but here is what I have
> > > > so far:
> > > >
> > > > Using Rstudio's cran mirror logs of package installations [2] (and
> > > > with the help of Arrow datasets to process/filter these files ) for
> > > > the period from 2020-05-18 [3] to today, for the installations that
> > > > have an r version reported ap

Re: [C++][Discuss] Switch to C++17

2021-06-08 Thread Jonathan Keane
I've been digging a bit to try and put numbers on those users the Neal
mentions. Specifically, we know that requiring C++17 will mean that R
users on windows using versions of R before 4.0.0 will not be able to
compile/install arrow. Although R version 3.6 is no longer supported
by CRAN [1], many people hang on to older versions for an extended
period of time.

We are still working on getting more solid numbers about how many
people might still be on these old versions, but here is what I have
so far:

Using Rstudio's cran mirror logs of package installations [2] (and
with the help of Arrow datasets to process/filter these files ) for
the period from 2020-05-18 [3] to today, for the installations that
have an r version reported approximately 27% of the windows package
installs are on versions before 4.0.0 (and therefore would be unable
to install arrow if we require C++17 right now).

There are a number of caveats about this data, however:
* the "that have an r version reported" is very important: only ~17%
of the installations provide an R version. It's possible (and very
likely) that the installations that don't include this information are
not distributed like those that do. This is the biggest problem with
this dataset/analysis and we're trying to see if others have better
information here.
* This is limited to one of many cran repositories. There's no
indication that folks using this repository are more likely to be
using older versions (if anything it is probably the opposite), but we
don't have that information directly.
* There isn't a way to filter out CI and other automated installations
that aren't representative of real-world use cases.

If we get a more reliable dataset for this I will update these
numbers. I'm not sure what the threshold is for if this impacts too
many people (and if these numbers are above that). But wanted to get
this information out here for us to think about. Additionally, it
might be useful to think about how quickly we cut off support for
client languages: if we release on our typical schedule (in July),
people who installed R 1.25 years ago (on windows) would be required
to upgrade R in order to install arrow. That might be long enough, or
the benefits of C++17 outweigh this, but like Neal mentions: the
people likely to run into this are likely not on this list.


[1] - the last release in the 3.6 line (3.6.3) was released on
2020-02-29, and was superceded by 4.0.0 2020-04-24
[2] - http://cran-logs.rstudio.com
[3] - this is the day that R 4.1.0 was released and 3.6.0 stopped
being supported by CRAN

-Jon

On Tue, Jun 8, 2021 at 4:39 PM Neal Richardson
 wrote:
>
> I'm guessing there hasn't been opposition on this thread because the users
> that this might affect aren't following this mailing list.
>
> I'd be interested to see which other major C++ projects out there have
> bumped their requirement to C++17, and how that experience was for
> everyone--the user community as well as the developers. Do you know of good
> examples? I just checked on CRAN today, and of the 17,694 R packages there,
> only 3 require C++17 (none of which have wide adoption) and only 20 require
> C++14.
>
> Neal
>
> On Tue, Jun 8, 2021 at 6:17 AM Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > Note the change in the message topic :-)
> > We now have a draft PR up to switch the C++ standard level to C++17.
> > This allows very nice simplifications in the code, especially the use
> > of elegant constructs that can replace some cumbersome uses of
> > std::enable_if, SFINAE and other pain points.
> >
> > https://github.com/apache/arrow/pull/10414
> >
> > It seems we were finally able to overcome the main platform
> > compatibility (CI) hurdles, though some effort will probably be
> > necessary to squash all regressions in that area.
> >
> > I haven't seen any opposition previously in this thread, so you are
> > really concerned by this, it would be better to speak up quickly, as
> > otherwise we may decide to move forward with the change.
> >
> > Best regards
> >
> > Antoine.
> >
> >
> > On Thu, 27 May 2021 10:03:03 +0200
> > Antoine Pitrou  wrote:
> > > Hello,
> > >
> > > It seems the only two platforms that constrained us to C++11 will not be
> > > supported anymore (those platforms are RTools 3.5 for R packages, and
> > > manylinux1 for Python packages).
> > >
> > > It would be beneficial to bump our C++ requirement to C++14.  There is
> > > an issue open listing benefits:
> > > https://issues.apache.org/jira/browse/ARROW-12816
> > >
> > > An additional benefit is that some useful third-party libraries for us
> > > may or will require C++14, including in their headers.
> > >
> > > Is anyone opposed to doing the switch?  Please speak up.
> > >
> > > Best regards
> > >
> > > Antoine.
> > >
> >
> >
> >
> >


Re: [NIGHTLY] Arrow Build Report for Job nightly-2021-06-06-0

2021-06-07 Thread Jonathan Keane
Yes, I absolutely agree that more triaging, visibility, and info into
these would be massively helpful for tracking some of these down.

The conda-osx-py* builds seem to all be related to this LLVM mismatch
https://issues.apache.org/jira/browse/ARROW-12738 which I've clarified
more on that ticket.

The two r checking builds (valgrind + sanitizer) are both already ticketed at:
https://issues.apache.org/jira/browse/ARROW-12708
https://issues.apache.org/jira/browse/ARROW-12896

-Jon

On Mon, Jun 7, 2021 at 8:41 AM Joris Van den Bossche
 wrote:
>
> The three "test-ubuntu-18.04-cpp" failing builds are due to a Gandiva test
> case failure, for which I opened
> https://issues.apache.org/jira/browse/ARROW-12987 (and Anthony is already
> fixing it).
>
> Looking into the "kartothek" failures (for which I opened
> https://issues.apache.org/jira/browse/ARROW-12988, and
> https://github.com/JDASoftwareGroup/kartothek/issues/475 on the kartothek
> side). This might be due to an incompatibility of kartothek with dask,
> probably not of our concern (I can add a skip for the one failing test in
> our CI).
>
> It would be good if we have an overview for each of the failures if they
> are new / need to be investigated or are waiting on action or an
> up/downstream fix. Because now it is a bit difficult to know where to start.
>
> On Sun, 6 Jun 2021 at 18:28, Neal Richardson 
> wrote:
>
> > Folks, I count 28 failing nightly builds. This is not good. Has moving the
> > nightly build report to a separate mailing list allowed us to ignore the
> > failures more easily?
> >
> > Leaving aside any questions of improving our nightly build monitoring,
> > which I know are ongoing: could you please take a look at the failures,
> > particularly the newer ones (I know there are some persistent ones here
> > that have open JIRA issues already) and see if they can be fixed?
> >
> > Thanks,
> > Neal
> >
> >
> > On Sun, Jun 6, 2021 at 3:17 AM Crossbow  wrote:
> >
> > >
> > > Arrow Build Report for Job nightly-2021-06-06-0
> > >
> > > All tasks:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0
> > >
> > > Failed Tasks:
> > > - centos-8-amd64:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-centos-8-amd64
> > > - centos-8-arm64:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-travis-centos-8-arm64
> > > - conda-osx-clang-py36-r36:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-conda-osx-clang-py36-r36
> > > - conda-osx-clang-py37-r40:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-conda-osx-clang-py37-r40
> > > - conda-osx-clang-py38:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-conda-osx-clang-py38
> > > - conda-osx-clang-py39:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-conda-osx-clang-py39
> > > - debian-bullseye-arm64:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-travis-debian-bullseye-arm64
> > > - debian-buster-arm64:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-travis-debian-buster-arm64
> > > - java-jars:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-java-jars
> > > - test-conda-python-3.7-kartothek-latest:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-test-conda-python-3.7-kartothek-latest
> > > - test-conda-python-3.7-kartothek-master:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-test-conda-python-3.7-kartothek-master
> > > - test-conda-python-3.7-turbodbc-latest:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-test-conda-python-3.7-turbodbc-latest
> > > - test-conda-python-3.7-turbodbc-master:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-test-conda-python-3.7-turbodbc-master
> > > - test-conda-python-3.8-spark-master:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-github-test-conda-python-3.8-spark-master
> > > - test-r-linux-valgrind:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-test-r-linux-valgrind
> > > - test-r-without-arrow:
> > >   URL:
> > >
> > https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-06-06-0-azure-test-r-without-arrow
> > > - test-ubuntu-18.04-cpp-release:
> > >   URL:
> > >
> > 

Re: Moving automated nightly build e-mails to a separate mailing list

2021-05-24 Thread Jonathan Keane
I also very much agree with all of the sentiments above.

One of the things that I'm hoping this new site/dashboard/whatever we
come up with will have is some more information / context around the
failures that hopefully will help make them less overwhelming and have
a higher signal to noise ratio. For one current example, there's
currently an R valgrind build that fails, and it's been (mostly)
isolated + we have a Jira to fix it [1], and that's in progress/in
someone's work queue. Being able to tie that information to the
continued failure will (hopefully) help us both prioritize fixing it
and help us identify which of the failures are new + novel and need
debugging attention.

Reliability over the past 7 or 30 days would be fantastic (I created
[2] to track that). Along with information about when the first
(sustained) failure was (which also already has a ticket [3]). Those
two would help a ton!

[1] https://issues.apache.org/jira/browse/ARROW-12708
[2] https://issues.apache.org/jira/browse/ARROW-12862
[3] https://issues.apache.org/jira/browse/ARROW-12821

-Jon

On Mon, May 24, 2021 at 3:54 PM Wes McKinney  wrote:
>
> I agree strongly with this sentiment. The idea of e-mailing only when
> there are failed builds was to motivate us to fix the builds. In
> practice this is not happening unfortunately. Having a website
> dashboard showing build health over time along with a ~ weekly e-mail
> to dev@ indicating currently broken builds and the reliability of each
> build over the trailing 7 or 30 days would be useful. Knowing that a
> particular build is only passing 20% of the time would help steer our
> efforts.
>
> On Mon, May 24, 2021 at 2:49 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I agree with the fact that current nightly build report is
> > noisy and reporting as HTML on website improves
> > readability. But I think that the real problem is not caused
> > by them. It's the real problem that we always have failed
> > nightly build jobs. If we sometimes have failed nightly
> > build jobs and nightly build report e-mails are sent only
> > when one more nightly builds are failed, nightly build
> > report will not be noisy.
> >
> > Constant nightly build failures aren't healthy. It may block
> > a new release. Can we work on keeping green nightly build? I
> > think that we need to fix failing jobs and remove
> > unmaintained failing jobs for it.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: Moving automated nightly build e-mails to a separate mailing list" 
> > on Mon, 24 May 2021 11:22:46 -0700,
> >   Wes McKinney  wrote:
> >
> > > OK. bui...@arrow.apache.org is now ready to use, so can someone please
> > > take care of moving the nightly e-mails there and confirm here when
> > > that's completed?
> > >
> > > https://lists.apache.org/list.html?bui...@arrow.apache.org
> > >
> > > On Mon, May 24, 2021 at 10:40 AM Mauricio Vargas
> > >  wrote:
> > >>
> > >> yes, i strongly agree with this idea
> > >> i was preparing a static site last week taht I plan to show on wednesday
> > >>
> > >> On Mon, May 24, 2021 at 11:13 AM Krisztián Szűcs 
> > >> 
> > >> wrote:
> > >>
> > >> > We could generate a static website on the crossbow github page
> > >> > including more details about the failures - this would keep crossbow
> > >> > self-contained.
> > >> > I'd suggest still sending the failing builds on the new mailing list
> > >> > (including the static page's link), so we get notified by the
> > >> > failures.
> > >> >
> > >> > On Mon, May 24, 2021 at 4:01 PM Wes McKinney  
> > >> > wrote:
> > >> > >
> > >> > > Frankly I think we should move these reports to a real website rather
> > >> > than
> > >> > > sending these emails. The emails to me were always a stopgap to add 
> > >> > > some
> > >> > > visibility where there previously was little (you had to go digging 
> > >> > > on
> > >> > > crossbow).
> > >> > >
> > >> > > On Mon, May 24, 2021 at 2:01 AM Krisztián Szűcs <
> > >> > szucs.kriszt...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > On Sun, May 23, 2021 at 7:30 PM Wes McKinney 
> > >> > wrote:
> > >> > > > >
> > >> > > > > I just requested builds@ to be created on the self-service 
> > >> > > > > platform.
> > >> > > > Hi,
> > >> > > >
> > >> > > > Having a separate list sounds like a good idea to me.
> > >> > > > Could we enable HTML emails on that list?
> > >> > > > See jira https://issues.apache.org/jira/browse/ARROW-12822
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sun, May 23, 2021 at 10:12 AM Mauricio Vargas
> > >> > > > >  wrote:
> > >> > > > > >
> > >> > > > > > hi
> > >> > > > > >
> > >> > > > > > I agree with this, and being the person who has been digging 
> > >> > > > > > into
> > >> > > > nightly
> > >> > > > > > errors, I can move this to a weekly email to builds@
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Sun, May 23, 2021 at 11:14 AM Wes McKinney 
> > >> > > > > >  > >> > >
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > > hi folks,
> > >> > > > > > 

Re: Nightly Builds Repors 2021-05-17

2021-05-18 Thread Jonathan Keane
Thanks for the comments + tickets Krisztián all of those sound like
good enhancements to this process.

On the point of:
>>  Error type: Internal
> I find it really useful to categorize the errors, especially if we
> have an error out of our direct reach.
> I can't think of an easy way to automate this though.

It seems that there is a need for some extremely lightweight content
storage linked to these that could provide this kind of info.
Something like a json or yml (or almost any text format that we are
comfortable with) in the crossbow repository that someone can write
information like this and then it can be parsed and that content used
in the report so that information about external issues like this can
be kept and surfaced even if we can't automate them.


On Tue, May 18, 2021 at 8:16 AM Krisztián Szűcs
 wrote:
>
> Thanks Mauricio for collecting the issues!
>
> I'm placing a couple automatization ideas inline:
>
> On Tue, May 18, 2021 at 3:32 AM Mauricio Vargas
>  wrote:
> >
> > *NIGHTLY BUILDS REPORT*
> >
> > 2021-05-17
> Could we send this email as a response to the original nightly report
> so we keep them organized in the same thread?
> >
> >
> > *New reported errors*
> >
> >
> > *GitHub*
> I find the report a bit hard to read due to the line breaks and
> asterisks. I think the CI service is not meaningful in the report's
> context.
>
> We could consider sending the nightly build report in both HTML and
> plaintext formats. This is not the preferred format for the dev
> mailing list but we could make an exception for the nightly reports to
> make them less noisy. Created a JIRA to track it:
> https://issues.apache.org/jira/browse/ARROW-12822
> >
> >
> > *Build: *github-test-conda-python-3.8-spark-master
> > 
> The Github statuses/checks API should return context about the build,
> optionally including the build URL (though it depends on the CI
> service integration).
> We could possibly automatize this, created a jira ticket to track it
> https://issues.apache.org/jira/browse/ARROW-12819
> >
> > Error type: Internal
> I find it really useful to categorize the errors, especially if we
> have an error out of our direct reach.
> I can't think of an easy way to automate this though.
> >
> > Progress: No work has yet been done on this issue.
> It clearly highlights that someone should take a look at the ticket below.
> >
> > First time issued: 2021-05-13 (4 days ago)
> Again, really useful! This is something we can automate as well,
> though we need to be careful to avoid API rate limiting.
> Created a ticket to track this:
> https://issues.apache.org/jira/browse/ARROW-12821
> >
> > Ticket: ARROW-12817
> We could identify streaks of failures by using a jira field (url to
> the crossbow branch perhaps), though it may not be worth the effort at
> the moment.
> >
> >
> > *Persisting errors*
> We could infer this from the first occurence of the failure based on
> https://issues.apache.org/jira/browse/ARROW-12821
>
> The additional context you provide for the failures will help a lot to
> maintain the nightly builds, thanks again for collecting it!
> >
> >
> > *Azure*
> >
> >
> > *Build: *azure-test-r-rhub-ubuntu-gcc-release-latest
> > 
> >
> > *Error type:* External
> >
> > *Progress**:* No work has yet been logged on this issue.
> >
> > *First time issued:* 2021-05-14 (3 days ago)
> >
> > *Ticket:* ARROW-12795 
> >
> > *Comment:* *I need to send a PR to R-Hub and fix bit64 installation on the
> > Docker image.*
> >
> >
> > *Build: *azure-test-r-rstudio-r-base-3.6-opensuse42
> > 
> >
> > *Error type:* External
> >
> > *Progress**:* A PR was sent to RStudio, we’ll wait for them to change ICU
> > build in RSPM.
> >
> > *First time issued:* 2021-05-13 (4 days ago)
> >
> > *Ticket:* ARROW-12786 
> >
> > *Comment:* *This error shall persist until **RSPM binaries are changed**.*
> >
> >
> > *Build:* azure-conda-osx-clang-py36-r36
> > 
> >
> > *Error type:* Internal
> >
> > *Progress**:* No work has yet been done on this issue.
> >
> > *First time issued:* 2021-05-13 (4 days ago)
> >
> > *Ticket:* ARROW-12782 
> >
> > *Related errors:** azure-conda-osx-clang-py36-r40*
> >
> >
> > *Build:* azure-test-ubuntu-20.10-docs
> > 

Re: String reverse kernel

2021-05-17 Thread Jonathan Keane
Yeah, piggybacking on what Weston said: is the line that we want to draw is
code point, combining character sequences, or graphemes [1]. IME, most
people would want/assume that combining characters would stay combined in
reversals (using Weston's example: "tréma" becoming "aḿert" (though this
specific character "é" has a combining version e+U+0300 and a single code
point é, and for many diacritics from different writing systems there is
only the combining version).

But whatever division we choose, documentation + links to explanations are
great.

[1]
https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.notion_of_character
there's also discussion at https://unicode.org/reports/tr29/, though the
first link I found much clearer.

On Mon, May 17, 2021 at 10:46 AM Weston Pace  wrote:

> FWIW, combining marks were not actually added to support emojis.  Emojis
> are just one of the more popular uses of the feature.  Combining marks is a
> standard Unicode feature necessary to represent single “characters” in some
> complex situations (e.g. when it is necessary to distinguish between tréma
> and umlaut, or to represent certain characters in Navajo).
>
> That being said I agree with the conclusions.  It’s ok to leave out for now
> and no need to link to any docs.
>
> On Mon, May 17, 2021 at 5:31 AM Antoine Pitrou  wrote:
>
> >
> > I'm fine with pointing out that the function operates on codepoints.
> >
> > Linking to the Unicode documentation for emojis sounds entirely like a
> > distraction, though.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 17/05/2021 à 17:28, Ian Cook a écrit :
> > > +1 for clarifying this in the kernel documentation, referring to these
> > > multi-emoji glyphs as "emoji ZWJ sequences," and linking to
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > >
> > > Ian
> > >
> > >
> > > On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Le 17/05/2021 à 17:17, David Li a écrit :
> > >>> A little clarification on my point: it's not that a single codepoint
> > >>> gets encoded with more than four bytes, it's that a grapheme
> > >>> cluster/human-delimited 'character' might be multiple codepoints, so
> > >>> reversing the individual codepoints may produce an unexpected
> > >>> result. For instance a flag emoji is actually two codepoints (two
> > >>> special 'letter' codepoints that represent the country code), so
> > >>> reversing a US flag naively will give you an odd '[SU]' instead.
> > >>
> > >> This sounds like saying that reversing a valid French word does not
> > >> produce a valid French word (well, in most cases). The kernel
> > >> documentation can't contain an entire tutorial about Unicode
> characters
> > >> and what to expect from them, IMHO.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> >
>


Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-22 Thread Jonathan Keane
+1 (non-binding)

Verified wheels, sources, and binaries on macOS 11.2 using the verification
script (except for Java Integration, Glib, and Ruby). Like Antoine I ran
into the same issue with Ruby.

I also installed Arrow and the R package locally + ran some adhoc tests
using some of our benchmarks.

I've also confirmed the RC (1 and 3) work using the R distribution setup we
have [1]. I will update those to point to the release when it is ready.

[1]:  https://github.com/r-windows/rtools-packages/pull/197 (and other
linked PRs)

On Thu, Apr 22, 2021 at 9:23 AM David Li  wrote:

> +1 (non-binding)
>
> Verified wheels, sources, and apt binaries on Ubuntu 18.04.
>
> Best,
> David
>
> On 2021/04/21 21:30:33, Krisztián Szűcs  wrote:
> > Hi,
> >
> > I would like to propose the following release candidate (RC3) of Apache
> > Arrow version 4.0.0. This is a release consisting of 719
> > resolved JIRA issues[1].
> >
> > This release candidate is based on commit:
> > f959141ece4d660bce5f7fa545befc0116a7db79 [2]
> >
> > The source release rc3 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 4.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 4.0.0 because...
> >
> > [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%204.0.0
> > [2]:
> https://github.com/apache/arrow/tree/f959141ece4d660bce5f7fa545befc0116a7db79
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-4.0.0-rc3
> > [4]: https://bintray.com/apache/arrow/centos-rc/4.0.0-rc3
> > [5]: https://bintray.com/apache/arrow/debian-rc/4.0.0-rc3
> > [6]: https://bintray.com/apache/arrow/python-rc/4.0.0-rc3
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/4.0.0-rc3
> > [8]:
> https://github.com/apache/arrow/blob/f959141ece4d660bce5f7fa545befc0116a7db79/CHANGELOG.md
> > [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


Re: [VOTE] Release Apache Arrow 4.0.0 - RC1

2021-04-20 Thread Jonathan Keane
I've also got a PR up for https://issues.apache.org/jira/browse/ARROW-12485
which sets the mimalloc default we thought we had already set.

On Tue, Apr 20, 2021 at 7:41 PM Weston Pace  wrote:

> Hmm, I wasn't actually running any script, just trying to access
> things manually.  When I ran the script I got errors because dnf
> wasn't installed and then when I installed dnf I find that dnf does
> not like the AWS mirrors...
>
> Repository u'amzn2-core': Error parsing config: Error parsing
> "mirrorlist =
> u'$awsproto://$amazonlinux.$awsregion.$awsdomain/2/$product/$target/x86_64/mirror.list'":
> URL must be http, ftp, file or https not ""
> Repository u'amzn2-core-source': Error parsing config: Error parsing
> "mirrorlist =
> u'$awsproto://$amazonlinux.$awsregion.$awsdomain/2/$product/$target/SRPMS/mirror.list'":
> URL must be http, ftp, file or https not ""
> Repository u'amzn2-core-debuginfo': Error parsing config: Error
> parsing "mirrorlist =
>
> u'$awsproto://$amazonlinux.$awsregion.$awsdomain/2/$product/$target/debuginfo/x86_64/mirror.list'":
> URL must be http, ftp, file or https not ""
>
> From reading the script it looks like...
>
> https://apache.jfrog.io/artifactory/arrow/centos-rc/7/x86_64/
>
> ...should be roughly equivalent to
>
>
> https://bintray.com/apache/arrow/centos-rc/3.0.0-rc2#files/centos-rc/7/x86_64
>
> However, the jfrog URL is missing the `repodata` folder.  I was able
> to fix the URL in the RPM to point at centos-rc but then I still get
> the error...
>
> (base) [ec2-user@ip-172-31-21-225 arrow]$ sudo yum update
> Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
> amzn2-core
>
>   | 3.7 kB  00:00:00
> amzn2extra-docker
>
>   | 3.0 kB  00:00:00
> amzn2extra-epel
>
>   | 3.0 kB  00:00:00
>
> https://apache.jfrog.io/artifactory/arrow/centos-rc/7/x86_64/repodata/repomd.xml
> :
> [Errno 14] HTTPS Error 404 - Not Found
> Trying other mirror.
> epel/x86_64/metalink
>
>   | 4.3 kB  00:00:00
> 214 packages excluded due to repository priority protections
> No packages marked for update
>
> On Tue, Apr 20, 2021 at 2:08 PM Sutou Kouhei  wrote:
> >
> > Hi Weston,
> >
> > It seems that you use old verification script. Could you
> > confirm that you use the verification script on master?
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [VOTE] Release Apache Arrow 4.0.0 - RC1" on Tue, 20 Apr 2021
> 11:23:25 -1000,
> >   Weston Pace  wrote:
> >
> > > I'm not sure if it is blocking (and it might even be expected given
> > > the current status of jfrog) but I attempted to install the CentOS 7
> > > RPM and got the following error when I ran `sudo yum update` after
> > > installing the arrow repo rpm.
> > >
> > >
> https://apache.jfrog.io/artifactory/arrow/centos/7/x86_64/repodata/repomd.xml
> :
> > > [Errno 14] HTTPS Error 404 - Not Found
> > >
> > > On Tue, Apr 20, 2021 at 8:49 AM Jonathan Keane 
> wrote:
> > >>
> > >> I'm still working on my verification, but as part of that noticed that
> > >> https://issues.apache.org/jira/browse/ARROW-12316 which we thought
> changed
> > >> the default memory allocator didn't fully accomplish that. Nothing is
> > >> broken per se, but jemalloc is still the default on macOS. I've made
> > >> https://issues.apache.org/jira/browse/ARROW-12485 as a follow on if
> there
> > >> is a need for another RC, that should definitely go in it.
> > >>
> > >> On Tue, Apr 20, 2021 at 2:44 AM Yibo Cai  wrote:
> > >>
> > >> > 'gandiva-decimal-test' hangs on my machine, not sure if it's a
> blocker
> > >> > issue.
> > >> > Details at https://issues.apache.org/jira/browse/ARROW-12476
> > >> > Test command "TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1
> > >> > dev/release/verify-release-candidate.sh source 4.0.0 1"
> > >> >
> > >> > On 4/19/21 10:50 PM, Krisztián Szűcs wrote:
> > >> > > Hi,
> > >> > >
> > >> > > I would like to propose the following release candidate (RC1) of
> Apache
> > >> > > Arrow version 4.0.0. This is a release consisting of 703
> > >> > > resolved JIRA issues[1].
> > >> > >
> > >> > > This re

Re: [VOTE] Release Apache Arrow 4.0.0 - RC1

2021-04-20 Thread Jonathan Keane
I'm still working on my verification, but as part of that noticed that
https://issues.apache.org/jira/browse/ARROW-12316 which we thought changed
the default memory allocator didn't fully accomplish that. Nothing is
broken per se, but jemalloc is still the default on macOS. I've made
https://issues.apache.org/jira/browse/ARROW-12485 as a follow on if there
is a need for another RC, that should definitely go in it.

On Tue, Apr 20, 2021 at 2:44 AM Yibo Cai  wrote:

> 'gandiva-decimal-test' hangs on my machine, not sure if it's a blocker
> issue.
> Details at https://issues.apache.org/jira/browse/ARROW-12476
> Test command "TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1
> dev/release/verify-release-candidate.sh source 4.0.0 1"
>
> On 4/19/21 10:50 PM, Krisztián Szűcs wrote:
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of Apache
> > Arrow version 4.0.0. This is a release consisting of 703
> > resolved JIRA issues[1].
> >
> > This release candidate is based on commit:
> > 9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e [2]
> >
> > The source release rc1 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 4.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 4.0.0 because...
> >
> > [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%204.0.0
> > [2]:
> https://github.com/apache/arrow/tree/9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-4.0.0-rc1
> > [4]: https://bintray.com/apache/arrow/centos-rc/4.0.0-rc1
> > [5]: https://bintray.com/apache/arrow/debian-rc/4.0.0-rc1
> > [6]: https://bintray.com/apache/arrow/python-rc/4.0.0-rc1
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/4.0.0-rc1
> > [8]:
> https://github.com/apache/arrow/blob/9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e/CHANGELOG.md
> > [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
>


Re: Setting Affects Version in Arrow Jira bug issues

2021-04-07 Thread Jonathan Keane
I think this proposal is great and will help a lot when scanning
through Jira issues.

I wonder if it's possible to automate this? I'm thinking something
along the lines of: If it's a Type = Bug, could have a yes/no or
checkbox where we ask "is this a bug reproducible in the most recent
arrow release?" and then fill in the Affects Version automatically if
the answer to that is yes. I'm certain that is possible in Jira,
though we might not be able to customize the Apache instance to do
this just for us.

Jon

On Wed, Apr 7, 2021 at 1:16 PM Ian Cook  wrote:
>
> Hi all,
>
> In discussion with Wes, Neal, and Krisztián, we had the following idea
> for how to make better use of the Affects Version field when creating
> issues with Type = Bug in Jira:
>
> (1) When reporting a bug, set Affects Version to the the most recent
> released version of Arrow if and only if the bug is reproducible in
> that released version of Arrow.
>
> (2) When reporting a bug in code that was added/modified since the
> most recent Arrow release, leave the Affects Version field empty.
>
> Adherence to this convention will help Arrow contributors to more
> efficiently diagnose and reproduce bugs. Some community members are
> already using the Affects Version field like this, and we hope that
> others who are creating and grooming Jira issues will follow suit.
>
> Feedback is welcome.
>
> Thanks,
> Ian


Re: Arrow sync call March 31 at 12:00 US/Eastern, 16:00 UTC

2021-03-31 Thread Jonathan Keane
Thank you everyone who attended, here are the notes.

Attendees:

Jonathan Keane

Colin Alworth

David Sanders

Micah Kornfield

Rok Mihevc

Projjal Chanda

Eduardo Ponce

Krill Lykov


Discussion:

   - 4.0 release
  - zstd compression for the java library (has PR that is approved but
  needs merged still)
  - One issue with parquet that might be good to get resolved before
  4.0 (getting the Jira
  https://issues.apache.org/jira/browse/ARROW-11629)
   - Formalize change for minor PRs
  - Will be merged Friday if there’s no objections
   - Regex kernel - is someone working on this (yes:
   https://issues.apache.org/jira/browse/ARROW-12134)
   - Discussion of jira search and how to locate where work is planned


On Wed, Mar 31, 2021 at 11:11 AM Antoine Pitrou  wrote:

>
> I'm fine with Zoom.  But doesn't need it a host as well?
>
>
> Le 31/03/2021 à 18:09, Wes McKinney a écrit :
> > The Google Meet link is on dremio.com, so there must not be someone
> > from the org to let people in. What do folks think about moving to
> > Zoom for future meetings (which shouldn't have this problem)?
> >
> > On Wed, Mar 31, 2021 at 11:07 AM Jonathan Keane 
> wrote:
> >>
> >> I'm experiencing the same here.
> >>
> >> On Wed, Mar 31, 2021 at 11:06 AM Kirill Lykov 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I don't know about the others but I cannot join because someone needs
> to
> >>> let me in.
> >>> Might be it the problem also for other people?
> >>>
> >>> On Tue, Mar 30, 2021 at 5:53 PM Neal Richardson <
> >>> neal.p.richard...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi all,
> >>>> Our biweekly call is coming up tomorrow at
> >>>> https://meet.google.com/vtm-teks-phx. All are welcome to join. I
> won't
> >>> be
> >>>> able to attend this week, but hopefully someone else will share notes
> >>> with
> >>>> the mailing list afterward.
> >>>>
> >>>> Neal
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Kirill Lykov
> >>>
>


Re: Arrow sync call March 31 at 12:00 US/Eastern, 16:00 UTC

2021-03-31 Thread Jonathan Keane
I'm experiencing the same here.

On Wed, Mar 31, 2021 at 11:06 AM Kirill Lykov 
wrote:

> Hi,
>
> I don't know about the others but I cannot join because someone needs to
> let me in.
> Might be it the problem also for other people?
>
> On Tue, Mar 30, 2021 at 5:53 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> > Hi all,
> > Our biweekly call is coming up tomorrow at
> > https://meet.google.com/vtm-teks-phx. All are welcome to join. I won't
> be
> > able to attend this week, but hopefully someone else will share notes
> with
> > the mailing list afterward.
> >
> > Neal
> >
>
>
> --
> Best regards,
> Kirill Lykov
>


[jira] [Created] (ARROW-8734) [R] Compilation error on macOS

2020-05-07 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-8734:
-

 Summary: [R] Compilation error on macOS
 Key: ARROW-8734
 URL: https://issues.apache.org/jira/browse/ARROW-8734
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane


I've tried to install / build from source (with from a git checkout and using 
the built-in `install_arrow()`) and when compiling I'm getting the following 
error reliably during the auto brew process:

{code:bash}
 x System command 'R' failed, exit status: 1, stdout + stderr:
E> * checking for file ‘/Users/jkeane/Dropbox/arrow/r/DESCRIPTION’ ... OK
E> * preparing ‘arrow’:
E> * checking DESCRIPTION meta-information ... OK
E> * cleaning src
E> * running ‘cleanup’
E> * installing the package to build vignettes
E>   ---
E> * installing *source* package ‘arrow’ ...
E> ** using staged installation
E> *** Generating code with data-raw/codegen.R
E> There were 27 warnings (use warnings() to see them)
E> *** > 375 functions decorated with [[arrow|s3::export]]
E> *** > generated file `src/arrowExports.cpp`
E> *** > generated file `R/arrowExports.R`
E> *** Downloading apache-arrow
E>  Using local manifest for apache-arrow
E> Thu May  7 13:13:42 CDT 2020: Auto-brewing apache-arrow in 
/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T//build-apache-arrow...
E> ==> Tapping autobrew/core from https://github.com/autobrew/homebrew-core
E> Tapped 2 commands and 4639 formulae (4,888 files, 12.7MB).
E> lz4
E> openssl
E> thrift
E> snappy
E> ==> Downloading 
https://homebrew.bintray.com/bottles/lz4-1.8.3.mojave.bottle.tar.gz
E> Already downloaded: 
/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/b4158ef68d619dbf78935df6a42a70b8339a65bc8876cbb4446355ccd40fa5de--lz4-1.8.3.mojave.bottle.tar.gz
E> ==> Pouring lz4-1.8.3.mojave.bottle.tar.gz
E> ==> Skipping post_install step for autobrew...
E>   
/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/lz4/1.8.3:
 22 files, 512.7KB
E> ==> Downloading 
https://homebrew.bintray.com/bottles/openssl-1.0.2p.mojave.bottle.tar.gz
E> Already downloaded: 
/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/fbb493745981c8b26c0fab115c76c2a70142bfde9e776c450277e9dfbbba0bb2--openssl-1.0.2p.mojave.bottle.tar.gz
E> ==> Pouring openssl-1.0.2p.mojave.bottle.tar.gz
E> ==> Skipping post_install step for autobrew...
E> ==> Caveats
E> openssl is keg-only, which means it was not symlinked into 
/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow,
E> because Apple has deprecated use of OpenSSL in favor of its own TLS and 
crypto libraries.
E> 
E> If you need to have openssl first in your PATH run:
E>   echo 'export 
PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/bin:$PATH"'
 >> ~/.zshrc
E> 
E> For compilers to find openssl you may need to set:
E>   export 
LDFLAGS="-L/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib"
E>   export 
CPPFLAGS="-I/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/include"
E> 
E> For pkg-config to find openssl you may need to set:
E>   export 
PKG_CONFIG_PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib/pkgconfig"
E> 
E> ==> Summary
E>   
/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/openssl/1.0.2p:
 1,793 files, 12MB
E> ==> Downloading 
https://homebrew.bintray.com/bottles/thrift-0.11.0.mojave.bottle.tar.gz
E> Already downloaded: 
/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/7e05ea11a9f7f924dd7f8f36252ec73a24958b7f214f71e3752a355e75e589bd--thrift-0.11.0.mojave.bottle.tar.gz
E> ==> Pouring thrift-0.11.0.mojave.bottle.tar.gz
E> ==> Skipping post_install step for autobrew...
E> ==> Caveats
E> To install Ruby binding:
E>   gem install thrift
E> ==> Summary
E>   
/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/thrift/0.11.0:
 102 files, 7MB
E> ==> Downloading 
https://homebrew.bintray.com/bottles/snappy-1.1.7_1.mojave.bottle.tar.gz
E> Already downloaded: 
/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/1f09938804055499d1dd951b13b26d80c56eae359aa051284bf4f51d109a9f73--snappy-1.1.7_1.mojave.bottle.tar.gz
E> ==> Pouring snappy-1.1.7_1.mojave.bottle.tar.gz
E> ==> Skipping post_install step for autobrew...
E>   
/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/snappy/1.1.7_1:
 18 files, 115.8KB
E> ==> Downloading 
https://au

[jira] [Created] (ARROW-8726) segfault with a mis-specified partition

2020-05-06 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-8726:
-

 Summary: segfault with a mis-specified partition
 Key: ARROW-8726
 URL: https://issues.apache.org/jira/browse/ARROW-8726
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane


Calling filter + collect on a dataset with a mis-specified partitioning causes 
a segfault. Though this is clearly input error, it would be nice if there was 
some guidance that something was wrong with the partitioning.

{code:r}
library(arrow)
library(dplyr)

dir.create("multi_mtcars/one", recursive = TRUE)
dir.create("multi_mtcars/two", recursive = TRUE)
write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")

ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))

# the following will segfault
ds %>%
  filter(cyl > 8) %>% 
  collect()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)