[NIGHTLY] Arrow Build Report for Job nightly-2021-05-17-0

2021-05-17 Thread Crossbow


Arrow Build Report for Job nightly-2021-05-17-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0

Failed Tasks:
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-osx-clang-py37-r40
- conda-osx-clang-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-osx-clang-py39
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-win-vs2017-py36-r36
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-win-vs2017-py37-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-win-vs2017-py38
- conda-win-vs2017-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-win-vs2017-py39
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-spark-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-conda-python-3.8-spark-master
- test-r-devdocs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-r-devdocs
- test-r-linux-valgrind:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-test-r-linux-valgrind
- test-r-rhub-ubuntu-gcc-release-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-test-r-rhub-ubuntu-gcc-release-latest
- test-r-rstudio-r-base-3.6-opensuse42:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-test-r-rstudio-r-base-3.6-opensuse42
- test-r-versions:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-test-r-versions
- test-ubuntu-20.10-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-test-ubuntu-20.10-docs

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-github-centos-8-amd64
- centos-8-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-travis-centos-8-arm64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-clean
- conda-linux-gcc-py36-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py36-arm64
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py37-arm64
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py38-arm64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py39-arm64
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-17-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gc

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Krisztián Szűcs
On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
 wrote:
>
> Hi,
>
> I have started collecting commits to the maint branch [1]. The exact
> commands I used:
>
> git clone g...@github.com:apache/arrow.git
> cd arrow/dev
> python3 -m venv venv
> source venv/bin/activate
> pip install -e archery
> pip install GitPython jira semver jinja2
> archery release cherry-pick 4.0.1
> # ran the commands it printed one by one
>
> There is a commit that does not apply cleanly. Could someone from C++ merge
> it? What to do:
>
> Run `git fetch upstream && git checkout maint-4.0.x && git cherry-pick
> ce2861713472818eea264957de4cc83d5a2c567c`
>
> This will trigger a merge conflict. Resolve and push to maint-4.0.x on
> apache/arrow.
Hi,

I've recreated the maintenance branch and resolved the conflicts.
According to the release curation script [1], we have 4 issues without
available patches:
- https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
submitted a PR)
- https://issues.apache.org/jira/browse/ARROW-12619
- https://issues.apache.org/jira/browse/ARROW-12604
- https://issues.apache.org/jira/browse/ARROW-12603

[1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
>
> Thanks,
> Jorge
>
> https://github.com/apache/arrow/tree/maint-4.0.x
>
>
>
> On Sat, May 15, 2021 at 1:23 AM Neal Richardson 
> wrote:
>
> > Thanks for taking this on. Krisztián can confirm the details (or point you
> > to where this is documented), but based on past patch releases, I believe
> > you would make a `maint-4.0.x` branch off of the existing `release-4.0.0`
> > branch, cherry-pick the commits associated with the JIRAs tagged for 4.0.1
> > (I believe there are utility scripts to help with this), and run the
> > release script that bumps the versions.
> >
> > Neal
> >
> >
> > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Just to make sure: the goal is to cherry-pick all changes targeted for
> > > 4.0.1 into a branch and release from there? If that is the case, then I
> > > will create a branch and start cherry-picking the changes in order they
> > > were merged in master.
> > >
> > > I see 5 issues on the list still open. I subscribed to them and will be
> > > cherry-picking them as they get merged.
> > >
> > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a 4.0.1. I
> > > suggest 4.0.1 to keep parity as we recently agreed, but let me know if
> > > others disagree.
> > >
> > > [1] https://github.com/apache/arrow-rs/pull/289
> > >
> > >
> > >
> > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > wrote:
> > >
> > > > Thanks, Jorge!
> > > >
> > > > If anyone else has bugfixes that they'd like included in a patch
> > release,
> > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a roundup
> > > and
> > > > start a vote early next week?
> > > >
> > > > Neal
> > > >
> > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > Addressing these accumulated issues in a patch release sounds like a
> > > > > good idea to me.
> > > > >
> > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > >  wrote:
> > > > > >
> > > > > > I agree. Segfaults are not nice.
> > > > > >
> > > > > > I can take it. I would possibly need some guidance.
> > > > > >
> > > > > > Best,
> > > > > > Jorge
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, May 13, 2021 at 12:52 AM Neal Richardson <
> > > > > > neal.p.richard...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > As discussed at the biweekly sync call, I wanted to gauge
> > interest
> > > in
> > > > > doing
> > > > > > > a 4.0.1 patch release.
> > > > > > >
> > > > > > > There currently are 14 issues in JIRA tagged with 4.0.1 [1].
> > There
> > > > are
> > > > > 3
> > > > > > > segfaults, including one that a cuDF maintainer raised yesterday
> > > [2]
> > > > in
> > > > > > > requesting a patch release.
> > > > > > >
> > > > > > > I don't want to bias the discussion by giving my opinion (yet). I
> > > > will
> > > > > say
> > > > > > > that the question is whether someone (or multiple people) wants
> > to
> > > > > step up
> > > > > > > and drive a release--if releases were costless, this would be
> > much
> > > > > > > different. We did decide to allow for a simpler patch release
> > > process
> > > > > > > (source vote only, not on binary artifacts), so this could be a
> > > test
> > > > > for
> > > > > > > whether that does simplify matters and/or lets us better
> > distribute
> > > > the
> > > > > > > work of producing binary artifacts.
> > > > > > >
> > > > > > > Any thoughts--especially from those who could/would be release
> > > > manager?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Neal
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20fixVersion%20%3D%204.0.1
> > > > > >

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread David Li
I'll provide a backport for ARROW-12603 - it's a duplicate of another
issue but the change there would pull in a lot of unrelated changes.

Best,
David

On 2021/05/17 11:42:11, Krisztián Szűcs  wrote: 
> On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > I have started collecting commits to the maint branch [1]. The exact
> > commands I used:
> >
> > git clone g...@github.com:apache/arrow.git
> > cd arrow/dev
> > python3 -m venv venv
> > source venv/bin/activate
> > pip install -e archery
> > pip install GitPython jira semver jinja2
> > archery release cherry-pick 4.0.1
> > # ran the commands it printed one by one
> >
> > There is a commit that does not apply cleanly. Could someone from C++ merge
> > it? What to do:
> >
> > Run `git fetch upstream && git checkout maint-4.0.x && git cherry-pick
> > ce2861713472818eea264957de4cc83d5a2c567c`
> >
> > This will trigger a merge conflict. Resolve and push to maint-4.0.x on
> > apache/arrow.
> Hi,
> 
> I've recreated the maintenance branch and resolved the conflicts.
> According to the release curation script [1], we have 4 issues without
> available patches:
> - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> submitted a PR)
> - https://issues.apache.org/jira/browse/ARROW-12619
> - https://issues.apache.org/jira/browse/ARROW-12604
> - https://issues.apache.org/jira/browse/ARROW-12603
> 
> [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> >
> > Thanks,
> > Jorge
> >
> > https://github.com/apache/arrow/tree/maint-4.0.x
> >
> >
> >
> > On Sat, May 15, 2021 at 1:23 AM Neal Richardson 
> > 
> > wrote:
> >
> > > Thanks for taking this on. Krisztián can confirm the details (or point you
> > > to where this is documented), but based on past patch releases, I believe
> > > you would make a `maint-4.0.x` branch off of the existing `release-4.0.0`
> > > branch, cherry-pick the commits associated with the JIRAs tagged for 4.0.1
> > > (I believe there are utility scripts to help with this), and run the
> > > release script that bumps the versions.
> > >
> > > Neal
> > >
> > >
> > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Just to make sure: the goal is to cherry-pick all changes targeted for
> > > > 4.0.1 into a branch and release from there? If that is the case, then I
> > > > will create a branch and start cherry-picking the changes in order they
> > > > were merged in master.
> > > >
> > > > I see 5 issues on the list still open. I subscribed to them and will be
> > > > cherry-picking them as they get merged.
> > > >
> > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a 4.0.1. I
> > > > suggest 4.0.1 to keep parity as we recently agreed, but let me know if
> > > > others disagree.
> > > >
> > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > >
> > > >
> > > >
> > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > neal.p.richard...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks, Jorge!
> > > > >
> > > > > If anyone else has bugfixes that they'd like included in a patch
> > > release,
> > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a 
> > > > > roundup
> > > > and
> > > > > start a vote early next week?
> > > > >
> > > > > Neal
> > > > >
> > > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Addressing these accumulated issues in a patch release sounds like a
> > > > > > good idea to me.
> > > > > >
> > > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > > >  wrote:
> > > > > > >
> > > > > > > I agree. Segfaults are not nice.
> > > > > > >
> > > > > > > I can take it. I would possibly need some guidance.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jorge
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 12:52 AM Neal Richardson <
> > > > > > > neal.p.richard...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > As discussed at the biweekly sync call, I wanted to gauge
> > > interest
> > > > in
> > > > > > doing
> > > > > > > > a 4.0.1 patch release.
> > > > > > > >
> > > > > > > > There currently are 14 issues in JIRA tagged with 4.0.1 [1].
> > > There
> > > > > are
> > > > > > 3
> > > > > > > > segfaults, including one that a cuDF maintainer raised yesterday
> > > > [2]
> > > > > in
> > > > > > > > requesting a patch release.
> > > > > > > >
> > > > > > > > I don't want to bias the discussion by giving my opinion (yet). 
> > > > > > > > I
> > > > > will
> > > > > > say
> > > > > > > > that the question is whether someone (or multiple people) wants
> > > to
> > > > > > step up
> > > > > > > > and drive a release--if releases were costless, this would be
> > > much
> > > > > > > > different. We did decide to allow for a simpler patch release
> > > > process
> > > > > > > > (source vote only, not on binary artifacts), so this could be a
> > > > test
> > > > > > for
> > > 

Re: Long title on github page

2021-05-17 Thread Wes McKinney
It's probably best for description to limit mentions of specific
features. There are some high level features mentioned in the
description now ("computational libraries and zero-copy streaming
messaging and interprocess communication"), but now in 2021 since the
project has grown so much, it could leave people with a limited view
of what they might find here.

On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
 wrote:
>
> How about
> 'Apache Arrow is a cross-language development platform for in-memory data.
> It enables systems to process and transport data efficiently, providing a
> simple and fast library for partitioning of large tables'?
>
> Sorry the delay, long election day
>
> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind 
> wrote:
>
> > Suggestion: faster -> more efficiently
> >
> > "Apache Arrow is a cross-language development platform for in-memory
> > data. It enables systems to process and transport data more efficiently."
> >
> > On Sun, May 16, 2021 at 11:35 AM Wes McKinney  wrote:
> >
> > > Here's what there now:
> > >
> > > "Apache Arrow is a cross-language development platform for in-memory
> > > data. It specifies a standardized language-independent columnar memory
> > > format for flat and hierarchical data, organized for efficient
> > > analytic operations on modern hardware. It also provides computational
> > > libraries and zero-copy streaming messaging and interprocess
> > > communication…"
> > >
> > > How about something shorter like
> > >
> > > "Apache Arrow is a cross-language development platform for in-memory
> > > data. It enables systems to process and transport data faster."
> > >
> > > Suggestions / refinements from others welcome
> > >
> > >
> > > On Sat, May 15, 2021 at 9:12 PM Dominik Moritz  wrote:
> > > >
> > > > Super minor issue but could someone make the description on GitHub
> > > shorter?
> > > >
> > > >
> > > >
> > > > GitHub puts the description into the title of the page and makes it
> > hard
> > > to find it in URL autocomplete.
> > > >
> > >
> >
> >
> > --
> >


[DISCUSS] Parquet/Arrow/Flight as distributed persistence service

2021-05-17 Thread Gary Pennington
Hi,

(NB: I first floated this question in the arrow-rust slack channel and Jorge 
Leitao suggested I should ask here.)

I’m cranking up a project to provide functionality based on: 
parquet/arrow/flight implemented in rust. The primary goals of the project are 
to provide a mechanism for storing/retrieving large quantities of column 
oriented data across different types of storage mechanism, (S3, filesystem, 
etc..). Initially, at least, the flight/arrow/parquet stack looks to be a great 
fit for what I’m doing.

I’ve done some prototyping and so far I’ve made good progress. I have a simple 
flight service (written in rust: arrow 4.0.0 stack) which is happy to 
send/receive data to/from a very simple flight client (written in python).

I’ve encountered a few rough edges and before proceeding further I thought I’d 
see what other people think of the idea of using flight/arrow to provide a 
persistence service (parquet) for large quantities of column oriented data.

One of my questions is about the use of flight. Flight seems to be primarily 
oriented around streams of data (which is cool), but has anyone else considered 
using that as the basis for a distributed storage framework? do_get would 
read_parquet/send_arrow parquet data and do_put would 
receive_arrow/write_parquet it. Or perhaps separate persistence as a new action?

Another question is around schema evolution. Any gotchas with this approach. Do 
I need to think about a separate schema registry and how would I evolve data 
against that registry?

For now, forget about authn/authz issues, I think the handshake mechanism will 
probably suffice, but if not I can roll extensions using the action mechanism.

Has anyone else done anything like this? Does it seem like a reasonable use of 
the tooling. Any gotchas I should be worrying about?

Cheers,

Gary



[RUST] Request for Comment / Check proposed release process

2021-05-17 Thread Andrew Lamb
I need help verifying the proposed source tarball format for the Arrow Rust
releases;

Specifically, can someone please:
1. Download the example files and ensure they can successfully validate the
signatures
2. Ensure that the contents of this tarball could be used to publish to
crates.io

Background: I have been working on the new release process for
arrow-rs (updates in [2]).  The contents and changelog in this example
release tarball are from [3] and were created using the scripts /
instructions in [1].

[1] https://github.com/apache/arrow-rs/pull/299
[2] https://github.com/apache/arrow-rs/issues/292
[3] https://github.com/apache/arrow-rs/pull/305

Here is an example output (including Vote Email) generated by script in [1]:

```
cd /Users/alamb/Software/arrow-rs/ && ./dev/release/create-tarball.sh  0.0.3
Attempting to create
/Users/alamb/Software/arrow-rs/dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz
from tag 0.0.3
Draft email for dev@arrow.apache.org mailing list

-
To: dev@arrow.apache.org
Subject: [VOTE][RUST] Release Apache Arrow

Hi,

I would like to propose a release of Apache Arrow Rust
Implementation, version 0.0.3.

This release candidate is based on commit:
f3959f59a6119dab23818e6eef87e0d7b58c820e [1]

The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:
https://github.com/apache/arrow-rs/tree/f3959f59a6119dab23818e6eef87e0d7b58c820e
[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-0.0.3
[3]:
https://github.com/apache/arrow-rs/blob/f3959f59a6119dab23818e6eef87e0d7b58c820e/CHANGELOG.md
-
Running rat license checker on
/Users/alamb/Software/arrow-rs/dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz
OK
No unapproved licenses
Signing tarball and creating checksums
Uploading to apache dist/dev to
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-0.0.3
Checked out revision 47764.
A dev/dist/apache-arrow-rs-0.0.3
A  (bin)
 dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz
A
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.sha512
A
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.asc
A
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.sha256
Adding dev/dist/apache-arrow-rs-0.0.3
Adding  (bin)
 dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz
Adding
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.asc
Adding
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.sha256
Adding
dev/dist/apache-arrow-rs-0.0.3/apache-arrow-rs-0.0.3-f3959f59a.tar.gz.sha512
Transmitting file data done
Committing transaction...
Committed revision 47765.

```


Re: Long title on github page

2021-05-17 Thread Brian Hulette
Thank you for bringing this up Dominik. I sampled some of the descriptions
for other Apache projects I frequent, the ones with a meaningful
description have a single sentence:

github.com/apache/spark - Apache Spark - A unified analytics engine for
large-scale data processing
github.com/apache/beam - Apache Beam is a unified programming model for
Batch and Streaming
github.com/apache/avro - Apache Avro is a data serialization system

Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache "
as the description.

+1 for Nate's suggestion "Apache Arrow is a cross-language development
platform for in-memory data. It enables systems to process and transport
data more efficiently."

On Mon, May 17, 2021 at 5:23 AM Wes McKinney  wrote:

> It's probably best for description to limit mentions of specific
> features. There are some high level features mentioned in the
> description now ("computational libraries and zero-copy streaming
> messaging and interprocess communication"), but now in 2021 since the
> project has grown so much, it could leave people with a limited view
> of what they might find here.
>
> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
>  wrote:
> >
> > How about
> > 'Apache Arrow is a cross-language development platform for in-memory
> data.
> > It enables systems to process and transport data efficiently, providing a
> > simple and fast library for partitioning of large tables'?
> >
> > Sorry the delay, long election day
> >
> > On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> natebauernfe...@deephaven.io>
> > wrote:
> >
> > > Suggestion: faster -> more efficiently
> > >
> > > "Apache Arrow is a cross-language development platform for in-memory
> > > data. It enables systems to process and transport data more
> efficiently."
> > >
> > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
> wrote:
> > >
> > > > Here's what there now:
> > > >
> > > > "Apache Arrow is a cross-language development platform for in-memory
> > > > data. It specifies a standardized language-independent columnar
> memory
> > > > format for flat and hierarchical data, organized for efficient
> > > > analytic operations on modern hardware. It also provides
> computational
> > > > libraries and zero-copy streaming messaging and interprocess
> > > > communication…"
> > > >
> > > > How about something shorter like
> > > >
> > > > "Apache Arrow is a cross-language development platform for in-memory
> > > > data. It enables systems to process and transport data faster."
> > > >
> > > > Suggestions / refinements from others welcome
> > > >
> > > >
> > > > On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
> wrote:
> > > > >
> > > > > Super minor issue but could someone make the description on GitHub
> > > > shorter?
> > > > >
> > > > >
> > > > >
> > > > > GitHub puts the description into the title of the page and makes it
> > > hard
> > > > to find it in URL autocomplete.
> > > > >
> > > >
> > >
> > >
> > > --
> > >
>


Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service

2021-05-17 Thread David Li
Hey Gary,

Sounds like an interesting project!

To speak a bit to the Flight question: I don't think you need a new
action; using DoGet/DoPut as you describe makes sense for
persistence. There's no required semantics for Flight - it certainly
suggests certain patterns (GetFlightInfo -> DoGet for instance) but
none of that is formally specified/required, nor is there a generic
client that expects to be able to talk to any Flight server.

And indeed, you can search the archives of this list for the FlightSQL
proposal, which is somewhat similar to your project in spirit (but
oriented towards traditional relational databases).

As for schema evolution - I think you are not talking about schema
evolution during a single Flight RPC call (not (yet) supported), but
rather evolving the schema of a stored dataset between reads? (Just to
clarify whether this is a question about Flight or not.)

Best,
David

On 2021/05/17 13:21:09, Gary Pennington  
wrote: 
> Hi,
> 
> (NB: I first floated this question in the arrow-rust slack channel and Jorge 
> Leitao suggested I should ask here.)
> 
> I’m cranking up a project to provide functionality based on: 
> parquet/arrow/flight implemented in rust. The primary goals of the project 
> are to provide a mechanism for storing/retrieving large quantities of column 
> oriented data across different types of storage mechanism, (S3, filesystem, 
> etc..). Initially, at least, the flight/arrow/parquet stack looks to be a 
> great fit for what I’m doing.
> 
> I’ve done some prototyping and so far I’ve made good progress. I have a 
> simple flight service (written in rust: arrow 4.0.0 stack) which is happy to 
> send/receive data to/from a very simple flight client (written in python).
> 
> I’ve encountered a few rough edges and before proceeding further I thought 
> I’d see what other people think of the idea of using flight/arrow to provide 
> a persistence service (parquet) for large quantities of column oriented data.
> 
> One of my questions is about the use of flight. Flight seems to be primarily 
> oriented around streams of data (which is cool), but has anyone else 
> considered using that as the basis for a distributed storage framework? 
> do_get would read_parquet/send_arrow parquet data and do_put would 
> receive_arrow/write_parquet it. Or perhaps separate persistence as a new 
> action?
> 
> Another question is around schema evolution. Any gotchas with this approach. 
> Do I need to think about a separate schema registry and how would I evolve 
> data against that registry?
> 
> For now, forget about authn/authz issues, I think the handshake mechanism 
> will probably suffice, but if not I can roll extensions using the action 
> mechanism.
> 
> Has anyone else done anything like this? Does it seem like a reasonable use 
> of the tooling. Any gotchas I should be worrying about?
> 
> Cheers,
> 
> Gary
> 
> 


String reverse kernel

2021-05-17 Thread Niranda Perera
Hi all,

This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
trivial exercise, I would like to clarify a few things.

In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
like to get some feedback for the following points.

1. For ASCII reverse, I am throwing an error if a non-ascii char is
encountered. Should we throw this error? or return a garbage output (ex:
a\xD1b --> b\x1D\a)
2. For UTF8 reverse, I am returning some garbage output when malformed utf8
buffers are present but the algorithm guarantees that it would return the
same buffer sizes as the input. IMO, the current algorithm works
efficiently for valid UTF8 chars.
#1 and #2 are inconsistent and I'd like to know what is the best way to
handle malformed/ invalid chars
3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
emojis, utf-8 pairs, etc). and currently these are not handled.

Look forward to hearing from you.

Best

[1] https://github.com/apache/arrow/pull/10317
[2] https://issues.apache.org/jira/browse/ARROW-12713

-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: Long title on github page

2021-05-17 Thread Eduardo Ponce
I agree with Nate's and Brian's suggestions, but would like to add that we
can make it a one-liner for more conciseness and consistency with other
Apache projects.
Apologies if it seems I am going around the suggestions loop again.

"Apache Arrow is a cross-language development platform enabling efficient
in-memory data processing and transport."




On Mon, May 17, 2021 at 10:11 AM Brian Hulette  wrote:

> Thank you for bringing this up Dominik. I sampled some of the descriptions
> for other Apache projects I frequent, the ones with a meaningful
> description have a single sentence:
>
> github.com/apache/spark - Apache Spark - A unified analytics engine for
> large-scale data processing
> github.com/apache/beam - Apache Beam is a unified programming model for
> Batch and Streaming
> github.com/apache/avro - Apache Avro is a data serialization system
>
> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache "
> as the description.
>
> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> platform for in-memory data. It enables systems to process and transport
> data more efficiently."
>
> On Mon, May 17, 2021 at 5:23 AM Wes McKinney  wrote:
>
> > It's probably best for description to limit mentions of specific
> > features. There are some high level features mentioned in the
> > description now ("computational libraries and zero-copy streaming
> > messaging and interprocess communication"), but now in 2021 since the
> > project has grown so much, it could leave people with a limited view
> > of what they might find here.
> >
> > On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> >  wrote:
> > >
> > > How about
> > > 'Apache Arrow is a cross-language development platform for in-memory
> > data.
> > > It enables systems to process and transport data efficiently,
> providing a
> > > simple and fast library for partitioning of large tables'?
> > >
> > > Sorry the delay, long election day
> > >
> > > On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> > natebauernfe...@deephaven.io>
> > > wrote:
> > >
> > > > Suggestion: faster -> more efficiently
> > > >
> > > > "Apache Arrow is a cross-language development platform for in-memory
> > > > data. It enables systems to process and transport data more
> > efficiently."
> > > >
> > > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
> > wrote:
> > > >
> > > > > Here's what there now:
> > > > >
> > > > > "Apache Arrow is a cross-language development platform for
> in-memory
> > > > > data. It specifies a standardized language-independent columnar
> > memory
> > > > > format for flat and hierarchical data, organized for efficient
> > > > > analytic operations on modern hardware. It also provides
> > computational
> > > > > libraries and zero-copy streaming messaging and interprocess
> > > > > communication…"
> > > > >
> > > > > How about something shorter like
> > > > >
> > > > > "Apache Arrow is a cross-language development platform for
> in-memory
> > > > > data. It enables systems to process and transport data faster."
> > > > >
> > > > > Suggestions / refinements from others welcome
> > > > >
> > > > >
> > > > > On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
> > wrote:
> > > > > >
> > > > > > Super minor issue but could someone make the description on
> GitHub
> > > > > shorter?
> > > > > >
> > > > > >
> > > > > >
> > > > > > GitHub puts the description into the title of the page and makes
> it
> > > > hard
> > > > > to find it in URL autocomplete.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> >
>


Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou



Le 17/05/2021 à 16:28, Niranda Perera a écrit :

Hi all,

This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
trivial exercise, I would like to clarify a few things.

In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
like to get some feedback for the following points.

1. For ASCII reverse, I am throwing an error if a non-ascii char is
encountered. Should we throw this error? or return a garbage output (ex:
a\xD1b --> b\x1D\a)


Since this is taking valid UTF8 input, it should not produce invalid 
output, so an error should be emitted (IMHO).



2. For UTF8 reverse, I am returning some garbage output when malformed utf8
buffers are present but the algorithm guarantees that it would return the
same buffer sizes as the input. IMO, the current algorithm works
efficiently for valid UTF8 chars.


Since this is taking invalid UTF8 input, we don't care that the output 
is invalid as well.



3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
emojis, utf-8 pairs, etc). and currently these are not handled.


I'm not aware of that.  Encodings beyond 4 bytes are invalid.
See for example the IETF RFC for UTF-8:
  https://datatracker.ietf.org/doc/html/rfc3629#section-4
or the Unicode standard (chapter 3, p. 124, Table 3-7. Well-Formed UTF-8 
Byte Sequences):

  https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf

Regards

Antoine.


Re: String reverse kernel

2021-05-17 Thread David Li
A little clarification on my point: it's not that a single codepoint
gets encoded with more than four bytes, it's that a grapheme
cluster/human-delimited 'character' might be multiple codepoints, so
reversing the individual codepoints may produce an unexpected
result. For instance a flag emoji is actually two codepoints (two
special 'letter' codepoints that represent the country code), so
reversing a US flag naively will give you an odd '[SU]' instead.

Not that this needs to be handled per se right now - but we should
perhaps point it out in the kernel documentation so people know what
to expect.

-David

On 2021/05/17 14:48:52, Antoine Pitrou  wrote: 
> 
> Le 17/05/2021 à 16:28, Niranda Perera a écrit :
> > Hi all,
> > 
> > This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly
> > trivial exercise, I would like to clarify a few things.
> > 
> > In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd
> > like to get some feedback for the following points.
> > 
> > 1. For ASCII reverse, I am throwing an error if a non-ascii char is
> > encountered. Should we throw this error? or return a garbage output (ex:
> > a\xD1b --> b\x1D\a)
> 
> Since this is taking valid UTF8 input, it should not produce invalid 
> output, so an error should be emitted (IMHO).
> 
> > 2. For UTF8 reverse, I am returning some garbage output when malformed utf8
> > buffers are present but the algorithm guarantees that it would return the
> > same buffer sizes as the input. IMO, the current algorithm works
> > efficiently for valid UTF8 chars.
> 
> Since this is taking invalid UTF8 input, we don't care that the output 
> is invalid as well.
> 
> > 3. As @DavidLi pointed out in the PR, UTF8 chars can go beyond 4 bytes (ex:
> > emojis, utf-8 pairs, etc). and currently these are not handled.
> 
> I'm not aware of that.  Encodings beyond 4 bytes are invalid.
> See for example the IETF RFC for UTF-8:
>https://datatracker.ietf.org/doc/html/rfc3629#section-4
> or the Unicode standard (chapter 3, p. 124, Table 3-7. Well-Formed UTF-8 
> Byte Sequences):
>https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf
> 
> Regards
> 
> Antoine.
> 


Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou



Le 17/05/2021 à 17:17, David Li a écrit :

A little clarification on my point: it's not that a single codepoint
gets encoded with more than four bytes, it's that a grapheme
cluster/human-delimited 'character' might be multiple codepoints, so
reversing the individual codepoints may produce an unexpected
result. For instance a flag emoji is actually two codepoints (two
special 'letter' codepoints that represent the country code), so
reversing a US flag naively will give you an odd '[SU]' instead.


This sounds like saying that reversing a valid French word does not 
produce a valid French word (well, in most cases). The kernel 
documentation can't contain an entire tutorial about Unicode characters 
and what to expect from them, IMHO.


Regards

Antoine.


Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service

2021-05-17 Thread Gary Pennington
Hi David,

Thanks for the feedback. I’m re-assured that you don’t think the idea is too 
crazy. 😊

I’ll take a look at the FlightSQL proposal you mention. There is actually a 
related project to the one I’m working on which will need a more structured 
approach for data storage. Maybe not SQL like, more Graph Like I think, but 
still ideas are likely to be applicable.

Regarding schema evolution – I am not talking about evolution during a call, 
but rather over time between gets/puts. I can think of ways to manage that over 
time, but I wondered if any best practices have started to emerge in this space.

Cheers,

Gary

From: David Li 
Date: Monday, 17 May 2021 at 15:25
To: dev@arrow.apache.org 
Subject: Re: [DISCUSS] Parquet/Arrow/Flight as distributed persistence service
Hey Gary,

Sounds like an interesting project!

To speak a bit to the Flight question: I don't think you need a new
action; using DoGet/DoPut as you describe makes sense for
persistence. There's no required semantics for Flight - it certainly
suggests certain patterns (GetFlightInfo -> DoGet for instance) but
none of that is formally specified/required, nor is there a generic
client that expects to be able to talk to any Flight server.

And indeed, you can search the archives of this list for the FlightSQL
proposal, which is somewhat similar to your project in spirit (but
oriented towards traditional relational databases).

As for schema evolution - I think you are not talking about schema
evolution during a single Flight RPC call (not (yet) supported), but
rather evolving the schema of a stored dataset between reads? (Just to
clarify whether this is a question about Flight or not.)

Best,
David

On 2021/05/17 13:21:09, Gary Pennington  
wrote:
> Hi,
>
> (NB: I first floated this question in the arrow-rust slack channel and Jorge 
> Leitao suggested I should ask here.)
>
> I’m cranking up a project to provide functionality based on: 
> parquet/arrow/flight implemented in rust. The primary goals of the project 
> are to provide a mechanism for storing/retrieving large quantities of column 
> oriented data across different types of storage mechanism, (S3, filesystem, 
> etc..). Initially, at least, the flight/arrow/parquet stack looks to be a 
> great fit for what I’m doing.
>
> I’ve done some prototyping and so far I’ve made good progress. I have a 
> simple flight service (written in rust: arrow 4.0.0 stack) which is happy to 
> send/receive data to/from a very simple flight client (written in python).
>
> I’ve encountered a few rough edges and before proceeding further I thought 
> I’d see what other people think of the idea of using flight/arrow to provide 
> a persistence service (parquet) for large quantities of column oriented data.
>
> One of my questions is about the use of flight. Flight seems to be primarily 
> oriented around streams of data (which is cool), but has anyone else 
> considered using that as the basis for a distributed storage framework? 
> do_get would read_parquet/send_arrow parquet data and do_put would 
> receive_arrow/write_parquet it. Or perhaps separate persistence as a new 
> action?
>
> Another question is around schema evolution. Any gotchas with this approach. 
> Do I need to think about a separate schema registry and how would I evolve 
> data against that registry?
>
> For now, forget about authn/authz issues, I think the handshake mechanism 
> will probably suffice, but if not I can roll extensions using the action 
> mechanism.
>
> Has anyone else done anything like this? Does it seem like a reasonable use 
> of the tooling. Any gotchas I should be worrying about?
>
> Cheers,
>
> Gary
>
>


Re: String reverse kernel

2021-05-17 Thread David Li
Sure, that is a fair point. But in this case Unicode defines both codepoint and 
(extended) grapheme cluster, so I felt it might be worth including a quick note 
about which one is being reversed (though to be fair, nearly every language 
picks codepoint except maybe Swift, IIUC).

In either case it's not something I feel very strongly about.

-David

On 2021/05/17 15:20:57, Antoine Pitrou  wrote: 
> 
> Le 17/05/2021 à 17:17, David Li a écrit :
> > A little clarification on my point: it's not that a single codepoint
> > gets encoded with more than four bytes, it's that a grapheme
> > cluster/human-delimited 'character' might be multiple codepoints, so
> > reversing the individual codepoints may produce an unexpected
> > result. For instance a flag emoji is actually two codepoints (two
> > special 'letter' codepoints that represent the country code), so
> > reversing a US flag naively will give you an odd '[SU]' instead.
> 
> This sounds like saying that reversing a valid French word does not 
> produce a valid French word (well, in most cases). The kernel 
> documentation can't contain an entire tutorial about Unicode characters 
> and what to expect from them, IMHO.
> 
> Regards
> 
> Antoine.
> 


Re: String reverse kernel

2021-05-17 Thread Ian Cook
+1 for clarifying this in the kernel documentation, referring to these
multi-emoji glyphs as "emoji ZWJ sequences," and linking to
https://unicode.org/emoji/charts/emoji-zwj-sequences.html

Ian


On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou  wrote:
>
>
> Le 17/05/2021 à 17:17, David Li a écrit :
> > A little clarification on my point: it's not that a single codepoint
> > gets encoded with more than four bytes, it's that a grapheme
> > cluster/human-delimited 'character' might be multiple codepoints, so
> > reversing the individual codepoints may produce an unexpected
> > result. For instance a flag emoji is actually two codepoints (two
> > special 'letter' codepoints that represent the country code), so
> > reversing a US flag naively will give you an odd '[SU]' instead.
>
> This sounds like saying that reversing a valid French word does not
> produce a valid French word (well, in most cases). The kernel
> documentation can't contain an entire tutorial about Unicode characters
> and what to expect from them, IMHO.
>
> Regards
>
> Antoine.


Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou



I'm fine with pointing out that the function operates on codepoints.

Linking to the Unicode documentation for emojis sounds entirely like a 
distraction, though.


Regards

Antoine.


Le 17/05/2021 à 17:28, Ian Cook a écrit :

+1 for clarifying this in the kernel documentation, referring to these
multi-emoji glyphs as "emoji ZWJ sequences," and linking to
https://unicode.org/emoji/charts/emoji-zwj-sequences.html

Ian


On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou  wrote:



Le 17/05/2021 à 17:17, David Li a écrit :

A little clarification on my point: it's not that a single codepoint
gets encoded with more than four bytes, it's that a grapheme
cluster/human-delimited 'character' might be multiple codepoints, so
reversing the individual codepoints may produce an unexpected
result. For instance a flag emoji is actually two codepoints (two
special 'letter' codepoints that represent the country code), so
reversing a US flag naively will give you an odd '[SU]' instead.


This sounds like saying that reversing a valid French word does not
produce a valid French word (well, in most cases). The kernel
documentation can't contain an entire tutorial about Unicode characters
and what to expect from them, IMHO.

Regards

Antoine.


Re: String reverse kernel

2021-05-17 Thread Niranda Perera
Thank you very much for your inputs, guys. So, based on the discussion, I
will make the following changes.

1. ASCII reverse would throw an error when a non-ASCII (valid/ invalid
utf8) byte is oThank you @antoinebserved (no change)
2. UTF8 kernel would return a garbage output when an invalid utf8 char is
observed but  (no change)
Thank you @antoine for the clarification.
3. Edit documentation to clarify that the kernel works on code-point level

On Mon, May 17, 2021 at 11:31 AM Antoine Pitrou  wrote:

>
> I'm fine with pointing out that the function operates on codepoints.
>
> Linking to the Unicode documentation for emojis sounds entirely like a
> distraction, though.
>
> Regards
>
> Antoine.
>
>
> Le 17/05/2021 à 17:28, Ian Cook a écrit :
> > +1 for clarifying this in the kernel documentation, referring to these
> > multi-emoji glyphs as "emoji ZWJ sequences," and linking to
> > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> >
> > Ian
> >
> >
> > On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Le 17/05/2021 à 17:17, David Li a écrit :
> >>> A little clarification on my point: it's not that a single codepoint
> >>> gets encoded with more than four bytes, it's that a grapheme
> >>> cluster/human-delimited 'character' might be multiple codepoints, so
> >>> reversing the individual codepoints may produce an unexpected
> >>> result. For instance a flag emoji is actually two codepoints (two
> >>> special 'letter' codepoints that represent the country code), so
> >>> reversing a US flag naively will give you an odd '[SU]' instead.
> >>
> >> This sounds like saying that reversing a valid French word does not
> >> produce a valid French word (well, in most cases). The kernel
> >> documentation can't contain an entire tutorial about Unicode characters
> >> and what to expect from them, IMHO.
> >>
> >> Regards
> >>
> >> Antoine.
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 


Re: String reverse kernel

2021-05-17 Thread Weston Pace
FWIW, combining marks were not actually added to support emojis.  Emojis
are just one of the more popular uses of the feature.  Combining marks is a
standard Unicode feature necessary to represent single “characters” in some
complex situations (e.g. when it is necessary to distinguish between tréma
and umlaut, or to represent certain characters in Navajo).

That being said I agree with the conclusions.  It’s ok to leave out for now
and no need to link to any docs.

On Mon, May 17, 2021 at 5:31 AM Antoine Pitrou  wrote:

>
> I'm fine with pointing out that the function operates on codepoints.
>
> Linking to the Unicode documentation for emojis sounds entirely like a
> distraction, though.
>
> Regards
>
> Antoine.
>
>
> Le 17/05/2021 à 17:28, Ian Cook a écrit :
> > +1 for clarifying this in the kernel documentation, referring to these
> > multi-emoji glyphs as "emoji ZWJ sequences," and linking to
> > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> >
> > Ian
> >
> >
> > On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Le 17/05/2021 à 17:17, David Li a écrit :
> >>> A little clarification on my point: it's not that a single codepoint
> >>> gets encoded with more than four bytes, it's that a grapheme
> >>> cluster/human-delimited 'character' might be multiple codepoints, so
> >>> reversing the individual codepoints may produce an unexpected
> >>> result. For instance a flag emoji is actually two codepoints (two
> >>> special 'letter' codepoints that represent the country code), so
> >>> reversing a US flag naively will give you an odd '[SU]' instead.
> >>
> >> This sounds like saying that reversing a valid French word does not
> >> produce a valid French word (well, in most cases). The kernel
> >> documentation can't contain an entire tutorial about Unicode characters
> >> and what to expect from them, IMHO.
> >>
> >> Regards
> >>
> >> Antoine.
>


Re: String reverse kernel

2021-05-17 Thread Jonathan Keane
Yeah, piggybacking on what Weston said: is the line that we want to draw is
code point, combining character sequences, or graphemes [1]. IME, most
people would want/assume that combining characters would stay combined in
reversals (using Weston's example: "tréma" becoming "aḿert" (though this
specific character "é" has a combining version e+U+0300 and a single code
point é, and for many diacritics from different writing systems there is
only the combining version).

But whatever division we choose, documentation + links to explanations are
great.

[1]
https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.notion_of_character
there's also discussion at https://unicode.org/reports/tr29/, though the
first link I found much clearer.

On Mon, May 17, 2021 at 10:46 AM Weston Pace  wrote:

> FWIW, combining marks were not actually added to support emojis.  Emojis
> are just one of the more popular uses of the feature.  Combining marks is a
> standard Unicode feature necessary to represent single “characters” in some
> complex situations (e.g. when it is necessary to distinguish between tréma
> and umlaut, or to represent certain characters in Navajo).
>
> That being said I agree with the conclusions.  It’s ok to leave out for now
> and no need to link to any docs.
>
> On Mon, May 17, 2021 at 5:31 AM Antoine Pitrou  wrote:
>
> >
> > I'm fine with pointing out that the function operates on codepoints.
> >
> > Linking to the Unicode documentation for emojis sounds entirely like a
> > distraction, though.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 17/05/2021 à 17:28, Ian Cook a écrit :
> > > +1 for clarifying this in the kernel documentation, referring to these
> > > multi-emoji glyphs as "emoji ZWJ sequences," and linking to
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > >
> > > Ian
> > >
> > >
> > > On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Le 17/05/2021 à 17:17, David Li a écrit :
> > >>> A little clarification on my point: it's not that a single codepoint
> > >>> gets encoded with more than four bytes, it's that a grapheme
> > >>> cluster/human-delimited 'character' might be multiple codepoints, so
> > >>> reversing the individual codepoints may produce an unexpected
> > >>> result. For instance a flag emoji is actually two codepoints (two
> > >>> special 'letter' codepoints that represent the country code), so
> > >>> reversing a US flag naively will give you an odd '[SU]' instead.
> > >>
> > >> This sounds like saying that reversing a valid French word does not
> > >> produce a valid French word (well, in most cases). The kernel
> > >> documentation can't contain an entire tutorial about Unicode
> characters
> > >> and what to expect from them, IMHO.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> >
>


Re: Long title on github page

2021-05-17 Thread Julian Hyde
I think that the “cross-language development platform for” is noise. (I’m sure 
that JPEG developers think that JPEG is a “cross-language development platform” 
too. But it isn’t. It is an image format.)

"Apache Arrow is data format for efficient in-memory processing.”

I’ll note that In marketing speak, we are developing a high-concept pitch [1] 
here. Every company needs a name, a brand, a high-concept pitch, and 3- or 
4-sentence description. But every Apache project needs these too. It’s worth 
spending the time on the description, also, and then use them in all the places 
that we describe Arrow.

Julian

[1] https://www.growthink.com/content/whats-your-high-concept-pitch



> On May 17, 2021, at 7:38 AM, Eduardo Ponce  wrote:
> 
> I agree with Nate's and Brian's suggestions, but would like to add that we
> can make it a one-liner for more conciseness and consistency with other
> Apache projects.
> Apologies if it seems I am going around the suggestions loop again.
> 
> "Apache Arrow is a cross-language development platform enabling efficient
> in-memory data processing and transport."
> 
> 
> 
> 
> On Mon, May 17, 2021 at 10:11 AM Brian Hulette  wrote:
> 
>> Thank you for bringing this up Dominik. I sampled some of the descriptions
>> for other Apache projects I frequent, the ones with a meaningful
>> description have a single sentence:
>> 
>> github.com/apache/spark - Apache Spark - A unified analytics engine for
>> large-scale data processing
>> github.com/apache/beam - Apache Beam is a unified programming model for
>> Batch and Streaming
>> github.com/apache/avro - Apache Avro is a data serialization system
>> 
>> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache "
>> as the description.
>> 
>> +1 for Nate's suggestion "Apache Arrow is a cross-language development
>> platform for in-memory data. It enables systems to process and transport
>> data more efficiently."
>> 
>> On Mon, May 17, 2021 at 5:23 AM Wes McKinney  wrote:
>> 
>>> It's probably best for description to limit mentions of specific
>>> features. There are some high level features mentioned in the
>>> description now ("computational libraries and zero-copy streaming
>>> messaging and interprocess communication"), but now in 2021 since the
>>> project has grown so much, it could leave people with a limited view
>>> of what they might find here.
>>> 
>>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
>>>  wrote:
 
 How about
 'Apache Arrow is a cross-language development platform for in-memory
>>> data.
 It enables systems to process and transport data efficiently,
>> providing a
 simple and fast library for partitioning of large tables'?
 
 Sorry the delay, long election day
 
 On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
>>> natebauernfe...@deephaven.io>
 wrote:
 
> Suggestion: faster -> more efficiently
> 
> "Apache Arrow is a cross-language development platform for in-memory
> data. It enables systems to process and transport data more
>>> efficiently."
> 
> On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
>>> wrote:
> 
>> Here's what there now:
>> 
>> "Apache Arrow is a cross-language development platform for
>> in-memory
>> data. It specifies a standardized language-independent columnar
>>> memory
>> format for flat and hierarchical data, organized for efficient
>> analytic operations on modern hardware. It also provides
>>> computational
>> libraries and zero-copy streaming messaging and interprocess
>> communication…"
>> 
>> How about something shorter like
>> 
>> "Apache Arrow is a cross-language development platform for
>> in-memory
>> data. It enables systems to process and transport data faster."
>> 
>> Suggestions / refinements from others welcome
>> 
>> 
>> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
>>> wrote:
>>> 
>>> Super minor issue but could someone make the description on
>> GitHub
>> shorter?
>>> 
>>> 
>>> 
>>> GitHub puts the description into the title of the page and makes
>> it
> hard
>> to find it in URL autocomplete.
>>> 
>> 
> 
> 
> --
> 
>>> 
>> 



Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Jorge Cardoso Leitão
Thanks, Krisztián!

I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked, so we
are 2 to go:

- https://issues.apache.org/jira/browse/ARROW-12604
- https://issues.apache.org/jira/browse/ARROW-12603

Best,
Jorge



On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs 
wrote:

> On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
>  wrote:
> >
> > Hi,
> >
> > I have started collecting commits to the maint branch [1]. The exact
> > commands I used:
> >
> > git clone g...@github.com:apache/arrow.git
> > cd arrow/dev
> > python3 -m venv venv
> > source venv/bin/activate
> > pip install -e archery
> > pip install GitPython jira semver jinja2
> > archery release cherry-pick 4.0.1
> > # ran the commands it printed one by one
> >
> > There is a commit that does not apply cleanly. Could someone from C++
> merge
> > it? What to do:
> >
> > Run `git fetch upstream && git checkout maint-4.0.x && git cherry-pick
> > ce2861713472818eea264957de4cc83d5a2c567c`
> >
> > This will trigger a merge conflict. Resolve and push to maint-4.0.x on
> > apache/arrow.
> Hi,
>
> I've recreated the maintenance branch and resolved the conflicts.
> According to the release curation script [1], we have 4 issues without
> available patches:
> - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> submitted a PR)
> - https://issues.apache.org/jira/browse/ARROW-12619
> - https://issues.apache.org/jira/browse/ARROW-12604
> - https://issues.apache.org/jira/browse/ARROW-12603
>
> [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> >
> > Thanks,
> > Jorge
> >
> > https://github.com/apache/arrow/tree/maint-4.0.x
> >
> >
> >
> > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> neal.p.richard...@gmail.com>
> > wrote:
> >
> > > Thanks for taking this on. Krisztián can confirm the details (or point
> you
> > > to where this is documented), but based on past patch releases, I
> believe
> > > you would make a `maint-4.0.x` branch off of the existing
> `release-4.0.0`
> > > branch, cherry-pick the commits associated with the JIRAs tagged for
> 4.0.1
> > > (I believe there are utility scripts to help with this), and run the
> > > release script that bumps the versions.
> > >
> > > Neal
> > >
> > >
> > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Just to make sure: the goal is to cherry-pick all changes targeted
> for
> > > > 4.0.1 into a branch and release from there? If that is the case,
> then I
> > > > will create a branch and start cherry-picking the changes in order
> they
> > > > were merged in master.
> > > >
> > > > I see 5 issues on the list still open. I subscribed to them and will
> be
> > > > cherry-picking them as they get merged.
> > > >
> > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> 4.0.1. I
> > > > suggest 4.0.1 to keep parity as we recently agreed, but let me know
> if
> > > > others disagree.
> > > >
> > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > >
> > > >
> > > >
> > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > neal.p.richard...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks, Jorge!
> > > > >
> > > > > If anyone else has bugfixes that they'd like included in a patch
> > > release,
> > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a
> roundup
> > > > and
> > > > > start a vote early next week?
> > > > >
> > > > > Neal
> > > > >
> > > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Addressing these accumulated issues in a patch release sounds
> like a
> > > > > > good idea to me.
> > > > > >
> > > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > > >  wrote:
> > > > > > >
> > > > > > > I agree. Segfaults are not nice.
> > > > > > >
> > > > > > > I can take it. I would possibly need some guidance.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jorge
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 12:52 AM Neal Richardson <
> > > > > > > neal.p.richard...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > As discussed at the biweekly sync call, I wanted to gauge
> > > interest
> > > > in
> > > > > > doing
> > > > > > > > a 4.0.1 patch release.
> > > > > > > >
> > > > > > > > There currently are 14 issues in JIRA tagged with 4.0.1 [1].
> > > There
> > > > > are
> > > > > > 3
> > > > > > > > segfaults, including one that a cuDF maintainer raised
> yesterday
> > > > [2]
> > > > > in
> > > > > > > > requesting a patch release.
> > > > > > > >
> > > > > > > > I don't want to bias the discussion by giving my opinion
> (yet). I
> > > > > will
> > > > > > say
> > > > > > > > that the question is whether someone (or multiple people)
> wants
> > > to
> > > > > > step up
> > > > > > > > and drive a release--if releases were costless, this would be
> > > much
> > > > > > > > different. We did decide to allow for a simpler patch release
> > > > process
> 

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Krisztián Szűcs
On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
 wrote:
>
> Thanks, Krisztián!
>
> I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked, so we
> are 2 to go:
>
> - https://issues.apache.org/jira/browse/ARROW-12604
Resolved now, but didn't require a patch on our side.
> - https://issues.apache.org/jira/browse/ARROW-12603
Just merged it.

The maintenance branch should be ready now.
>
> Best,
> Jorge
>
>
>
> On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs 
> wrote:
>
> > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> >  wrote:
> > >
> > > Hi,
> > >
> > > I have started collecting commits to the maint branch [1]. The exact
> > > commands I used:
> > >
> > > git clone g...@github.com:apache/arrow.git
> > > cd arrow/dev
> > > python3 -m venv venv
> > > source venv/bin/activate
> > > pip install -e archery
> > > pip install GitPython jira semver jinja2
> > > archery release cherry-pick 4.0.1
> > > # ran the commands it printed one by one
> > >
> > > There is a commit that does not apply cleanly. Could someone from C++
> > merge
> > > it? What to do:
> > >
> > > Run `git fetch upstream && git checkout maint-4.0.x && git cherry-pick
> > > ce2861713472818eea264957de4cc83d5a2c567c`
> > >
> > > This will trigger a merge conflict. Resolve and push to maint-4.0.x on
> > > apache/arrow.
> > Hi,
> >
> > I've recreated the maintenance branch and resolved the conflicts.
> > According to the release curation script [1], we have 4 issues without
> > available patches:
> > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > submitted a PR)
> > - https://issues.apache.org/jira/browse/ARROW-12619
> > - https://issues.apache.org/jira/browse/ARROW-12604
> > - https://issues.apache.org/jira/browse/ARROW-12603
> >
> > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > >
> > > Thanks,
> > > Jorge
> > >
> > > https://github.com/apache/arrow/tree/maint-4.0.x
> > >
> > >
> > >
> > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > neal.p.richard...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for taking this on. Krisztián can confirm the details (or point
> > you
> > > > to where this is documented), but based on past patch releases, I
> > believe
> > > > you would make a `maint-4.0.x` branch off of the existing
> > `release-4.0.0`
> > > > branch, cherry-pick the commits associated with the JIRAs tagged for
> > 4.0.1
> > > > (I believe there are utility scripts to help with this), and run the
> > > > release script that bumps the versions.
> > > >
> > > > Neal
> > > >
> > > >
> > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > jorgecarlei...@gmail.com> wrote:
> > > >
> > > > > Just to make sure: the goal is to cherry-pick all changes targeted
> > for
> > > > > 4.0.1 into a branch and release from there? If that is the case,
> > then I
> > > > > will create a branch and start cherry-picking the changes in order
> > they
> > > > > were merged in master.
> > > > >
> > > > > I see 5 issues on the list still open. I subscribed to them and will
> > be
> > > > > cherry-picking them as they get merged.
> > > > >
> > > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> > 4.0.1. I
> > > > > suggest 4.0.1 to keep parity as we recently agreed, but let me know
> > if
> > > > > others disagree.
> > > > >
> > > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > > >
> > > > >
> > > > >
> > > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > > neal.p.richard...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks, Jorge!
> > > > > >
> > > > > > If anyone else has bugfixes that they'd like included in a patch
> > > > release,
> > > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a
> > roundup
> > > > > and
> > > > > > start a vote early next week?
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney 
> > > > > wrote:
> > > > > >
> > > > > > > Addressing these accumulated issues in a patch release sounds
> > like a
> > > > > > > good idea to me.
> > > > > > >
> > > > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > I agree. Segfaults are not nice.
> > > > > > > >
> > > > > > > > I can take it. I would possibly need some guidance.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jorge
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, May 13, 2021 at 12:52 AM Neal Richardson <
> > > > > > > > neal.p.richard...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > As discussed at the biweekly sync call, I wanted to gauge
> > > > interest
> > > > > in
> > > > > > > doing
> > > > > > > > > a 4.0.1 patch release.
> > > > > > > > >
> > > > > > > > > There currently are 14 issues in JIRA tagged with 4.0.1 [1].
> > > > There
> > > > > > are
> > > > > > > 3
> > > > > > > > > segfaults, including one that a cuDF maintainer raised
> > yesterday
> > > > > [2]
>

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Micah Kornfield
Small logistical question.  Jorge do you have a PGP key in the Apache Web
of Trust [1]

[1] https://infra.apache.org/release-signing.html#web-of-trust

On Mon, May 17, 2021 at 11:46 AM Krisztián Szűcs 
wrote:

> On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
>  wrote:
> >
> > Thanks, Krisztián!
> >
> > I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked, so
> we
> > are 2 to go:
> >
> > - https://issues.apache.org/jira/browse/ARROW-12604
> Resolved now, but didn't require a patch on our side.
> > - https://issues.apache.org/jira/browse/ARROW-12603
> Just merged it.
>
> The maintenance branch should be ready now.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > wrote:
> >
> > > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > I have started collecting commits to the maint branch [1]. The exact
> > > > commands I used:
> > > >
> > > > git clone g...@github.com:apache/arrow.git
> > > > cd arrow/dev
> > > > python3 -m venv venv
> > > > source venv/bin/activate
> > > > pip install -e archery
> > > > pip install GitPython jira semver jinja2
> > > > archery release cherry-pick 4.0.1
> > > > # ran the commands it printed one by one
> > > >
> > > > There is a commit that does not apply cleanly. Could someone from C++
> > > merge
> > > > it? What to do:
> > > >
> > > > Run `git fetch upstream && git checkout maint-4.0.x && git
> cherry-pick
> > > > ce2861713472818eea264957de4cc83d5a2c567c`
> > > >
> > > > This will trigger a merge conflict. Resolve and push to maint-4.0.x
> on
> > > > apache/arrow.
> > > Hi,
> > >
> > > I've recreated the maintenance branch and resolved the conflicts.
> > > According to the release curation script [1], we have 4 issues without
> > > available patches:
> > > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > > submitted a PR)
> > > - https://issues.apache.org/jira/browse/ARROW-12619
> > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > - https://issues.apache.org/jira/browse/ARROW-12603
> > >
> > > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > > >
> > > > Thanks,
> > > > Jorge
> > > >
> > > > https://github.com/apache/arrow/tree/maint-4.0.x
> > > >
> > > >
> > > >
> > > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for taking this on. Krisztián can confirm the details (or
> point
> > > you
> > > > > to where this is documented), but based on past patch releases, I
> > > believe
> > > > > you would make a `maint-4.0.x` branch off of the existing
> > > `release-4.0.0`
> > > > > branch, cherry-pick the commits associated with the JIRAs tagged
> for
> > > 4.0.1
> > > > > (I believe there are utility scripts to help with this), and run
> the
> > > > > release script that bumps the versions.
> > > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > > jorgecarlei...@gmail.com> wrote:
> > > > >
> > > > > > Just to make sure: the goal is to cherry-pick all changes
> targeted
> > > for
> > > > > > 4.0.1 into a branch and release from there? If that is the case,
> > > then I
> > > > > > will create a branch and start cherry-picking the changes in
> order
> > > they
> > > > > > were merged in master.
> > > > > >
> > > > > > I see 5 issues on the list still open. I subscribed to them and
> will
> > > be
> > > > > > cherry-picking them as they get merged.
> > > > > >
> > > > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> > > 4.0.1. I
> > > > > > suggest 4.0.1 to keep parity as we recently agreed, but let me
> know
> > > if
> > > > > > others disagree.
> > > > > >
> > > > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > > > neal.p.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks, Jorge!
> > > > > > >
> > > > > > > If anyone else has bugfixes that they'd like included in a
> patch
> > > > > release,
> > > > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a
> > > roundup
> > > > > > and
> > > > > > > start a vote early next week?
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney <
> wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Addressing these accumulated issues in a patch release sounds
> > > like a
> > > > > > > > good idea to me.
> > > > > > > >
> > > > > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > I agree. Segfaults are not nice.
> > > > > > > > >
> > > > > > > > > I can take it. I would possibly need some guidance.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jorge
> > > > > > > > >
> > > > > > > > >
> > > > > 

Re: Long title on github page

2021-05-17 Thread Mauricio Vargas
sorry to come with a marketing-style title, but how about

github.com/apache/arrow - Apache Arrow is an efficient format for big data
processing and sharing
?

On Mon, May 17, 2021 at 1:15 PM Julian Hyde  wrote:

> I think that the “cross-language development platform for” is noise. (I’m
> sure that JPEG developers think that JPEG is a “cross-language development
> platform” too. But it isn’t. It is an image format.)
>
> "Apache Arrow is data format for efficient in-memory processing.”
>
> I’ll note that In marketing speak, we are developing a high-concept pitch
> [1] here. Every company needs a name, a brand, a high-concept pitch, and 3-
> or 4-sentence description. But every Apache project needs these too. It’s
> worth spending the time on the description, also, and then use them in all
> the places that we describe Arrow.
>
> Julian
>
> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
>
>
>
> > On May 17, 2021, at 7:38 AM, Eduardo Ponce  wrote:
> >
> > I agree with Nate's and Brian's suggestions, but would like to add that
> we
> > can make it a one-liner for more conciseness and consistency with other
> > Apache projects.
> > Apologies if it seems I am going around the suggestions loop again.
> >
> > "Apache Arrow is a cross-language development platform enabling efficient
> > in-memory data processing and transport."
> >
> >
> >
> >
> > On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> wrote:
> >
> >> Thank you for bringing this up Dominik. I sampled some of the
> descriptions
> >> for other Apache projects I frequent, the ones with a meaningful
> >> description have a single sentence:
> >>
> >> github.com/apache/spark - Apache Spark - A unified analytics engine for
> >> large-scale data processing
> >> github.com/apache/beam - Apache Beam is a unified programming model for
> >> Batch and Streaming
> >> github.com/apache/avro - Apache Avro is a data serialization system
> >>
> >> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> "
> >> as the description.
> >>
> >> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> >> platform for in-memory data. It enables systems to process and transport
> >> data more efficiently."
> >>
> >> On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> wrote:
> >>
> >>> It's probably best for description to limit mentions of specific
> >>> features. There are some high level features mentioned in the
> >>> description now ("computational libraries and zero-copy streaming
> >>> messaging and interprocess communication"), but now in 2021 since the
> >>> project has grown so much, it could leave people with a limited view
> >>> of what they might find here.
> >>>
> >>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> >>>  wrote:
> 
>  How about
>  'Apache Arrow is a cross-language development platform for in-memory
> >>> data.
>  It enables systems to process and transport data efficiently,
> >> providing a
>  simple and fast library for partitioning of large tables'?
> 
>  Sorry the delay, long election day
> 
>  On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> >>> natebauernfe...@deephaven.io>
>  wrote:
> 
> > Suggestion: faster -> more efficiently
> >
> > "Apache Arrow is a cross-language development platform for in-memory
> > data. It enables systems to process and transport data more
> >>> efficiently."
> >
> > On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
> >>> wrote:
> >
> >> Here's what there now:
> >>
> >> "Apache Arrow is a cross-language development platform for
> >> in-memory
> >> data. It specifies a standardized language-independent columnar
> >>> memory
> >> format for flat and hierarchical data, organized for efficient
> >> analytic operations on modern hardware. It also provides
> >>> computational
> >> libraries and zero-copy streaming messaging and interprocess
> >> communication…"
> >>
> >> How about something shorter like
> >>
> >> "Apache Arrow is a cross-language development platform for
> >> in-memory
> >> data. It enables systems to process and transport data faster."
> >>
> >> Suggestions / refinements from others welcome
> >>
> >>
> >> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
> >>> wrote:
> >>>
> >>> Super minor issue but could someone make the description on
> >> GitHub
> >> shorter?
> >>>
> >>>
> >>>
> >>> GitHub puts the description into the title of the page and makes
> >> it
> > hard
> >> to find it in URL autocomplete.
> >>>
> >>
> >
> >
> > --
> >
> >>>
> >>
>
>


Re: Long title on github page

2021-05-17 Thread Wes McKinney
I think less is better in the description, but unfortunately the
association of Arrow as being "just a data format" has been actively
harmful in some ways to community growth. We have a data format, yes,
but we are also creating a computational platform to go hand-in-hand
with the data format to make it easier to build fast applications that
use the data format. So the description needs to capture both of these
ideas.

On Mon, May 17, 2021 at 12:15 PM Julian Hyde  wrote:
>
> I think that the “cross-language development platform for” is noise. (I’m 
> sure that JPEG developers think that JPEG is a “cross-language development 
> platform” too. But it isn’t. It is an image format.)
>
> "Apache Arrow is data format for efficient in-memory processing.”
>
> I’ll note that In marketing speak, we are developing a high-concept pitch [1] 
> here. Every company needs a name, a brand, a high-concept pitch, and 3- or 
> 4-sentence description. But every Apache project needs these too. It’s worth 
> spending the time on the description, also, and then use them in all the 
> places that we describe Arrow.
>
> Julian
>
> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
>
>
>
> > On May 17, 2021, at 7:38 AM, Eduardo Ponce  wrote:
> >
> > I agree with Nate's and Brian's suggestions, but would like to add that we
> > can make it a one-liner for more conciseness and consistency with other
> > Apache projects.
> > Apologies if it seems I am going around the suggestions loop again.
> >
> > "Apache Arrow is a cross-language development platform enabling efficient
> > in-memory data processing and transport."
> >
> >
> >
> >
> > On Mon, May 17, 2021 at 10:11 AM Brian Hulette  wrote:
> >
> >> Thank you for bringing this up Dominik. I sampled some of the descriptions
> >> for other Apache projects I frequent, the ones with a meaningful
> >> description have a single sentence:
> >>
> >> github.com/apache/spark - Apache Spark - A unified analytics engine for
> >> large-scale data processing
> >> github.com/apache/beam - Apache Beam is a unified programming model for
> >> Batch and Streaming
> >> github.com/apache/avro - Apache Avro is a data serialization system
> >>
> >> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache "
> >> as the description.
> >>
> >> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> >> platform for in-memory data. It enables systems to process and transport
> >> data more efficiently."
> >>
> >> On Mon, May 17, 2021 at 5:23 AM Wes McKinney  wrote:
> >>
> >>> It's probably best for description to limit mentions of specific
> >>> features. There are some high level features mentioned in the
> >>> description now ("computational libraries and zero-copy streaming
> >>> messaging and interprocess communication"), but now in 2021 since the
> >>> project has grown so much, it could leave people with a limited view
> >>> of what they might find here.
> >>>
> >>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> >>>  wrote:
> 
>  How about
>  'Apache Arrow is a cross-language development platform for in-memory
> >>> data.
>  It enables systems to process and transport data efficiently,
> >> providing a
>  simple and fast library for partitioning of large tables'?
> 
>  Sorry the delay, long election day
> 
>  On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> >>> natebauernfe...@deephaven.io>
>  wrote:
> 
> > Suggestion: faster -> more efficiently
> >
> > "Apache Arrow is a cross-language development platform for in-memory
> > data. It enables systems to process and transport data more
> >>> efficiently."
> >
> > On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
> >>> wrote:
> >
> >> Here's what there now:
> >>
> >> "Apache Arrow is a cross-language development platform for
> >> in-memory
> >> data. It specifies a standardized language-independent columnar
> >>> memory
> >> format for flat and hierarchical data, organized for efficient
> >> analytic operations on modern hardware. It also provides
> >>> computational
> >> libraries and zero-copy streaming messaging and interprocess
> >> communication…"
> >>
> >> How about something shorter like
> >>
> >> "Apache Arrow is a cross-language development platform for
> >> in-memory
> >> data. It enables systems to process and transport data faster."
> >>
> >> Suggestions / refinements from others welcome
> >>
> >>
> >> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
> >>> wrote:
> >>>
> >>> Super minor issue but could someone make the description on
> >> GitHub
> >> shorter?
> >>>
> >>>
> >>>
> >>> GitHub puts the description into the title of the page and makes
> >> it
> > hard
> >> to find it in URL autocomplete.
> >>>
> >>
> >
> >
> > --
> >
> >>>
> >>
>


Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Krisztián Szűcs
I think your GPG key hasn't been configured yet, at least it is not in
the KEYS file [1].
The source release tarball must be signed by the release manager.

Do you have an Apache Code Signing key?
If not, then it could be better if either Kou or I would be the
release manager.

[1]: https://dist.apache.org/repos/dist/dev/arrow/KEYS

On Mon, May 17, 2021 at 8:45 PM Krisztián Szűcs
 wrote:
>
> On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
>  wrote:
> >
> > Thanks, Krisztián!
> >
> > I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked, so we
> > are 2 to go:
> >
> > - https://issues.apache.org/jira/browse/ARROW-12604
> Resolved now, but didn't require a patch on our side.
> > - https://issues.apache.org/jira/browse/ARROW-12603
> Just merged it.
>
> The maintenance branch should be ready now.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs 
> > wrote:
> >
> > > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > I have started collecting commits to the maint branch [1]. The exact
> > > > commands I used:
> > > >
> > > > git clone g...@github.com:apache/arrow.git
> > > > cd arrow/dev
> > > > python3 -m venv venv
> > > > source venv/bin/activate
> > > > pip install -e archery
> > > > pip install GitPython jira semver jinja2
> > > > archery release cherry-pick 4.0.1
> > > > # ran the commands it printed one by one
> > > >
> > > > There is a commit that does not apply cleanly. Could someone from C++
> > > merge
> > > > it? What to do:
> > > >
> > > > Run `git fetch upstream && git checkout maint-4.0.x && git cherry-pick
> > > > ce2861713472818eea264957de4cc83d5a2c567c`
> > > >
> > > > This will trigger a merge conflict. Resolve and push to maint-4.0.x on
> > > > apache/arrow.
> > > Hi,
> > >
> > > I've recreated the maintenance branch and resolved the conflicts.
> > > According to the release curation script [1], we have 4 issues without
> > > available patches:
> > > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > > submitted a PR)
> > > - https://issues.apache.org/jira/browse/ARROW-12619
> > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > - https://issues.apache.org/jira/browse/ARROW-12603
> > >
> > > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > > >
> > > > Thanks,
> > > > Jorge
> > > >
> > > > https://github.com/apache/arrow/tree/maint-4.0.x
> > > >
> > > >
> > > >
> > > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for taking this on. Krisztián can confirm the details (or point
> > > you
> > > > > to where this is documented), but based on past patch releases, I
> > > believe
> > > > > you would make a `maint-4.0.x` branch off of the existing
> > > `release-4.0.0`
> > > > > branch, cherry-pick the commits associated with the JIRAs tagged for
> > > 4.0.1
> > > > > (I believe there are utility scripts to help with this), and run the
> > > > > release script that bumps the versions.
> > > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > > jorgecarlei...@gmail.com> wrote:
> > > > >
> > > > > > Just to make sure: the goal is to cherry-pick all changes targeted
> > > for
> > > > > > 4.0.1 into a branch and release from there? If that is the case,
> > > then I
> > > > > > will create a branch and start cherry-picking the changes in order
> > > they
> > > > > > were merged in master.
> > > > > >
> > > > > > I see 5 issues on the list still open. I subscribed to them and will
> > > be
> > > > > > cherry-picking them as they get merged.
> > > > > >
> > > > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> > > 4.0.1. I
> > > > > > suggest 4.0.1 to keep parity as we recently agreed, but let me know
> > > if
> > > > > > others disagree.
> > > > > >
> > > > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > > > neal.p.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks, Jorge!
> > > > > > >
> > > > > > > If anyone else has bugfixes that they'd like included in a patch
> > > > > release,
> > > > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can do a
> > > roundup
> > > > > > and
> > > > > > > start a vote early next week?
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 8:20 AM Wes McKinney 
> > > > > > wrote:
> > > > > > >
> > > > > > > > Addressing these accumulated issues in a patch release sounds
> > > like a
> > > > > > > > good idea to me.
> > > > > > > >
> > > > > > > > On Wed, May 12, 2021 at 6:18 PM Jorge Cardoso Leitão
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > I agree. Segfaults are not nice.
> > > > > > > > >
> > > > > > > > > I can take it. I would possibly need some guidance.
> > >

Re: Long title on github page

2021-05-17 Thread Eduardo Ponce
One more suggestion for the bucket:
"Apache Arrow is a computational platform for efficient in-memory data
representation and processing."

On Mon, May 17, 2021 at 2:49 PM Wes McKinney  wrote:

> I think less is better in the description, but unfortunately the
> association of Arrow as being "just a data format" has been actively
> harmful in some ways to community growth. We have a data format, yes,
> but we are also creating a computational platform to go hand-in-hand
> with the data format to make it easier to build fast applications that
> use the data format. So the description needs to capture both of these
> ideas.
>
> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> wrote:
> >
> > I think that the “cross-language development platform for” is noise.
> (I’m sure that JPEG developers think that JPEG is a “cross-language
> development platform” too. But it isn’t. It is an image format.)
> >
> > "Apache Arrow is data format for efficient in-memory processing.”
> >
> > I’ll note that In marketing speak, we are developing a high-concept
> pitch [1] here. Every company needs a name, a brand, a high-concept pitch,
> and 3- or 4-sentence description. But every Apache project needs these too.
> It’s worth spending the time on the description, also, and then use them in
> all the places that we describe Arrow.
> >
> > Julian
> >
> > [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> >
> >
> >
> > > On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> wrote:
> > >
> > > I agree with Nate's and Brian's suggestions, but would like to add
> that we
> > > can make it a one-liner for more conciseness and consistency with other
> > > Apache projects.
> > > Apologies if it seems I am going around the suggestions loop again.
> > >
> > > "Apache Arrow is a cross-language development platform enabling
> efficient
> > > in-memory data processing and transport."
> > >
> > >
> > >
> > >
> > > On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> wrote:
> > >
> > >> Thank you for bringing this up Dominik. I sampled some of the
> descriptions
> > >> for other Apache projects I frequent, the ones with a meaningful
> > >> description have a single sentence:
> > >>
> > >> github.com/apache/spark - Apache Spark - A unified analytics engine
> for
> > >> large-scale data processing
> > >> github.com/apache/beam - Apache Beam is a unified programming model
> for
> > >> Batch and Streaming
> > >> github.com/apache/avro - Apache Avro is a data serialization system
> > >>
> > >> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> "
> > >> as the description.
> > >>
> > >> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> > >> platform for in-memory data. It enables systems to process and
> transport
> > >> data more efficiently."
> > >>
> > >> On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> wrote:
> > >>
> > >>> It's probably best for description to limit mentions of specific
> > >>> features. There are some high level features mentioned in the
> > >>> description now ("computational libraries and zero-copy streaming
> > >>> messaging and interprocess communication"), but now in 2021 since the
> > >>> project has grown so much, it could leave people with a limited view
> > >>> of what they might find here.
> > >>>
> > >>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> > >>>  wrote:
> > 
> >  How about
> >  'Apache Arrow is a cross-language development platform for in-memory
> > >>> data.
> >  It enables systems to process and transport data efficiently,
> > >> providing a
> >  simple and fast library for partitioning of large tables'?
> > 
> >  Sorry the delay, long election day
> > 
> >  On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> > >>> natebauernfe...@deephaven.io>
> >  wrote:
> > 
> > > Suggestion: faster -> more efficiently
> > >
> > > "Apache Arrow is a cross-language development platform for
> in-memory
> > > data. It enables systems to process and transport data more
> > >>> efficiently."
> > >
> > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney  >
> > >>> wrote:
> > >
> > >> Here's what there now:
> > >>
> > >> "Apache Arrow is a cross-language development platform for
> > >> in-memory
> > >> data. It specifies a standardized language-independent columnar
> > >>> memory
> > >> format for flat and hierarchical data, organized for efficient
> > >> analytic operations on modern hardware. It also provides
> > >>> computational
> > >> libraries and zero-copy streaming messaging and interprocess
> > >> communication…"
> > >>
> > >> How about something shorter like
> > >>
> > >> "Apache Arrow is a cross-language development platform for
> > >> in-memory
> > >> data. It enables systems to process and transport data faster."
> > >>
> > >> Suggestions / refinements from others welcome
> > >>
> > >>
> > >> On Sat, May 15, 2021 at 9:12 PM Do

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Neal Richardson
How does one get their key in the Web of Trust? We do need to be able to
add people to that so that it's not just the same handful of individuals
who can be release manager, and now seems like a great time to add Jorge.

Neal

On Mon, May 17, 2021 at 11:52 AM Krisztián Szűcs 
wrote:

> I think your GPG key hasn't been configured yet, at least it is not in
> the KEYS file [1].
> The source release tarball must be signed by the release manager.
>
> Do you have an Apache Code Signing key?
> If not, then it could be better if either Kou or I would be the
> release manager.
>
> [1]: https://dist.apache.org/repos/dist/dev/arrow/KEYS
>
> On Mon, May 17, 2021 at 8:45 PM Krisztián Szűcs
>  wrote:
> >
> > On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
> >  wrote:
> > >
> > > Thanks, Krisztián!
> > >
> > > I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked,
> so we
> > > are 2 to go:
> > >
> > > - https://issues.apache.org/jira/browse/ARROW-12604
> > Resolved now, but didn't require a patch on our side.
> > > - https://issues.apache.org/jira/browse/ARROW-12603
> > Just merged it.
> >
> > The maintenance branch should be ready now.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > >
> > > On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > > wrote:
> > >
> > > > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I have started collecting commits to the maint branch [1]. The
> exact
> > > > > commands I used:
> > > > >
> > > > > git clone g...@github.com:apache/arrow.git
> > > > > cd arrow/dev
> > > > > python3 -m venv venv
> > > > > source venv/bin/activate
> > > > > pip install -e archery
> > > > > pip install GitPython jira semver jinja2
> > > > > archery release cherry-pick 4.0.1
> > > > > # ran the commands it printed one by one
> > > > >
> > > > > There is a commit that does not apply cleanly. Could someone from
> C++
> > > > merge
> > > > > it? What to do:
> > > > >
> > > > > Run `git fetch upstream && git checkout maint-4.0.x && git
> cherry-pick
> > > > > ce2861713472818eea264957de4cc83d5a2c567c`
> > > > >
> > > > > This will trigger a merge conflict. Resolve and push to
> maint-4.0.x on
> > > > > apache/arrow.
> > > > Hi,
> > > >
> > > > I've recreated the maintenance branch and resolved the conflicts.
> > > > According to the release curation script [1], we have 4 issues
> without
> > > > available patches:
> > > > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > > > submitted a PR)
> > > > - https://issues.apache.org/jira/browse/ARROW-12619
> > > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > > - https://issues.apache.org/jira/browse/ARROW-12603
> > > >
> > > > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > > > >
> > > > > Thanks,
> > > > > Jorge
> > > > >
> > > > > https://github.com/apache/arrow/tree/maint-4.0.x
> > > > >
> > > > >
> > > > >
> > > > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > > > neal.p.richard...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for taking this on. Krisztián can confirm the details (or
> point
> > > > you
> > > > > > to where this is documented), but based on past patch releases, I
> > > > believe
> > > > > > you would make a `maint-4.0.x` branch off of the existing
> > > > `release-4.0.0`
> > > > > > branch, cherry-pick the commits associated with the JIRAs tagged
> for
> > > > 4.0.1
> > > > > > (I believe there are utility scripts to help with this), and run
> the
> > > > > > release script that bumps the versions.
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > >
> > > > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > > > jorgecarlei...@gmail.com> wrote:
> > > > > >
> > > > > > > Just to make sure: the goal is to cherry-pick all changes
> targeted
> > > > for
> > > > > > > 4.0.1 into a branch and release from there? If that is the
> case,
> > > > then I
> > > > > > > will create a branch and start cherry-picking the changes in
> order
> > > > they
> > > > > > > were merged in master.
> > > > > > >
> > > > > > > I see 5 issues on the list still open. I subscribed to them
> and will
> > > > be
> > > > > > > cherry-picking them as they get merged.
> > > > > > >
> > > > > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> > > > 4.0.1. I
> > > > > > > suggest 4.0.1 to keep parity as we recently agreed, but let me
> know
> > > > if
> > > > > > > others disagree.
> > > > > > >
> > > > > > > [1] https://github.com/apache/arrow-rs/pull/289
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 13, 2021 at 5:54 PM Neal Richardson <
> > > > > > > neal.p.richard...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks, Jorge!
> > > > > > > >
> > > > > > > > If anyone else has bugfixes that they'd like included in a
> patch
> > > > > > release,
> > > > > > > > please tag them with the 4.0.1 Fix Version. Perhaps we can
> do

Re: Long title on github page

2021-05-17 Thread Adam Lippai
Hi,

I'm 100% behind Wes.
Being not just a file format, but adding compute and libs are the best
selling points of Arrow.
It shouldn't be reduced to "a file format and it's utils", as the ecosystem
is at least that important.
This is something we have to emphasize constantly.

Best regards,
Adam Lippai

On Mon, May 17, 2021 at 8:49 PM Wes McKinney  wrote:

> I think less is better in the description, but unfortunately the
> association of Arrow as being "just a data format" has been actively
> harmful in some ways to community growth. We have a data format, yes,
> but we are also creating a computational platform to go hand-in-hand
> with the data format to make it easier to build fast applications that
> use the data format. So the description needs to capture both of these
> ideas.
>
> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> wrote:
> >
> > I think that the “cross-language development platform for” is noise.
> (I’m sure that JPEG developers think that JPEG is a “cross-language
> development platform” too. But it isn’t. It is an image format.)
> >
> > "Apache Arrow is data format for efficient in-memory processing.”
> >
> > I’ll note that In marketing speak, we are developing a high-concept
> pitch [1] here. Every company needs a name, a brand, a high-concept pitch,
> and 3- or 4-sentence description. But every Apache project needs these too.
> It’s worth spending the time on the description, also, and then use them in
> all the places that we describe Arrow.
> >
> > Julian
> >
> > [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> >
> >
> >
> > > On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> wrote:
> > >
> > > I agree with Nate's and Brian's suggestions, but would like to add
> that we
> > > can make it a one-liner for more conciseness and consistency with other
> > > Apache projects.
> > > Apologies if it seems I am going around the suggestions loop again.
> > >
> > > "Apache Arrow is a cross-language development platform enabling
> efficient
> > > in-memory data processing and transport."
> > >
> > >
> > >
> > >
> > > On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> wrote:
> > >
> > >> Thank you for bringing this up Dominik. I sampled some of the
> descriptions
> > >> for other Apache projects I frequent, the ones with a meaningful
> > >> description have a single sentence:
> > >>
> > >> github.com/apache/spark - Apache Spark - A unified analytics engine
> for
> > >> large-scale data processing
> > >> github.com/apache/beam - Apache Beam is a unified programming model
> for
> > >> Batch and Streaming
> > >> github.com/apache/avro - Apache Avro is a data serialization system
> > >>
> > >> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> "
> > >> as the description.
> > >>
> > >> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> > >> platform for in-memory data. It enables systems to process and
> transport
> > >> data more efficiently."
> > >>
> > >> On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> wrote:
> > >>
> > >>> It's probably best for description to limit mentions of specific
> > >>> features. There are some high level features mentioned in the
> > >>> description now ("computational libraries and zero-copy streaming
> > >>> messaging and interprocess communication"), but now in 2021 since the
> > >>> project has grown so much, it could leave people with a limited view
> > >>> of what they might find here.
> > >>>
> > >>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> > >>>  wrote:
> > 
> >  How about
> >  'Apache Arrow is a cross-language development platform for in-memory
> > >>> data.
> >  It enables systems to process and transport data efficiently,
> > >> providing a
> >  simple and fast library for partitioning of large tables'?
> > 
> >  Sorry the delay, long election day
> > 
> >  On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> > >>> natebauernfe...@deephaven.io>
> >  wrote:
> > 
> > > Suggestion: faster -> more efficiently
> > >
> > > "Apache Arrow is a cross-language development platform for
> in-memory
> > > data. It enables systems to process and transport data more
> > >>> efficiently."
> > >
> > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney  >
> > >>> wrote:
> > >
> > >> Here's what there now:
> > >>
> > >> "Apache Arrow is a cross-language development platform for
> > >> in-memory
> > >> data. It specifies a standardized language-independent columnar
> > >>> memory
> > >> format for flat and hierarchical data, organized for efficient
> > >> analytic operations on modern hardware. It also provides
> > >>> computational
> > >> libraries and zero-copy streaming messaging and interprocess
> > >> communication…"
> > >>
> > >> How about something shorter like
> > >>
> > >> "Apache Arrow is a cross-language development platform for
> > >> in-memory
> > >> data. It enables systems to proces

Re: Long title on github page

2021-05-17 Thread Julian Hyde
Alright, well, whatever it is, it must fit into one breath. If the high-concept 
pitch is successful, people will stick around for the full pitch.

Words such as “platform” and “enable” are noise. You say “platform”, they start 
to say “what exactly do you mean by platform”, the elevator doors open, and 
they’re gone.

“Apache Arrow is a format and compute kernel for in-memory data”


> On May 17, 2021, at 12:03 PM, Eduardo Ponce  wrote:
> 
> One more suggestion for the bucket:
> "Apache Arrow is a computational platform for efficient in-memory data
> representation and processing."
> 
> On Mon, May 17, 2021 at 2:49 PM Wes McKinney  wrote:
> 
>> I think less is better in the description, but unfortunately the
>> association of Arrow as being "just a data format" has been actively
>> harmful in some ways to community growth. We have a data format, yes,
>> but we are also creating a computational platform to go hand-in-hand
>> with the data format to make it easier to build fast applications that
>> use the data format. So the description needs to capture both of these
>> ideas.
>> 
>> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
>> wrote:
>>> 
>>> I think that the “cross-language development platform for” is noise.
>> (I’m sure that JPEG developers think that JPEG is a “cross-language
>> development platform” too. But it isn’t. It is an image format.)
>>> 
>>> "Apache Arrow is data format for efficient in-memory processing.”
>>> 
>>> I’ll note that In marketing speak, we are developing a high-concept
>> pitch [1] here. Every company needs a name, a brand, a high-concept pitch,
>> and 3- or 4-sentence description. But every Apache project needs these too.
>> It’s worth spending the time on the description, also, and then use them in
>> all the places that we describe Arrow.
>>> 
>>> Julian
>>> 
>>> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
>>> 
>>> 
>>> 
 On May 17, 2021, at 7:38 AM, Eduardo Ponce 
>> wrote:
 
 I agree with Nate's and Brian's suggestions, but would like to add
>> that we
 can make it a one-liner for more conciseness and consistency with other
 Apache projects.
 Apologies if it seems I am going around the suggestions loop again.
 
 "Apache Arrow is a cross-language development platform enabling
>> efficient
 in-memory data processing and transport."
 
 
 
 
 On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
>> wrote:
 
> Thank you for bringing this up Dominik. I sampled some of the
>> descriptions
> for other Apache projects I frequent, the ones with a meaningful
> description have a single sentence:
> 
> github.com/apache/spark - Apache Spark - A unified analytics engine
>> for
> large-scale data processing
> github.com/apache/beam - Apache Beam is a unified programming model
>> for
> Batch and Streaming
> github.com/apache/avro - Apache Avro is a data serialization system
> 
> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
>> "
> as the description.
> 
> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> platform for in-memory data. It enables systems to process and
>> transport
> data more efficiently."
> 
> On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
>> wrote:
> 
>> It's probably best for description to limit mentions of specific
>> features. There are some high level features mentioned in the
>> description now ("computational libraries and zero-copy streaming
>> messaging and interprocess communication"), but now in 2021 since the
>> project has grown so much, it could leave people with a limited view
>> of what they might find here.
>> 
>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
>>  wrote:
>>> 
>>> How about
>>> 'Apache Arrow is a cross-language development platform for in-memory
>> data.
>>> It enables systems to process and transport data efficiently,
> providing a
>>> simple and fast library for partitioning of large tables'?
>>> 
>>> Sorry the delay, long election day
>>> 
>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
>> natebauernfe...@deephaven.io>
>>> wrote:
>>> 
 Suggestion: faster -> more efficiently
 
 "Apache Arrow is a cross-language development platform for
>> in-memory
 data. It enables systems to process and transport data more
>> efficiently."
 
 On Sun, May 16, 2021 at 11:35 AM Wes McKinney >> 
>> wrote:
 
> Here's what there now:
> 
> "Apache Arrow is a cross-language development platform for
> in-memory
> data. It specifies a standardized language-independent columnar
>> memory
> format for flat and hierarchical data, organized for efficient
> analytic operations on modern hardware. It also provides
>> computational
> libraries an

Re: Long title on github page

2021-05-17 Thread Mauricio Vargas
a few ideas

github.com/apache/arrow - Apache Arrow is an efficient library for big data
processing and sharing

github.com/apache/arrow - Apache Arrow is a computational tool for
processing, storing and sharing large datasets

github.com/apache/arrow - Apache Arrow is a  fast and simple library for
big data analytics

*github.com/apache/arrow  - Apache Arrow is
a powerful workhorse for analytic operations on modern hardware*


On Mon, May 17, 2021 at 3:13 PM Julian Hyde  wrote:

> Alright, well, whatever it is, it must fit into one breath. If the
> high-concept pitch is successful, people will stick around for the full
> pitch.
>
> Words such as “platform” and “enable” are noise. You say “platform”, they
> start to say “what exactly do you mean by platform”, the elevator doors
> open, and they’re gone.
>
> “Apache Arrow is a format and compute kernel for in-memory data”
>
>
> > On May 17, 2021, at 12:03 PM, Eduardo Ponce  wrote:
> >
> > One more suggestion for the bucket:
> > "Apache Arrow is a computational platform for efficient in-memory data
> > representation and processing."
> >
> > On Mon, May 17, 2021 at 2:49 PM Wes McKinney 
> wrote:
> >
> >> I think less is better in the description, but unfortunately the
> >> association of Arrow as being "just a data format" has been actively
> >> harmful in some ways to community growth. We have a data format, yes,
> >> but we are also creating a computational platform to go hand-in-hand
> >> with the data format to make it easier to build fast applications that
> >> use the data format. So the description needs to capture both of these
> >> ideas.
> >>
> >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> >> wrote:
> >>>
> >>> I think that the “cross-language development platform for” is noise.
> >> (I’m sure that JPEG developers think that JPEG is a “cross-language
> >> development platform” too. But it isn’t. It is an image format.)
> >>>
> >>> "Apache Arrow is data format for efficient in-memory processing.”
> >>>
> >>> I’ll note that In marketing speak, we are developing a high-concept
> >> pitch [1] here. Every company needs a name, a brand, a high-concept
> pitch,
> >> and 3- or 4-sentence description. But every Apache project needs these
> too.
> >> It’s worth spending the time on the description, also, and then use
> them in
> >> all the places that we describe Arrow.
> >>>
> >>> Julian
> >>>
> >>> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> >>>
> >>>
> >>>
>  On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> >> wrote:
> 
>  I agree with Nate's and Brian's suggestions, but would like to add
> >> that we
>  can make it a one-liner for more conciseness and consistency with
> other
>  Apache projects.
>  Apologies if it seems I am going around the suggestions loop again.
> 
>  "Apache Arrow is a cross-language development platform enabling
> >> efficient
>  in-memory data processing and transport."
> 
> 
> 
> 
>  On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> >> wrote:
> 
> > Thank you for bringing this up Dominik. I sampled some of the
> >> descriptions
> > for other Apache projects I frequent, the ones with a meaningful
> > description have a single sentence:
> >
> > github.com/apache/spark - Apache Spark - A unified analytics engine
> >> for
> > large-scale data processing
> > github.com/apache/beam - Apache Beam is a unified programming model
> >> for
> > Batch and Streaming
> > github.com/apache/avro - Apache Avro is a data serialization system
> >
> > Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> >> "
> > as the description.
> >
> > +1 for Nate's suggestion "Apache Arrow is a cross-language
> development
> > platform for in-memory data. It enables systems to process and
> >> transport
> > data more efficiently."
> >
> > On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> >> wrote:
> >
> >> It's probably best for description to limit mentions of specific
> >> features. There are some high level features mentioned in the
> >> description now ("computational libraries and zero-copy streaming
> >> messaging and interprocess communication"), but now in 2021 since
> the
> >> project has grown so much, it could leave people with a limited view
> >> of what they might find here.
> >>
> >> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> >>  wrote:
> >>>
> >>> How about
> >>> 'Apache Arrow is a cross-language development platform for
> in-memory
> >> data.
> >>> It enables systems to process and transport data efficiently,
> > providing a
> >>> simple and fast library for partitioning of large tables'?
> >>>
> >>> Sorry the delay, long election day
> >>>
> >>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> >> natebauernfe...@deephaven.io>
> >>> wrote:
> >>>

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Wes McKinney
I would suggest Krisztian or someone in the web of trust have a video
call with Jorge to confirm his identity (and GPG fingerprint) and then
commit his code signing key to KEYS. I don't think it's necessary to
be extremely paranoid about this.

On Mon, May 17, 2021 at 2:06 PM Neal Richardson
 wrote:
>
> How does one get their key in the Web of Trust? We do need to be able to
> add people to that so that it's not just the same handful of individuals
> who can be release manager, and now seems like a great time to add Jorge.
>
> Neal
>
> On Mon, May 17, 2021 at 11:52 AM Krisztián Szűcs 
> wrote:
>
> > I think your GPG key hasn't been configured yet, at least it is not in
> > the KEYS file [1].
> > The source release tarball must be signed by the release manager.
> >
> > Do you have an Apache Code Signing key?
> > If not, then it could be better if either Kou or I would be the
> > release manager.
> >
> > [1]: https://dist.apache.org/repos/dist/dev/arrow/KEYS
> >
> > On Mon, May 17, 2021 at 8:45 PM Krisztián Szűcs
> >  wrote:
> > >
> > > On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Thanks, Krisztián!
> > > >
> > > > I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked,
> > so we
> > > > are 2 to go:
> > > >
> > > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > Resolved now, but didn't require a patch on our side.
> > > > - https://issues.apache.org/jira/browse/ARROW-12603
> > > Just merged it.
> > >
> > > The maintenance branch should be ready now.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > >
> > > > On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > > > wrote:
> > > >
> > > > > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have started collecting commits to the maint branch [1]. The
> > exact
> > > > > > commands I used:
> > > > > >
> > > > > > git clone g...@github.com:apache/arrow.git
> > > > > > cd arrow/dev
> > > > > > python3 -m venv venv
> > > > > > source venv/bin/activate
> > > > > > pip install -e archery
> > > > > > pip install GitPython jira semver jinja2
> > > > > > archery release cherry-pick 4.0.1
> > > > > > # ran the commands it printed one by one
> > > > > >
> > > > > > There is a commit that does not apply cleanly. Could someone from
> > C++
> > > > > merge
> > > > > > it? What to do:
> > > > > >
> > > > > > Run `git fetch upstream && git checkout maint-4.0.x && git
> > cherry-pick
> > > > > > ce2861713472818eea264957de4cc83d5a2c567c`
> > > > > >
> > > > > > This will trigger a merge conflict. Resolve and push to
> > maint-4.0.x on
> > > > > > apache/arrow.
> > > > > Hi,
> > > > >
> > > > > I've recreated the maintenance branch and resolved the conflicts.
> > > > > According to the release curation script [1], we have 4 issues
> > without
> > > > > available patches:
> > > > > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > > > > submitted a PR)
> > > > > - https://issues.apache.org/jira/browse/ARROW-12619
> > > > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > > > - https://issues.apache.org/jira/browse/ARROW-12603
> > > > >
> > > > > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > > > > >
> > > > > > Thanks,
> > > > > > Jorge
> > > > > >
> > > > > > https://github.com/apache/arrow/tree/maint-4.0.x
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > > > > neal.p.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for taking this on. Krisztián can confirm the details (or
> > point
> > > > > you
> > > > > > > to where this is documented), but based on past patch releases, I
> > > > > believe
> > > > > > > you would make a `maint-4.0.x` branch off of the existing
> > > > > `release-4.0.0`
> > > > > > > branch, cherry-pick the commits associated with the JIRAs tagged
> > for
> > > > > 4.0.1
> > > > > > > (I believe there are utility scripts to help with this), and run
> > the
> > > > > > > release script that bumps the versions.
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > > > > jorgecarlei...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Just to make sure: the goal is to cherry-pick all changes
> > targeted
> > > > > for
> > > > > > > > 4.0.1 into a branch and release from there? If that is the
> > case,
> > > > > then I
> > > > > > > > will create a branch and start cherry-picking the changes in
> > order
> > > > > they
> > > > > > > > were merged in master.
> > > > > > > >
> > > > > > > > I see 5 issues on the list still open. I subscribed to them
> > and will
> > > > > be
> > > > > > > > cherry-picking them as they get merged.
> > > > > > > >
> > > > > > > > On the Rust side; we can either bump 4.1.0 or cherry-pick for a
> > > > > 4.0.1. I
> > > > > > > > suggest 4.0.1 to kee

Re: [DISCUSS] 4.0.1 patch release?

2021-05-17 Thread Krisztián Szűcs
On Mon, May 17, 2021 at 9:05 PM Neal Richardson
 wrote:
>
> How does one get their key in the Web of Trust? We do need to be able to
> add people to that so that it's not just the same handful of individuals
> who can be release manager, and now seems like a great time to add Jorge.
Totally agree, we should have more Arrow PMC members in the web of
trust though the process is a bit more involved.

The release signing procedure is documented at [1], web-of-trust is
explained at [2] and [3].
In short, it is preferred to meet in person for keysigning to maintain
strong web-of-trust.
Though a secure channel (like an online meeting) is more likely in the
current circumstances.

[1]: https://infra.apache.org/release-signing.html
[2]: https://infra.apache.org/release-signing.html#web-of-trust
[3]: https://infra.apache.org/openpgp.html#apache-wot
>
> Neal
>
> On Mon, May 17, 2021 at 11:52 AM Krisztián Szűcs 
> wrote:
>
> > I think your GPG key hasn't been configured yet, at least it is not in
> > the KEYS file [1].
> > The source release tarball must be signed by the release manager.
> >
> > Do you have an Apache Code Signing key?
> > If not, then it could be better if either Kou or I would be the
> > release manager.
> >
> > [1]: https://dist.apache.org/repos/dist/dev/arrow/KEYS
> >
> > On Mon, May 17, 2021 at 8:45 PM Krisztián Szűcs
> >  wrote:
> > >
> > > On Mon, May 17, 2021 at 8:30 PM Jorge Cardoso Leitão
> > >  wrote:
> > > >
> > > > Thanks, Krisztián!
> > > >
> > > > I saw that ARROW-12769 and ARROW-12619 were also just cherry-picked,
> > so we
> > > > are 2 to go:
> > > >
> > > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > Resolved now, but didn't require a patch on our side.
> > > > - https://issues.apache.org/jira/browse/ARROW-12603
> > > Just merged it.
> > >
> > > The maintenance branch should be ready now.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > >
> > > > On Mon, May 17, 2021 at 1:42 PM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > > > wrote:
> > > >
> > > > > On Sat, May 15, 2021 at 7:44 AM Jorge Cardoso Leitão
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have started collecting commits to the maint branch [1]. The
> > exact
> > > > > > commands I used:
> > > > > >
> > > > > > git clone g...@github.com:apache/arrow.git
> > > > > > cd arrow/dev
> > > > > > python3 -m venv venv
> > > > > > source venv/bin/activate
> > > > > > pip install -e archery
> > > > > > pip install GitPython jira semver jinja2
> > > > > > archery release cherry-pick 4.0.1
> > > > > > # ran the commands it printed one by one
> > > > > >
> > > > > > There is a commit that does not apply cleanly. Could someone from
> > C++
> > > > > merge
> > > > > > it? What to do:
> > > > > >
> > > > > > Run `git fetch upstream && git checkout maint-4.0.x && git
> > cherry-pick
> > > > > > ce2861713472818eea264957de4cc83d5a2c567c`
> > > > > >
> > > > > > This will trigger a merge conflict. Resolve and push to
> > maint-4.0.x on
> > > > > > apache/arrow.
> > > > > Hi,
> > > > >
> > > > > I've recreated the maintenance branch and resolved the conflicts.
> > > > > According to the release curation script [1], we have 4 issues
> > without
> > > > > available patches:
> > > > > - https://issues.apache.org/jira/browse/ARROW-12769 (Joris has just
> > > > > submitted a PR)
> > > > > - https://issues.apache.org/jira/browse/ARROW-12619
> > > > > - https://issues.apache.org/jira/browse/ARROW-12604
> > > > > - https://issues.apache.org/jira/browse/ARROW-12603
> > > > >
> > > > > [1]: https://gist.github.com/kszucs/ee55942138caf14845fdecf43edb3ecc
> > > > > >
> > > > > > Thanks,
> > > > > > Jorge
> > > > > >
> > > > > > https://github.com/apache/arrow/tree/maint-4.0.x
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, May 15, 2021 at 1:23 AM Neal Richardson <
> > > > > neal.p.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for taking this on. Krisztián can confirm the details (or
> > point
> > > > > you
> > > > > > > to where this is documented), but based on past patch releases, I
> > > > > believe
> > > > > > > you would make a `maint-4.0.x` branch off of the existing
> > > > > `release-4.0.0`
> > > > > > > branch, cherry-pick the commits associated with the JIRAs tagged
> > for
> > > > > 4.0.1
> > > > > > > (I believe there are utility scripts to help with this), and run
> > the
> > > > > > > release script that bumps the versions.
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 14, 2021 at 12:32 PM Jorge Cardoso Leitão <
> > > > > > > jorgecarlei...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Just to make sure: the goal is to cherry-pick all changes
> > targeted
> > > > > for
> > > > > > > > 4.0.1 into a branch and release from there? If that is the
> > case,
> > > > > then I
> > > > > > > > will create a branch and start cherry-picking the changes in
> > order
> > > > > they
> > > > > > > > were merged in master.
> >

Re: Long title on github page

2021-05-17 Thread Weston Pace
> “Apache Arrow is a format and compute kernel for in-memory data”

I like this but no one ever knows what "in-memory" means (or they just
think 'data is always in memory').  How about...

"Apache Arrow is a format and compute kernel for zero-copy processing
and sharing of data."

or...

"Apache Arrow is a format and compute kernel for processing and
sharing data without serialization overhead."

Although marshalling[1] would probably be a more precise word it is
not as well known.

[1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)

On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
 wrote:
>
> a few ideas
>
> github.com/apache/arrow - Apache Arrow is an efficient library for big data
> processing and sharing
>
> github.com/apache/arrow - Apache Arrow is a computational tool for
> processing, storing and sharing large datasets
>
> github.com/apache/arrow - Apache Arrow is a  fast and simple library for
> big data analytics
>
> *github.com/apache/arrow  - Apache Arrow is
> a powerful workhorse for analytic operations on modern hardware*
>
>
> On Mon, May 17, 2021 at 3:13 PM Julian Hyde  wrote:
>
> > Alright, well, whatever it is, it must fit into one breath. If the
> > high-concept pitch is successful, people will stick around for the full
> > pitch.
> >
> > Words such as “platform” and “enable” are noise. You say “platform”, they
> > start to say “what exactly do you mean by platform”, the elevator doors
> > open, and they’re gone.
> >
> > “Apache Arrow is a format and compute kernel for in-memory data”
> >
> >
> > > On May 17, 2021, at 12:03 PM, Eduardo Ponce  wrote:
> > >
> > > One more suggestion for the bucket:
> > > "Apache Arrow is a computational platform for efficient in-memory data
> > > representation and processing."
> > >
> > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney 
> > wrote:
> > >
> > >> I think less is better in the description, but unfortunately the
> > >> association of Arrow as being "just a data format" has been actively
> > >> harmful in some ways to community growth. We have a data format, yes,
> > >> but we are also creating a computational platform to go hand-in-hand
> > >> with the data format to make it easier to build fast applications that
> > >> use the data format. So the description needs to capture both of these
> > >> ideas.
> > >>
> > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> > >> wrote:
> > >>>
> > >>> I think that the “cross-language development platform for” is noise.
> > >> (I’m sure that JPEG developers think that JPEG is a “cross-language
> > >> development platform” too. But it isn’t. It is an image format.)
> > >>>
> > >>> "Apache Arrow is data format for efficient in-memory processing.”
> > >>>
> > >>> I’ll note that In marketing speak, we are developing a high-concept
> > >> pitch [1] here. Every company needs a name, a brand, a high-concept
> > pitch,
> > >> and 3- or 4-sentence description. But every Apache project needs these
> > too.
> > >> It’s worth spending the time on the description, also, and then use
> > them in
> > >> all the places that we describe Arrow.
> > >>>
> > >>> Julian
> > >>>
> > >>> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> > >>>
> > >>>
> > >>>
> >  On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> > >> wrote:
> > 
> >  I agree with Nate's and Brian's suggestions, but would like to add
> > >> that we
> >  can make it a one-liner for more conciseness and consistency with
> > other
> >  Apache projects.
> >  Apologies if it seems I am going around the suggestions loop again.
> > 
> >  "Apache Arrow is a cross-language development platform enabling
> > >> efficient
> >  in-memory data processing and transport."
> > 
> > 
> > 
> > 
> >  On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> > >> wrote:
> > 
> > > Thank you for bringing this up Dominik. I sampled some of the
> > >> descriptions
> > > for other Apache projects I frequent, the ones with a meaningful
> > > description have a single sentence:
> > >
> > > github.com/apache/spark - Apache Spark - A unified analytics engine
> > >> for
> > > large-scale data processing
> > > github.com/apache/beam - Apache Beam is a unified programming model
> > >> for
> > > Batch and Streaming
> > > github.com/apache/avro - Apache Avro is a data serialization system
> > >
> > > Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> > >> "
> > > as the description.
> > >
> > > +1 for Nate's suggestion "Apache Arrow is a cross-language
> > development
> > > platform for in-memory data. It enables systems to process and
> > >> transport
> > > data more efficiently."
> > >
> > > On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> > >> wrote:
> > >
> > >> It's probably best for description to limit mentions of specific
> > >> features. There are some high level features mention

Re: Long title on github page

2021-05-17 Thread Wes McKinney
On Mon, May 17, 2021 at 4:58 PM Weston Pace  wrote:
>
> > “Apache Arrow is a format and compute kernel for in-memory data”
>
> I like this but no one ever knows what "in-memory" means (or they just
> think 'data is always in memory').  How about...
>
> "Apache Arrow is a format and compute kernel for zero-copy processing
> and sharing of data."
>
> or...
>
> "Apache Arrow is a format and compute kernel for processing and
> sharing data without serialization overhead."

A few issues with this:

* Multiple PL aspect unclear (is a single piece of software, or
multiple pieces of software?)
* Development platform aspect unclear

I see that some people don't like the word "platform". Some people
come to this project and want to find an end-to-end application,
rather than a developer toolkit that they can use to build
applications. Perhaps we should be more explicit and use
"computational development toolkit" instead of "platform".

> Although marshalling[1] would probably be a more precise word it is
> not as well known.
>
> [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
>
> On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
>  wrote:
> >
> > a few ideas
> >
> > github.com/apache/arrow - Apache Arrow is an efficient library for big data
> > processing and sharing
> >
> > github.com/apache/arrow - Apache Arrow is a computational tool for
> > processing, storing and sharing large datasets
> >
> > github.com/apache/arrow - Apache Arrow is a  fast and simple library for
> > big data analytics
> >
> > *github.com/apache/arrow  - Apache Arrow is
> > a powerful workhorse for analytic operations on modern hardware*
> >
> >
> > On Mon, May 17, 2021 at 3:13 PM Julian Hyde  wrote:
> >
> > > Alright, well, whatever it is, it must fit into one breath. If the
> > > high-concept pitch is successful, people will stick around for the full
> > > pitch.
> > >
> > > Words such as “platform” and “enable” are noise. You say “platform”, they
> > > start to say “what exactly do you mean by platform”, the elevator doors
> > > open, and they’re gone.
> > >
> > > “Apache Arrow is a format and compute kernel for in-memory data”
> > >
> > >
> > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce  wrote:
> > > >
> > > > One more suggestion for the bucket:
> > > > "Apache Arrow is a computational platform for efficient in-memory data
> > > > representation and processing."
> > > >
> > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney 
> > > wrote:
> > > >
> > > >> I think less is better in the description, but unfortunately the
> > > >> association of Arrow as being "just a data format" has been actively
> > > >> harmful in some ways to community growth. We have a data format, yes,
> > > >> but we are also creating a computational platform to go hand-in-hand
> > > >> with the data format to make it easier to build fast applications that
> > > >> use the data format. So the description needs to capture both of these
> > > >> ideas.
> > > >>
> > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> > > >> wrote:
> > > >>>
> > > >>> I think that the “cross-language development platform for” is noise.
> > > >> (I’m sure that JPEG developers think that JPEG is a “cross-language
> > > >> development platform” too. But it isn’t. It is an image format.)
> > > >>>
> > > >>> "Apache Arrow is data format for efficient in-memory processing.”
> > > >>>
> > > >>> I’ll note that In marketing speak, we are developing a high-concept
> > > >> pitch [1] here. Every company needs a name, a brand, a high-concept
> > > pitch,
> > > >> and 3- or 4-sentence description. But every Apache project needs these
> > > too.
> > > >> It’s worth spending the time on the description, also, and then use
> > > them in
> > > >> all the places that we describe Arrow.
> > > >>>
> > > >>> Julian
> > > >>>
> > > >>> [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> > > >>>
> > > >>>
> > > >>>
> > >  On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> > > >> wrote:
> > > 
> > >  I agree with Nate's and Brian's suggestions, but would like to add
> > > >> that we
> > >  can make it a one-liner for more conciseness and consistency with
> > > other
> > >  Apache projects.
> > >  Apologies if it seems I am going around the suggestions loop again.
> > > 
> > >  "Apache Arrow is a cross-language development platform enabling
> > > >> efficient
> > >  in-memory data processing and transport."
> > > 
> > > 
> > > 
> > > 
> > >  On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> > > >> wrote:
> > > 
> > > > Thank you for bringing this up Dominik. I sampled some of the
> > > >> descriptions
> > > > for other Apache projects I frequent, the ones with a meaningful
> > > > description have a single sentence:
> > > >
> > > > github.com/apache/spark - Apache Spark - A unified analytics engine
> > > >> for
> > > > large-scale data processing
> > > > github.com/

Re: Long title on github page

2021-05-17 Thread Micah Kornfield
How about: "Apache Arrow is a collection of specifications, cross language
libraries and applications focused on efficient sharing and processing of
structured data."

On Mon, May 17, 2021 at 3:06 PM Wes McKinney  wrote:

> On Mon, May 17, 2021 at 4:58 PM Weston Pace  wrote:
> >
> > > “Apache Arrow is a format and compute kernel for in-memory data”
> >
> > I like this but no one ever knows what "in-memory" means (or they just
> > think 'data is always in memory').  How about...
> >
> > "Apache Arrow is a format and compute kernel for zero-copy processing
> > and sharing of data."
> >
> > or...
> >
> > "Apache Arrow is a format and compute kernel for processing and
> > sharing data without serialization overhead."
>
> A few issues with this:
>
> * Multiple PL aspect unclear (is a single piece of software, or
> multiple pieces of software?)
> * Development platform aspect unclear
>
> I see that some people don't like the word "platform". Some people
> come to this project and want to find an end-to-end application,
> rather than a developer toolkit that they can use to build
> applications. Perhaps we should be more explicit and use
> "computational development toolkit" instead of "platform".
>
> > Although marshalling[1] would probably be a more precise word it is
> > not as well known.
> >
> > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
> >
> > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
> >  wrote:
> > >
> > > a few ideas
> > >
> > > github.com/apache/arrow - Apache Arrow is an efficient library for
> big data
> > > processing and sharing
> > >
> > > github.com/apache/arrow - Apache Arrow is a computational tool for
> > > processing, storing and sharing large datasets
> > >
> > > github.com/apache/arrow - Apache Arrow is a  fast and simple library
> for
> > > big data analytics
> > >
> > > *github.com/apache/arrow  - Apache
> Arrow is
> > > a powerful workhorse for analytic operations on modern hardware*
> > >
> > >
> > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde 
> wrote:
> > >
> > > > Alright, well, whatever it is, it must fit into one breath. If the
> > > > high-concept pitch is successful, people will stick around for the
> full
> > > > pitch.
> > > >
> > > > Words such as “platform” and “enable” are noise. You say “platform”,
> they
> > > > start to say “what exactly do you mean by platform”, the elevator
> doors
> > > > open, and they’re gone.
> > > >
> > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > >
> > > >
> > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce 
> wrote:
> > > > >
> > > > > One more suggestion for the bucket:
> > > > > "Apache Arrow is a computational platform for efficient in-memory
> data
> > > > > representation and processing."
> > > > >
> > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney 
> > > > wrote:
> > > > >
> > > > >> I think less is better in the description, but unfortunately the
> > > > >> association of Arrow as being "just a data format" has been
> actively
> > > > >> harmful in some ways to community growth. We have a data format,
> yes,
> > > > >> but we are also creating a computational platform to go
> hand-in-hand
> > > > >> with the data format to make it easier to build fast applications
> that
> > > > >> use the data format. So the description needs to capture both of
> these
> > > > >> ideas.
> > > > >>
> > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
> jhyde.apa...@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> I think that the “cross-language development platform for” is
> noise.
> > > > >> (I’m sure that JPEG developers think that JPEG is a
> “cross-language
> > > > >> development platform” too. But it isn’t. It is an image format.)
> > > > >>>
> > > > >>> "Apache Arrow is data format for efficient in-memory processing.”
> > > > >>>
> > > > >>> I’ll note that In marketing speak, we are developing a
> high-concept
> > > > >> pitch [1] here. Every company needs a name, a brand, a
> high-concept
> > > > pitch,
> > > > >> and 3- or 4-sentence description. But every Apache project needs
> these
> > > > too.
> > > > >> It’s worth spending the time on the description, also, and then
> use
> > > > them in
> > > > >> all the places that we describe Arrow.
> > > > >>>
> > > > >>> Julian
> > > > >>>
> > > > >>> [1]
> https://www.growthink.com/content/whats-your-high-concept-pitch
> > > > >>>
> > > > >>>
> > > > >>>
> > > >  On May 17, 2021, at 7:38 AM, Eduardo Ponce  >
> > > > >> wrote:
> > > > 
> > > >  I agree with Nate's and Brian's suggestions, but would like to
> add
> > > > >> that we
> > > >  can make it a one-liner for more conciseness and consistency
> with
> > > > other
> > > >  Apache projects.
> > > >  Apologies if it seems I am going around the suggestions loop
> again.
> > > > 
> > > >  "Apache Arrow is a cross-language development platform enabling
> > > > >> efficient
> > > >  in-memory data processing and tr

Re: Long title on github page

2021-05-17 Thread Mauricio Vargas
more marketed:
How about: "Apache Arrow is a format and language-agnostic library focused
on efficient sharing and processing of structured data."

On Mon, May 17, 2021 at 6:25 PM Micah Kornfield 
wrote:

> How about: "Apache Arrow is a collection of specifications, cross language
> libraries and applications focused on efficient sharing and processing of
> structured data."
>
> On Mon, May 17, 2021 at 3:06 PM Wes McKinney  wrote:
>
> > On Mon, May 17, 2021 at 4:58 PM Weston Pace 
> wrote:
> > >
> > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > >
> > > I like this but no one ever knows what "in-memory" means (or they just
> > > think 'data is always in memory').  How about...
> > >
> > > "Apache Arrow is a format and compute kernel for zero-copy processing
> > > and sharing of data."
> > >
> > > or...
> > >
> > > "Apache Arrow is a format and compute kernel for processing and
> > > sharing data without serialization overhead."
> >
> > A few issues with this:
> >
> > * Multiple PL aspect unclear (is a single piece of software, or
> > multiple pieces of software?)
> > * Development platform aspect unclear
> >
> > I see that some people don't like the word "platform". Some people
> > come to this project and want to find an end-to-end application,
> > rather than a developer toolkit that they can use to build
> > applications. Perhaps we should be more explicit and use
> > "computational development toolkit" instead of "platform".
> >
> > > Although marshalling[1] would probably be a more precise word it is
> > > not as well known.
> > >
> > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
> > >
> > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
> > >  wrote:
> > > >
> > > > a few ideas
> > > >
> > > > github.com/apache/arrow - Apache Arrow is an efficient library for
> > big data
> > > > processing and sharing
> > > >
> > > > github.com/apache/arrow - Apache Arrow is a computational tool for
> > > > processing, storing and sharing large datasets
> > > >
> > > > github.com/apache/arrow - Apache Arrow is a  fast and simple library
> > for
> > > > big data analytics
> > > >
> > > > *github.com/apache/arrow  - Apache
> > Arrow is
> > > > a powerful workhorse for analytic operations on modern hardware*
> > > >
> > > >
> > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde 
> > wrote:
> > > >
> > > > > Alright, well, whatever it is, it must fit into one breath. If the
> > > > > high-concept pitch is successful, people will stick around for the
> > full
> > > > > pitch.
> > > > >
> > > > > Words such as “platform” and “enable” are noise. You say
> “platform”,
> > they
> > > > > start to say “what exactly do you mean by platform”, the elevator
> > doors
> > > > > open, and they’re gone.
> > > > >
> > > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > > >
> > > > >
> > > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce  >
> > wrote:
> > > > > >
> > > > > > One more suggestion for the bucket:
> > > > > > "Apache Arrow is a computational platform for efficient in-memory
> > data
> > > > > > representation and processing."
> > > > > >
> > > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> I think less is better in the description, but unfortunately the
> > > > > >> association of Arrow as being "just a data format" has been
> > actively
> > > > > >> harmful in some ways to community growth. We have a data format,
> > yes,
> > > > > >> but we are also creating a computational platform to go
> > hand-in-hand
> > > > > >> with the data format to make it easier to build fast
> applications
> > that
> > > > > >> use the data format. So the description needs to capture both of
> > these
> > > > > >> ideas.
> > > > > >>
> > > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
> > jhyde.apa...@gmail.com>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> I think that the “cross-language development platform for” is
> > noise.
> > > > > >> (I’m sure that JPEG developers think that JPEG is a
> > “cross-language
> > > > > >> development platform” too. But it isn’t. It is an image format.)
> > > > > >>>
> > > > > >>> "Apache Arrow is data format for efficient in-memory
> processing.”
> > > > > >>>
> > > > > >>> I’ll note that In marketing speak, we are developing a
> > high-concept
> > > > > >> pitch [1] here. Every company needs a name, a brand, a
> > high-concept
> > > > > pitch,
> > > > > >> and 3- or 4-sentence description. But every Apache project needs
> > these
> > > > > too.
> > > > > >> It’s worth spending the time on the description, also, and then
> > use
> > > > > them in
> > > > > >> all the places that we describe Arrow.
> > > > > >>>
> > > > > >>> Julian
> > > > > >>>
> > > > > >>> [1]
> > https://www.growthink.com/content/whats-your-high-concept-pitch
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > >  On May 17, 2021, at 7:38 AM, Eduardo Po

Re: Long title on github page

2021-05-17 Thread Weston Pace
I'd avoid the word "structured" as it is somewhat ill-defined.

On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
 wrote:
>
> more marketed:
> How about: "Apache Arrow is a format and language-agnostic library focused
> on efficient sharing and processing of structured data."
>
> On Mon, May 17, 2021 at 6:25 PM Micah Kornfield 
> wrote:
>
> > How about: "Apache Arrow is a collection of specifications, cross language
> > libraries and applications focused on efficient sharing and processing of
> > structured data."
> >
> > On Mon, May 17, 2021 at 3:06 PM Wes McKinney  wrote:
> >
> > > On Mon, May 17, 2021 at 4:58 PM Weston Pace 
> > wrote:
> > > >
> > > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > >
> > > > I like this but no one ever knows what "in-memory" means (or they just
> > > > think 'data is always in memory').  How about...
> > > >
> > > > "Apache Arrow is a format and compute kernel for zero-copy processing
> > > > and sharing of data."
> > > >
> > > > or...
> > > >
> > > > "Apache Arrow is a format and compute kernel for processing and
> > > > sharing data without serialization overhead."
> > >
> > > A few issues with this:
> > >
> > > * Multiple PL aspect unclear (is a single piece of software, or
> > > multiple pieces of software?)
> > > * Development platform aspect unclear
> > >
> > > I see that some people don't like the word "platform". Some people
> > > come to this project and want to find an end-to-end application,
> > > rather than a developer toolkit that they can use to build
> > > applications. Perhaps we should be more explicit and use
> > > "computational development toolkit" instead of "platform".
> > >
> > > > Although marshalling[1] would probably be a more precise word it is
> > > > not as well known.
> > > >
> > > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
> > > >
> > > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
> > > >  wrote:
> > > > >
> > > > > a few ideas
> > > > >
> > > > > github.com/apache/arrow - Apache Arrow is an efficient library for
> > > big data
> > > > > processing and sharing
> > > > >
> > > > > github.com/apache/arrow - Apache Arrow is a computational tool for
> > > > > processing, storing and sharing large datasets
> > > > >
> > > > > github.com/apache/arrow - Apache Arrow is a  fast and simple library
> > > for
> > > > > big data analytics
> > > > >
> > > > > *github.com/apache/arrow  - Apache
> > > Arrow is
> > > > > a powerful workhorse for analytic operations on modern hardware*
> > > > >
> > > > >
> > > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde 
> > > wrote:
> > > > >
> > > > > > Alright, well, whatever it is, it must fit into one breath. If the
> > > > > > high-concept pitch is successful, people will stick around for the
> > > full
> > > > > > pitch.
> > > > > >
> > > > > > Words such as “platform” and “enable” are noise. You say
> > “platform”,
> > > they
> > > > > > start to say “what exactly do you mean by platform”, the elevator
> > > doors
> > > > > > open, and they’re gone.
> > > > > >
> > > > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > > > >
> > > > > >
> > > > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce  > >
> > > wrote:
> > > > > > >
> > > > > > > One more suggestion for the bucket:
> > > > > > > "Apache Arrow is a computational platform for efficient in-memory
> > > data
> > > > > > > representation and processing."
> > > > > > >
> > > > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney <
> > wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> I think less is better in the description, but unfortunately the
> > > > > > >> association of Arrow as being "just a data format" has been
> > > actively
> > > > > > >> harmful in some ways to community growth. We have a data format,
> > > yes,
> > > > > > >> but we are also creating a computational platform to go
> > > hand-in-hand
> > > > > > >> with the data format to make it easier to build fast
> > applications
> > > that
> > > > > > >> use the data format. So the description needs to capture both of
> > > these
> > > > > > >> ideas.
> > > > > > >>
> > > > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
> > > jhyde.apa...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>> I think that the “cross-language development platform for” is
> > > noise.
> > > > > > >> (I’m sure that JPEG developers think that JPEG is a
> > > “cross-language
> > > > > > >> development platform” too. But it isn’t. It is an image format.)
> > > > > > >>>
> > > > > > >>> "Apache Arrow is data format for efficient in-memory
> > processing.”
> > > > > > >>>
> > > > > > >>> I’ll note that In marketing speak, we are developing a
> > > high-concept
> > > > > > >> pitch [1] here. Every company needs a name, a brand, a
> > > high-concept
> > > > > > pitch,
> > > > > > >> and 3- or 4-sentence description. But every Apache project needs
> > > these
> > > > > > too.

Language Silos and transpilers

2021-05-17 Thread Arun Sharma
Hello:

I just watched a video about Apache Arrow (
https://www.youtube.com/watch?v=-ZikPi2nmSI) that discussed Language Silos
and one of the questions towards the end was about being able to translate
automatically from one language to another.

I'm not aware of the specific requirements for one to be able to write a
query in say python and push it down into a rust based query engine. But I
wanted to share what I've been working on for the last few months in the
hope that people in this community can give me feedback about the
usefulness of this work and any specific feature requests you may have.

The project is called py2many. It transpiles a small subset of python to
rust, c++, julia and 4 other languages. More info here:

https://adsharma.github.io/py2many0.2.1/
https://github.com/adsharma/py2many

 -Arun


Nightly Builds Repors 2021-05-17

2021-05-17 Thread Mauricio Vargas
*NIGHTLY BUILDS REPORT*

2021-05-17


*New reported errors*


*GitHub*


*Build: *github-test-conda-python-3.8-spark-master


Error type: Internal

Progress: No work has yet been done on this issue.

First time issued: 2021-05-13 (4 days ago)

Ticket: ARROW-12817


*Persisting errors*


*Azure*


*Build: *azure-test-r-rhub-ubuntu-gcc-release-latest


*Error type:* External

*Progress**:* No work has yet been logged on this issue.

*First time issued:* 2021-05-14 (3 days ago)

*Ticket:* ARROW-12795 

*Comment:* *I need to send a PR to R-Hub and fix bit64 installation on the
Docker image.*


*Build: *azure-test-r-rstudio-r-base-3.6-opensuse42


*Error type:* External

*Progress**:* A PR was sent to RStudio, we’ll wait for them to change ICU
build in RSPM.

*First time issued:* 2021-05-13 (4 days ago)

*Ticket:* ARROW-12786 

*Comment:* *This error shall persist until **RSPM binaries are changed**.*


*Build:* azure-conda-osx-clang-py36-r36


*Error type:* Internal

*Progress**:* No work has yet been done on this issue.

*First time issued:* 2021-05-13 (4 days ago)

*Ticket:* ARROW-12782 

*Related errors:** azure-conda-osx-clang-py36-r40*


*Build:* azure-test-ubuntu-20.10-docs


*Error type:* External

*Progress**:* No work has yet been done on this issue.

*First time issued:* 2021-05-13 (4 days ago)

*Ticket:* ARROW-12765 

Comment: *This is due to a bug in the newly release sphinx 4.0
. **Last friday the idea
was to pin to** 3.5 **and* *wait for a **bug-fix release **last** weekend. **We
need to wait more time and this shall go away.*


*Build: *azure-conda-win-vs2017-py36-r36


*Error type:* Internal

*Progress**:* Kristián opened PR 10322, which uses a crossbow command to
support multiple arguments instead of pattern options.

*First time issued:* 2021-05-12 (5 days ago)

*Ticket:* ARROW-12764 

*Related errors:* azure-conda-win-vs2017-py37-r40,
azure-conda-win-vs2017-py38, and azure-conda-win-vs2017-py39

*Comment:* *This error shall persist until PR 10322 is validated by the
team and then merged.*


*Build: *azure-test-r-linux-valgrind


*Error type:* Internal

*Progress**:* No work has yet been logged on this issue.

*First time issued:* 2021-05-10 (7 days ago)

*Ticket:* ARROW-12708


*GitHub*


*Build: *github-test-conda-python-3.7-turbodbc-latest


*Error type:* Internal

*Progress**:* No work has yet been done on this issue.

*First time issued:* 2021-05-12 (5 days ago)

*Ticket:* ARROW-12783

*Related errors:* github-test-conda-python-3.7-turbodbc-master

*Comment: This error shall persist until **we fix this with a PR**.*


*Fixes*


>From the errors mentioned in the report from 2021-05-14, these tickets were
solved:

   -

   ARROW-12785  - [CI]
   the r-devdocs build errors when brew installing gcc


Re: Python Flight example with query command

2021-05-17 Thread Tanveer Ahmad - EWI
Hi David,


Thank you for the reply.


I have found that Arrow 
Datafusion
 project offers something similar for what I am looking for. Do you think this 
project implements FlightSQL proposal?


Regards,
Tanveer Ahmad

From: David Li 
Sent: Saturday, May 15, 2021 3:10:53 PM
To: dev@arrow.apache.org
Subject: Re: Python Flight example with query command

Hey Tanveer,

Something like this should work:

$ python examples/flight/client.py put localhost:1234 foo.csv
File Name: foo.csv
Table rows= 1
   a  b
0  1  2
$ python examples/flight/client.py get localhost:1234 -p foo.csv
Ticket: 

   a  b
0  1  2

Note that Flight itself does not implement SQL query functionality or
anything of the sort. It is a common misconception, I think
exacerbated since Flight is often discussed in the context of products
like Dremio which implement such functionality on top of Flight. But
really, Flight itself is just a 'dumb pipe' for Arrow data for
building such systems.

You may be interested in the FlightSQL proposal which defines at least
an interface for database systems to make themselves available over
Flight and for clients to generically query them. However that
proposal has been stalled for a while.

Best,
David

On 2021/05/15 12:15:26, Tanveer Ahmad - EWI  wrote:
> Hi all,
>
>
> For Python Flight 
> example  >, I can start server (python server.py -> Serving on 
> grpc+tcp://localhost:5005) and client can put (python client.py put 
> localhost:5005 mycsv.csv) and also get (python client.py get localhost:5005 
> -p mycsv.csv) command retrieves data with -p (path) option.
>
>
> I am wondering how to query (like python client.py get localhost:5005 -c 
> "select * from ? limit 10") using -c, command this data , which I had already 
> put on server through put command.
>
>
> Thanks.
>
> Regards,
> Tanveer Ahmad
>
>


Re: Long title on github page

2021-05-17 Thread Aldrin
"Apache Arrow is a data processing library that also provides a uniform,
efficient interface for data systems."

This probably still isn't quite right, I imagine the bit about "for data
systems" needs some addition (maybe "for transport between data systems")?

My primary motivators:

   - "A data processing library":
  - Arrow provides many language bindings, but ultimately they're all
  part of the same "library ecosystem", which I think is fine to capture in
  "library"
  - A main goal of arrow is for processing to be fast, whatever that
  processing may be
  - "uniform, efficient interface for data systems":
  - Arrow, provides (or tries to) a cohesive ("uniform") interface for
  data processing (although it has several APIs to do this)
  - Also, IMO, a motivation for arrow was a format and library to
  facilitate processing, but that provided functions and
interfaces to easily
  translate into optimized data formats used by disparate data systems
  (cassandra, hadoop, etc.).
  - Arrow tries to be transparently zero-copy, which is part of the
  interface for efficiency
   - Arrow certainly has a data format, but that format is the crux of the
   interface (IMO). However, it also makes using other formats easy (via
   filesystem API and parquet reader/writers, etc.). So, focusing on the data
   format seems unnecessary in such a terse description.


Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Mon, May 17, 2021 at 5:07 PM Weston Pace  wrote:

> I'd avoid the word "structured" as it is somewhat ill-defined.
>
> On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
>  wrote:
> >
> > more marketed:
> > How about: "Apache Arrow is a format and language-agnostic library
> focused
> > on efficient sharing and processing of structured data."
> >
> > On Mon, May 17, 2021 at 6:25 PM Micah Kornfield 
> > wrote:
> >
> > > How about: "Apache Arrow is a collection of specifications, cross
> language
> > > libraries and applications focused on efficient sharing and processing
> of
> > > structured data."
> > >
> > > On Mon, May 17, 2021 at 3:06 PM Wes McKinney 
> wrote:
> > >
> > > > On Mon, May 17, 2021 at 4:58 PM Weston Pace 
> > > wrote:
> > > > >
> > > > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > > >
> > > > > I like this but no one ever knows what "in-memory" means (or they
> just
> > > > > think 'data is always in memory').  How about...
> > > > >
> > > > > "Apache Arrow is a format and compute kernel for zero-copy
> processing
> > > > > and sharing of data."
> > > > >
> > > > > or...
> > > > >
> > > > > "Apache Arrow is a format and compute kernel for processing and
> > > > > sharing data without serialization overhead."
> > > >
> > > > A few issues with this:
> > > >
> > > > * Multiple PL aspect unclear (is a single piece of software, or
> > > > multiple pieces of software?)
> > > > * Development platform aspect unclear
> > > >
> > > > I see that some people don't like the word "platform". Some people
> > > > come to this project and want to find an end-to-end application,
> > > > rather than a developer toolkit that they can use to build
> > > > applications. Perhaps we should be more explicit and use
> > > > "computational development toolkit" instead of "platform".
> > > >
> > > > > Although marshalling[1] would probably be a more precise word it is
> > > > > not as well known.
> > > > >
> > > > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
> > > > >
> > > > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
> > > > >  wrote:
> > > > > >
> > > > > > a few ideas
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is an efficient library
> for
> > > > big data
> > > > > > processing and sharing
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is a computational tool
> for
> > > > > > processing, storing and sharing large datasets
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is a  fast and simple
> library
> > > > for
> > > > > > big data analytics
> > > > > >
> > > > > > *github.com/apache/arrow  -
> Apache
> > > > Arrow is
> > > > > > a powerful workhorse for analytic operations on modern hardware*
> > > > > >
> > > > > >
> > > > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde <
> jhyde.apa...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Alright, well, whatever it is, it must fit into one breath. If
> the
> > > > > > > high-concept pitch is successful, people will stick around for
> the
> > > > full
> > > > > > > pitch.
> > > > > > >
> > > > > > > Words such as “platform” and “enable” are noise. You say
> > > “platform”,
> > > > they
> > > > > > > start to say “what exactly do you mean by platform”, the
> elevator
> > > > doors
> > > > > > > open, and they’re gone.
> > > > > > >
> > > > > > > “Apache Arrow is a format and compute kernel for in-memory
> data”
> > > > > > >
> > > > > > >
> > > > > > > > On May