Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
The VOTE carries with 4 binding +1 votes, 3 non-binding +1 votes and one binding +0 vote. I'm starting the post-release tasks, if anyone wants to help please let me know. On Fri, Feb 7, 2020 at 12:25 AM Krisztián Szűcs wrote: > > So far we have the following votes: > > +0 (binding) > +1

[jira] [Created] (ARROW-7788) Add schema conversion support for map type

2020-02-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7788: -- Summary: Add schema conversion support for map type Key: ARROW-7788 URL: https://issues.apache.org/jira/browse/ARROW-7788 Project: Apache Arrow Issue

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
So far we have the following votes: +0 (binding) +1 (binding) +1 (non-binding) +1 (binding) +1 (non-binding) +1 (binding) +1 (non-binding) +1 (binding) 4 +1 (binding) 3 +1 (non-binding) I'm waiting for votes until tomorrow morning (UTC), then I'm closing the VOTE. Thanks everyone!

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
Testing on macOS Catalina Binaries: OK Wheels: OK Verified on macOS and on Linux. On linux the verification script has failed for python 3.5 and manylinux2010 and manylinux2014 with unsupported platform tag. I've manually checked these wheels in the python:3.5 docker image, and the wheels were

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread David Li
Catching up on questions here... > Typically you can solve this by having enough IO concurrency at once :-) > I'm not sure having sophisticated global coordination (based on which > algorithms) would bring anything. Would you care to elaborate? We aren't proposing *sophisticated* global

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou wrote: > > > Le 06/02/2020 à 20:20, Wes McKinney a écrit : > >> Actually, on a more high-level basis, is the goal to prefetch for > >> sequential consumption of row groups? > >> > > > > Essentially yes. One "easy" optimization is to prefetch the

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 20:20, Wes McKinney a écrit : >> Actually, on a more high-level basis, is the goal to prefetch for >> sequential consumption of row groups? >> > > Essentially yes. One "easy" optimization is to prefetch the entire > serialized row group. This is an evolution of that idea where

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou > wrote: > >> > >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:41 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > >>> > >>> This seems useful, too. It becomes a question of where do you want to

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: >> >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : This seems useful, too. It becomes a question of where do you want to

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Neal Richardson
I re-verified the macOS wheels and they worked but I had to hard-code `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported previously. I tried to set that env var dynamically based on your current OS version but didn't succeed in getting it passed through to pytest, despite

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 19:37, Wes McKinney a écrit : > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : >>> >>> This seems useful, too. It becomes a question of where do you want to >>> manage the cached memory segments, however you obtain them.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > > > This seems useful, too. It becomes a question of where do you want to > > manage the cached memory segments, however you obtain them. I'm > > arguing that we should not have much custom

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments

[jira] [Created] (ARROW-7786) [R] Wire up check_metadata in Table.Equals method

2020-02-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7786: -- Summary: [R] Wire up check_metadata in Table.Equals method Key: ARROW-7786 URL: https://issues.apache.org/jira/browse/ARROW-7786 Project: Apache Arrow

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
Le 06/02/2020 à 17:07, Wes McKinney a écrit : > In case folks are interested in how some other systems deal with IO > management / scheduling, the comments in > > https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h > > and related files might be interesting Thanks.

[jira] [Created] (ARROW-7785) [C++] sparse_tensor.cc is extremely slow to compile

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7785: - Summary: [C++] sparse_tensor.cc is extremely slow to compile Key: ARROW-7785 URL: https://issues.apache.org/jira/browse/ARROW-7785 Project: Apache Arrow

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
In case folks are interested in how some other systems deal with IO management / scheduling, the comments in https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h and related files might be interesting On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney wrote: > > On Thu, Feb 6,

[jira] [Created] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7784: - Summary: [C++] diff.cc is extremely slow to compile Key: ARROW-7784 URL: https://issues.apache.org/jira/browse/ARROW-7784 Project: Apache Arrow Issue

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou wrote: > > On Wed, 5 Feb 2020 15:46:15 -0600 > Wes McKinney wrote: > > > > I'll comment in more detail on some of the other items in due course, > > but I think this should be handled by an implementation of > > RandomAccessFile (that wraps a naked

[jira] [Created] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7783: - Summary: [C++] ARROW_DATASET should enable ARROW_COMPUTE Key: ARROW-7783 URL: https://issues.apache.org/jira/browse/ARROW-7783 Project: Apache Arrow Issue

[jira] [Created] (ARROW-7782) Losing index information when using write_to_dataset with partition_cols

2020-02-06 Thread Ludwik Bielczynski (Jira)
Ludwik Bielczynski created ARROW-7782: - Summary: Losing index information when using write_to_dataset with partition_cols Key: ARROW-7782 URL: https://issues.apache.org/jira/browse/ARROW-7782

[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault

2020-02-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7781: Summary: [C++][Dataset] Filtering on a non-existent column gives a segfault Key: ARROW-7781 URL: https://issues.apache.org/jira/browse/ARROW-7781

[jira] [Created] (ARROW-7780) [Release] Fix Windows wheel RC verification script given lack of "m" ABI tag in Python 3.8

2020-02-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7780: -- Summary: [Release] Fix Windows wheel RC verification script given lack of "m" ABI tag in Python 3.8 Key: ARROW-7780 URL: https://issues.apache.org/jira/browse/ARROW-7780

Re: [C++] Arrow added to OSS-Fuzz

2020-02-06 Thread Antoine Pitrou
Hello, A quick update: since Arrow C++ started being fuzzed in OSS-Fuzz, 41 issues (usually crashes) on invalid input have been found, 35 of which have already been corrected. We plan to expand the fuzzed areas to cover Parquet files, as well as serialized Tensor and SparseTensor data.

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 16:37:17 -0500 David Li wrote: > > As a separate step, prefetching/caching should also make use of a > global (or otherwise shared) IO thread pool, so that parallel reads of > different files implicitly coordinate work with each other as well. > Then, you could queue up reads

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 15:46:15 -0600 Wes McKinney wrote: > > I'll comment in more detail on some of the other items in due course, > but I think this should be handled by an implementation of > RandomAccessFile (that wraps a naked RandomAccessFile) with some > additional methods, rather than adding