[jira] [Created] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)
Anish Biswas created ARROW-8642:
---

 Summary: Is there a good way to convert data types from numpy 
types to pyarrow DataType?
 Key: ARROW-8642
 URL: https://issues.apache.org/jira/browse/ARROW-8642
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Anish Biswas


Pretty much what the title says. Suppose I have a numpy array and its a 
numpy.int8 type. How do I convert it to a pyarrow.Datatype intuitively? I 
thought a Dictionary lookup table might work but perhaps there is some better 
way?

Why do I need this? I am trying to make pyarrow arrays with from_buffers(). The 
first parameter is essentially a pyarrow.Datatype. So that's why. I have 
validity_bitmaps as a buffer of uint8 and that's why I am using from_buffers() 
and not pyarrow.array().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

2020-04-30 Thread Rémi Dettai
Hi!

Does your point 1 also apply to the AWS SDK dependency ? Currently it seems
that it cannot be built in BUNDLED mode. As stated in
https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make
a static build with the S3 dependency activated ! I would really like to
help on this because it is very important for my usecase that we can
assemble compact builds of Arrow, but I'm still very uncomfortable with
CMake :-(

Thanks for your amazing work !

Remi

Le mar. 28 avr. 2020 à 16:22, Wes McKinney  a écrit :

> hi folks,
>
> I would like to highlight some outstanding problems with our packages
>
> 1. Our Arrow C++ static libraries are generally unusable.
>
> Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
> mode, libarrow.a (or other static libraries) cannot be used for
> linking. That's because the static library has a dependency on the
> bundled static wheels which are _not_ packaged with the Arrow static
> libraries.
>
> The preferred solution seems to be ARROW-7605. I demonstrated how this
> works in
>
> https://github.com/apache/arrow/pull/6220
>
> but I need someone to help with the PR to deal with other BUNDLED
> dependencies. I likely won't be able to complete the PR myself in time
> for the next release.
>
> 2. Our Python packages are unacceptably large
>
> On Linux, wheels are now 64MB and after installation take up 218MB.
> There is an immediate serious problem that has gone unresolved that is
> easier to fix and a separate structural problem that is more difficult
> to fix. See the directory listing
>
> https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248
>
> We're duplicating all of the shared libraries inside the wheel and on
> disk. It's unfortunate that we've allowed this problem for a whole
> year or more
>
> https://issues.apache.org/jira/browse/ARROW-5082
>
> I also recently opened
>
> https://issues.apache.org/jira/browse/ARROW-8518
>
> which describes a proposal to create some tools to assist with
> building "parent" and "child" Python packages. This would enable us to
> ship components like Flight and Gandiva as separate wheels. This is a
> large project but one that will ultimately be necessary for the
> long-term scalability and sustainability of the project.
>
> I am not able to personally work on either of these projects in the
> current release cycle, but I hope that some progress can be made on
> these since they have lingered on for a long time, and it would be
> good for us to "put our best foot forward" with the 1.0.0 release.
>
> Thanks,
> Wes
>


[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8643:


 Summary: [Python] Tests with pandas master failing due to freq 
assertion 
 Key: ARROW-8643
 URL: https://issues.apache.org/jira/browse/ARROW-8643
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Nightly pandas master tests are failing, eg 
https://circleci.com/gh/ursa-labs/crossbow/11858?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

This is caused by a change in pandas, see 
https://github.com/pandas-dev/pandas/pull/33815#issuecomment-620820134



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8644:


 Summary: [Python] Dask integration tests failing due to change in 
not including partition columns
 Key: ARROW-8644
 URL: https://issues.apache.org/jira/browse/ARROW-8644
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In ARROW-3861 (https://github.com/apache/arrow/pull/7050), I "fixed" a bug that 
the partition columns are always included even when the user did a manual 
column selection.

But apparently, this behaviour was being relied upon by dask. See the failing 
nightly integration tests: 
https://circleci.com/gh/ursa-labs/crossbow/11854?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

So the best option might be to just keep the "old" behaviour for the legacy 
ParquetDataset, when using the new datasets API 
({{use_legacy_datasets=False}}), you get the new / correct behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-29-0

2020-04-30 Thread Joris Van den Bossche
I opened issues to track the failing dask and pandas-master integration
tests:

https://issues.apache.org/jira/browse/ARROW-8643
https://issues.apache.org/jira/browse/ARROW-8644


On Wed, 29 Apr 2020 at 12:09, Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2020-04-29-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0
>
> Failed Tasks:
> - centos-6-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-6-amd64
> - centos-7-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-7-amd64
> - test-conda-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-test-conda-python-3.6
> - test-conda-python-3.7-dask-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.7-dask-latest
> - test-conda-python-3.7-pandas-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.7-pandas-master
> - test-conda-python-3.8-dask-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.8-dask-master
> - test-conda-python-3.8-jpype:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.8-jpype
> - test-ubuntu-18.04-docs:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-ubuntu-18.04-docs
> - test-ubuntu-18.04-python-3:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-test-ubuntu-18.04-python-3
> - ubuntu-xenial-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-ubuntu-xenial-amd64
> - wheel-osx-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp35m
> - wheel-osx-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp36m
> - wheel-osx-cp38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp38
> - wheel-win-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp36m
> - wheel-win-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp37m
> - wheel-win-cp38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp38
>
> Succeeded Tasks:
> - centos-8-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-8-amd64
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py38
> - debian-buster-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-debian-buster-amd64
> - debian-stretch-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-debian-stretch-amd64
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-gandiva-jar-xenial
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-homebrew-cpp
> - homebrew-r-autobrew:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-homebrew-r-autobrew
> - test-conda-cpp-valgrind:
>

[jira] [Created] (ARROW-8645) [C++] Missing gflags dependency for plasma

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8645:
--

 Summary: [C++] Missing gflags dependency for plasma
 Key: ARROW-8645
 URL: https://issues.apache.org/jira/browse/ARROW-8645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The documentation build fails because gflags is not installed and CMake doesn't 
build the bundled version of it.

Introduced by 
https://github.com/apache/arrow/commit/dfc14ef24ed54ff757c10a26663a629ce5e8cebf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8646) Allow UnionListWriter to write null values

2020-04-30 Thread Thippana Vamsi Kalyan (Jira)
Thippana Vamsi Kalyan created ARROW-8646:


 Summary: Allow UnionListWriter to write null values
 Key: ARROW-8646
 URL: https://issues.apache.org/jira/browse/ARROW-8646
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Thippana Vamsi Kalyan


UnionListWriter has no provision to skip an index to write a null value into 
the list.

It should allow to writeNull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-04-30-0

2020-04-30 Thread Crossbow


Arrow Build Report for Job nightly-2020-04-30-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0

Failed Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-centos-6-amd64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-centos-7-amd64
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-win-vs2015-py37
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-ubuntu-18.04-docs
- test-ubuntu-18.04-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-test-ubuntu-18.04-python-3
- ubuntu-xenial-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-ubuntu-xenial-amd64
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-travis-wheel-osx-cp35m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-centos-8-amd64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-debian-buster-amd64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-debian-stretch-amd64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-azure-test-conda-python-3.6
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-29-0

2020-04-30 Thread Krisztián Szűcs
I suggest to create a github actions workflow to trigger these integration
tests on pull requests when the relevant modules have changed:
parquet.py, dataset.pyx etc.

We have plenty of build failures, I'm trying to go through them.
Given the regularly occurring nightly errors we should move some
of the sensitive builds to run on each pull request, otherwise we
need to keep up these post-merge issues.

On Thu, Apr 30, 2020 at 10:59 AM Joris Van den Bossche
 wrote:
>
> I opened issues to track the failing dask and pandas-master integration
> tests:
>
> https://issues.apache.org/jira/browse/ARROW-8643
> https://issues.apache.org/jira/browse/ARROW-8644
>
>
> On Wed, 29 Apr 2020 at 12:09, Crossbow  wrote:
>
> >
> > Arrow Build Report for Job nightly-2020-04-29-0
> >
> > All tasks:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0
> >
> > Failed Tasks:
> > - centos-6-amd64:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-6-amd64
> > - centos-7-amd64:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-7-amd64
> > - test-conda-python-3.6:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-test-conda-python-3.6
> > - test-conda-python-3.7-dask-latest:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.7-dask-latest
> > - test-conda-python-3.7-pandas-master:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.7-pandas-master
> > - test-conda-python-3.8-dask-master:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.8-dask-master
> > - test-conda-python-3.8-jpype:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-conda-python-3.8-jpype
> > - test-ubuntu-18.04-docs:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-circle-test-ubuntu-18.04-docs
> > - test-ubuntu-18.04-python-3:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-test-ubuntu-18.04-python-3
> > - ubuntu-xenial-amd64:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-ubuntu-xenial-amd64
> > - wheel-osx-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp35m
> > - wheel-osx-cp36m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp36m
> > - wheel-osx-cp38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-travis-wheel-osx-cp38
> > - wheel-win-cp36m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp36m
> > - wheel-win-cp37m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp37m
> > - wheel-win-cp38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-appveyor-wheel-win-cp38
> >
> > Succeeded Tasks:
> > - centos-8-amd64:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-centos-8-amd64
> > - conda-linux-gcc-py36:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py36
> > - conda-linux-gcc-py37:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py37
> > - conda-linux-gcc-py38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-linux-gcc-py38
> > - conda-osx-clang-py36:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py36
> > - conda-osx-clang-py37:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py37
> > - conda-osx-clang-py38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-osx-clang-py38
> > - conda-win-vs2015-py36:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py36
> > - conda-win-vs2015-py37:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py37
> > - conda-win-vs2015-py38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-azure-conda-win-vs2015-py38
> > - debian-buster-amd64:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-29-0-github-debian-buster-amd64
> > - debian-stretch-amd64:
> >   URL:
> > https

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Joris Van den Bossche
On Thu, 30 Apr 2020 at 04:06, Wes McKinney  wrote:

> On Wed, Apr 29, 2020 at 6:54 PM David Li  wrote:
> >
> > Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
> > guaranteed to download all the files in order, but with more control,
> > you can make this more likely. You can also prevent the case where due
> > to scheduling, file N+1 doesn't even start downloading until after
> > file N+2, which can happen if you just submit all reads to a thread
> > pool, as demonstrated in the linked trace.
> >
> > And again, with this level of control, you can also decide to reduce
> > or increase parallelism based on network conditions, memory usage,
> > other readers, etc. So it is both about improving/smoothing out
> > performance, and limiting resource consumption.
> >
> > Finally, I do not mean to propose that we necessarily build all of
> > this into Arrow, just that it we would like to make it possible to
> > build this with Arrow, and that Datasets may find this interesting for
> > its optimization purposes, if concurrent reads are a goal.
> >
> > >  Except that datasets are essentially unordered.
> >
> > I did not realize this, but that means it's not really suitable for
> > our use case, unfortunately.
>
> It would be helpful to understand things a bit better so that we do
> not miss out on an opportunity to collaborate. I don't know that the
> current mode of the some of the public Datasets APIs is a dogmatic
> view about how everything should always work, and it's possible that
> some relatively minor changes could allow you to use it. So let's try
> not to be closing any doors right now
>

Note that a Dataset itself is actually ordered, AFAIK. Meaning: the list of
Fragments it is composed of is an ordered vector. It's only when eg
consuming scan tasks that the result might not be ordered (this is
currently the case in ToTable, but see
https://issues.apache.org/jira/browse/ARROW-8447 for an issue about
potentially changing this).


> > Thanks,
> > David
> >
> > On 4/29/20, Antoine Pitrou  wrote:
> > >
> > > Le 29/04/2020 à 23:30, David Li a écrit :
> > >> Sure -
> > >>
> > >> The use case is to read a large partitioned dataset, consisting of
> > >> tens or hundreds of Parquet files. A reader expects to scan through
> > >> the data in order of the partition key. However, to improve
> > >> performance, we'd like to begin loading files N+1, N+2, ... N + k
> > >> while the consumer is still reading file N, so that it doesn't have to
> > >> wait every time it opens a new file, and to help hide any latency or
> > >> slowness that might be happening on the backend. We also don't want to
> > >> be in a situation where file N+2 is ready but file N+1 isn't, because
> > >> that doesn't help us (we still have to wait for N+1 to load).
> > >
> > > But depending on network conditions, you may very well get file N+2
> > > before N+1, even if you start loading it after...
> > >
> > >> This is why I mention the project is quite similar to the Datasets
> > >> project - Datasets likely covers all the functionality we would
> > >> eventually need.
> > >
> > > Except that datasets are essentially unordered.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Sure, and we are still interested in collaborating. The main use case
we have is scanning datasets in order of the partition key; it seems
ordering is the only missing thing from Antoine's comments. However,
from briefly playing around with the Python API, an application could
manually order the fragments if so desired, so that still works for
us, even if ordering isn't otherwise a guarantee.

Performance-wise, we would want intra-file concurrency (coalescing)
and inter-file concurrency (buffering files in order, as described in
my previous messages). Even if Datasets doesn't directly handle this,
it'd be ideal if an application could achieve this if it were willing
to manage the details. I also vaguely remember seeing some interest in
things like being able to distribute a computation over a dataset via
Dask or some other distributed computation system, which would also be
interesting to us, though not a concrete requirement.

I'd like to reference the original proposal document, which has more
detail on our workloads and use cases:
https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit
As described there, we have a library that implements both a
datasets-like API (hand it a remote directory, get back an Arrow
Table) and several optimizations to make that library perform
acceptably. Our motivation here is to be able to have a path to
migrate to using and contributing to Arrow Datasets, which we see as a
cross-language, cross-filesystem library, without regressing in
performance. (We are limited to Python and S3.)

Best,
David

On 4/29/20, Wes McKinney  wrote:
> On Wed, Apr 29, 2020 at 6:54 PM David Li  wrote:
>>
>> Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
>> guaranteed to download all the files in order, but with more control,
>> you can make this more likely. You can also prevent the case where due
>> to scheduling, file N+1 doesn't even start downloading until after
>> file N+2, which can happen if you just submit all reads to a thread
>> pool, as demonstrated in the linked trace.
>>
>> And again, with this level of control, you can also decide to reduce
>> or increase parallelism based on network conditions, memory usage,
>> other readers, etc. So it is both about improving/smoothing out
>> performance, and limiting resource consumption.
>>
>> Finally, I do not mean to propose that we necessarily build all of
>> this into Arrow, just that it we would like to make it possible to
>> build this with Arrow, and that Datasets may find this interesting for
>> its optimization purposes, if concurrent reads are a goal.
>>
>> >  Except that datasets are essentially unordered.
>>
>> I did not realize this, but that means it's not really suitable for
>> our use case, unfortunately.
>
> It would be helpful to understand things a bit better so that we do
> not miss out on an opportunity to collaborate. I don't know that the
> current mode of the some of the public Datasets APIs is a dogmatic
> view about how everything should always work, and it's possible that
> some relatively minor changes could allow you to use it. So let's try
> not to be closing any doors right now
>
>> Thanks,
>> David
>>
>> On 4/29/20, Antoine Pitrou  wrote:
>> >
>> > Le 29/04/2020 à 23:30, David Li a écrit :
>> >> Sure -
>> >>
>> >> The use case is to read a large partitioned dataset, consisting of
>> >> tens or hundreds of Parquet files. A reader expects to scan through
>> >> the data in order of the partition key. However, to improve
>> >> performance, we'd like to begin loading files N+1, N+2, ... N + k
>> >> while the consumer is still reading file N, so that it doesn't have to
>> >> wait every time it opens a new file, and to help hide any latency or
>> >> slowness that might be happening on the backend. We also don't want to
>> >> be in a situation where file N+2 is ready but file N+1 isn't, because
>> >> that doesn't help us (we still have to wait for N+1 to load).
>> >
>> > But depending on network conditions, you may very well get file N+2
>> > before N+1, even if you start loading it after...
>> >
>> >> This is why I mention the project is quite similar to the Datasets
>> >> project - Datasets likely covers all the functionality we would
>> >> eventually need.
>> >
>> > Except that datasets are essentially unordered.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
Hello David,

I think that what you ask is achievable with the dataset API without
much effort. You'd have to insert the pre-buffering at
ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is
essentially a generator that looks like
flatmap(Iterator>). It consumes the
fragment in-order. The application consuming the ScanTask could
control the number of scheduled tasks by looking at the IO pool load.

OTOH, It would be good if we could make this format agnostic, e.g.
offer this via a ScanOptions toggle, e.g. "readahead_files" and this
would be applicable to all formats, CSV, ipc, ...

François
[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L383-L401

On Thu, Apr 30, 2020 at 8:20 AM David Li  wrote:
>
> Sure, and we are still interested in collaborating. The main use case
> we have is scanning datasets in order of the partition key; it seems
> ordering is the only missing thing from Antoine's comments. However,
> from briefly playing around with the Python API, an application could
> manually order the fragments if so desired, so that still works for
> us, even if ordering isn't otherwise a guarantee.
>
> Performance-wise, we would want intra-file concurrency (coalescing)
> and inter-file concurrency (buffering files in order, as described in
> my previous messages). Even if Datasets doesn't directly handle this,
> it'd be ideal if an application could achieve this if it were willing
> to manage the details. I also vaguely remember seeing some interest in
> things like being able to distribute a computation over a dataset via
> Dask or some other distributed computation system, which would also be
> interesting to us, though not a concrete requirement.
>
> I'd like to reference the original proposal document, which has more
> detail on our workloads and use cases:
> https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit
> As described there, we have a library that implements both a
> datasets-like API (hand it a remote directory, get back an Arrow
> Table) and several optimizations to make that library perform
> acceptably. Our motivation here is to be able to have a path to
> migrate to using and contributing to Arrow Datasets, which we see as a
> cross-language, cross-filesystem library, without regressing in
> performance. (We are limited to Python and S3.)
>
> Best,
> David
>
> On 4/29/20, Wes McKinney  wrote:
> > On Wed, Apr 29, 2020 at 6:54 PM David Li  wrote:
> >>
> >> Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
> >> guaranteed to download all the files in order, but with more control,
> >> you can make this more likely. You can also prevent the case where due
> >> to scheduling, file N+1 doesn't even start downloading until after
> >> file N+2, which can happen if you just submit all reads to a thread
> >> pool, as demonstrated in the linked trace.
> >>
> >> And again, with this level of control, you can also decide to reduce
> >> or increase parallelism based on network conditions, memory usage,
> >> other readers, etc. So it is both about improving/smoothing out
> >> performance, and limiting resource consumption.
> >>
> >> Finally, I do not mean to propose that we necessarily build all of
> >> this into Arrow, just that it we would like to make it possible to
> >> build this with Arrow, and that Datasets may find this interesting for
> >> its optimization purposes, if concurrent reads are a goal.
> >>
> >> >  Except that datasets are essentially unordered.
> >>
> >> I did not realize this, but that means it's not really suitable for
> >> our use case, unfortunately.
> >
> > It would be helpful to understand things a bit better so that we do
> > not miss out on an opportunity to collaborate. I don't know that the
> > current mode of the some of the public Datasets APIs is a dogmatic
> > view about how everything should always work, and it's possible that
> > some relatively minor changes could allow you to use it. So let's try
> > not to be closing any doors right now
> >
> >> Thanks,
> >> David
> >>
> >> On 4/29/20, Antoine Pitrou  wrote:
> >> >
> >> > Le 29/04/2020 à 23:30, David Li a écrit :
> >> >> Sure -
> >> >>
> >> >> The use case is to read a large partitioned dataset, consisting of
> >> >> tens or hundreds of Parquet files. A reader expects to scan through
> >> >> the data in order of the partition key. However, to improve
> >> >> performance, we'd like to begin loading files N+1, N+2, ... N + k
> >> >> while the consumer is still reading file N, so that it doesn't have to
> >> >> wait every time it opens a new file, and to help hide any latency or
> >> >> slowness that might be happening on the backend. We also don't want to
> >> >> be in a situation where file N+2 is ready but file N+1 isn't, because
> >> >> that doesn't help us (we still have to wait for N+1 to load).
> >> >
> >> > But depending on network conditions, you may very well get file N+2
> >> > before N+1, even if you sta

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
One more point,

It would seem beneficial if we could express this in
`RandomAccessFile::ReadAhead(vector)` method: no async
buffering/coalescing would be needed. In the case of Parquet, we'd get
the _exact_ ranges computed from the medata.This method would also
possibly benefit other filesystems since on linux it can call
`readahead` and/or `madvise`.

François


On Thu, Apr 30, 2020 at 8:56 AM Francois Saint-Jacques
 wrote:
>
> Hello David,
>
> I think that what you ask is achievable with the dataset API without
> much effort. You'd have to insert the pre-buffering at
> ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is
> essentially a generator that looks like
> flatmap(Iterator>). It consumes the
> fragment in-order. The application consuming the ScanTask could
> control the number of scheduled tasks by looking at the IO pool load.
>
> OTOH, It would be good if we could make this format agnostic, e.g.
> offer this via a ScanOptions toggle, e.g. "readahead_files" and this
> would be applicable to all formats, CSV, ipc, ...
>
> François
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L383-L401
>
> On Thu, Apr 30, 2020 at 8:20 AM David Li  wrote:
> >
> > Sure, and we are still interested in collaborating. The main use case
> > we have is scanning datasets in order of the partition key; it seems
> > ordering is the only missing thing from Antoine's comments. However,
> > from briefly playing around with the Python API, an application could
> > manually order the fragments if so desired, so that still works for
> > us, even if ordering isn't otherwise a guarantee.
> >
> > Performance-wise, we would want intra-file concurrency (coalescing)
> > and inter-file concurrency (buffering files in order, as described in
> > my previous messages). Even if Datasets doesn't directly handle this,
> > it'd be ideal if an application could achieve this if it were willing
> > to manage the details. I also vaguely remember seeing some interest in
> > things like being able to distribute a computation over a dataset via
> > Dask or some other distributed computation system, which would also be
> > interesting to us, though not a concrete requirement.
> >
> > I'd like to reference the original proposal document, which has more
> > detail on our workloads and use cases:
> > https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit
> > As described there, we have a library that implements both a
> > datasets-like API (hand it a remote directory, get back an Arrow
> > Table) and several optimizations to make that library perform
> > acceptably. Our motivation here is to be able to have a path to
> > migrate to using and contributing to Arrow Datasets, which we see as a
> > cross-language, cross-filesystem library, without regressing in
> > performance. (We are limited to Python and S3.)
> >
> > Best,
> > David
> >
> > On 4/29/20, Wes McKinney  wrote:
> > > On Wed, Apr 29, 2020 at 6:54 PM David Li  wrote:
> > >>
> > >> Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
> > >> guaranteed to download all the files in order, but with more control,
> > >> you can make this more likely. You can also prevent the case where due
> > >> to scheduling, file N+1 doesn't even start downloading until after
> > >> file N+2, which can happen if you just submit all reads to a thread
> > >> pool, as demonstrated in the linked trace.
> > >>
> > >> And again, with this level of control, you can also decide to reduce
> > >> or increase parallelism based on network conditions, memory usage,
> > >> other readers, etc. So it is both about improving/smoothing out
> > >> performance, and limiting resource consumption.
> > >>
> > >> Finally, I do not mean to propose that we necessarily build all of
> > >> this into Arrow, just that it we would like to make it possible to
> > >> build this with Arrow, and that Datasets may find this interesting for
> > >> its optimization purposes, if concurrent reads are a goal.
> > >>
> > >> >  Except that datasets are essentially unordered.
> > >>
> > >> I did not realize this, but that means it's not really suitable for
> > >> our use case, unfortunately.
> > >
> > > It would be helpful to understand things a bit better so that we do
> > > not miss out on an opportunity to collaborate. I don't know that the
> > > current mode of the some of the public Datasets APIs is a dogmatic
> > > view about how everything should always work, and it's possible that
> > > some relatively minor changes could allow you to use it. So let's try
> > > not to be closing any doors right now
> > >
> > >> Thanks,
> > >> David
> > >>
> > >> On 4/29/20, Antoine Pitrou  wrote:
> > >> >
> > >> > Le 29/04/2020 à 23:30, David Li a écrit :
> > >> >> Sure -
> > >> >>
> > >> >> The use case is to read a large partitioned dataset, consisting of
> > >> >> tens or hundreds of Parquet files. A reader expects to scan through
> > >> >> the data in order o

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Antoine Pitrou


If we want to discuss IO APIs we should do that comprehensively.
There are various ways of expressing what we want to do (explicit
readahead, fadvise-like APIs, async APIs, etc.).

Regards

Antoine.


Le 30/04/2020 à 15:08, Francois Saint-Jacques a écrit :
> One more point,
> 
> It would seem beneficial if we could express this in
> `RandomAccessFile::ReadAhead(vector)` method: no async
> buffering/coalescing would be needed. In the case of Parquet, we'd get
> the _exact_ ranges computed from the medata.This method would also
> possibly benefit other filesystems since on linux it can call
> `readahead` and/or `madvise`.
> 
> François
> 
> 
> On Thu, Apr 30, 2020 at 8:56 AM Francois Saint-Jacques
>  wrote:
>>
>> Hello David,
>>
>> I think that what you ask is achievable with the dataset API without
>> much effort. You'd have to insert the pre-buffering at
>> ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is
>> essentially a generator that looks like
>> flatmap(Iterator>). It consumes the
>> fragment in-order. The application consuming the ScanTask could
>> control the number of scheduled tasks by looking at the IO pool load.
>>
>> OTOH, It would be good if we could make this format agnostic, e.g.
>> offer this via a ScanOptions toggle, e.g. "readahead_files" and this
>> would be applicable to all formats, CSV, ipc, ...
>>
>> François
>> [1] 
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L383-L401
>>
>> On Thu, Apr 30, 2020 at 8:20 AM David Li  wrote:
>>>
>>> Sure, and we are still interested in collaborating. The main use case
>>> we have is scanning datasets in order of the partition key; it seems
>>> ordering is the only missing thing from Antoine's comments. However,
>>> from briefly playing around with the Python API, an application could
>>> manually order the fragments if so desired, so that still works for
>>> us, even if ordering isn't otherwise a guarantee.
>>>
>>> Performance-wise, we would want intra-file concurrency (coalescing)
>>> and inter-file concurrency (buffering files in order, as described in
>>> my previous messages). Even if Datasets doesn't directly handle this,
>>> it'd be ideal if an application could achieve this if it were willing
>>> to manage the details. I also vaguely remember seeing some interest in
>>> things like being able to distribute a computation over a dataset via
>>> Dask or some other distributed computation system, which would also be
>>> interesting to us, though not a concrete requirement.
>>>
>>> I'd like to reference the original proposal document, which has more
>>> detail on our workloads and use cases:
>>> https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit
>>> As described there, we have a library that implements both a
>>> datasets-like API (hand it a remote directory, get back an Arrow
>>> Table) and several optimizations to make that library perform
>>> acceptably. Our motivation here is to be able to have a path to
>>> migrate to using and contributing to Arrow Datasets, which we see as a
>>> cross-language, cross-filesystem library, without regressing in
>>> performance. (We are limited to Python and S3.)
>>>
>>> Best,
>>> David
>>>
>>> On 4/29/20, Wes McKinney  wrote:
 On Wed, Apr 29, 2020 at 6:54 PM David Li  wrote:
>
> Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
> guaranteed to download all the files in order, but with more control,
> you can make this more likely. You can also prevent the case where due
> to scheduling, file N+1 doesn't even start downloading until after
> file N+2, which can happen if you just submit all reads to a thread
> pool, as demonstrated in the linked trace.
>
> And again, with this level of control, you can also decide to reduce
> or increase parallelism based on network conditions, memory usage,
> other readers, etc. So it is both about improving/smoothing out
> performance, and limiting resource consumption.
>
> Finally, I do not mean to propose that we necessarily build all of
> this into Arrow, just that it we would like to make it possible to
> build this with Arrow, and that Datasets may find this interesting for
> its optimization purposes, if concurrent reads are a goal.
>
>>  Except that datasets are essentially unordered.
>
> I did not realize this, but that means it's not really suitable for
> our use case, unfortunately.

 It would be helpful to understand things a bit better so that we do
 not miss out on an opportunity to collaborate. I don't know that the
 current mode of the some of the public Datasets APIs is a dogmatic
 view about how everything should always work, and it's possible that
 some relatively minor changes could allow you to use it. So let's try
 not to be closing any doors right now

> Thanks,
> David
>
> On 4/29/20, Antoine Pitrou  wrote:

[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8647:


 Summary: [C++][Dataset] Optionally encode partition field values 
as dictionary type
 Key: ARROW-8647
 URL: https://issues.apache.org/jira/browse/ARROW-8647
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

2020-04-30 Thread Wes McKinney
The proposal is for any BUNDLED dependency to be merged into
libarrow.a (or another one of the static libraries if the dependency
is only used in e.g. one subcomponent), so this applies to the AWS SDK
also

On Thu, Apr 30, 2020 at 3:02 AM Rémi Dettai  wrote:
>
> Hi!
>
> Does your point 1 also apply to the AWS SDK dependency ? Currently it seems
> that it cannot be built in BUNDLED mode. As stated in
> https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make
> a static build with the S3 dependency activated ! I would really like to
> help on this because it is very important for my usecase that we can
> assemble compact builds of Arrow, but I'm still very uncomfortable with
> CMake :-(
>
> Thanks for your amazing work !
>
> Remi
>
> Le mar. 28 avr. 2020 à 16:22, Wes McKinney  a écrit :
>
> > hi folks,
> >
> > I would like to highlight some outstanding problems with our packages
> >
> > 1. Our Arrow C++ static libraries are generally unusable.
> >
> > Whenever -DARROW_JEMALLOC=ON or any dependency is built in BUNDLED
> > mode, libarrow.a (or other static libraries) cannot be used for
> > linking. That's because the static library has a dependency on the
> > bundled static wheels which are _not_ packaged with the Arrow static
> > libraries.
> >
> > The preferred solution seems to be ARROW-7605. I demonstrated how this
> > works in
> >
> > https://github.com/apache/arrow/pull/6220
> >
> > but I need someone to help with the PR to deal with other BUNDLED
> > dependencies. I likely won't be able to complete the PR myself in time
> > for the next release.
> >
> > 2. Our Python packages are unacceptably large
> >
> > On Linux, wheels are now 64MB and after installation take up 218MB.
> > There is an immediate serious problem that has gone unresolved that is
> > easier to fix and a separate structural problem that is more difficult
> > to fix. See the directory listing
> >
> > https://gist.github.com/wesm/57bd99798a2fa23ef3cb5e4b18b5a248
> >
> > We're duplicating all of the shared libraries inside the wheel and on
> > disk. It's unfortunate that we've allowed this problem for a whole
> > year or more
> >
> > https://issues.apache.org/jira/browse/ARROW-5082
> >
> > I also recently opened
> >
> > https://issues.apache.org/jira/browse/ARROW-8518
> >
> > which describes a proposal to create some tools to assist with
> > building "parent" and "child" Python packages. This would enable us to
> > ship components like Flight and Gandiva as separate wheels. This is a
> > large project but one that will ultimately be necessary for the
> > long-term scalability and sustainability of the project.
> >
> > I am not able to personally work on either of these projects in the
> > current release cycle, but I hope that some progress can be made on
> > these since they have lingered on for a long time, and it would be
> > good for us to "put our best foot forward" with the 1.0.0 release.
> >
> > Thanks,
> > Wes
> >


[jira] [Created] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8648:


 Summary: [Rust] Optimize Rust CI Build Times
 Key: ARROW-8648
 URL: https://issues.apache.org/jira/browse/ARROW-8648
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mark Hildreth


In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
options used that are at odds with each other, resulting in multiple redundant 
builds where a smaller number could do the same job. The following tweaks, at 
minimal, could reduce this, speeding up build times:
 * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
Currently, it's only used for a single command (the {{build --all-targets}} in 
{{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
since RUSTFLAGS has changed.
 * Don't run examples in release mode, as that would force a new (and slower) 
rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8649) [Java] [Website] Java documentation on website is hidden

2020-04-30 Thread Andy Grove (Jira)
Andy Grove created ARROW-8649:
-

 Summary: [Java] [Website] Java documentation on website is hidden
 Key: ARROW-8649
 URL: https://issues.apache.org/jira/browse/ARROW-8649
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andy Grove
 Fix For: 1.0.0


There is some excellent Java documentation on the web site that is hard to find 
because the Java documentation link  [1] goes straight to the generated 
javadocs.

 

 [1] https://arrow.apache.org/docs/java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8650) [Rust] [Website] Add documentation to Arrow website

2020-04-30 Thread Andy Grove (Jira)
Andy Grove created ARROW-8650:
-

 Summary: [Rust] [Website] Add documentation to Arrow website
 Key: ARROW-8650
 URL: https://issues.apache.org/jira/browse/ARROW-8650
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Website
Reporter: Andy Grove
 Fix For: 1.0.0


The documentation page [1] on the Arrow site has links for C, C++, Java, 
Python, JavaScript, and R. It would be good do add Rust here as well, even if 
the docs here are brief and link to the rustdocs on docs.rs [2] (which are 
currently broken due to ARROW-8536 [3].

 

[1] [https://arrow.apache.org/docs/]

[2] https://docs.rs/crate/arrow/0.17.0

[3] https://issues.apache.org/jira/browse/ARROW-8536



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8651:


 Summary: [Python][Dataset] Support pickling of Dataset objects
 Key: ARROW-8651
 URL: https://issues.apache.org/jira/browse/ARROW-8651
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


We alraedy made several parts of a Dataset serializable (the formats, the 
expressions, the filesystem). With those, it should also be possible to pickle 
FileFragments, and with that also Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8652:


 Summary: [Python] Test error message when discovering dataset with 
invalid files
 Key: ARROW-8652
 URL: https://issues.apache.org/jira/browse/ARROW-8652
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


There is comment in the test_parquet.py about the Dataset API needing a better 
error message for invalid files:

https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648

Although, this seems to work now:

{code}
import tempfile 
import pathlib
import pyarrow.dataset as ds

   

tempdir = pathlib.Path(tempfile.mkdtemp()) 

with open(str(tempdir / "data.parquet"), 'wb') as f: 
pass 

In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")

   
...
OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
Invalid: Parquet file size is 0 bytes
{code}

So we need update the test to actually test it instead of skipping.

The only difference with the python ParquetDataset implementation is that the 
datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8653) [C++] Add support for gflags version detection

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8653:
--

 Summary: [C++] Add support for gflags version detection
 Key: ARROW-8653
 URL: https://issues.apache.org/jira/browse/ARROW-8653
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs


Missing functionality from FindgflagsAlt, follop-up for 
https://github.com/apache/arrow/pull/7067/files#diff-bc36ca94c3abd969dcdbaec7125fed65R18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Mike Macpherson (Jira)
Mike Macpherson created ARROW-8654:
--

 Summary: [Python] pyarrow 0.17.0 fails reading "wide" parquet files
 Key: ARROW-8654
 URL: https://issues.apache.org/jira/browse/ARROW-8654
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Mike Macpherson


{code:java}
import pandas as pd

num_rows, num_cols = 1000, 45000

df = pd.DataFrame(np.random.randint(0, 256, size=(num_rows, 
num_cols)).astype(np.uint8))

outfile = "test.parquet"
df.to_parquet(outfile)
del df

df = pd.read_parquet(fout)
{code}
Yields:
{noformat}
df = pd.read_parquet(outfile) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
310, in read_parquet 
return impl.read(path, columns=columns, kwargs) 
File "/jupyter/venv/lib/python3.6/site-packages/pandas/io/parquet.py", line 
125, in read 
path, columns=columns, kwargs 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1530, 
in read_table 
partitioning=partitioning) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1189, 
in __init__ 
self.validate_schemas() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1217, 
in validate_schemas 
self.schema = self.pieces[0].get_metadata().schema 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 662, 
in get_metadata 
f = self.open() 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 669, 
in open 
reader = self.open_file_func(self.path) 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 1040, 
in _open_dataset_file 
buffer_size=dataset.buffer_size 
File "/jupyter/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, 
in __init__ 
read_dictionary=read_dictionary, metadata=metadata) 
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open 
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status 
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
{noformat}
This is pandas 1.0.3, and pyarrow 0.17.0.

 

I tried this with pyarrow 0.16.0, and it works. 0.15.1 did as well.

 

I also tried with 40,000 columns aot 45,000 as above, and that does work with 
0.17.0.

 

Thanks for all your work on this project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8655:


 Summary: [C++][Dataset][Python][R] Preserve partitioning 
information for a discovered Dataset
 Key: ARROW-8655
 URL: https://issues.apache.org/jira/browse/ARROW-8655
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
classes that describe a partitioning used in the discovery phase. But once a 
dataset object is created, it doesn't know any more about this, it just has 
partition expressions for the fragments. And the partition keys are added to 
the schema, but you can't directly know which columns of the schema originated 
from the partitions.

However, there can be use cases where it would be useful that a dataset still 
"knows" from what kind of partitioning it was created:

- The "read CSV write back Parquet" use case, where the CSV was already 
partitioned and you want to automatically preserve that partitioning for 
parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it 
can be useful to know what columns were partition columns (eg for pandas, those 
columns might be good candidates to be set as the index of the pandas/dask 
DataFrame). I can imagine conversions to other systems can use similar 
information.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8656:
--

 Summary: [Python] Switch to VS2017 in the windows wheel builds
 Key: ARROW-8656
 URL: https://issues.apache.org/jira/browse/ARROW-8656
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Since the recent conda-forge compiler migrations the wheel builds are failing 
https://mail.google.com/mail/u/0/#label/ARROW/FMfcgxwHNCsqSGKQRMZxGlWWsfmGpKdC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2

2020-04-30 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-8657:
-

 Summary: Distinguish parquet version 2 logical type vs DataPageV2
 Key: ARROW-8657
 URL: https://issues.apache.org/jira/browse/ARROW-8657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.17.0
Reporter: Pierre Belzile


With the recent release of 0.17, the ParquetVersion is used to define the 
logical type interpretation of fields and the selection of the DataPage format.

As a result all parquet files that were created with ParquetVersion::V2 to get 
features such as unsigned int32s, timestamps with nanosecond resolution, etc 
are now unreadable. That's TBs of data in my case.

Those two concerns should be separated. Given that that DataPageV2 pages were 
not written prior to 0.17 and in order to allow reading existing files, the 
existing version property should continue to operate as in 0.16 and inform the 
logical type mapping.

Some consideration should be given to issue a release 0.17.1.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Micah Kornfield
This sounds like something we might want to do and issue a patch release.
It seems bad to default to a non-production version?

I can try to take a look tonight at a patch of no gets to it before.

Thanks,
Micah

On Wednesday, April 29, 2020, Wes McKinney  wrote:

> On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile 
> wrote:
> >
> > Wes,
> >
> > You used the words "forward compatible". Does this mean that 0.17 is able
> > to decode 0.16 datapagev2?
>
> 0.16 doesn't write DataPageV2 at all, the version flag only determines
> the type casting and metadata behavior I indicated in my email. The
> changes in
>
> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> a9da588516
>
> enabled the use of DataPageV2 and I/we didn't think about the forward
> compatibility issue (version=2.0 files written in 0.17.0 being
> unreadable in 0.16.0). We might actually want to revert this (just the
> toggle between DataPageV1/V2, not the whole patch).
>
>
>
> > Crossing my fingers...
> >
> > Pierre
> >
> > Le mer. 29 avr. 2020 à 19:05, Wes McKinney  a
> écrit :
> >
> > > Ah, so we have a slight mess on our hands because the patch for
> > > PARQUET-458 enabled the use of DataPageV2, which is not forward
> > > compatible with older version because the implementation was fixed
> > > (see the JIRA for more details)
> > >
> > >
> > > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> a9da588516
> > >
> > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
> > > being used for two different purposes:
> > >
> > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
> > > and nanosecond timestamps
> > > * DataPageV1 vs. DataPageV2 data pages
> > >
> > > I think we should separate these concepts and instead have a
> > > "compatibility mode" option regarding the ConvertedType/LogicalType
> > > annotations and the behavior around conversions when writing unsigned
> > > integers, nanosecond timestamps, and other types to Parquet V1 (which
> > > is the only "production" Parquet format).
> > >
> > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
> pierre.belz...@gmail.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > We've been using the parquet 2 format (mostly because of nanosecond
> > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
> 0.16,
> > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > > > expected? Would a 0.17 decode a 0.16?
> > > >
> > > > If that's not expected, I can put the debugger on it and see what is
> > > > happening. I suspect it's with string fields (regular, not large
> string).
> > > >
> > > > Cheers, Pierre
> > >
>


Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Wes McKinney
I'd be fine with a patch release addressing this so long as it's
binary-only (to save us all time).

On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield 
wrote:

> This sounds like something we might want to do and issue a patch release.
> It seems bad to default to a non-production version?
>
> I can try to take a look tonight at a patch of no gets to it before.
>
> Thanks,
> Micah
>
> On Wednesday, April 29, 2020, Wes McKinney  wrote:
>
> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile  >
> > wrote:
> > >
> > > Wes,
> > >
> > > You used the words "forward compatible". Does this mean that 0.17 is
> able
> > > to decode 0.16 datapagev2?
> >
> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
> > the type casting and metadata behavior I indicated in my email. The
> > changes in
> >
> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> > a9da588516
> >
> > enabled the use of DataPageV2 and I/we didn't think about the forward
> > compatibility issue (version=2.0 files written in 0.17.0 being
> > unreadable in 0.16.0). We might actually want to revert this (just the
> > toggle between DataPageV1/V2, not the whole patch).
> >
> >
> >
> > > Crossing my fingers...
> > >
> > > Pierre
> > >
> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney  a
> > écrit :
> > >
> > > > Ah, so we have a slight mess on our hands because the patch for
> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
> > > > compatible with older version because the implementation was fixed
> > > > (see the JIRA for more details)
> > > >
> > > >
> > > >
> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> > a9da588516
> > > >
> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
> > > > being used for two different purposes:
> > > >
> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
> > > > and nanosecond timestamps
> > > > * DataPageV1 vs. DataPageV2 data pages
> > > >
> > > > I think we should separate these concepts and instead have a
> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
> > > > annotations and the behavior around conversions when writing unsigned
> > > > integers, nanosecond timestamps, and other types to Parquet V1 (which
> > > > is the only "production" Parquet format).
> > > >
> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
> > pierre.belz...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > We've been using the parquet 2 format (mostly because of nanosecond
> > > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
> > 0.16,
> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > > > > expected? Would a 0.17 decode a 0.16?
> > > > >
> > > > > If that's not expected, I can put the debugger on it and see what
> is
> > > > > happening. I suspect it's with string fields (regular, not large
> > string).
> > > > >
> > > > > Cheers, Pierre
> > > >
> >
>


[jira] [Created] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

2020-04-30 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8658:
---

 Summary: [C++][Dataset] Implement subtree pruning for 
FileSystemDataset::GetFragments
 Key: ARROW-8658
 URL: https://issues.apache.org/jira/browse/ARROW-8658
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


This is a very handy optimization for large datasets with multiple partition 
fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a 
filter {{"a"_ == 2}} none of its files or subdirectories need be examined.

After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose 
implementation depended on the presence of directories to represent subtrees) 
was disabled. It should be possible to reintroduce this without reference to 
directories by examining partition expressions directly and extracting a tree 
structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Francois,

Thanks for the pointers. I'll see if I can put together a
proof-of-concept, might that help discussion? I agree it would be good
to make it format-agnostic. I'm also curious what thoughts you'd have
on how to manage cross-file parallelism (coalescing only helps within
a file). If we just naively start scanning fragments in parallel, we'd
still want some way to help ensure the actual reads get issued roughly
in order of file (to avoid the problem discussed above, where reads
for file B prevent reads for file A from getting scheduled, where B
follows A from the consumer's standpoint).

Antoine,

We would be interested in that as well. One thing we do want to
investigate is a better ReadAsync() implementation for S3File as
preliminary benchmarking on our side has shown it's quite inefficient
(the default implementation makes lots of memcpy()s).

Thanks,
David

On 4/30/20, Antoine Pitrou  wrote:
>
> If we want to discuss IO APIs we should do that comprehensively.
> There are various ways of expressing what we want to do (explicit
> readahead, fadvise-like APIs, async APIs, etc.).
>
> Regards
>
> Antoine.
>
>
> Le 30/04/2020 à 15:08, Francois Saint-Jacques a écrit :
>> One more point,
>>
>> It would seem beneficial if we could express this in
>> `RandomAccessFile::ReadAhead(vector)` method: no async
>> buffering/coalescing would be needed. In the case of Parquet, we'd get
>> the _exact_ ranges computed from the medata.This method would also
>> possibly benefit other filesystems since on linux it can call
>> `readahead` and/or `madvise`.
>>
>> François
>>
>>
>> On Thu, Apr 30, 2020 at 8:56 AM Francois Saint-Jacques
>>  wrote:
>>>
>>> Hello David,
>>>
>>> I think that what you ask is achievable with the dataset API without
>>> much effort. You'd have to insert the pre-buffering at
>>> ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is
>>> essentially a generator that looks like
>>> flatmap(Iterator>). It consumes the
>>> fragment in-order. The application consuming the ScanTask could
>>> control the number of scheduled tasks by looking at the IO pool load.
>>>
>>> OTOH, It would be good if we could make this format agnostic, e.g.
>>> offer this via a ScanOptions toggle, e.g. "readahead_files" and this
>>> would be applicable to all formats, CSV, ipc, ...
>>>
>>> François
>>> [1]
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L383-L401
>>>
>>> On Thu, Apr 30, 2020 at 8:20 AM David Li  wrote:

 Sure, and we are still interested in collaborating. The main use case
 we have is scanning datasets in order of the partition key; it seems
 ordering is the only missing thing from Antoine's comments. However,
 from briefly playing around with the Python API, an application could
 manually order the fragments if so desired, so that still works for
 us, even if ordering isn't otherwise a guarantee.

 Performance-wise, we would want intra-file concurrency (coalescing)
 and inter-file concurrency (buffering files in order, as described in
 my previous messages). Even if Datasets doesn't directly handle this,
 it'd be ideal if an application could achieve this if it were willing
 to manage the details. I also vaguely remember seeing some interest in
 things like being able to distribute a computation over a dataset via
 Dask or some other distributed computation system, which would also be
 interesting to us, though not a concrete requirement.

 I'd like to reference the original proposal document, which has more
 detail on our workloads and use cases:
 https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit
 As described there, we have a library that implements both a
 datasets-like API (hand it a remote directory, get back an Arrow
 Table) and several optimizations to make that library perform
 acceptably. Our motivation here is to be able to have a path to
 migrate to using and contributing to Arrow Datasets, which we see as a
 cross-language, cross-filesystem library, without regressing in
 performance. (We are limited to Python and S3.)

 Best,
 David

 On 4/29/20, Wes McKinney  wrote:
> On Wed, Apr 29, 2020 at 6:54 PM David Li 
> wrote:
>>
>> Ah, sorry, so I am being somewhat unclear here. Yes, you aren't
>> guaranteed to download all the files in order, but with more control,
>> you can make this more likely. You can also prevent the case where
>> due
>> to scheduling, file N+1 doesn't even start downloading until after
>> file N+2, which can happen if you just submit all reads to a thread
>> pool, as demonstrated in the linked trace.
>>
>> And again, with this level of control, you can also decide to reduce
>> or increase parallelism based on network conditions, memory usage,
>> other readers, etc. So it is both about improving/smoothing out
>

[RESULT] [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-30 Thread Wes McKinney
The vote carries with 7 binding +1 votes and 1 non-binding +1

On Fri, Apr 24, 2020 at 7:40 AM Francois Saint-Jacques
 wrote:
>
> +1 (binding)
>
> On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs
>  wrote:
> >
> > +1 (binding)
> >
> > On 2020. Apr 24., Fri at 1:51, Micah Kornfield 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > In 
> > > >   "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC
> > > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500,
> > > >   Wes McKinney  wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I have proposed adding a simple RecordBatch IPC message body
> > > > > compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> > > > > protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> > > > > is distinct from separate discussions about adding in-memory encodings
> > > > > (like RLE-encoding) to the Arrow columnar format.
> > > > >
> > > > > This change is not forward compatible so it will not be safe to send
> > > > > compressed messages to old libraries, but since we are still pre-1.0.0
> > > > > the consensus is that this is acceptable. We may separately consider
> > > > > increasing the metadata version for 1.0.0 to require clients to
> > > > > upgrade.
> > > > >
> > > > > Please vote whether to accept the addition. The vote will be open for
> > > > > at least 72 hours.
> > > > >
> > > > > [ ] +1 Accept this addition to the IPC protocol
> > > > > [ ] +0
> > > > > [ ] -1 Do not accept the changes because...
> > > > >
> > > > > Here is my vote: +1
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/6707
> > > > > [2]:
> > > >
> > > https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
> > > >
> > >


[C++] Heads up about breaking API change with Interval types

2020-04-30 Thread Wes McKinney
Hi folks,

In https://github.com/apache/arrow/pull/7060 I proposed an
(unavoidable) C++ API change related to the two types of intervals
that are in the Arrow columnar format.

As context, in the C++ library in almost all cases we use different
Type enum values for each "subtype" that has a different in-memory
representation. So we have

Flatbuffers "Date" -> Type::DATE32 and Type::DATE64
Flatbuffers "Time" -> Type::TIME32 and Type::TIME64

There are two flavors of Interval, YEAR_MONTH (which is represented as
4-byte values) and DAY_TIME (which is represented as 8-byte values).
This means that we generally have to branch on the interval type to
select code paths, so you end up with code like

case Type::INTERVAL: {
  switch (checked_cast(*type).interval_type()) {
case IntervalType::MONTHS:
  res = std::make_shared>(type);
  break;
case IntervalType::DAY_TIME:
  res = std::make_shared(type);
  break;
default:
  return not_implemented();
}

This makes any kind of dynamic dispatch for intervals more complex
than other types (e.g. DATE32/64). My patch splits the enum into
INTERVAL_MONTHS and INTERVAL_DAY_TIME to make Interval work the same
as the other types which have different in-memory representations
based on their parameters (i.e. Date and Time).

Since this is a less traveled part of the codebase, the number of
downstream users impacted by the API change should not be large but
per discussion on the PR I wanted to make this change more visible in
case there was a concern.

Thanks
Wes


[jira] [Created] (ARROW-8659) ListBuilder and FixedSizeListBuilder capacity

2020-04-30 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-8659:


 Summary: ListBuilder and FixedSizeListBuilder capacity
 Key: ARROW-8659
 URL: https://issues.apache.org/jira/browse/ARROW-8659
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Both ListBuilder and FixedSizeListBuilder accept a values_builder as a 
constructor argument and then set the capacity of their internal builders based 
off the length of this values_builder. Unfortunately at construction time this 
values_builder is normally empty, and consequently programs spend an 
unnecessary amount of time reallocating memory.

 

This should be addressed by adding new constructor methods that allow 
specifying the desired capacity upfront.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8660:
---

 Summary: [C++][Gandiva] Reduce dependence on Boost
 Key: ARROW-8660
 URL: https://issues.apache.org/jira/browse/ARROW-8660
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Remove Boost usages aside from Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8661:
---

 Summary: [C++][Gandiva] Reduce number of files and headers
 Key: ARROW-8661
 URL: https://issues.apache.org/jira/browse/ARROW-8661
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units.

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Pyarrow building from source along with CPP Libraries to link to another Cython API

2020-04-30 Thread Vibhatha Abeykoon
Hi,

I am trying to integrate Arrow with an application that I am developing.
Here I build Arrow from the source (CPP) and use the API to develop some
custom functions to do a scientific calculation after data loaded with
Arrow table API. On top of this, I develop a Cython API to design a python
API.

In the current stage, I have a new necessity where I need to consume Arrow
Cython API for my code.

Here It was hard to link the build libarrow.so.16 with the
libarrow_python.so.16 from the installed pyarrow (separately from pip).
What I realised was everything has to be built from the same source, so
that I can install pyarrow from the source in my virtual environment.

Before going through deeper things, I started by just building from source
(CPP) and then moving towards installing pyarrow from the source.

I tried to follow the guideline form here,

https://arrow.apache.org/docs/developers/python.html,

But when I found issues in the python build, I followed this source,
(but still, I used the clone from the master, not a released version)

https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603c8d521

My environmental variables are as follows,

python3 setup.py build_ext --inplace
running build_ext
-- Running cmake for pyarrow
cmake 
-DPYTHON_EXECUTABLE=/home/vibhatha/sandbox/arrow/repos/arrow/ENVARROW/bin/python3
 -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=off
-DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=off
-DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on
-DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off
-DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off
-DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off
-DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on
-DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release
/home/vibhatha/sandbox/arrow/repos/arrow/python
-- System processor: x86_64
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Build output directory:
/home/vibhatha/sandbox/arrow/repos/arrow/python/build/temp.linux-x86_64-3.8/release
-- Arrow version: 0.18.0 (HOME:
/home/vibhatha/sandbox/arrow/repos/arrow/cpp/arrowmylibs)
-- Arrow SO and ABI version: 18
-- Arrow full SO version: 18.0.0
-- Found the Arrow core shared library:
/home/vibhatha/sandbox/arrow/repos/arrow/cpp/arrowmylibs/libarrow.so
-- Found the Arrow core import library:
/home/vibhatha/sandbox/arrow/repos/arrow/cpp/arrowmylibs/libarrow.so
-- Found the Arrow core static library:
/home/vibhatha/sandbox/arrow/repos/arrow/cpp/arrowmylibs/libarrow.a
CMake Error at 
/usr/local/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146
(message):
  Could NOT find ArrowPython (missing: ARROW_PYTHON_INCLUDE_DIR) (found
  version "0.18.0")
Call Stack (most recent call first):
  /usr/local/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393
(_FPHSA_FAILURE_MESSAGE)
  cmake_modules/FindArrowPython.cmake:76 (find_package_handle_standard_args)
  CMakeLists.txt:210 (find_package)


-- Configuring incomplete, errors occurred!
See also 
"/home/vibhatha/sandbox/arrow/repos/arrow/python/build/temp.linux-x86_64-3.8/CMakeFiles/CMakeOutput.log".
error: command 'cmake' failed with exit status 1


How to include arrow and parquet in another project's CMakeLists.txt

2020-04-30 Thread Zhuo Jia Dai
Hi all,

I am trying to write a Julia parquet writer by leveraging the C++ arrow
library. I can build arrow and arrow/parquet and can write out a parquet
file successfully. The next part I need to do is to use the [CxxWrap.jl](
https://github.com/JuliaInterop/CxxWrap.jl) Julia package to call the C++
functions I wrote. However, CxxWrap.jl underlying JlCxx library only has
build examples in CMake, so I am having issues modifying the CMakeLists.txt
to add the arrow and arrow/parquet dependencies.

This is the error I get when I try to run their CMake process, so I should
be able to add something to the CMakeLists.txt file to tell i where to the
find the parquet and arrow libraries right? I am so new to C++ that I am at
lost on how to do this. I have include the CMakeLIst.txt file at the very
end! Thanks for any help and for helping me make a Julia parquet write a
reality.


```
[ 50%] Building CXX object CMakeFiles/testlib.dir/testlib.cpp.o
/home/xiaodai/git/ParquetWriter.jl/test-include-julia/testlib.cpp:26:10:
fatal error: parquet/arrow/writer.h: No such file or directory
 #include "parquet/arrow/writer.h"
  ^~~~
compilation terminated.
CMakeFiles/testlib.dir/build.make:62: recipe for target
'CMakeFiles/testlib.dir/testlib.cpp.o' failed
make[2]: *** [CMakeFiles/testlib.dir/testlib.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/testlib.dir/all'
failed
make[1]: *** [CMakeFiles/testlib.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
```

The CMakeLists.txt I am using now

```
project(TestLib)

cmake_minimum_required(VERSION 2.8.12)
set(CMAKE_MACOSX_RPATH 1)
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib")

find_package(JlCxx)
get_target_property(JlCxx_location JlCxx::cxxwrap_julia LOCATION)
get_filename_component(JlCxx_location ${JlCxx_location} DIRECTORY)
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/lib;${JlCxx_location}")

message(STATUS "Found JlCxx at ${JlCxx_location}")

add_library(testlib SHARED testlib.cpp)

target_link_libraries(testlib JlCxx::cxxwrap_julia)

install(TARGETS
  testlib
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib
RUNTIME DESTINATION lib)
```

Regards



-- 
ZJ

zhuojia@gmail.com


Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Micah Kornfield
Sorry I didn't get to this, will try again tomorrow.

On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney  wrote:

> I'd be fine with a patch release addressing this so long as it's
> binary-only (to save us all time).
>
> On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield 
> wrote:
>
>> This sounds like something we might want to do and issue a patch release.
>> It seems bad to default to a non-production version?
>>
>> I can try to take a look tonight at a patch of no gets to it before.
>>
>> Thanks,
>> Micah
>>
>> On Wednesday, April 29, 2020, Wes McKinney  wrote:
>>
>> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <
>> pierre.belz...@gmail.com>
>> > wrote:
>> > >
>> > > Wes,
>> > >
>> > > You used the words "forward compatible". Does this mean that 0.17 is
>> able
>> > > to decode 0.16 datapagev2?
>> >
>> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
>> > the type casting and metadata behavior I indicated in my email. The
>> > changes in
>> >
>> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> >
>> > enabled the use of DataPageV2 and I/we didn't think about the forward
>> > compatibility issue (version=2.0 files written in 0.17.0 being
>> > unreadable in 0.16.0). We might actually want to revert this (just the
>> > toggle between DataPageV1/V2, not the whole patch).
>> >
>> >
>> >
>> > > Crossing my fingers...
>> > >
>> > > Pierre
>> > >
>> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney  a
>> > écrit :
>> > >
>> > > > Ah, so we have a slight mess on our hands because the patch for
>> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
>> > > > compatible with older version because the implementation was fixed
>> > > > (see the JIRA for more details)
>> > > >
>> > > >
>> > > >
>> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> > > >
>> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
>> > > > being used for two different purposes:
>> > > >
>> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
>> > > > and nanosecond timestamps
>> > > > * DataPageV1 vs. DataPageV2 data pages
>> > > >
>> > > > I think we should separate these concepts and instead have a
>> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
>> > > > annotations and the behavior around conversions when writing
>> unsigned
>> > > > integers, nanosecond timestamps, and other types to Parquet V1
>> (which
>> > > > is the only "production" Parquet format).
>> > > >
>> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
>> > pierre.belz...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > Hi,
>> > > > >
>> > > > > We've been using the parquet 2 format (mostly because of
>> nanosecond
>> > > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
>> > 0.16,
>> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is
>> this
>> > > > > expected? Would a 0.17 decode a 0.16?
>> > > > >
>> > > > > If that's not expected, I can put the debugger on it and see what
>> is
>> > > > > happening. I suspect it's with string fields (regular, not large
>> > string).
>> > > > >
>> > > > > Cheers, Pierre
>> > > >
>> >
>>
>