Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
The VOTE carries with 4 binding +1 votes, 3 non-binding +1 votes and
one binding +0 vote.

I'm starting the post-release tasks, if anyone wants to help please let me know.

On Fri, Feb 7, 2020 at 12:25 AM Krisztián Szűcs
 wrote:
>
> So far we have the following votes:
>
> +0 (binding)
> +1 (binding)
> +1 (non-binding)
> +1 (binding)
> +1 (non-binding)
> +1 (binding)
> +1 (non-binding)
> +1 (binding)
> 
> 4 +1 (binding)
> 3 +1 (non-binding)
>
> I'm waiting for votes until tomorrow morning (UTC), then I'm closing the VOTE.
>
> Thanks everyone!
>
> - Krisztian
>
> On Fri, Feb 7, 2020 at 12:06 AM Krisztián Szűcs
>  wrote:
> >
> > Testing on macOS Catalina
> >
> > Binaries: OK
> >
> > Wheels: OK
> > Verified on macOS and on Linux.
> > On linux the verification script has failed for python 3.5 and manylinux2010
> > and manylinux2014 with unsupported platform tag. I've manually checked
> > these wheels in the python:3.5 docker image, and the wheels were good
> > (this is automatically checked by crossbow too [1]). All other wheels were
> > passing using the verification script.
> >
> > Source: OK
> > I had to revert the nvm path [2] to pass the js and integration tests and
> > force the glib test to use my system python instead of the conda one.
> >
> > I vote with +1 (binding)
> >
> > [1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568
> > [2]: 
> > https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2
> >
> >
> > On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson
> >  wrote:
> > >
> > > I re-verified the macOS wheels and they worked but I had to hard-code
> > > `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported
> > > previously. I tried to set that env var dynamically based on your current
> > > OS version but didn't succeed in getting it passed through to pytest,
> > > despite many attempts to `export` it; someone with better bash skills than
> > > I should probably add that to the script. FTR `defaults read loginwindow
> > > SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS
> > > version.
> > >
> > > Neal
> > >
> > > On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > I was able to verify the Windows wheels with the following patch applied
> > > >
> > > > https://github.com/apache/arrow/pull/6364
> > > >
> > > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > There were binary naming issues with the macosx and the win-cp38
> > > > > wheels.
> > > > > I've uploaded them, all of the wheels should be available now [1]
> > > > >
> > > > > Note that the newly built macosx wheels have 10_9 platform tag
> > > > > instead of 10_6, so the verification script must be updated [2] to
> > > > > verify the macosx wheels.
> > > > >
> > > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > > > > [2]
> > > > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658
> > > > >
> > > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs
> > > > >  wrote:
> > > > > >
> > > > > > The wheel was built successfully and available under the crossbow
> > > > > > releases. Something must have gone wrong during download/upload
> > > > > > to bintray. I'm re-uploading the wheels again, waiting for the 
> > > > > > network.
> > > > > >
> > > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > The Windows wheel RC script is broken
> > > > > > >
> > > > > > > wget --no-check-certificate -O
> > > > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > > > >
> > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1
> > > > > > > 6.0-cp38-cp38m-win_amd64.whl   || EXIT /B 1
> > > > > > > --2020-02-05 11:11:15--
> > > > > > >
> > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > > > > Resolving bintray.com (bintray.com)... 75.126.208.206
> > > > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443...
> > > > > > > connected.
> > > > > > > HTTP request sent, awaiting response... 404 Not Found
> > > > > > > 2020-02-05 11:11:15 ERROR 404: Not Found.
> > > > > > >
> > > > > > > I will try to fix
> > > > > > >
> > > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays
> > > > pull all dependencies from the system. Is there a known way to build & 
> > > > test
> > > > on OSX with the script and use conda for the requirements?
> > > > > > > >
> > > > > > > > Otherwise I probably need to investe to create such a way.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > > Uwe
> > > > > > > >
> > > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I've cherry-picked the 

[jira] [Created] (ARROW-7788) Add schema conversion support for map type

2020-02-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7788:
--

 Summary: Add schema conversion support for map type
 Key: ARROW-7788
 URL: https://issues.apache.org/jira/browse/ARROW-7788
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
Assignee: Micah Kornfield


there is also some other cleanup that is probably worth doing:

1.  Adding "large types"

2. Adding a flag to support parquet spec required naming for list types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
So far we have the following votes:

+0 (binding)
+1 (binding)
+1 (non-binding)
+1 (binding)
+1 (non-binding)
+1 (binding)
+1 (non-binding)
+1 (binding)

4 +1 (binding)
3 +1 (non-binding)

I'm waiting for votes until tomorrow morning (UTC), then I'm closing the VOTE.

Thanks everyone!

- Krisztian

On Fri, Feb 7, 2020 at 12:06 AM Krisztián Szűcs
 wrote:
>
> Testing on macOS Catalina
>
> Binaries: OK
>
> Wheels: OK
> Verified on macOS and on Linux.
> On linux the verification script has failed for python 3.5 and manylinux2010
> and manylinux2014 with unsupported platform tag. I've manually checked
> these wheels in the python:3.5 docker image, and the wheels were good
> (this is automatically checked by crossbow too [1]). All other wheels were
> passing using the verification script.
>
> Source: OK
> I had to revert the nvm path [2] to pass the js and integration tests and
> force the glib test to use my system python instead of the conda one.
>
> I vote with +1 (binding)
>
> [1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568
> [2]: 
> https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2
>
>
> On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson
>  wrote:
> >
> > I re-verified the macOS wheels and they worked but I had to hard-code
> > `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported
> > previously. I tried to set that env var dynamically based on your current
> > OS version but didn't succeed in getting it passed through to pytest,
> > despite many attempts to `export` it; someone with better bash skills than
> > I should probably add that to the script. FTR `defaults read loginwindow
> > SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS
> > version.
> >
> > Neal
> >
> > On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney  wrote:
> >
> > > +1 (binding)
> > >
> > > I was able to verify the Windows wheels with the following patch applied
> > >
> > > https://github.com/apache/arrow/pull/6364
> > >
> > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > There were binary naming issues with the macosx and the win-cp38
> > > > wheels.
> > > > I've uploaded them, all of the wheels should be available now [1]
> > > >
> > > > Note that the newly built macosx wheels have 10_9 platform tag
> > > > instead of 10_6, so the verification script must be updated [2] to
> > > > verify the macosx wheels.
> > > >
> > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > > > [2]
> > > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658
> > > >
> > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > The wheel was built successfully and available under the crossbow
> > > > > releases. Something must have gone wrong during download/upload
> > > > > to bintray. I'm re-uploading the wheels again, waiting for the 
> > > > > network.
> > > > >
> > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney 
> > > wrote:
> > > > > >
> > > > > > The Windows wheel RC script is broken
> > > > > >
> > > > > > wget --no-check-certificate -O
> > > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > > >
> > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1
> > > > > > 6.0-cp38-cp38m-win_amd64.whl   || EXIT /B 1
> > > > > > --2020-02-05 11:11:15--
> > > > > >
> > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > > > Resolving bintray.com (bintray.com)... 75.126.208.206
> > > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443...
> > > > > > connected.
> > > > > > HTTP request sent, awaiting response... 404 Not Found
> > > > > > 2020-02-05 11:11:15 ERROR 404: Not Found.
> > > > > >
> > > > > > I will try to fix
> > > > > >
> > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn  wrote:
> > > > > > >
> > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays
> > > pull all dependencies from the system. Is there a known way to build & 
> > > test
> > > on OSX with the script and use conda for the requirements?
> > > > > > >
> > > > > > > Otherwise I probably need to investe to create such a way.
> > > > > > >
> > > > > > > Cheers
> > > > > > > Uwe
> > > > > > >
> > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release
> > > tag,
> > > > > > > > re-built the wheels using crossbow [2], and uploaded them to
> > > > > > > > bintray [3] (also removed win-py38m).
> > > > > > > >
> > > > > > > > Anyone who has voted after verifying the wheels, please re-run
> > > > > > > > the verification script again for the wheels and re-vote.
> > > > > > > >
> > > > > > > > Thanks, Krisztian
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > 

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Krisztián Szűcs
Testing on macOS Catalina

Binaries: OK

Wheels: OK
Verified on macOS and on Linux.
On linux the verification script has failed for python 3.5 and manylinux2010
and manylinux2014 with unsupported platform tag. I've manually checked
these wheels in the python:3.5 docker image, and the wheels were good
(this is automatically checked by crossbow too [1]). All other wheels were
passing using the verification script.

Source: OK
I had to revert the nvm path [2] to pass the js and integration tests and
force the glib test to use my system python instead of the conda one.

I vote with +1 (binding)

[1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568
[2]: 
https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2


On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson
 wrote:
>
> I re-verified the macOS wheels and they worked but I had to hard-code
> `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported
> previously. I tried to set that env var dynamically based on your current
> OS version but didn't succeed in getting it passed through to pytest,
> despite many attempts to `export` it; someone with better bash skills than
> I should probably add that to the script. FTR `defaults read loginwindow
> SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS
> version.
>
> Neal
>
> On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney  wrote:
>
> > +1 (binding)
> >
> > I was able to verify the Windows wheels with the following patch applied
> >
> > https://github.com/apache/arrow/pull/6364
> >
> > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs
> >  wrote:
> > >
> > > There were binary naming issues with the macosx and the win-cp38
> > > wheels.
> > > I've uploaded them, all of the wheels should be available now [1]
> > >
> > > Note that the newly built macosx wheels have 10_9 platform tag
> > > instead of 10_6, so the verification script must be updated [2] to
> > > verify the macosx wheels.
> > >
> > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > > [2]
> > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658
> > >
> > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > The wheel was built successfully and available under the crossbow
> > > > releases. Something must have gone wrong during download/upload
> > > > to bintray. I'm re-uploading the wheels again, waiting for the network.
> > > >
> > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney 
> > wrote:
> > > > >
> > > > > The Windows wheel RC script is broken
> > > > >
> > > > > wget --no-check-certificate -O
> > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > >
> > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1
> > > > > 6.0-cp38-cp38m-win_amd64.whl   || EXIT /B 1
> > > > > --2020-02-05 11:11:15--
> > > > >
> > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > > Resolving bintray.com (bintray.com)... 75.126.208.206
> > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443...
> > > > > connected.
> > > > > HTTP request sent, awaiting response... 404 Not Found
> > > > > 2020-02-05 11:11:15 ERROR 404: Not Found.
> > > > >
> > > > > I will try to fix
> > > > >
> > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn  wrote:
> > > > > >
> > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays
> > pull all dependencies from the system. Is there a known way to build & test
> > on OSX with the script and use conda for the requirements?
> > > > > >
> > > > > > Otherwise I probably need to investe to create such a way.
> > > > > >
> > > > > > Cheers
> > > > > > Uwe
> > > > > >
> > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release
> > tag,
> > > > > > > re-built the wheels using crossbow [2], and uploaded them to
> > > > > > > bintray [3] (also removed win-py38m).
> > > > > > >
> > > > > > > Anyone who has voted after verifying the wheels, please re-run
> > > > > > > the verification script again for the wheels and re-vote.
> > > > > > >
> > > > > > > Thanks, Krisztian
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > https://github.com/apache/arrow/commit/67e34c53b3be4c88348369f8109626b4a8a997aa
> > > > > > > [2]
> > https://github.com/ursa-labs/crossbow/branches/all?query=build-733
> > > > > > > [3] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > > > > > >
> > > > > > > On Tue, Feb 4, 2020 at 7:08 PM Wes McKinney 
> > wrote:
> > > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Some patches were required to the verification scripts but I
> > have run:
> > > > > > > >
> > > > > > > > * Full source verification on Ubuntu 18.04
> > > > > > > > * Linux binary verification
> > > > > > > > * Source verification on 

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread David Li
Catching up on questions here...

> Typically you can solve this by having enough IO concurrency at once :-)
> I'm not sure having sophisticated global coordination (based on which
> algorithms) would bring anything.  Would you care to elaborate?

We aren't proposing *sophisticated* global coordination, rather, just
using a global pool with a global limit, so that a user doesn't
unintentionally start hundreds of requests in parallel, and so that
you can adjust the resource consumption/performance tradeoff.

Essentially, what our library does is maintain two pools (for I/O):
- One pool produces I/O requests, by going through the list of files,
fetching the Parquet footers, and queuing up I/O requests on the main
pool. (This uses a pool so we can fetch and parse metadata from
multiple Parquet files at once.)
- One pool serves I/O requests, by fetching chunks and placing them in
buffers inside the file object implementation.

The global concurrency manager additionally limits the second pool by
not servicing I/O requests for a file until all of the I/O requests
for previous files have at least started. (By just having lots of
concurrency, you might end up starving yourself by reading data you
don't want quite yet.)

Additionally, the global pool could still be a win for non-Parquet
files - an implementation can at least submit, say, an entire CSV file
as a "chunk" and have it read in the background.

> Actually, on a more high-level basis, is the goal to prefetch for
> sequential consumption of row groups?

At least for us, our query pattern is to sequentially consume row
groups from a large dataset, where we select a subset of columns and a
subset of the partition key range (usually time range). Prefetching
speeds this up substantially, or in general, pipelining discovery of
files, I/O, and deserialization.

> There are no situations where you would want to consume a scattered
> subset of row groups (e.g. predicate pushdown)?

With coalescing, this "automatically" gets optimized. If you happen to
need column chunks from separate row groups that are adjacent or close
on-disk, coalescing will still fetch them in a single IO call.

We found that having large row groups was more beneficial than small
row groups, since when you combine small row groups with column
selection, you end up with a lot of small non-adjacent column chunks -
which coalescing can't help with. The exact tradeoff depends on the
dataset and workload, of course.

> This seems like too much to try to build into RandomAccessFile. I would
> suggest a class that wraps a random access file and manages cached segments
> and their lifetimes through explicit APIs.

A wrapper class seems ideal, especially as the logic is agnostic to
the storage backend (except for some parameters which can either be
hand-tuned or estimated on the fly). It also keeps the scope of the
changes down.

> Where to put the "async multiple range request" API is a separate question,
> though. Probably makes sense to start writing some working code and sort it
> out there.

We haven't looked in this direction much. Our designs are based around
thread pools partly because we wanted to avoid modifying the Parquet
and Arrow internals, instead choosing to modify the I/O layer to "keep
Parquet fed" as quickly as possible.

Overall, I recall there's an issue open for async APIs in
Arrow...perhaps we want to move that to a separate discussion, or on
the contrary, explore some experimental APIs here to inform the
overall design.

Thanks,
David

On 2/6/20, Wes McKinney  wrote:
> On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou  wrote:
>>
>>
>> Le 06/02/2020 à 20:20, Wes McKinney a écrit :
>> >> Actually, on a more high-level basis, is the goal to prefetch for
>> >> sequential consumption of row groups?
>> >>
>> >
>> > Essentially yes. One "easy" optimization is to prefetch the entire
>> > serialized row group. This is an evolution of that idea where we want
>> > to
>> > prefetch only the needed parts of a row group in a minimum number of IO
>> > calls (consider reading the first 10 columns from a file with 1000
>> > columns
>> > -- so we want to do one IO call instead of 10 like we do now).
>>
>> There are no situations where you would want to consume a scattered
>> subset of row groups (e.g. predicate pushdown)?
>
> There are. If it can be demonstrated that there are performance gains
> resulting from IO optimizations involving multiple row groups then I
> see no reason not to implement them.
>
>> Regards
>>
>> Antoine.
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou  wrote:
>
>
> Le 06/02/2020 à 20:20, Wes McKinney a écrit :
> >> Actually, on a more high-level basis, is the goal to prefetch for
> >> sequential consumption of row groups?
> >>
> >
> > Essentially yes. One "easy" optimization is to prefetch the entire
> > serialized row group. This is an evolution of that idea where we want to
> > prefetch only the needed parts of a row group in a minimum number of IO
> > calls (consider reading the first 10 columns from a file with 1000 columns
> > -- so we want to do one IO call instead of 10 like we do now).
>
> There are no situations where you would want to consume a scattered
> subset of row groups (e.g. predicate pushdown)?

There are. If it can be demonstrated that there are performance gains
resulting from IO optimizations involving multiple row groups then I
see no reason not to implement them.

> Regards
>
> Antoine.


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou


Le 06/02/2020 à 20:20, Wes McKinney a écrit :
>> Actually, on a more high-level basis, is the goal to prefetch for
>> sequential consumption of row groups?
>>
> 
> Essentially yes. One "easy" optimization is to prefetch the entire
> serialized row group. This is an evolution of that idea where we want to
> prefetch only the needed parts of a row group in a minimum number of IO
> calls (consider reading the first 10 columns from a file with 1000 columns
> -- so we want to do one IO call instead of 10 like we do now).

There are no situations where you would want to consume a scattered
subset of row groups (e.g. predicate pushdown)?

Regards

Antoine.


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou  wrote:

>
> Le 06/02/2020 à 19:40, Antoine Pitrou a écrit :
> >
> > Le 06/02/2020 à 19:37, Wes McKinney a écrit :
> >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou 
> wrote:
> >>
> >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit :
> 
>  This seems useful, too. It becomes a question of where do you want to
>  manage the cached memory segments, however you obtain them. I'm
>  arguing that we should not have much custom code in the Parquet
>  library to manage the prefetched segments (and providing the correct
>  buffer slice to each column reader when they need it), and instead
>  encapsulate this logic so it can be reused.
> >>>
> >>> I see, so RandomAccessFile would have some associative caching logic to
> >>> find whether the exact requested range was cached and then return it to
> >>> the caller?  That sounds doable.  How is lifetime handled then?  Are
> >>> cached buffers kept on the RandomAccessFile until they are requested,
> at
> >>> which point their ownership is transferred to the caller?
> >>>
> >>
> >> This seems like too much to try to build into RandomAccessFile. I would
> >> suggest a class that wraps a random access file and manages cached
> segments
> >> and their lifetimes through explicit APIs.
> >
> > So Parquet would expect to receive that class rather than
> > RandomAccessFile?  Or it would grow separate paths for it?
>
> Actually, on a more high-level basis, is the goal to prefetch for
> sequential consumption of row groups?
>

Essentially yes. One "easy" optimization is to prefetch the entire
serialized row group. This is an evolution of that idea where we want to
prefetch only the needed parts of a row group in a minimum number of IO
calls (consider reading the first 10 columns from a file with 1000 columns
-- so we want to do one IO call instead of 10 like we do now).



> Regards
>
> Antoine.
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:41 PM Antoine Pitrou  wrote:

>
> Le 06/02/2020 à 19:37, Wes McKinney a écrit :
> > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou  wrote:
> >
> >> Le 06/02/2020 à 16:26, Wes McKinney a écrit :
> >>>
> >>> This seems useful, too. It becomes a question of where do you want to
> >>> manage the cached memory segments, however you obtain them. I'm
> >>> arguing that we should not have much custom code in the Parquet
> >>> library to manage the prefetched segments (and providing the correct
> >>> buffer slice to each column reader when they need it), and instead
> >>> encapsulate this logic so it can be reused.
> >>
> >> I see, so RandomAccessFile would have some associative caching logic to
> >> find whether the exact requested range was cached and then return it to
> >> the caller?  That sounds doable.  How is lifetime handled then?  Are
> >> cached buffers kept on the RandomAccessFile until they are requested, at
> >> which point their ownership is transferred to the caller?
> >>
> >
> > This seems like too much to try to build into RandomAccessFile. I would
> > suggest a class that wraps a random access file and manages cached
> segments
> > and their lifetimes through explicit APIs.
>
> So Parquet would expect to receive that class rather than
> RandomAccessFile?  Or it would grow separate paths for it?
>

If the user opts in to coalesced prefetching then the RowGroupReader would
instantiate the wrapper under the hood. Public APIs (aside from new APIs in
ReaderProperties for prefetching) would be unchanged.



>
>
>
> Regards
>
> Antoine.
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou


Le 06/02/2020 à 19:40, Antoine Pitrou a écrit :
> 
> Le 06/02/2020 à 19:37, Wes McKinney a écrit :
>> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou  wrote:
>>
>>> Le 06/02/2020 à 16:26, Wes McKinney a écrit :

 This seems useful, too. It becomes a question of where do you want to
 manage the cached memory segments, however you obtain them. I'm
 arguing that we should not have much custom code in the Parquet
 library to manage the prefetched segments (and providing the correct
 buffer slice to each column reader when they need it), and instead
 encapsulate this logic so it can be reused.
>>>
>>> I see, so RandomAccessFile would have some associative caching logic to
>>> find whether the exact requested range was cached and then return it to
>>> the caller?  That sounds doable.  How is lifetime handled then?  Are
>>> cached buffers kept on the RandomAccessFile until they are requested, at
>>> which point their ownership is transferred to the caller?
>>>
>>
>> This seems like too much to try to build into RandomAccessFile. I would
>> suggest a class that wraps a random access file and manages cached segments
>> and their lifetimes through explicit APIs.
> 
> So Parquet would expect to receive that class rather than
> RandomAccessFile?  Or it would grow separate paths for it?

Actually, on a more high-level basis, is the goal to prefetch for
sequential consumption of row groups?

Regards

Antoine.


Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-06 Thread Neal Richardson
I re-verified the macOS wheels and they worked but I had to hard-code
`MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported
previously. I tried to set that env var dynamically based on your current
OS version but didn't succeed in getting it passed through to pytest,
despite many attempts to `export` it; someone with better bash skills than
I should probably add that to the script. FTR `defaults read loginwindow
SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS
version.

Neal

On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney  wrote:

> +1 (binding)
>
> I was able to verify the Windows wheels with the following patch applied
>
> https://github.com/apache/arrow/pull/6364
>
> On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs
>  wrote:
> >
> > There were binary naming issues with the macosx and the win-cp38
> > wheels.
> > I've uploaded them, all of the wheels should be available now [1]
> >
> > Note that the newly built macosx wheels have 10_9 platform tag
> > instead of 10_6, so the verification script must be updated [2] to
> > verify the macosx wheels.
> >
> > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > [2]
> https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658
> >
> > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs
> >  wrote:
> > >
> > > The wheel was built successfully and available under the crossbow
> > > releases. Something must have gone wrong during download/upload
> > > to bintray. I'm re-uploading the wheels again, waiting for the network.
> > >
> > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney 
> wrote:
> > > >
> > > > The Windows wheel RC script is broken
> > > >
> > > > wget --no-check-certificate -O
> pyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > >
> https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1
> > > > 6.0-cp38-cp38m-win_amd64.whl   || EXIT /B 1
> > > > --2020-02-05 11:11:15--
> > > >
> https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> > > > Resolving bintray.com (bintray.com)... 75.126.208.206
> > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443...
> > > > connected.
> > > > HTTP request sent, awaiting response... 404 Not Found
> > > > 2020-02-05 11:11:15 ERROR 404: Not Found.
> > > >
> > > > I will try to fix
> > > >
> > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn  wrote:
> > > > >
> > > > > I'm failing to verify C++ on macOS as it seems that we nowadays
> pull all dependencies from the system. Is there a known way to build & test
> on OSX with the script and use conda for the requirements?
> > > > >
> > > > > Otherwise I probably need to investe to create such a way.
> > > > >
> > > > > Cheers
> > > > > Uwe
> > > > >
> > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release
> tag,
> > > > > > re-built the wheels using crossbow [2], and uploaded them to
> > > > > > bintray [3] (also removed win-py38m).
> > > > > >
> > > > > > Anyone who has voted after verifying the wheels, please re-run
> > > > > > the verification script again for the wheels and re-vote.
> > > > > >
> > > > > > Thanks, Krisztian
> > > > > >
> > > > > > [1]
> > > > > >
> https://github.com/apache/arrow/commit/67e34c53b3be4c88348369f8109626b4a8a997aa
> > > > > > [2]
> https://github.com/ursa-labs/crossbow/branches/all?query=build-733
> > > > > > [3] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> > > > > >
> > > > > > On Tue, Feb 4, 2020 at 7:08 PM Wes McKinney 
> wrote:
> > > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Some patches were required to the verification scripts but I
> have run:
> > > > > > >
> > > > > > > * Full source verification on Ubuntu 18.04
> > > > > > > * Linux binary verification
> > > > > > > * Source verification on Windows 10 (needed ARROW-6757)
> > > > > > > * Windows binary verification. Note that Python 3.8 wheel is
> broken
> > > > > > > (see ARROW-7755). Whoever uploads the wheels to PyPI _SHOULD
> NOT_
> > > > > > > upload this 3.8 wheel until we know what's wrong (if we upload
> a
> > > > > > > broken wheel then `pip install pyarrow==0.16.0` will be
> permanently
> > > > > > > broken on Windows/Python 3.8)
> > > > > > >
> > > > > > > On Mon, Feb 3, 2020 at 9:26 PM Francois Saint-Jacques
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Tested on ubuntu 18.04 for the source release.
> > > > > > > >
> > > > > > > > On Mon, Feb 3, 2020 at 10:07 PM Francois Saint-Jacques
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Binaries verification didn't have any issues.
> > > > > > > > > Sources verification worked with some local environment
> hiccups
> > > > > > > > >
> > > > > > > > > François
> > > > > > > > >
> > > > > > > > > On Mon, Feb 3, 2020 at 8:46 PM Andy Grove <

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou


Le 06/02/2020 à 19:37, Wes McKinney a écrit :
> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou  wrote:
> 
>> Le 06/02/2020 à 16:26, Wes McKinney a écrit :
>>>
>>> This seems useful, too. It becomes a question of where do you want to
>>> manage the cached memory segments, however you obtain them. I'm
>>> arguing that we should not have much custom code in the Parquet
>>> library to manage the prefetched segments (and providing the correct
>>> buffer slice to each column reader when they need it), and instead
>>> encapsulate this logic so it can be reused.
>>
>> I see, so RandomAccessFile would have some associative caching logic to
>> find whether the exact requested range was cached and then return it to
>> the caller?  That sounds doable.  How is lifetime handled then?  Are
>> cached buffers kept on the RandomAccessFile until they are requested, at
>> which point their ownership is transferred to the caller?
>>
> 
> This seems like too much to try to build into RandomAccessFile. I would
> suggest a class that wraps a random access file and manages cached segments
> and their lifetimes through explicit APIs.

So Parquet would expect to receive that class rather than
RandomAccessFile?  Or it would grow separate paths for it?

Regards

Antoine.


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou  wrote:

>
> Le 06/02/2020 à 16:26, Wes McKinney a écrit :
> >
> > This seems useful, too. It becomes a question of where do you want to
> > manage the cached memory segments, however you obtain them. I'm
> > arguing that we should not have much custom code in the Parquet
> > library to manage the prefetched segments (and providing the correct
> > buffer slice to each column reader when they need it), and instead
> > encapsulate this logic so it can be reused.
>
> I see, so RandomAccessFile would have some associative caching logic to
> find whether the exact requested range was cached and then return it to
> the caller?  That sounds doable.  How is lifetime handled then?  Are
> cached buffers kept on the RandomAccessFile until they are requested, at
> which point their ownership is transferred to the caller?
>

This seems like too much to try to build into RandomAccessFile. I would
suggest a class that wraps a random access file and manages cached segments
and their lifetimes through explicit APIs.

Where to put the "async multiple range request" API is a separate question,
though. Probably makes sense to start writing some working code and sort it
out there.


> Regards
>
> Antoine.
>


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou


Le 06/02/2020 à 16:26, Wes McKinney a écrit :
> 
> This seems useful, too. It becomes a question of where do you want to
> manage the cached memory segments, however you obtain them. I'm
> arguing that we should not have much custom code in the Parquet
> library to manage the prefetched segments (and providing the correct
> buffer slice to each column reader when they need it), and instead
> encapsulate this logic so it can be reused.

I see, so RandomAccessFile would have some associative caching logic to
find whether the exact requested range was cached and then return it to
the caller?  That sounds doable.  How is lifetime handled then?  Are
cached buffers kept on the RandomAccessFile until they are requested, at
which point their ownership is transferred to the caller?

Regards

Antoine.


[jira] [Created] (ARROW-7786) [R] Wire up check_metadata in Table.Equals method

2020-02-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7786:
--

 Summary: [R] Wire up check_metadata in Table.Equals method
 Key: ARROW-7786
 URL: https://issues.apache.org/jira/browse/ARROW-7786
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


See https://github.com/apache/arrow/pull/6318/files#r375404306. Followup to 
ARROW-7720.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou


Le 06/02/2020 à 17:07, Wes McKinney a écrit :
> In case folks are interested in how some other systems deal with IO
> management / scheduling, the comments in
> 
> https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h
> 
> and related files might be interesting

Thanks.  There's quite a lot of functionality.  It would be useful to
discuss which parts of that functionality is desirable, and which are
not.  For example, I don't think we should spend development time
writing a complex IO scheduler (using which heuristics?) like Impala
has, but that's my opinion :-)

Regards

Antoine.


> On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney  wrote:
>>
>> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou  wrote:
>>>
>>> On Wed, 5 Feb 2020 15:46:15 -0600
>>> Wes McKinney  wrote:

 I'll comment in more detail on some of the other items in due course,
 but I think this should be handled by an implementation of
 RandomAccessFile (that wraps a naked RandomAccessFile) with some
 additional methods, rather than adding this to the abstract
 RandomAccessFile interface, e.g.

 class CachingInputFile : public RandomAccessFile {
  public:
CachingInputFile(std::shared_ptr naked_file);
Status CacheRanges(...);
 };

 etc.
>>>
>>> IMHO it may be more beneficial to expose it as an asynchronous API on
>>> RandomAccessFile, for example:
>>> class RandomAccessFile {
>>>  public:
>>>   struct Range {
>>> int64_t offset;
>>> int64_t length;
>>>   };
>>>
>>>   std::vector>>
>>> ReadRangesAsync(std::vector ranges);
>>> };
>>>
>>>
>>> The reason is that some APIs such as the C++ AWS S3 API have their own
>>> async support, which may be beneficial to use over a generic Arrow
>>> thread-pool implementation.
>>>
>>> Also, by returning a Promise instead of simply caching the results, you
>>> make it easier to handle the lifetime of the results.
>>
>> This seems useful, too. It becomes a question of where do you want to
>> manage the cached memory segments, however you obtain them. I'm
>> arguing that we should not have much custom code in the Parquet
>> library to manage the prefetched segments (and providing the correct
>> buffer slice to each column reader when they need it), and instead
>> encapsulate this logic so it can be reused.
>>
>> The API I proposed was just a mockup, I agree it would make sense for
>> the prefetching to occur asynchronously so that a column reader can
>> proceed as soon as its coalesced chunk has been prefetched, rather
>> than having to wait synchronously for all prefetching to complete.
>>
>>>
>>> (Promise can be something like std::future>, though
>>> std::future<> has annoying limitations and we may want to write our own
>>> instead)
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>


[jira] [Created] (ARROW-7785) [C++] sparse_tensor.cc is extremely slow to compile

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7785:
-

 Summary: [C++] sparse_tensor.cc is extremely slow to compile
 Key: ARROW-7785
 URL: https://issues.apache.org/jira/browse/ARROW-7785
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This comes up especially when doing an optimized build. {{sparse_tensor.cc}} is 
always enabled even if all components are disabled, and it takes multiple 
seconds to compile.

Using [CLangBuildAnalyzer|https://github.com/aras-p/ClangBuildAnalyzer] I get 
the following results:
{code}
 Files that took longest to codegen (compiler backend):
 66372 ms: 
build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
 16457 ms: 
build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
  6283 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
  5284 ms: 
build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
  5090 ms: 
build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
In case folks are interested in how some other systems deal with IO
management / scheduling, the comments in

https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h

and related files might be interesting

On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney  wrote:
>
> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou  wrote:
> >
> > On Wed, 5 Feb 2020 15:46:15 -0600
> > Wes McKinney  wrote:
> > >
> > > I'll comment in more detail on some of the other items in due course,
> > > but I think this should be handled by an implementation of
> > > RandomAccessFile (that wraps a naked RandomAccessFile) with some
> > > additional methods, rather than adding this to the abstract
> > > RandomAccessFile interface, e.g.
> > >
> > > class CachingInputFile : public RandomAccessFile {
> > >  public:
> > >CachingInputFile(std::shared_ptr naked_file);
> > >Status CacheRanges(...);
> > > };
> > >
> > > etc.
> >
> > IMHO it may be more beneficial to expose it as an asynchronous API on
> > RandomAccessFile, for example:
> > class RandomAccessFile {
> >  public:
> >   struct Range {
> > int64_t offset;
> > int64_t length;
> >   };
> >
> >   std::vector>>
> > ReadRangesAsync(std::vector ranges);
> > };
> >
> >
> > The reason is that some APIs such as the C++ AWS S3 API have their own
> > async support, which may be beneficial to use over a generic Arrow
> > thread-pool implementation.
> >
> > Also, by returning a Promise instead of simply caching the results, you
> > make it easier to handle the lifetime of the results.
>
> This seems useful, too. It becomes a question of where do you want to
> manage the cached memory segments, however you obtain them. I'm
> arguing that we should not have much custom code in the Parquet
> library to manage the prefetched segments (and providing the correct
> buffer slice to each column reader when they need it), and instead
> encapsulate this logic so it can be reused.
>
> The API I proposed was just a mockup, I agree it would make sense for
> the prefetching to occur asynchronously so that a column reader can
> proceed as soon as its coalesced chunk has been prefetched, rather
> than having to wait synchronously for all prefetching to complete.
>
> >
> > (Promise can be something like std::future>, though
> > std::future<> has annoying limitations and we may want to write our own
> > instead)
> >
> > Regards
> >
> > Antoine.
> >
> >


[jira] [Created] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7784:
-

 Summary: [C++] diff.cc is extremely slow to compile
 Key: ARROW-7784
 URL: https://issues.apache.org/jira/browse/ARROW-7784
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This comes up especially when doing an optimized build. {{diff.cc}} is always 
enabled even if all components are disabled, and it takes multiple seconds to 
compile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Wes McKinney
On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou  wrote:
>
> On Wed, 5 Feb 2020 15:46:15 -0600
> Wes McKinney  wrote:
> >
> > I'll comment in more detail on some of the other items in due course,
> > but I think this should be handled by an implementation of
> > RandomAccessFile (that wraps a naked RandomAccessFile) with some
> > additional methods, rather than adding this to the abstract
> > RandomAccessFile interface, e.g.
> >
> > class CachingInputFile : public RandomAccessFile {
> >  public:
> >CachingInputFile(std::shared_ptr naked_file);
> >Status CacheRanges(...);
> > };
> >
> > etc.
>
> IMHO it may be more beneficial to expose it as an asynchronous API on
> RandomAccessFile, for example:
> class RandomAccessFile {
>  public:
>   struct Range {
> int64_t offset;
> int64_t length;
>   };
>
>   std::vector>>
> ReadRangesAsync(std::vector ranges);
> };
>
>
> The reason is that some APIs such as the C++ AWS S3 API have their own
> async support, which may be beneficial to use over a generic Arrow
> thread-pool implementation.
>
> Also, by returning a Promise instead of simply caching the results, you
> make it easier to handle the lifetime of the results.

This seems useful, too. It becomes a question of where do you want to
manage the cached memory segments, however you obtain them. I'm
arguing that we should not have much custom code in the Parquet
library to manage the prefetched segments (and providing the correct
buffer slice to each column reader when they need it), and instead
encapsulate this logic so it can be reused.

The API I proposed was just a mockup, I agree it would make sense for
the prefetching to occur asynchronously so that a column reader can
proceed as soon as its coalesced chunk has been prefetched, rather
than having to wait synchronously for all prefetching to complete.

>
> (Promise can be something like std::future>, though
> std::future<> has annoying limitations and we may want to write our own
> instead)
>
> Regards
>
> Antoine.
>
>


[jira] [Created] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE

2020-02-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7783:
-

 Summary: [C++] ARROW_DATASET should enable ARROW_COMPUTE
 Key: ARROW-7783
 URL: https://issues.apache.org/jira/browse/ARROW-7783
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Currenty, passing {{-DARROW_DATASET=ON}} to CMake doesn't enable ARROW_COMPUTE, 
which leads to linker errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7782) Losing index information when using write_to_dataset with partition_cols

2020-02-06 Thread Ludwik Bielczynski (Jira)
Ludwik Bielczynski created ARROW-7782:
-

 Summary: Losing index information when using write_to_dataset with 
partition_cols
 Key: ARROW-7782
 URL: https://issues.apache.org/jira/browse/ARROW-7782
 Project: Apache Arrow
  Issue Type: Bug
 Environment: pyarrow==0.15.1
Reporter: Ludwik Bielczynski


One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} 
with given partition_cols arguments. Here I have created a minimal example 
which shows the issue:
{code:java}
 
from pathlib import Path
import pandas as pd
from pyarrow import Table
from pyarrow.parquet import write_to_dataset
path = Path('/home/ludwik/Documents/YieldPlanet/research/trials')
file_name = 'trial_pq.parquet'
df = pd.DataFrame({"A": [1, 2, 3], 
 "B": ['a', 'a', 'b']
 }, 
 index=pd.Index(['a', 'b', 'c'], name='idx'))

table = Table.from_pandas(df)
write_to_dataset(table, str(path / file_name), partition_cols=['B'],
 partition_filename_cb=None, filesystem=None)
{code}
 

The issue is rather important for pandas and dask users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault

2020-02-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7781:


 Summary: [C++][Dataset] Filtering on a non-existent column gives a 
segfault
 Key: ARROW-7781
 URL: https://issues.apache.org/jira/browse/ARROW-7781
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Example with python code:

{code}
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2, 3]})

In [3]: df.to_parquet("test-filter-crash.parquet")

In [4]: import pyarrow.dataset as ds

In [5]: dataset = ds.dataset("test-filter-crash.parquet")

In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas()
Out[6]:
   a
0  2
1  3

In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas()
../src/arrow/dataset/filter.cc:929:  Check failed: _s.ok() Operation failed: 
maybe_value.status()
Bad status: Invalid: attempting to cast non-null scalar to NullScalar
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]

[jira] [Created] (ARROW-7780) [Release] Fix Windows wheel RC verification script given lack of "m" ABI tag in Python 3.8

2020-02-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7780:
--

 Summary: [Release] Fix Windows wheel RC verification script given 
lack of "m" ABI tag in Python 3.8
 Key: ARROW-7780
 URL: https://issues.apache.org/jira/browse/ARROW-7780
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Krisztian Szucs


Python 3.8 wheels don't have the "m" postfix in their ABI tag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++] Arrow added to OSS-Fuzz

2020-02-06 Thread Antoine Pitrou


Hello,

A quick update: since Arrow C++ started being fuzzed in OSS-Fuzz, 41
issues (usually crashes) on invalid input have been found, 35 of which
have already been corrected.

We plan to expand the fuzzed areas to cover Parquet files, as well as
serialized Tensor and SparseTensor data.

Regards

Antoine.


On Wed, 15 Jan 2020 19:59:24 +0100
Antoine Pitrou  wrote:
> Hello,
> 
> I would like to announce that Arrow has been accepted on the OSS-Fuzz
> infrastructure (a continuous fuzzing infrastructure operated by Google):
> https://github.com/google/oss-fuzz/pull/3233
> 
> Right now the only fuzz targets are the C++ stream and file IPC readers.
> The first build results haven't appeared yet.  They will appear on
> https://oss-fuzz.com/ .   Access needs a Google account, and you also
> need to be listed in the "auto_ccs" here:
> https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml
> 
> (if you are a PMC or core developer and want to be listed, just open a
> PR to the oss-fuzz repository)
> 
> Once we confirm the first builds succeed on OSS-Fuzz, we should probably
> add more fuzz targets (for example for reading Parquet files).
> 
> Regards
> 
> Antoine.
> 





Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 16:37:17 -0500
David Li  wrote:
> 
> As a separate step, prefetching/caching should also make use of a
> global (or otherwise shared) IO thread pool, so that parallel reads of
> different files implicitly coordinate work with each other as well.
> Then, you could queue up reads of several Parquet files, such that a
> slow network call for one file doesn't block progress for other files,
> without issuing reads for all of these files at once.

Typically you can solve this by having enough IO concurrency at once :-)
I'm not sure having sophisticated global coordination (based on which
algorithms) would bring anything.  Would you care to elaborate?

> It's unclear to me what readahead at the record batch level would
> accomplish - Parquet reads each column chunk in a row group as a
> whole, and if the row groups are large, then multiple record batches
> would fall in the same row group, so then we wouldn't gain any
> parallelism, no? (Admittedly, I'm not familiar with the internals
> here.)

Well, if each row group is read as a whole, then readahead can be
applied at the row group level (e.g. read K row groups in advance).

Regards

Antoine.




Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-02-06 Thread Antoine Pitrou
On Wed, 5 Feb 2020 15:46:15 -0600
Wes McKinney  wrote:
> 
> I'll comment in more detail on some of the other items in due course,
> but I think this should be handled by an implementation of
> RandomAccessFile (that wraps a naked RandomAccessFile) with some
> additional methods, rather than adding this to the abstract
> RandomAccessFile interface, e.g.
> 
> class CachingInputFile : public RandomAccessFile {
>  public:
>CachingInputFile(std::shared_ptr naked_file);
>Status CacheRanges(...);
> };
> 
> etc.

IMHO it may be more beneficial to expose it as an asynchronous API on
RandomAccessFile, for example:

class RandomAccessFile {
 public:
  struct Range {
int64_t offset;
int64_t length;
  };

  std::vector>>
ReadRangesAsync(std::vector ranges);
};


The reason is that some APIs such as the C++ AWS S3 API have their own
async support, which may be beneficial to use over a generic Arrow
thread-pool implementation.

Also, by returning a Promise instead of simply caching the results, you
make it easier to handle the lifetime of the results.


(Promise can be something like std::future>, though
std::future<> has annoying limitations and we may want to write our own
instead)

Regards

Antoine.