Re: [VOTE] Release Apache Arrow 0.16.0 - RC2
The VOTE carries with 4 binding +1 votes, 3 non-binding +1 votes and one binding +0 vote. I'm starting the post-release tasks, if anyone wants to help please let me know. On Fri, Feb 7, 2020 at 12:25 AM Krisztián Szűcs wrote: > > So far we have the following votes: > > +0 (binding) > +1 (binding) > +1 (non-binding) > +1 (binding) > +1 (non-binding) > +1 (binding) > +1 (non-binding) > +1 (binding) > > 4 +1 (binding) > 3 +1 (non-binding) > > I'm waiting for votes until tomorrow morning (UTC), then I'm closing the VOTE. > > Thanks everyone! > > - Krisztian > > On Fri, Feb 7, 2020 at 12:06 AM Krisztián Szűcs > wrote: > > > > Testing on macOS Catalina > > > > Binaries: OK > > > > Wheels: OK > > Verified on macOS and on Linux. > > On linux the verification script has failed for python 3.5 and manylinux2010 > > and manylinux2014 with unsupported platform tag. I've manually checked > > these wheels in the python:3.5 docker image, and the wheels were good > > (this is automatically checked by crossbow too [1]). All other wheels were > > passing using the verification script. > > > > Source: OK > > I had to revert the nvm path [2] to pass the js and integration tests and > > force the glib test to use my system python instead of the conda one. > > > > I vote with +1 (binding) > > > > [1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568 > > [2]: > > https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2 > > > > > > On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson > > wrote: > > > > > > I re-verified the macOS wheels and they worked but I had to hard-code > > > `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported > > > previously. I tried to set that env var dynamically based on your current > > > OS version but didn't succeed in getting it passed through to pytest, > > > despite many attempts to `export` it; someone with better bash skills than > > > I should probably add that to the script. FTR `defaults read loginwindow > > > SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS > > > version. > > > > > > Neal > > > > > > On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney wrote: > > > > > > > +1 (binding) > > > > > > > > I was able to verify the Windows wheels with the following patch applied > > > > > > > > https://github.com/apache/arrow/pull/6364 > > > > > > > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs > > > > wrote: > > > > > > > > > > There were binary naming issues with the macosx and the win-cp38 > > > > > wheels. > > > > > I've uploaded them, all of the wheels should be available now [1] > > > > > > > > > > Note that the newly built macosx wheels have 10_9 platform tag > > > > > instead of 10_6, so the verification script must be updated [2] to > > > > > verify the macosx wheels. > > > > > > > > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > > > [2] > > > > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658 > > > > > > > > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs > > > > > wrote: > > > > > > > > > > > > The wheel was built successfully and available under the crossbow > > > > > > releases. Something must have gone wrong during download/upload > > > > > > to bintray. I'm re-uploading the wheels again, waiting for the > > > > > > network. > > > > > > > > > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney > > > > wrote: > > > > > > > > > > > > > > The Windows wheel RC script is broken > > > > > > > > > > > > > > wget --no-check-certificate -O > > > > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1 > > > > > > > 6.0-cp38-cp38m-win_amd64.whl || EXIT /B 1 > > > > > > > --2020-02-05 11:11:15-- > > > > > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > > > Resolving bintray.com (bintray.com)... 75.126.208.206 > > > > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443... > > > > > > > connected. > > > > > > > HTTP request sent, awaiting response... 404 Not Found > > > > > > > 2020-02-05 11:11:15 ERROR 404: Not Found. > > > > > > > > > > > > > > I will try to fix > > > > > > > > > > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn > > > > > > > wrote: > > > > > > > > > > > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays > > > > pull all dependencies from the system. Is there a known way to build & > > > > test > > > > on OSX with the script and use conda for the requirements? > > > > > > > > > > > > > > > > Otherwise I probably need to investe to create such a way. > > > > > > > > > > > > > > > > Cheers > > > > > > > > Uwe > > > > > > > > > > > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I've cherry-picked the
[jira] [Created] (ARROW-7788) Add schema conversion support for map type
Micah Kornfield created ARROW-7788: -- Summary: Add schema conversion support for map type Key: ARROW-7788 URL: https://issues.apache.org/jira/browse/ARROW-7788 Project: Apache Arrow Issue Type: Sub-task Reporter: Micah Kornfield Assignee: Micah Kornfield there is also some other cleanup that is probably worth doing: 1. Adding "large types" 2. Adding a flag to support parquet spec required naming for list types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release Apache Arrow 0.16.0 - RC2
So far we have the following votes: +0 (binding) +1 (binding) +1 (non-binding) +1 (binding) +1 (non-binding) +1 (binding) +1 (non-binding) +1 (binding) 4 +1 (binding) 3 +1 (non-binding) I'm waiting for votes until tomorrow morning (UTC), then I'm closing the VOTE. Thanks everyone! - Krisztian On Fri, Feb 7, 2020 at 12:06 AM Krisztián Szűcs wrote: > > Testing on macOS Catalina > > Binaries: OK > > Wheels: OK > Verified on macOS and on Linux. > On linux the verification script has failed for python 3.5 and manylinux2010 > and manylinux2014 with unsupported platform tag. I've manually checked > these wheels in the python:3.5 docker image, and the wheels were good > (this is automatically checked by crossbow too [1]). All other wheels were > passing using the verification script. > > Source: OK > I had to revert the nvm path [2] to pass the js and integration tests and > force the glib test to use my system python instead of the conda one. > > I vote with +1 (binding) > > [1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568 > [2]: > https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2 > > > On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson > wrote: > > > > I re-verified the macOS wheels and they worked but I had to hard-code > > `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported > > previously. I tried to set that env var dynamically based on your current > > OS version but didn't succeed in getting it passed through to pytest, > > despite many attempts to `export` it; someone with better bash skills than > > I should probably add that to the script. FTR `defaults read loginwindow > > SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS > > version. > > > > Neal > > > > On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney wrote: > > > > > +1 (binding) > > > > > > I was able to verify the Windows wheels with the following patch applied > > > > > > https://github.com/apache/arrow/pull/6364 > > > > > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs > > > wrote: > > > > > > > > There were binary naming issues with the macosx and the win-cp38 > > > > wheels. > > > > I've uploaded them, all of the wheels should be available now [1] > > > > > > > > Note that the newly built macosx wheels have 10_9 platform tag > > > > instead of 10_6, so the verification script must be updated [2] to > > > > verify the macosx wheels. > > > > > > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > > [2] > > > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658 > > > > > > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs > > > > wrote: > > > > > > > > > > The wheel was built successfully and available under the crossbow > > > > > releases. Something must have gone wrong during download/upload > > > > > to bintray. I'm re-uploading the wheels again, waiting for the > > > > > network. > > > > > > > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney > > > wrote: > > > > > > > > > > > > The Windows wheel RC script is broken > > > > > > > > > > > > wget --no-check-certificate -O > > > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1 > > > > > > 6.0-cp38-cp38m-win_amd64.whl || EXIT /B 1 > > > > > > --2020-02-05 11:11:15-- > > > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > > Resolving bintray.com (bintray.com)... 75.126.208.206 > > > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443... > > > > > > connected. > > > > > > HTTP request sent, awaiting response... 404 Not Found > > > > > > 2020-02-05 11:11:15 ERROR 404: Not Found. > > > > > > > > > > > > I will try to fix > > > > > > > > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn wrote: > > > > > > > > > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays > > > pull all dependencies from the system. Is there a known way to build & > > > test > > > on OSX with the script and use conda for the requirements? > > > > > > > > > > > > > > Otherwise I probably need to investe to create such a way. > > > > > > > > > > > > > > Cheers > > > > > > > Uwe > > > > > > > > > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release > > > tag, > > > > > > > > re-built the wheels using crossbow [2], and uploaded them to > > > > > > > > bintray [3] (also removed win-py38m). > > > > > > > > > > > > > > > > Anyone who has voted after verifying the wheels, please re-run > > > > > > > > the verification script again for the wheels and re-vote. > > > > > > > > > > > > > > > > Thanks, Krisztian > > > > > > > > > > > > > > > > [1] > > > > > > > > > > >
Re: [VOTE] Release Apache Arrow 0.16.0 - RC2
Testing on macOS Catalina Binaries: OK Wheels: OK Verified on macOS and on Linux. On linux the verification script has failed for python 3.5 and manylinux2010 and manylinux2014 with unsupported platform tag. I've manually checked these wheels in the python:3.5 docker image, and the wheels were good (this is automatically checked by crossbow too [1]). All other wheels were passing using the verification script. Source: OK I had to revert the nvm path [2] to pass the js and integration tests and force the glib test to use my system python instead of the conda one. I vote with +1 (binding) [1]: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L568 [2]: https://github.com/apache/arrow/commit/37434fb34a1f2cd5273092ed3e1c61db90bb4dd2 On Thu, Feb 6, 2020 at 7:42 PM Neal Richardson wrote: > > I re-verified the macOS wheels and they worked but I had to hard-code > `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported > previously. I tried to set that env var dynamically based on your current > OS version but didn't succeed in getting it passed through to pytest, > despite many attempts to `export` it; someone with better bash skills than > I should probably add that to the script. FTR `defaults read loginwindow > SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS > version. > > Neal > > On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney wrote: > > > +1 (binding) > > > > I was able to verify the Windows wheels with the following patch applied > > > > https://github.com/apache/arrow/pull/6364 > > > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs > > wrote: > > > > > > There were binary naming issues with the macosx and the win-cp38 > > > wheels. > > > I've uploaded them, all of the wheels should be available now [1] > > > > > > Note that the newly built macosx wheels have 10_9 platform tag > > > instead of 10_6, so the verification script must be updated [2] to > > > verify the macosx wheels. > > > > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > [2] > > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658 > > > > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs > > > wrote: > > > > > > > > The wheel was built successfully and available under the crossbow > > > > releases. Something must have gone wrong during download/upload > > > > to bintray. I'm re-uploading the wheels again, waiting for the network. > > > > > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney > > wrote: > > > > > > > > > > The Windows wheel RC script is broken > > > > > > > > > > wget --no-check-certificate -O > > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1 > > > > > 6.0-cp38-cp38m-win_amd64.whl || EXIT /B 1 > > > > > --2020-02-05 11:11:15-- > > > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > Resolving bintray.com (bintray.com)... 75.126.208.206 > > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443... > > > > > connected. > > > > > HTTP request sent, awaiting response... 404 Not Found > > > > > 2020-02-05 11:11:15 ERROR 404: Not Found. > > > > > > > > > > I will try to fix > > > > > > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn wrote: > > > > > > > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays > > pull all dependencies from the system. Is there a known way to build & test > > on OSX with the script and use conda for the requirements? > > > > > > > > > > > > Otherwise I probably need to investe to create such a way. > > > > > > > > > > > > Cheers > > > > > > Uwe > > > > > > > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release > > tag, > > > > > > > re-built the wheels using crossbow [2], and uploaded them to > > > > > > > bintray [3] (also removed win-py38m). > > > > > > > > > > > > > > Anyone who has voted after verifying the wheels, please re-run > > > > > > > the verification script again for the wheels and re-vote. > > > > > > > > > > > > > > Thanks, Krisztian > > > > > > > > > > > > > > [1] > > > > > > > > > https://github.com/apache/arrow/commit/67e34c53b3be4c88348369f8109626b4a8a997aa > > > > > > > [2] > > https://github.com/ursa-labs/crossbow/branches/all?query=build-733 > > > > > > > [3] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > > > > > > > > > > > > On Tue, Feb 4, 2020 at 7:08 PM Wes McKinney > > wrote: > > > > > > > > > > > > > > > > +1 (binding) > > > > > > > > > > > > > > > > Some patches were required to the verification scripts but I > > have run: > > > > > > > > > > > > > > > > * Full source verification on Ubuntu 18.04 > > > > > > > > * Linux binary verification > > > > > > > > * Source verification on
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Catching up on questions here... > Typically you can solve this by having enough IO concurrency at once :-) > I'm not sure having sophisticated global coordination (based on which > algorithms) would bring anything. Would you care to elaborate? We aren't proposing *sophisticated* global coordination, rather, just using a global pool with a global limit, so that a user doesn't unintentionally start hundreds of requests in parallel, and so that you can adjust the resource consumption/performance tradeoff. Essentially, what our library does is maintain two pools (for I/O): - One pool produces I/O requests, by going through the list of files, fetching the Parquet footers, and queuing up I/O requests on the main pool. (This uses a pool so we can fetch and parse metadata from multiple Parquet files at once.) - One pool serves I/O requests, by fetching chunks and placing them in buffers inside the file object implementation. The global concurrency manager additionally limits the second pool by not servicing I/O requests for a file until all of the I/O requests for previous files have at least started. (By just having lots of concurrency, you might end up starving yourself by reading data you don't want quite yet.) Additionally, the global pool could still be a win for non-Parquet files - an implementation can at least submit, say, an entire CSV file as a "chunk" and have it read in the background. > Actually, on a more high-level basis, is the goal to prefetch for > sequential consumption of row groups? At least for us, our query pattern is to sequentially consume row groups from a large dataset, where we select a subset of columns and a subset of the partition key range (usually time range). Prefetching speeds this up substantially, or in general, pipelining discovery of files, I/O, and deserialization. > There are no situations where you would want to consume a scattered > subset of row groups (e.g. predicate pushdown)? With coalescing, this "automatically" gets optimized. If you happen to need column chunks from separate row groups that are adjacent or close on-disk, coalescing will still fetch them in a single IO call. We found that having large row groups was more beneficial than small row groups, since when you combine small row groups with column selection, you end up with a lot of small non-adjacent column chunks - which coalescing can't help with. The exact tradeoff depends on the dataset and workload, of course. > This seems like too much to try to build into RandomAccessFile. I would > suggest a class that wraps a random access file and manages cached segments > and their lifetimes through explicit APIs. A wrapper class seems ideal, especially as the logic is agnostic to the storage backend (except for some parameters which can either be hand-tuned or estimated on the fly). It also keeps the scope of the changes down. > Where to put the "async multiple range request" API is a separate question, > though. Probably makes sense to start writing some working code and sort it > out there. We haven't looked in this direction much. Our designs are based around thread pools partly because we wanted to avoid modifying the Parquet and Arrow internals, instead choosing to modify the I/O layer to "keep Parquet fed" as quickly as possible. Overall, I recall there's an issue open for async APIs in Arrow...perhaps we want to move that to a separate discussion, or on the contrary, explore some experimental APIs here to inform the overall design. Thanks, David On 2/6/20, Wes McKinney wrote: > On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou wrote: >> >> >> Le 06/02/2020 à 20:20, Wes McKinney a écrit : >> >> Actually, on a more high-level basis, is the goal to prefetch for >> >> sequential consumption of row groups? >> >> >> > >> > Essentially yes. One "easy" optimization is to prefetch the entire >> > serialized row group. This is an evolution of that idea where we want >> > to >> > prefetch only the needed parts of a row group in a minimum number of IO >> > calls (consider reading the first 10 columns from a file with 1000 >> > columns >> > -- so we want to do one IO call instead of 10 like we do now). >> >> There are no situations where you would want to consume a scattered >> subset of row groups (e.g. predicate pushdown)? > > There are. If it can be demonstrated that there are performance gains > resulting from IO optimizations involving multiple row groups then I > see no reason not to implement them. > >> Regards >> >> Antoine. >
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Thu, Feb 6, 2020 at 1:30 PM Antoine Pitrou wrote: > > > Le 06/02/2020 à 20:20, Wes McKinney a écrit : > >> Actually, on a more high-level basis, is the goal to prefetch for > >> sequential consumption of row groups? > >> > > > > Essentially yes. One "easy" optimization is to prefetch the entire > > serialized row group. This is an evolution of that idea where we want to > > prefetch only the needed parts of a row group in a minimum number of IO > > calls (consider reading the first 10 columns from a file with 1000 columns > > -- so we want to do one IO call instead of 10 like we do now). > > There are no situations where you would want to consume a scattered > subset of row groups (e.g. predicate pushdown)? There are. If it can be demonstrated that there are performance gains resulting from IO optimizations involving multiple row groups then I see no reason not to implement them. > Regards > > Antoine.
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Le 06/02/2020 à 20:20, Wes McKinney a écrit : >> Actually, on a more high-level basis, is the goal to prefetch for >> sequential consumption of row groups? >> > > Essentially yes. One "easy" optimization is to prefetch the entire > serialized row group. This is an evolution of that idea where we want to > prefetch only the needed parts of a row group in a minimum number of IO > calls (consider reading the first 10 columns from a file with 1000 columns > -- so we want to do one IO call instead of 10 like we do now). There are no situations where you would want to consume a scattered subset of row groups (e.g. predicate pushdown)? Regards Antoine.
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou > wrote: > >> > >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments (and providing the correct > buffer slice to each column reader when they need it), and instead > encapsulate this logic so it can be reused. > >>> > >>> I see, so RandomAccessFile would have some associative caching logic to > >>> find whether the exact requested range was cached and then return it to > >>> the caller? That sounds doable. How is lifetime handled then? Are > >>> cached buffers kept on the RandomAccessFile until they are requested, > at > >>> which point their ownership is transferred to the caller? > >>> > >> > >> This seems like too much to try to build into RandomAccessFile. I would > >> suggest a class that wraps a random access file and manages cached > segments > >> and their lifetimes through explicit APIs. > > > > So Parquet would expect to receive that class rather than > > RandomAccessFile? Or it would grow separate paths for it? > > Actually, on a more high-level basis, is the goal to prefetch for > sequential consumption of row groups? > Essentially yes. One "easy" optimization is to prefetch the entire serialized row group. This is an evolution of that idea where we want to prefetch only the needed parts of a row group in a minimum number of IO calls (consider reading the first 10 columns from a file with 1000 columns -- so we want to do one IO call instead of 10 like we do now). > Regards > > Antoine. >
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Thu, Feb 6, 2020, 12:41 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > >>> > >>> This seems useful, too. It becomes a question of where do you want to > >>> manage the cached memory segments, however you obtain them. I'm > >>> arguing that we should not have much custom code in the Parquet > >>> library to manage the prefetched segments (and providing the correct > >>> buffer slice to each column reader when they need it), and instead > >>> encapsulate this logic so it can be reused. > >> > >> I see, so RandomAccessFile would have some associative caching logic to > >> find whether the exact requested range was cached and then return it to > >> the caller? That sounds doable. How is lifetime handled then? Are > >> cached buffers kept on the RandomAccessFile until they are requested, at > >> which point their ownership is transferred to the caller? > >> > > > > This seems like too much to try to build into RandomAccessFile. I would > > suggest a class that wraps a random access file and manages cached > segments > > and their lifetimes through explicit APIs. > > So Parquet would expect to receive that class rather than > RandomAccessFile? Or it would grow separate paths for it? > If the user opts in to coalesced prefetching then the RowGroupReader would instantiate the wrapper under the hood. Public APIs (aside from new APIs in ReaderProperties for prefetching) would be unchanged. > > > > Regards > > Antoine. >
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: >> >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : This seems useful, too. It becomes a question of where do you want to manage the cached memory segments, however you obtain them. I'm arguing that we should not have much custom code in the Parquet library to manage the prefetched segments (and providing the correct buffer slice to each column reader when they need it), and instead encapsulate this logic so it can be reused. >>> >>> I see, so RandomAccessFile would have some associative caching logic to >>> find whether the exact requested range was cached and then return it to >>> the caller? That sounds doable. How is lifetime handled then? Are >>> cached buffers kept on the RandomAccessFile until they are requested, at >>> which point their ownership is transferred to the caller? >>> >> >> This seems like too much to try to build into RandomAccessFile. I would >> suggest a class that wraps a random access file and manages cached segments >> and their lifetimes through explicit APIs. > > So Parquet would expect to receive that class rather than > RandomAccessFile? Or it would grow separate paths for it? Actually, on a more high-level basis, is the goal to prefetch for sequential consumption of row groups? Regards Antoine.
Re: [VOTE] Release Apache Arrow 0.16.0 - RC2
I re-verified the macOS wheels and they worked but I had to hard-code `MACOSX_DEPLOYMENT_TARGET="10.14"` to get past the cython error I reported previously. I tried to set that env var dynamically based on your current OS version but didn't succeed in getting it passed through to pytest, despite many attempts to `export` it; someone with better bash skills than I should probably add that to the script. FTR `defaults read loginwindow SystemVersionStampAsString | sed s/\.[0-9]$//` returns the right macOS version. Neal On Wed, Feb 5, 2020 at 1:29 PM Wes McKinney wrote: > +1 (binding) > > I was able to verify the Windows wheels with the following patch applied > > https://github.com/apache/arrow/pull/6364 > > On Wed, Feb 5, 2020 at 1:09 PM Krisztián Szűcs > wrote: > > > > There were binary naming issues with the macosx and the win-cp38 > > wheels. > > I've uploaded them, all of the wheels should be available now [1] > > > > Note that the newly built macosx wheels have 10_9 platform tag > > instead of 10_6, so the verification script must be updated [2] to > > verify the macosx wheels. > > > > [1] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > [2] > https://github.com/apache/arrow/pull/6362/files#diff-8cc7fa3ae5de30b356c17d7a4b59fe09R658 > > > > On Wed, Feb 5, 2020 at 6:30 PM Krisztián Szűcs > > wrote: > > > > > > The wheel was built successfully and available under the crossbow > > > releases. Something must have gone wrong during download/upload > > > to bintray. I'm re-uploading the wheels again, waiting for the network. > > > > > > On Wed, Feb 5, 2020 at 6:14 PM Wes McKinney > wrote: > > > > > > > > The Windows wheel RC script is broken > > > > > > > > wget --no-check-certificate -O > pyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.1 > > > > 6.0-cp38-cp38m-win_amd64.whl || EXIT /B 1 > > > > --2020-02-05 11:11:15-- > > > > > https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl > > > > Resolving bintray.com (bintray.com)... 75.126.208.206 > > > > Connecting to bintray.com (bintray.com)|75.126.208.206|:443... > > > > connected. > > > > HTTP request sent, awaiting response... 404 Not Found > > > > 2020-02-05 11:11:15 ERROR 404: Not Found. > > > > > > > > I will try to fix > > > > > > > > On Wed, Feb 5, 2020 at 7:31 AM Uwe L. Korn wrote: > > > > > > > > > > I'm failing to verify C++ on macOS as it seems that we nowadays > pull all dependencies from the system. Is there a known way to build & test > on OSX with the script and use conda for the requirements? > > > > > > > > > > Otherwise I probably need to investe to create such a way. > > > > > > > > > > Cheers > > > > > Uwe > > > > > > > > > > On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote: > > > > > > Hi, > > > > > > > > > > > > I've cherry-picked the wheel fix [1] on top of the 0.16 release > tag, > > > > > > re-built the wheels using crossbow [2], and uploaded them to > > > > > > bintray [3] (also removed win-py38m). > > > > > > > > > > > > Anyone who has voted after verifying the wheels, please re-run > > > > > > the verification script again for the wheels and re-vote. > > > > > > > > > > > > Thanks, Krisztian > > > > > > > > > > > > [1] > > > > > > > https://github.com/apache/arrow/commit/67e34c53b3be4c88348369f8109626b4a8a997aa > > > > > > [2] > https://github.com/ursa-labs/crossbow/branches/all?query=build-733 > > > > > > [3] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > > > > > > > > > > On Tue, Feb 4, 2020 at 7:08 PM Wes McKinney > wrote: > > > > > > > > > > > > > > +1 (binding) > > > > > > > > > > > > > > Some patches were required to the verification scripts but I > have run: > > > > > > > > > > > > > > * Full source verification on Ubuntu 18.04 > > > > > > > * Linux binary verification > > > > > > > * Source verification on Windows 10 (needed ARROW-6757) > > > > > > > * Windows binary verification. Note that Python 3.8 wheel is > broken > > > > > > > (see ARROW-7755). Whoever uploads the wheels to PyPI _SHOULD > NOT_ > > > > > > > upload this 3.8 wheel until we know what's wrong (if we upload > a > > > > > > > broken wheel then `pip install pyarrow==0.16.0` will be > permanently > > > > > > > broken on Windows/Python 3.8) > > > > > > > > > > > > > > On Mon, Feb 3, 2020 at 9:26 PM Francois Saint-Jacques > > > > > > > wrote: > > > > > > > > > > > > > > > > Tested on ubuntu 18.04 for the source release. > > > > > > > > > > > > > > > > On Mon, Feb 3, 2020 at 10:07 PM Francois Saint-Jacques > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > > > > > Binaries verification didn't have any issues. > > > > > > > > > Sources verification worked with some local environment > hiccups > > > > > > > > > > > > > > > > > > François > > > > > > > > > > > > > > > > > > On Mon, Feb 3, 2020 at 8:46 PM Andy Grove <
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Le 06/02/2020 à 19:37, Wes McKinney a écrit : > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : >>> >>> This seems useful, too. It becomes a question of where do you want to >>> manage the cached memory segments, however you obtain them. I'm >>> arguing that we should not have much custom code in the Parquet >>> library to manage the prefetched segments (and providing the correct >>> buffer slice to each column reader when they need it), and instead >>> encapsulate this logic so it can be reused. >> >> I see, so RandomAccessFile would have some associative caching logic to >> find whether the exact requested range was cached and then return it to >> the caller? That sounds doable. How is lifetime handled then? Are >> cached buffers kept on the RandomAccessFile until they are requested, at >> which point their ownership is transferred to the caller? >> > > This seems like too much to try to build into RandomAccessFile. I would > suggest a class that wraps a random access file and manages cached segments > and their lifetimes through explicit APIs. So Parquet would expect to receive that class rather than RandomAccessFile? Or it would grow separate paths for it? Regards Antoine.
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou wrote: > > Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > > > This seems useful, too. It becomes a question of where do you want to > > manage the cached memory segments, however you obtain them. I'm > > arguing that we should not have much custom code in the Parquet > > library to manage the prefetched segments (and providing the correct > > buffer slice to each column reader when they need it), and instead > > encapsulate this logic so it can be reused. > > I see, so RandomAccessFile would have some associative caching logic to > find whether the exact requested range was cached and then return it to > the caller? That sounds doable. How is lifetime handled then? Are > cached buffers kept on the RandomAccessFile until they are requested, at > which point their ownership is transferred to the caller? > This seems like too much to try to build into RandomAccessFile. I would suggest a class that wraps a random access file and manages cached segments and their lifetimes through explicit APIs. Where to put the "async multiple range request" API is a separate question, though. Probably makes sense to start writing some working code and sort it out there. > Regards > > Antoine. >
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Le 06/02/2020 à 16:26, Wes McKinney a écrit : > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments (and providing the correct > buffer slice to each column reader when they need it), and instead > encapsulate this logic so it can be reused. I see, so RandomAccessFile would have some associative caching logic to find whether the exact requested range was cached and then return it to the caller? That sounds doable. How is lifetime handled then? Are cached buffers kept on the RandomAccessFile until they are requested, at which point their ownership is transferred to the caller? Regards Antoine.
[jira] [Created] (ARROW-7786) [R] Wire up check_metadata in Table.Equals method
Neal Richardson created ARROW-7786: -- Summary: [R] Wire up check_metadata in Table.Equals method Key: ARROW-7786 URL: https://issues.apache.org/jira/browse/ARROW-7786 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Fix For: 1.0.0 See https://github.com/apache/arrow/pull/6318/files#r375404306. Followup to ARROW-7720. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
Le 06/02/2020 à 17:07, Wes McKinney a écrit : > In case folks are interested in how some other systems deal with IO > management / scheduling, the comments in > > https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h > > and related files might be interesting Thanks. There's quite a lot of functionality. It would be useful to discuss which parts of that functionality is desirable, and which are not. For example, I don't think we should spend development time writing a complex IO scheduler (using which heuristics?) like Impala has, but that's my opinion :-) Regards Antoine. > On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney wrote: >> >> On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou wrote: >>> >>> On Wed, 5 Feb 2020 15:46:15 -0600 >>> Wes McKinney wrote: I'll comment in more detail on some of the other items in due course, but I think this should be handled by an implementation of RandomAccessFile (that wraps a naked RandomAccessFile) with some additional methods, rather than adding this to the abstract RandomAccessFile interface, e.g. class CachingInputFile : public RandomAccessFile { public: CachingInputFile(std::shared_ptr naked_file); Status CacheRanges(...); }; etc. >>> >>> IMHO it may be more beneficial to expose it as an asynchronous API on >>> RandomAccessFile, for example: >>> class RandomAccessFile { >>> public: >>> struct Range { >>> int64_t offset; >>> int64_t length; >>> }; >>> >>> std::vector>> >>> ReadRangesAsync(std::vector ranges); >>> }; >>> >>> >>> The reason is that some APIs such as the C++ AWS S3 API have their own >>> async support, which may be beneficial to use over a generic Arrow >>> thread-pool implementation. >>> >>> Also, by returning a Promise instead of simply caching the results, you >>> make it easier to handle the lifetime of the results. >> >> This seems useful, too. It becomes a question of where do you want to >> manage the cached memory segments, however you obtain them. I'm >> arguing that we should not have much custom code in the Parquet >> library to manage the prefetched segments (and providing the correct >> buffer slice to each column reader when they need it), and instead >> encapsulate this logic so it can be reused. >> >> The API I proposed was just a mockup, I agree it would make sense for >> the prefetching to occur asynchronously so that a column reader can >> proceed as soon as its coalesced chunk has been prefetched, rather >> than having to wait synchronously for all prefetching to complete. >> >>> >>> (Promise can be something like std::future>, though >>> std::future<> has annoying limitations and we may want to write our own >>> instead) >>> >>> Regards >>> >>> Antoine. >>> >>>
[jira] [Created] (ARROW-7785) [C++] sparse_tensor.cc is extremely slow to compile
Antoine Pitrou created ARROW-7785: - Summary: [C++] sparse_tensor.cc is extremely slow to compile Key: ARROW-7785 URL: https://issues.apache.org/jira/browse/ARROW-7785 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou This comes up especially when doing an optimized build. {{sparse_tensor.cc}} is always enabled even if all components are disabled, and it takes multiple seconds to compile. Using [CLangBuildAnalyzer|https://github.com/aras-p/ClangBuildAnalyzer] I get the following results: {code} Files that took longest to codegen (compiler backend): 66372 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o 16457 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o 6283 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o 5284 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o 5090 ms: build-clang-profile/src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
In case folks are interested in how some other systems deal with IO management / scheduling, the comments in https://github.com/apache/impala/blob/master/be/src/runtime/io/disk-io-mgr.h and related files might be interesting On Thu, Feb 6, 2020 at 9:26 AM Wes McKinney wrote: > > On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou wrote: > > > > On Wed, 5 Feb 2020 15:46:15 -0600 > > Wes McKinney wrote: > > > > > > I'll comment in more detail on some of the other items in due course, > > > but I think this should be handled by an implementation of > > > RandomAccessFile (that wraps a naked RandomAccessFile) with some > > > additional methods, rather than adding this to the abstract > > > RandomAccessFile interface, e.g. > > > > > > class CachingInputFile : public RandomAccessFile { > > > public: > > >CachingInputFile(std::shared_ptr naked_file); > > >Status CacheRanges(...); > > > }; > > > > > > etc. > > > > IMHO it may be more beneficial to expose it as an asynchronous API on > > RandomAccessFile, for example: > > class RandomAccessFile { > > public: > > struct Range { > > int64_t offset; > > int64_t length; > > }; > > > > std::vector>> > > ReadRangesAsync(std::vector ranges); > > }; > > > > > > The reason is that some APIs such as the C++ AWS S3 API have their own > > async support, which may be beneficial to use over a generic Arrow > > thread-pool implementation. > > > > Also, by returning a Promise instead of simply caching the results, you > > make it easier to handle the lifetime of the results. > > This seems useful, too. It becomes a question of where do you want to > manage the cached memory segments, however you obtain them. I'm > arguing that we should not have much custom code in the Parquet > library to manage the prefetched segments (and providing the correct > buffer slice to each column reader when they need it), and instead > encapsulate this logic so it can be reused. > > The API I proposed was just a mockup, I agree it would make sense for > the prefetching to occur asynchronously so that a column reader can > proceed as soon as its coalesced chunk has been prefetched, rather > than having to wait synchronously for all prefetching to complete. > > > > > (Promise can be something like std::future>, though > > std::future<> has annoying limitations and we may want to write our own > > instead) > > > > Regards > > > > Antoine. > > > >
[jira] [Created] (ARROW-7784) [C++] diff.cc is extremely slow to compile
Antoine Pitrou created ARROW-7784: - Summary: [C++] diff.cc is extremely slow to compile Key: ARROW-7784 URL: https://issues.apache.org/jira/browse/ARROW-7784 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou This comes up especially when doing an optimized build. {{diff.cc}} is always enabled even if all components are disabled, and it takes multiple seconds to compile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Thu, Feb 6, 2020 at 2:46 AM Antoine Pitrou wrote: > > On Wed, 5 Feb 2020 15:46:15 -0600 > Wes McKinney wrote: > > > > I'll comment in more detail on some of the other items in due course, > > but I think this should be handled by an implementation of > > RandomAccessFile (that wraps a naked RandomAccessFile) with some > > additional methods, rather than adding this to the abstract > > RandomAccessFile interface, e.g. > > > > class CachingInputFile : public RandomAccessFile { > > public: > >CachingInputFile(std::shared_ptr naked_file); > >Status CacheRanges(...); > > }; > > > > etc. > > IMHO it may be more beneficial to expose it as an asynchronous API on > RandomAccessFile, for example: > class RandomAccessFile { > public: > struct Range { > int64_t offset; > int64_t length; > }; > > std::vector>> > ReadRangesAsync(std::vector ranges); > }; > > > The reason is that some APIs such as the C++ AWS S3 API have their own > async support, which may be beneficial to use over a generic Arrow > thread-pool implementation. > > Also, by returning a Promise instead of simply caching the results, you > make it easier to handle the lifetime of the results. This seems useful, too. It becomes a question of where do you want to manage the cached memory segments, however you obtain them. I'm arguing that we should not have much custom code in the Parquet library to manage the prefetched segments (and providing the correct buffer slice to each column reader when they need it), and instead encapsulate this logic so it can be reused. The API I proposed was just a mockup, I agree it would make sense for the prefetching to occur asynchronously so that a column reader can proceed as soon as its coalesced chunk has been prefetched, rather than having to wait synchronously for all prefetching to complete. > > (Promise can be something like std::future>, though > std::future<> has annoying limitations and we may want to write our own > instead) > > Regards > > Antoine. > >
[jira] [Created] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE
Antoine Pitrou created ARROW-7783: - Summary: [C++] ARROW_DATASET should enable ARROW_COMPUTE Key: ARROW-7783 URL: https://issues.apache.org/jira/browse/ARROW-7783 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Currenty, passing {{-DARROW_DATASET=ON}} to CMake doesn't enable ARROW_COMPUTE, which leads to linker errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7782) Losing index information when using write_to_dataset with partition_cols
Ludwik Bielczynski created ARROW-7782: - Summary: Losing index information when using write_to_dataset with partition_cols Key: ARROW-7782 URL: https://issues.apache.org/jira/browse/ARROW-7782 Project: Apache Arrow Issue Type: Bug Environment: pyarrow==0.15.1 Reporter: Ludwik Bielczynski One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} with given partition_cols arguments. Here I have created a minimal example which shows the issue: {code:java} from pathlib import Path import pandas as pd from pyarrow import Table from pyarrow.parquet import write_to_dataset path = Path('/home/ludwik/Documents/YieldPlanet/research/trials') file_name = 'trial_pq.parquet' df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b'] }, index=pd.Index(['a', 'b', 'c'], name='idx')) table = Table.from_pandas(df) write_to_dataset(table, str(path / file_name), partition_cols=['B'], partition_filename_cb=None, filesystem=None) {code} The issue is rather important for pandas and dask users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault
Joris Van den Bossche created ARROW-7781: Summary: [C++][Dataset] Filtering on a non-existent column gives a segfault Key: ARROW-7781 URL: https://issues.apache.org/jira/browse/ARROW-7781 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche Fix For: 1.0.0 Example with python code: {code} In [1]: import pandas as pd In [2]: df = pd.DataFrame({'a': [1, 2, 3]}) In [3]: df.to_parquet("test-filter-crash.parquet") In [4]: import pyarrow.dataset as ds In [5]: dataset = ds.dataset("test-filter-crash.parquet") In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas() Out[6]: a 0 2 1 3 In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas() ../src/arrow/dataset/filter.cc:929: Check failed: _s.ok() Operation failed: maybe_value.status() Bad status: Invalid: attempting to cast non-null scalar to NullScalar /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
[jira] [Created] (ARROW-7780) [Release] Fix Windows wheel RC verification script given lack of "m" ABI tag in Python 3.8
Krisztian Szucs created ARROW-7780: -- Summary: [Release] Fix Windows wheel RC verification script given lack of "m" ABI tag in Python 3.8 Key: ARROW-7780 URL: https://issues.apache.org/jira/browse/ARROW-7780 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Krisztian Szucs Python 3.8 wheels don't have the "m" postfix in their ABI tag. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [C++] Arrow added to OSS-Fuzz
Hello, A quick update: since Arrow C++ started being fuzzed in OSS-Fuzz, 41 issues (usually crashes) on invalid input have been found, 35 of which have already been corrected. We plan to expand the fuzzed areas to cover Parquet files, as well as serialized Tensor and SparseTensor data. Regards Antoine. On Wed, 15 Jan 2020 19:59:24 +0100 Antoine Pitrou wrote: > Hello, > > I would like to announce that Arrow has been accepted on the OSS-Fuzz > infrastructure (a continuous fuzzing infrastructure operated by Google): > https://github.com/google/oss-fuzz/pull/3233 > > Right now the only fuzz targets are the C++ stream and file IPC readers. > The first build results haven't appeared yet. They will appear on > https://oss-fuzz.com/ . Access needs a Google account, and you also > need to be listed in the "auto_ccs" here: > https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml > > (if you are a PMC or core developer and want to be listed, just open a > PR to the oss-fuzz repository) > > Once we confirm the first builds succeed on OSS-Fuzz, we should probably > add more fuzz targets (for example for reading Parquet files). > > Regards > > Antoine. >
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Wed, 5 Feb 2020 16:37:17 -0500 David Li wrote: > > As a separate step, prefetching/caching should also make use of a > global (or otherwise shared) IO thread pool, so that parallel reads of > different files implicitly coordinate work with each other as well. > Then, you could queue up reads of several Parquet files, such that a > slow network call for one file doesn't block progress for other files, > without issuing reads for all of these files at once. Typically you can solve this by having enough IO concurrency at once :-) I'm not sure having sophisticated global coordination (based on which algorithms) would bring anything. Would you care to elaborate? > It's unclear to me what readahead at the record batch level would > accomplish - Parquet reads each column chunk in a row group as a > whole, and if the row groups are large, then multiple record batches > would fall in the same row group, so then we wouldn't gain any > parallelism, no? (Admittedly, I'm not familiar with the internals > here.) Well, if each row group is read as a whole, then readahead can be applied at the row group level (e.g. read K row groups in advance). Regards Antoine.
Re: [Discuss] Proposal for optimizing Datasets over S3/object storage
On Wed, 5 Feb 2020 15:46:15 -0600 Wes McKinney wrote: > > I'll comment in more detail on some of the other items in due course, > but I think this should be handled by an implementation of > RandomAccessFile (that wraps a naked RandomAccessFile) with some > additional methods, rather than adding this to the abstract > RandomAccessFile interface, e.g. > > class CachingInputFile : public RandomAccessFile { > public: >CachingInputFile(std::shared_ptr naked_file); >Status CacheRanges(...); > }; > > etc. IMHO it may be more beneficial to expose it as an asynchronous API on RandomAccessFile, for example: class RandomAccessFile { public: struct Range { int64_t offset; int64_t length; }; std::vector>> ReadRangesAsync(std::vector ranges); }; The reason is that some APIs such as the C++ AWS S3 API have their own async support, which may be beneficial to use over a generic Arrow thread-pool implementation. Also, by returning a Promise instead of simply caching the results, you make it easier to handle the lifetime of the results. (Promise can be something like std::future>, though std::future<> has annoying limitations and we may want to write our own instead) Regards Antoine.