[jira] [Created] (ARROW-8228) [C++][Parquet] Support writing lists that have null elements that are non-empty.
Micah Kornfield created ARROW-8228: -- Summary: [C++][Parquet] Support writing lists that have null elements that are non-empty. Key: ARROW-8228 URL: https://issues.apache.org/jira/browse/ARROW-8228 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Micah Kornfield Fix For: 1.0.0 With the new V2 level writing engine we can detect this case but fail as not implemented. Fixing this will require changes to the "core" parquet API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8227) [C++] Propose refining SIMD code framework
Yibo Cai created ARROW-8227: --- Summary: [C++] Propose refining SIMD code framework Key: ARROW-8227 URL: https://issues.apache.org/jira/browse/ARROW-8227 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai Arrow supports wide range of hardware(x86,arm,ppc?) + os(linux,windows,macos,others?) + compiler(gcc,clang,msvc,others?). Managing platform dependent code is non-trivial. This Jira aims to refine(or mess up) simd related code framework. Some goals: Move simd feature definition into one place, possibly in cmake, and reduce compiler based ifdef is source code. Manage simd code in one place, but leave non-simd default implementations where they are. Shouldn't introduce any performance penalty, prefer direct inline to runtime dispatcher. Code should be easy to maintain, expand, and hard to make mistakes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8226) [Go] Add binary builder that uses 64 bit offsets and make binary builders resettable
Richard created ARROW-8226: -- Summary: [Go] Add binary builder that uses 64 bit offsets and make binary builders resettable Key: ARROW-8226 URL: https://issues.apache.org/jira/browse/ARROW-8226 Project: Apache Arrow Issue Type: New Feature Components: Go Reporter: Richard I ran into some overflow issues with the existing 32 bit binary builder. My changes add a new binary builder that uses 64-bit offsets + tests. I also added a panic for when the 32-bit offset binary builder overflows. Finally I made both binary builders resettable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8224) [C++] Remove APIs deprecated prior to 0.16.0
Wes McKinney created ARROW-8224: --- Summary: [C++] Remove APIs deprecated prior to 0.16.0 Key: ARROW-8224 URL: https://issues.apache.org/jira/browse/ARROW-8224 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8223) Schema.from_pandas breaks with pandas nullable integer dtype
Ged Steponavicius created ARROW-8223: Summary: Schema.from_pandas breaks with pandas nullable integer dtype Key: ARROW-8223 URL: https://issues.apache.org/jira/browse/ARROW-8223 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1, 0.16.0, 0.15.0 Environment: pyarrow 0.16 Reporter: Ged Steponavicius {code:java} import pandas as pd import pyarrow as pa df = pd.DataFrame([{'int_col':1}, {'int_col':2}]) df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) schema = pa.Schema.from_pandas(df) {code} produces ArrowTypeError: Did not pass numpy.dtype object However, this works fine {code:java} schema = pa.Table.from_pandas(df).schema{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build
Neal Richardson created ARROW-8222: -- Summary: [C++] Use bcp to make a slim boost for bundled build Key: ARROW-8222 URL: https://issues.apache.org/jira/browse/ARROW-8222 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson We don't use much of Boost (just system, filesystem, and regex), but when we do a bundled build, we still download and extract all of boost. The tarball itself is 113mb, expanded is over 700mb. This can be slow, and it requires a lot of free disk space that we don't really need. [bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is a boost tool that lets you extract a subset of boost, resolving any of its necessary dependencies across boost. The savings for us could be huge: {code} mkdir test ./bcp system.hpp filesystem.hpp regex.hpp test tar -czf test.tar.gz test/ {code} The resulting tarball is 885K (kilobytes!). {{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as well. We would need a place to host this tarball, and we would have to updated it whenever we (1) bump the boost version or (2) add a new boost library dependency. This patch would of course include a script that would generate the tarball. Given the small size, we could also consider just vendoring it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Preparing for 0.17.0 Arrow release
I just took a first pass at reviewing the Java and Rust issues and removed some from the 0.17.0 release. There are a few small Rust issues that I am actively working on for this release. Thanks. On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney wrote: > hi Neal, > > Thanks for helping coordinate. I agree we should be in a position to > release sometime next week. > > Can folks from the Rust and Java side review issues in the backlog? > According to the dashboard there are 19 Rust issues open and 7 Java > issues. > > Thanks > > On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson > wrote: > > > > Hi all, > > A few weeks ago, there seemed to be consensus (lazy, at least) for a 0.17 > > release at the end of the month. Judging from > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release, > it > > looks like we're getting closer. > > > > I'd encourage everyone to review their backlogs and (1) bump from 0.17 > > scope any tickets they don't plan to finish this week, and (2) if there > are > > any issues that should block release, make sure they are flagged as > > "blockers". > > > > Neal > > > > On Tue, Mar 10, 2020 at 7:39 AM Wes McKinney > wrote: > > > > > It seems like the consensus is to push for a 0.17.0 major release > > > sooner rather than doing a patch release, since releases in general > > > are costly. This is fine with me. I see that a 0.17.0 milestone has > > > been created in JIRA and some JIRA gardening has begun. Do you think > > > we can be in a position to release by the week of March 23 or the week > > > of March 30? > > > > > > On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney > wrote: > > > > > > > > If people are generally on board with accelerating a 0.17.0 major > > > > release, then I would suggest renaming "1.0.0" to "0.17.0" and > > > > beginning to do issue gardening to whittle things down to > > > > critical-looking bugs and high probability patches for the next > couple > > > > of weeks. > > > > > > > > On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney > > > wrote: > > > > > > > > > > I recall there are some other issues that have been reported or > fixed > > > > > that are critical and not yet marked with 0.16.1. > > > > > > > > > > I'm also OK with doing a 0.17.0 release sooner > > > > > > > > > > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson > > > > > wrote: > > > > > > > > > > > > I would also be more supportive of doing 0.17 earlier instead of > a > > > patch > > > > > > release. > > > > > > > > > > > > Neal > > > > > > > > > > > > > > > > > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson < > > > neal.p.richard...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > If releases were costless to make, I'd be all for it, but it's > not > > > clear > > > > > > > to me that it's worth the diversion from other priorities to > make > > > a release > > > > > > > right now. Nothing on > > > > > > > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1 > > > > > > > jumps out to me as super urgent--what are you seeing as > critical? > > > > > > > > > > > > > > If we did decide to go forward, would it be possible to do a > > > release that > > > > > > > is limited to the affected implementations (say, do a > Python-only > > > release)? > > > > > > > That might reduce the cost of building and verifying enough to > > > make it > > > > > > > reasonable to consider. > > > > > > > > > > > > > > Neal > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs < > > > szucs.kriszt...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney < > wesmck...@gmail.com> > > > wrote: > > > > > > >> > > > > > > > >> > hi folks, > > > > > > >> > > > > > > > >> > There have been a number of critical issues reported (many > of > > > them > > > > > > >> > fixed already) since 0.16.0 was released. Is there interest > in > > > > > > >> > preparing a patch 0.16.1 release (with backported patches > onto a > > > > > > >> > maint-0.16.x branch as with 0.15.1) since the next major > > > release is a > > > > > > >> > minimum of 6-8 weeks away from general availability? > > > > > > >> > > > > > > > >> > Did the 0.15.1 patch release helper script that Krisztian > wrote > > > get > > > > > > >> > contributed as a PR? > > > > > > >> Not yet, but it is available at > > > > > > >> > https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0 > > > > > > >> > > > > > > > >> > Thanks > > > > > > >> > Wes > > > > > > >> > > > > > > > > > > >
Re: Preparing for 0.17.0 Arrow release
hi Neal, Thanks for helping coordinate. I agree we should be in a position to release sometime next week. Can folks from the Rust and Java side review issues in the backlog? According to the dashboard there are 19 Rust issues open and 7 Java issues. Thanks On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson wrote: > > Hi all, > A few weeks ago, there seemed to be consensus (lazy, at least) for a 0.17 > release at the end of the month. Judging from > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release, it > looks like we're getting closer. > > I'd encourage everyone to review their backlogs and (1) bump from 0.17 > scope any tickets they don't plan to finish this week, and (2) if there are > any issues that should block release, make sure they are flagged as > "blockers". > > Neal > > On Tue, Mar 10, 2020 at 7:39 AM Wes McKinney wrote: > > > It seems like the consensus is to push for a 0.17.0 major release > > sooner rather than doing a patch release, since releases in general > > are costly. This is fine with me. I see that a 0.17.0 milestone has > > been created in JIRA and some JIRA gardening has begun. Do you think > > we can be in a position to release by the week of March 23 or the week > > of March 30? > > > > On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney wrote: > > > > > > If people are generally on board with accelerating a 0.17.0 major > > > release, then I would suggest renaming "1.0.0" to "0.17.0" and > > > beginning to do issue gardening to whittle things down to > > > critical-looking bugs and high probability patches for the next couple > > > of weeks. > > > > > > On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney > > wrote: > > > > > > > > I recall there are some other issues that have been reported or fixed > > > > that are critical and not yet marked with 0.16.1. > > > > > > > > I'm also OK with doing a 0.17.0 release sooner > > > > > > > > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson > > > > wrote: > > > > > > > > > > I would also be more supportive of doing 0.17 earlier instead of a > > patch > > > > > release. > > > > > > > > > > Neal > > > > > > > > > > > > > > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson < > > neal.p.richard...@gmail.com> > > > > > wrote: > > > > > > > > > > > If releases were costless to make, I'd be all for it, but it's not > > clear > > > > > > to me that it's worth the diversion from other priorities to make > > a release > > > > > > right now. Nothing on > > > > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1 > > > > > > jumps out to me as super urgent--what are you seeing as critical? > > > > > > > > > > > > If we did decide to go forward, would it be possible to do a > > release that > > > > > > is limited to the affected implementations (say, do a Python-only > > release)? > > > > > > That might reduce the cost of building and verifying enough to > > make it > > > > > > reasonable to consider. > > > > > > > > > > > > Neal > > > > > > > > > > > > > > > > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs < > > szucs.kriszt...@gmail.com> > > > > > > wrote: > > > > > > > > > > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney > > wrote: > > > > > >> > > > > > > >> > hi folks, > > > > > >> > > > > > > >> > There have been a number of critical issues reported (many of > > them > > > > > >> > fixed already) since 0.16.0 was released. Is there interest in > > > > > >> > preparing a patch 0.16.1 release (with backported patches onto a > > > > > >> > maint-0.16.x branch as with 0.15.1) since the next major > > release is a > > > > > >> > minimum of 6-8 weeks away from general availability? > > > > > >> > > > > > > >> > Did the 0.15.1 patch release helper script that Krisztian wrote > > get > > > > > >> > contributed as a PR? > > > > > >> Not yet, but it is available at > > > > > >> https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0 > > > > > >> > > > > > > >> > Thanks > > > > > >> > Wes > > > > > >> > > > > > > > >
[jira] [Created] (ARROW-8220) [Python] Make dataset FileFormat objects serializable
Joris Van den Bossche created ARROW-8220: Summary: [Python] Make dataset FileFormat objects serializable Key: ARROW-8220 URL: https://issues.apache.org/jira/browse/ARROW-8220 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8219) [Rust] sqlparser crate needs to be bumped to version 0.2.5
Paddy Horan created ARROW-8219: -- Summary: [Rust] sqlparser crate needs to be bumped to version 0.2.5 Key: ARROW-8219 URL: https://issues.apache.org/jira/browse/ARROW-8219 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Affects Versions: 0.16.0 Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
If it isn't hard could you run with batch sizes of 1024 or 2048 records? I think there was a question previously raised if there was benefit for smaller sizes buffers. Thanks, Micah On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney wrote: > On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield > wrote: > > > > > > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > > > dataset. So that's a huge space savings > > > > One more question on this. What was the average row-batch size used? I > > see in the proposal some buffers might not be compressed, did you this > > feature in the test? > > I used 64K row batch size. I haven't implemented the optional > non-compressed buffers (for cases where there is little space savings) > so everything is compressed. I can check different batch sizes if you > like > > > > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney > wrote: > > > > > hi folks, > > > > > > Sorry it's taken me a little while to produce supporting benchmarks. > > > > > > * I implemented experimental trivial body buffer compression in > > > https://github.com/apache/arrow/pull/6638 > > > * I hooked up the Arrow IPC file format with compression as the new > > > Feather V2 format in > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 > > > > > > I tested a couple of real-world datasets from a prior blog post > > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4 > > > codecs > > > > > > The complete results are here > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 > > > > > > Summary: > > > > > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > > > dataset. So that's a huge space savings > > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4 > > > and 1.2-3GByte/s with ZSTD > > > > > > I would have to do some more engineering to test throughput changes > > > with Flight, but given these results on slower networking (e.g. 1 > > > Gigabit) my guess is that the compression and decompression overhead > > > is little compared with the time savings due to high compression > > > ratios. If people would like to see these numbers to help make a > > > decision I can take a closer look > > > > > > As far as what Micah said about having a limited number of > > > compressors: I would be in favor of having just LZ4 and ZSTD. It seems > > > anecdotally that these outperform Snappy in most real world scenarios > > > and generally have > 1 GB/s decompression performance. Some Linux > > > distributions (Arch at least) have already started adopting ZSTD over > > > LZMA or GZIP [1] > > > > > > - Wes > > > > > > [1]: > > > > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/ > > > > > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya wrote: > > > > > > > > Hi Wes, > > > > > > > > Thanks a lot for the additional information. > > > > Looking forward to see the good results from your experiments. > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney > > > wrote: > > > > > > > > > I see, thank you. > > > > > > > > > > For such a scenario, implementations would need to define a > > > > > "UserDefinedCodec" interface to enable codecs to be registered from > > > > > third party code, similar to what is done for extension types [1] > > > > > > > > > > I'll update this thread when I get my experimental C++ patch up to > see > > > > > what I'm thinking at least for the built-in codecs we have like > ZSTD. > > > > > > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types > > > > > > > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya > wrote: > > > > > > > > > > > > Hi Wes, > > > > > > > > > > > > Thanks a lot for your further clarification. > > > > > > > > > > > > Some of my prelimiary thoughts: > > > > > > > > > > > > 1. We assign a unique GUID to each pair of > compression/decompression > > > > > > strategies. The GUID is stored as part of the > > > Message.custom_metadata. > > > > > When > > > > > > receiving the GUID, the receiver knows which decompression > strategy > > > to > > > > > use. > > > > > > > > > > > > 2. We serialize the decompression strategy, and store it into the > > > > > > Message.custom_metadata. The receiver can decompress data after > > > > > > deserializing the strategy. > > > > > > > > > > > > Method 1 is generally used in static strategy scenarios while > method > > > 2 is > > > > > > generally used in dynamic strategy scenarios. > > > > > > > > > > > > Best, > > > > > > Liya Fan > > > > > > > > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney < > wesmck...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Okay, I guess my question is how the receiver is going to be > able > > > to >
[jira] [Created] (ARROW-8218) [C++] Parallelize decompression at field level in experimental IPC compression code
Wes McKinney created ARROW-8218: --- Summary: [C++] Parallelize decompression at field level in experimental IPC compression code Key: ARROW-8218 URL: https://issues.apache.org/jira/browse/ARROW-8218 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 This is follow up work to ARROW-7979, a minor amount of refactoring will be required to move the decompression step out of {{ArrayLoader}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8217) [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979
Wes McKinney created ARROW-8217: --- Summary: [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979 Key: ARROW-8217 URL: https://issues.apache.org/jira/browse/ARROW-8217 Project: Apache Arrow Issue Type: Bug Components: C++, R Reporter: Wes McKinney Fix For: 0.17.0 If we can obtain a gdb backtrace from the failed test in https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8216) filter method for Dataset doesn't distinguish between empty strings and NAs
Sam Albers created ARROW-8216: - Summary: filter method for Dataset doesn't distinguish between empty strings and NAs Key: ARROW-8216 URL: https://issues.apache.org/jira/browse/ARROW-8216 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.16.0 Environment: R 3.6.3, Windows 10 Reporter: Sam Albers I have just noticed some slightly odd behaviour with the filter method for Dataset. {code:java} library(arrow) library(dplyr) packageVersion("arrow") #> [1] '0.16.0.20200323' ## Make sample parquet starwars$hair_color[starwars$hair_color == "brown"] <- "" dir <- tempdir() fpath <- file.path(dir, 'data.parquet') write_parquet(starwars, fpath) ## df in memory df_mem <- starwars %>% filter(hair_color == "") ## reading from the parquet df_parquet <- read_parquet(fpath) %>% filter(hair_color == "") ## using open_dataset df_dataset <- open_dataset(dir) %>% filter(hair_color == "") %>% collect() {code} I'm pretty sure all these should return the same data.frame. Am I missing something? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8215) [CI][Glib] Meson install fails in the macOS build
Krisztian Szucs created ARROW-8215: -- Summary: [CI][Glib] Meson install fails in the macOS build Key: ARROW-8215 URL: https://issues.apache.org/jira/browse/ARROW-8215 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, GLib Reporter: Krisztian Szucs It also happens in the pull request builds, see build log https://github.com/apache/arrow/runs/533168517#step:5:1230 cc @kou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield wrote: > > > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > > dataset. So that's a huge space savings > > One more question on this. What was the average row-batch size used? I > see in the proposal some buffers might not be compressed, did you this > feature in the test? I used 64K row batch size. I haven't implemented the optional non-compressed buffers (for cases where there is little space savings) so everything is compressed. I can check different batch sizes if you like > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney wrote: > > > hi folks, > > > > Sorry it's taken me a little while to produce supporting benchmarks. > > > > * I implemented experimental trivial body buffer compression in > > https://github.com/apache/arrow/pull/6638 > > * I hooked up the Arrow IPC file format with compression as the new > > Feather V2 format in > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 > > > > I tested a couple of real-world datasets from a prior blog post > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4 > > codecs > > > > The complete results are here > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 > > > > Summary: > > > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae > > dataset. So that's a huge space savings > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4 > > and 1.2-3GByte/s with ZSTD > > > > I would have to do some more engineering to test throughput changes > > with Flight, but given these results on slower networking (e.g. 1 > > Gigabit) my guess is that the compression and decompression overhead > > is little compared with the time savings due to high compression > > ratios. If people would like to see these numbers to help make a > > decision I can take a closer look > > > > As far as what Micah said about having a limited number of > > compressors: I would be in favor of having just LZ4 and ZSTD. It seems > > anecdotally that these outperform Snappy in most real world scenarios > > and generally have > 1 GB/s decompression performance. Some Linux > > distributions (Arch at least) have already started adopting ZSTD over > > LZMA or GZIP [1] > > > > - Wes > > > > [1]: > > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/ > > > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya wrote: > > > > > > Hi Wes, > > > > > > Thanks a lot for the additional information. > > > Looking forward to see the good results from your experiments. > > > > > > Best, > > > Liya Fan > > > > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney > > wrote: > > > > > > > I see, thank you. > > > > > > > > For such a scenario, implementations would need to define a > > > > "UserDefinedCodec" interface to enable codecs to be registered from > > > > third party code, similar to what is done for extension types [1] > > > > > > > > I'll update this thread when I get my experimental C++ patch up to see > > > > what I'm thinking at least for the built-in codecs we have like ZSTD. > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types > > > > > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya wrote: > > > > > > > > > > Hi Wes, > > > > > > > > > > Thanks a lot for your further clarification. > > > > > > > > > > Some of my prelimiary thoughts: > > > > > > > > > > 1. We assign a unique GUID to each pair of compression/decompression > > > > > strategies. The GUID is stored as part of the > > Message.custom_metadata. > > > > When > > > > > receiving the GUID, the receiver knows which decompression strategy > > to > > > > use. > > > > > > > > > > 2. We serialize the decompression strategy, and store it into the > > > > > Message.custom_metadata. The receiver can decompress data after > > > > > deserializing the strategy. > > > > > > > > > > Method 1 is generally used in static strategy scenarios while method > > 2 is > > > > > generally used in dynamic strategy scenarios. > > > > > > > > > > Best, > > > > > Liya Fan > > > > > > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney > > > > wrote: > > > > > > > > > > > Okay, I guess my question is how the receiver is going to be able > > to > > > > > > determine how to "rehydrate" the record batch buffers: > > > > > > > > > > > > What I've proposed amounts to the following: > > > > > > > > > > > > * UNCOMPRESSED: the current behavior > > > > > > * ZSTD/LZ4/...: each buffer is compressed and written with an int64 > > > > > > length prefix > > > > > > > > > > > > (I'm close to putting up a PR implementing an experimental version > > of > > > > > > this that uses Message.custom_metadata to transmit the codec, so > > this > > > > > > will make the implementation
[jira] [Created] (ARROW-8213) [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message
Joris Van den Bossche created ARROW-8213: Summary: [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message Key: ARROW-8213 URL: https://issues.apache.org/jira/browse/ARROW-8213 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Even after the previous PRs related to local paths (https://github.com/apache/arrow/pull/6643, https://github.com/apache/arrow/pull/6655), I don't the user experience optimal in case you are working with local files, and pass a wrong, non-existent path (eg due to a typo). Currently, you get this error: {code} >>> dataset = ds.dataset("data_with_typo.parquet", format="parquet") ... ArrowInvalid: URI has empty scheme: 'data_with_typo.parquet' {code} where "URI has empty scheme" is rather confusing for the user in case of a non-existent path. I think ideally we should raise a "No such file or directory" error. I am not fully sure what the best solution is, as {{FileSystem.from_uri}} can also give other errors that we do want to propagate to the user. The most straightforward that I am now thinking of is checking if "URI has empty scheme" is in the error message, and then rewording it, but that's not very clean .. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8212) [Python][Dataset] Consider adding Cast like operation
Krisztian Szucs created ARROW-8212: -- Summary: [Python][Dataset] Consider adding Cast like operation Key: ARROW-8212 URL: https://issues.apache.org/jira/browse/ARROW-8212 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs It would cast an expression to the datatype of another expression. Re-evalutate once the new LogicalPlan implementation is merged. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8211) [C++] Sanitize hdfs host when creating HadoopFileSystem from endpoint
Krisztian Szucs created ARROW-8211: -- Summary: [C++] Sanitize hdfs host when creating HadoopFileSystem from endpoint Key: ARROW-8211 URL: https://issues.apache.org/jira/browse/ARROW-8211 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Creating a HadoopFileSystem from uri always [prepends|https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/hdfs.cc#L283] the host with the uri scheme whereas configuring endpoint [does not|https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/hdfs.cc#L253]. It has caused issue during equality checks and serialization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8209) [Python] Accessing duplicate column of Table by name gives wrong error
Joris Van den Bossche created ARROW-8209: Summary: [Python] Accessing duplicate column of Table by name gives wrong error Key: ARROW-8209 URL: https://issues.apache.org/jira/browse/ARROW-8209 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche When you have a table with duplicate column names and you try to access this column, you get an error about the column not existing: {code} >>> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, >>> 9])], names=['a', 'b', 'a']) >>> table.column('a') >>> >>> --- KeyError Traceback (most recent call last) in > 1 table.column('a') ~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.column() KeyError: 'Column a does not exist in table' {code} It should rather give an error message about the column name being duplicate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-03-25-0
Arrow Build Report for Job nightly-2020-03-25-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0 Failed Tasks: - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-gandiva-jar-trusty - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-cpp-valgrind - test-debian-10-go-1.12: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-debian-10-go-1.12 - test-r-linux-as-cran: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-test-r-linux-as-cran - wheel-win-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp36m - wheel-win-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp37m - wheel-win-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-8 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-gandiva-jar-osx - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL:
[jira] [Created] (ARROW-8208) [PYTHON] RowGroup filtering with ParquetDataset
Christophe Clienti created ARROW-8208: - Summary: [PYTHON] RowGroup filtering with ParquetDataset Key: ARROW-8208 URL: https://issues.apache.org/jira/browse/ARROW-8208 Project: Apache Arrow Issue Type: New Feature Reporter: Christophe Clienti Hello, I tried to use the row_group filtering at the file level with an instance of ParquetDataset without success. I've tested the workaround propose here: [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883] But I wonder if it can work on a file as I get an exception with the following code: {code:python} ParquetDataset('data.parquet', filters=[('ticker', '=', 'AAPL')]).read().to_pandas() {code} {noformat} AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition' {noformat} I read the documentation, and the filtering seems to work only on partitioned dataset. Moreover I read some information in the following JIRA ticket: https://issues.apache.org/jira/browse/ARROW-1796 So I'm not sure that a ParquetDataset can use row_group statistics to filter specific row_group in a file in a dataset or not? As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug (statistics.min instead of statistics.min_value), I was able to apply the row_group filtering. Today I'm forced with pyarrow to filter manually the row_groups in each file, which prevents me to use the ParquetDataset partition filtering functionality. The row groups are really useful because it prevents to fill the filesystem with small files... -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
On Wed, Mar 25, 2020 at 2:32 AM Wes McKinney wrote: > From what I've found searching on the internet > > - Java: > * ZSTD -- JNI-based library available > * LZ4 -- both JNI and native Java available > > - Go: ZSTD is a C binding, while there is an LZ4 native Go implementation > AFAIK, one has access to pure-Go packages for both of these compressors: - github.com/pierrec/lz4 - github.com/klauspost/compress -s