[jira] [Created] (ARROW-7979) [C++] Implement experimental buffer compression in IPC messages
Wes McKinney created ARROW-7979: --- Summary: [C++] Implement experimental buffer compression in IPC messages Key: ARROW-7979 URL: https://issues.apache.org/jira/browse/ARROW-7979 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 The idea is that this can be used for experiments and bespoke applications (e.g. in the context of ARROW-5510). If this is adopted formally into the IPC format then the experimental implementation can be altered to match the specification -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7978) [Developer] GitHub Actions "lint" task is running include-what-you-use and failing
Wes McKinney created ARROW-7978: --- Summary: [Developer] GitHub Actions "lint" task is running include-what-you-use and failing Key: ARROW-7978 URL: https://issues.apache.org/jira/browse/ARROW-7978 Project: Apache Arrow Issue Type: Bug Components: C++, Developer Tools Reporter: Wes McKinney Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7977) [C++] Rename fs::FileStats to fs::FileStat
Kouhei Sutou created ARROW-7977: --- Summary: [C++] Rename fs::FileStats to fs::FileStat Key: ARROW-7977 URL: https://issues.apache.org/jira/browse/ARROW-7977 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou Because widely used stat(2) is an abbreviation of "status" not "statistics". It's better that we follow the widely used existing convention. Linux: http://man7.org/linux/man-pages/man2/stat.2.html {quote} get file status {quote} FreeBSD: https://www.freebsd.org/cgi/man.cgi?query=stat=2 {quote} get file status {quote} If we use FileStat instead of FileStats, we can use singular form "stat" and plural form "stats" as variable names instead of "stats" and "stats_vector". It will help writing readable code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7976) [C++] Add field to IpcOptions to include padding in Buffer metadata accounting
Wes McKinney created ARROW-7976: --- Summary: [C++] Add field to IpcOptions to include padding in Buffer metadata accounting Key: ARROW-7976 URL: https://issues.apache.org/jira/browse/ARROW-7976 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 While this will modify buffers in roundtrips, when transmitting buffers that you wish to be 64-byte padded, for example, this may be the desired behavior. See related discussion in ARROW-7975 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7975) [C++] Do not include padding bytes in "Buffer" IPC metadata accounting
Wes McKinney created ARROW-7975: --- Summary: [C++] Do not include padding bytes in "Buffer" IPC metadata accounting Key: ARROW-7975 URL: https://issues.apache.org/jira/browse/ARROW-7975 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 At this line, we include the padding bytes into the IPC metadata https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L192 The effect of this is that buffer sizes are modified by an IPC roundtrip. According to the Format, the padding bytes do not need to be accounted for in the metadata. https://github.com/apache/arrow/blob/master/format/Schema.fbs#L330 The Java implementation, for example, does not. I ran into this when working on a prototype implementation of ARROW-300, where it is important to have the exact unpadded size of the original buffer that was written. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7974) [Developer][C++] ResourceWarning in "make check-format"
Wes McKinney created ARROW-7974: --- Summary: [Developer][C++] ResourceWarning in "make check-format" Key: ARROW-7974 URL: https://issues.apache.org/jira/browse/ARROW-7974 Project: Apache Arrow Issue Type: Bug Components: C++, Developer Tools Reporter: Wes McKinney Fix For: 1.0.0 Related to ARROW-7973, I also see {code} $ ninja check-format [1/1] cd /home/wesm/code/arrow/cpp/preflight...ce_dir /home/wesm/code/arrow/cpp/src --quiet /home/wesm/code/arrow/cpp/build-support/run_clang_format.py:77: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt' mode='r' encoding='UTF-8'> for line in open(arguments.exclude_globs): ResourceWarning: Enable tracemalloc to get the object allocation traceback {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7973) [Developer][C++] ResourceWarnings in run_cpplint.py
Wes McKinney created ARROW-7973: --- Summary: [Developer][C++] ResourceWarnings in run_cpplint.py Key: ARROW-7973 URL: https://issues.apache.org/jira/browse/ARROW-7973 Project: Apache Arrow Issue Type: Bug Components: C++, Developer Tools Reporter: Wes McKinney Fix For: 1.0.0 Seeing warnings like this locally {code} $ ninja lint [1/1] cd /home/wesm/code/arrow/cpp/preflight...ce_dir /home/wesm/code/arrow/cpp/src --quiet FAILED: CMakeFiles/lint cd /home/wesm/code/arrow/cpp/preflight-build && /home/wesm/miniconda/envs/arrow-3.7/bin/python /home/wesm/code/arrow/cpp/build-support/run_cpplint.py --cpplint_binary /home/wesm/code/arrow/cpp/build-support/cpplint.py --exclude_globs /home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt --source_dir /home/wesm/code/arrow/cpp/src --quiet /home/wesm/code/arrow/cpp/build-support/run_cpplint.py:77: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt' mode='r' encoding='UTF-8'> for line in open(arguments.exclude_globs): ResourceWarning: Enable tracemalloc to get the object allocation traceback /home/wesm/code/arrow/cpp/build-support/cpplint.py:6240: ResourceWarning: unclosed file <_io.BufferedReader name='/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/add.h'> lines = codecs.open(filename, 'r', 'utf8', 'replace').read().split('\n') ResourceWarning: Enable tracemalloc to get the object allocation traceback /home/wesm/code/arrow/cpp/build-support/cpplint.py:6240: ResourceWarning: unclosed file <_io.BufferedReader name='/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util_internal.h'> lines = codecs.open(filename, 'r', 'utf8', 'replace').read().split('\n') ResourceWarning: Enable tracemalloc to get the object allocation traceback {code} I was using {{PYTHONDEVMODE=1}} so this may be related -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou wrote: > > > Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > In the context of a "next version of the Feather format" ARROW-5510 > > (which is consumed only by Python and R at the moment), I have been > > looking at compressing buffers using fast compressors like ZSTD when > > writing the RecordBatch bodies. This could be handled privately as an > > implementation detail of the Feather file, but since ZSTD compression > > could improve throughput in Flight, for example, I thought I would > > bring it up for discussion. > > > > I can see two simple compression strategies: > > > > * Compress the entire message body in one-shot, writing the result out > > with an 8-byte int64 prefix indicating the uncompressed size > > * Compress each non-zero-length constituent Buffer prior to writing to > > the body (and using the same uncompressed-length-prefix when writing > > the compressed buffer) > > > > The latter strategy is preferable for scenarios where we may project > > out only a few fields from a larger record batch (such as reading from > > a memory-mapped file). > > Agreed. It may also allow using different compression strategies for > different kinds of buffers (for example a bytestream splitting strategy > for floats and doubles, or a delta encoding strategy for integers). If we wanted to allow for different compression to apply to different buffers, I think we will need a new Message type because this would inflate metadata sizes in a way that is not likely to be acceptable for the current uncompressed use case. Here is my strawman proposal https://github.com/apache/arrow/compare/master...wesm:compression-strawman > > Implementation could be accomplished by one of the following methods: > > > > * Setting a field in Message.custom_metadata > > * Adding a new field to Message > > I think it has to be a new field in Message. Making it an ignorable > metadata field means non-supporting receivers will decode and interpret > the data wrongly. > > Regards > > Antoine.
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
I also support compression at the buffer level, and making it an extra message. Talking about compression and flight, has anyone tested using grpc's compression to compress at the transport level (if that's a correct way to describe it)? I believe only gzip and brotli are currently supported, so that might be insufficient. On Sun, 01 Mar 2020, 23:14 Antoine Pitrou, wrote: > > Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > In the context of a "next version of the Feather format" ARROW-5510 > > (which is consumed only by Python and R at the moment), I have been > > looking at compressing buffers using fast compressors like ZSTD when > > writing the RecordBatch bodies. This could be handled privately as an > > implementation detail of the Feather file, but since ZSTD compression > > could improve throughput in Flight, for example, I thought I would > > bring it up for discussion. > > > > I can see two simple compression strategies: > > > > * Compress the entire message body in one-shot, writing the result out > > with an 8-byte int64 prefix indicating the uncompressed size > > * Compress each non-zero-length constituent Buffer prior to writing to > > the body (and using the same uncompressed-length-prefix when writing > > the compressed buffer) > > > > The latter strategy is preferable for scenarios where we may project > > out only a few fields from a larger record batch (such as reading from > > a memory-mapped file). > > Agreed. It may also allow using different compression strategies for > different kinds of buffers (for example a bytestream splitting strategy > for floats and doubles, or a delta encoding strategy for integers). > > > Implementation could be accomplished by one of the following methods: > > > > * Setting a field in Message.custom_metadata > > * Adding a new field to Message > > I think it has to be a new field in Message. Making it an ignorable > metadata field means non-supporting receivers will decode and interpret > the data wrongly. > > Regards > > Antoine. >
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
Le 01/03/2020 à 22:01, Wes McKinney a écrit : > In the context of a "next version of the Feather format" ARROW-5510 > (which is consumed only by Python and R at the moment), I have been > looking at compressing buffers using fast compressors like ZSTD when > writing the RecordBatch bodies. This could be handled privately as an > implementation detail of the Feather file, but since ZSTD compression > could improve throughput in Flight, for example, I thought I would > bring it up for discussion. > > I can see two simple compression strategies: > > * Compress the entire message body in one-shot, writing the result out > with an 8-byte int64 prefix indicating the uncompressed size > * Compress each non-zero-length constituent Buffer prior to writing to > the body (and using the same uncompressed-length-prefix when writing > the compressed buffer) > > The latter strategy is preferable for scenarios where we may project > out only a few fields from a larger record batch (such as reading from > a memory-mapped file). Agreed. It may also allow using different compression strategies for different kinds of buffers (for example a bytestream splitting strategy for floats and doubles, or a delta encoding strategy for integers). > Implementation could be accomplished by one of the following methods: > > * Setting a field in Message.custom_metadata > * Adding a new field to Message I think it has to be a new field in Message. Making it an ignorable metadata field means non-supporting receivers will decode and interpret the data wrongly. Regards Antoine.
Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
On Sun, Mar 1, 2020 at 3:01 PM Wes McKinney wrote: > > In the context of a "next version of the Feather format" ARROW-5510 > (which is consumed only by Python and R at the moment), I have been > looking at compressing buffers using fast compressors like ZSTD when > writing the RecordBatch bodies. This could be handled privately as an > implementation detail of the Feather file, but since ZSTD compression > could improve throughput in Flight, for example, I thought I would > bring it up for discussion. I should also add that I'm nearly done with implementing this for experimentation purposes which would allow us to collect some benchmark data about how this affects Flight throughput on data having good compression ratios. > I can see two simple compression strategies: > > * Compress the entire message body in one-shot, writing the result out > with an 8-byte int64 prefix indicating the uncompressed size > * Compress each non-zero-length constituent Buffer prior to writing to > the body (and using the same uncompressed-length-prefix when writing > the compressed buffer) > > The latter strategy is preferable for scenarios where we may project > out only a few fields from a larger record batch (such as reading from > a memory-mapped file). > > Implementation could be accomplished by one of the following methods: > > * Setting a field in Message.custom_metadata > * Adding a new field to Message > > There have been past discussions about standardizing encodings and > allowing for sparse data representations, so compression could get > rolled up in that, but I still think there would be value in having a > very simple one-shot compression option for record batch bodies, so I > don't think the initiatives are in conflict with each other. > > If this were of interest, it would be important to add this to the > columnar specification ASAP for forward compatibility reasons, and any > implementation that does not want to implement decompression right > away can at least raise an error to say "this isn't supported". > > thanks > Wes
[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)
In the context of a "next version of the Feather format" ARROW-5510 (which is consumed only by Python and R at the moment), I have been looking at compressing buffers using fast compressors like ZSTD when writing the RecordBatch bodies. This could be handled privately as an implementation detail of the Feather file, but since ZSTD compression could improve throughput in Flight, for example, I thought I would bring it up for discussion. I can see two simple compression strategies: * Compress the entire message body in one-shot, writing the result out with an 8-byte int64 prefix indicating the uncompressed size * Compress each non-zero-length constituent Buffer prior to writing to the body (and using the same uncompressed-length-prefix when writing the compressed buffer) The latter strategy is preferable for scenarios where we may project out only a few fields from a larger record batch (such as reading from a memory-mapped file). Implementation could be accomplished by one of the following methods: * Setting a field in Message.custom_metadata * Adding a new field to Message There have been past discussions about standardizing encodings and allowing for sparse data representations, so compression could get rolled up in that, but I still think there would be value in having a very simple one-shot compression option for record batch bodies, so I don't think the initiatives are in conflict with each other. If this were of interest, it would be important to add this to the columnar specification ASAP for forward compatibility reasons, and any implementation that does not want to implement decompression right away can at least raise an error to say "this isn't supported". thanks Wes
[jira] [Created] (ARROW-7972) Allow reading CSV in chunks
Bulat Yaminov created ARROW-7972: Summary: Allow reading CSV in chunks Key: ARROW-7972 URL: https://issues.apache.org/jira/browse/ARROW-7972 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 0.16.0 Reporter: Bulat Yaminov Currently in the Python API you can read a CSV using [{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html]. There are some settings for the reader that you can pass in [{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions], but I don't see an option to read a part of the CSV file instead of the whole (or starting from `skip_rows`). As a result if I have a big CSV file that cannot be fit into memory, I cannot process it with this API. Is it possible to implement a chunked iterator in the similar way that [Pandas allows it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]: {code:python} from pyarrow import csv for table_chunk in csv.read_csv("big.csv", read_options=csv.ReadOptions(chunksize=1_000_000)): # do something with the table_chunk, e.g. filter and save to disk pass {code} Thanks in advance for your reaction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-29-0
On Sat, Feb 29, 2020 at 3:57 PM Neal Richardson wrote: > > I'm looking into the R failures (https://github.com/apache/arrow/pull/6509). > Since all of those docker-compose jobs are failing on Crossbow on Azure, > but the one that we run on push/pull_request on GHA is passing ( > https://github.com/apache/arrow/actions/runs/46824058), my guess is > something transient. Spot-checking one of the wheel failures, there's a > timeout trying to download Boost from bintray, so could be the same issue. I've updated the OSX wheels to use the system boost installed by brew Although the issue still persists in other builds, like the ubuntu 16.04 ones where we try to build boost external project. The download is rejected by bintray with 403 Forbidden, there is an issue about it [1]. The github release of boost is not identical with the bintay source release and 1.71 is not available on sourceforge [2]. [1]: https://github.com/boostorg/boost/issues/375 [2]: https://sourceforge.net/projects/boost/files/boost/1.71.0/ > Either way I'll try to reproduce and get more failure logging. > > Neal > > On Sat, Feb 29, 2020 at 8:31 AM Crossbow wrote: > > > > > Arrow Build Report for Job nightly-2020-02-29-0 > > > > All tasks: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0 > > > > Failed Tasks: > > - centos-7: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-github-centos-7 > > - centos-8: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-github-centos-8 > > - conda-linux-gcc-py37: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-conda-linux-gcc-py37 > > - conda-osx-clang-py36: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-conda-osx-clang-py36 > > - gandiva-jar-trusty: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-gandiva-jar-trusty > > - test-conda-python-3.7-pandas-master: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-pandas-master > > - test-conda-python-3.7-turbodbc-latest: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-turbodbc-latest > > - test-conda-python-3.7-turbodbc-master: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-turbodbc-master > > - test-r-rhub-debian-gcc-devel: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rhub-debian-gcc-devel > > - test-r-rhub-ubuntu-gcc-release: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rhub-ubuntu-gcc-release > > - test-r-rstudio-r-base-3.6-bionic: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-bionic > > - test-r-rstudio-r-base-3.6-centos6: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-centos6 > > - test-r-rstudio-r-base-3.6-opensuse15: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-opensuse15 > > - test-r-rstudio-r-base-3.6-opensuse42: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-opensuse42 > > - test-ubuntu-16.04-cpp: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-ubuntu-16.04-cpp > > - test-ubuntu-18.04-cpp-cmake32: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-ubuntu-18.04-cpp-cmake32 > > - wheel-manylinux1-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux1-cp35m > > - wheel-manylinux2010-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux2010-cp35m > > - wheel-manylinux2014-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux2014-cp35m > > - wheel-osx-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp35m > > - wheel-osx-cp36m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp36m > > - wheel-osx-cp37m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp37m > > - wheel-osx-cp38: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp38 > > - wheel-win-cp38: > > URL: > >
[jira] [Created] (ARROW-7971) Create rowcount utility in Rust
Ken Suenobu created ARROW-7971: -- Summary: Create rowcount utility in Rust Key: ARROW-7971 URL: https://issues.apache.org/jira/browse/ARROW-7971 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Ken Suenobu As a developer, I would like the ability to count the number of rows that are present in a Parquet file from the command line. Ideally, this would be something similar to {{parquet-rowcount}} or {{parquet-rows}} to count the number of rows in a Parquet file(s). -- This message was sent by Atlassian Jira (v8.3.4#803005)