date:20200301

[jira] [Created] (ARROW-7979) [C++] Implement experimental buffer compression in IPC messages

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7979:
---

 Summary: [C++] Implement experimental buffer compression in IPC 
messages
 Key: ARROW-7979
 URL: https://issues.apache.org/jira/browse/ARROW-7979
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The idea is that this can be used for experiments and bespoke applications 
(e.g. in the context of ARROW-5510). If this is adopted formally into the IPC 
format then the experimental implementation can be altered to match the 
specification



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7978) [Developer] GitHub Actions "lint" task is running include-what-you-use and failing

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7978:
---

 Summary: [Developer] GitHub Actions "lint" task is running 
include-what-you-use and failing
 Key: ARROW-7978
 URL: https://issues.apache.org/jira/browse/ARROW-7978
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7977) [C++] Rename fs::FileStats to fs::FileStat

2020-03-01 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-7977:
---

 Summary: [C++] Rename fs::FileStats to fs::FileStat
 Key: ARROW-7977
 URL: https://issues.apache.org/jira/browse/ARROW-7977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because widely used stat(2) is an abbreviation of "status" not
"statistics". It's better that we follow the widely used existing
convention.

Linux: http://man7.org/linux/man-pages/man2/stat.2.html

{quote}
get file status
{quote}

FreeBSD: https://www.freebsd.org/cgi/man.cgi?query=stat=2

{quote}
get file status
{quote}

If we use FileStat instead of FileStats, we can use singular form
"stat" and plural form "stats" as variable names instead of "stats"
and "stats_vector". It will help writing readable code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7976) [C++] Add field to IpcOptions to include padding in Buffer metadata accounting

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7976:
---

 Summary: [C++] Add field to IpcOptions to include padding in 
Buffer metadata accounting
 Key: ARROW-7976
 URL: https://issues.apache.org/jira/browse/ARROW-7976
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


While this will modify buffers in roundtrips, when transmitting buffers that 
you wish to be 64-byte padded, for example, this may be the desired behavior. 

See related discussion in ARROW-7975



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7975) [C++] Do not include padding bytes in "Buffer" IPC metadata accounting

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7975:
---

 Summary: [C++] Do not include padding bytes in "Buffer" IPC 
metadata accounting
 Key: ARROW-7975
 URL: https://issues.apache.org/jira/browse/ARROW-7975
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


At this line, we include the padding bytes into the IPC metadata

https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L192

The effect of this is that buffer sizes are modified by an IPC roundtrip. 
According to the Format, the padding bytes do not need to be accounted for in 
the metadata. 

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L330

The Java implementation, for example, does not.

I ran into this when working on a prototype implementation of ARROW-300, where 
it is important to have the exact unpadded size of the original buffer that was 
written. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7974) [Developer][C++] ResourceWarning in "make check-format"

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7974:
---

 Summary: [Developer][C++] ResourceWarning in "make check-format"
 Key: ARROW-7974
 URL: https://issues.apache.org/jira/browse/ARROW-7974
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


Related to ARROW-7973, I also see

{code}
$ ninja check-format
[1/1] cd /home/wesm/code/arrow/cpp/preflight...ce_dir 
/home/wesm/code/arrow/cpp/src --quiet
/home/wesm/code/arrow/cpp/build-support/run_clang_format.py:77: 
ResourceWarning: unclosed file <_io.TextIOWrapper 
name='/home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt' mode='r' 
encoding='UTF-8'>
  for line in open(arguments.exclude_globs):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7973) [Developer][C++] ResourceWarnings in run_cpplint.py

2020-03-01 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7973:
---

 Summary: [Developer][C++] ResourceWarnings in run_cpplint.py 
 Key: ARROW-7973
 URL: https://issues.apache.org/jira/browse/ARROW-7973
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


Seeing warnings like this locally

{code}
$ ninja lint
[1/1] cd /home/wesm/code/arrow/cpp/preflight...ce_dir 
/home/wesm/code/arrow/cpp/src --quiet
FAILED: CMakeFiles/lint 
cd /home/wesm/code/arrow/cpp/preflight-build && 
/home/wesm/miniconda/envs/arrow-3.7/bin/python 
/home/wesm/code/arrow/cpp/build-support/run_cpplint.py --cpplint_binary 
/home/wesm/code/arrow/cpp/build-support/cpplint.py --exclude_globs 
/home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt --source_dir 
/home/wesm/code/arrow/cpp/src --quiet
/home/wesm/code/arrow/cpp/build-support/run_cpplint.py:77: ResourceWarning: 
unclosed file <_io.TextIOWrapper 
name='/home/wesm/code/arrow/cpp/build-support/lint_exclusions.txt' mode='r' 
encoding='UTF-8'>
  for line in open(arguments.exclude_globs):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/home/wesm/code/arrow/cpp/build-support/cpplint.py:6240: ResourceWarning: 
unclosed file <_io.BufferedReader 
name='/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/add.h'>
  lines = codecs.open(filename, 'r', 'utf8', 'replace').read().split('\n')
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/home/wesm/code/arrow/cpp/build-support/cpplint.py:6240: ResourceWarning: 
unclosed file <_io.BufferedReader 
name='/home/wesm/code/arrow/cpp/src/arrow/compute/kernels/util_internal.h'>
  lines = codecs.open(filename, 'r', 'utf8', 'replace').read().split('\n')
ResourceWarning: Enable tracemalloc to get the object allocation traceback
{code}

I was using {{PYTHONDEVMODE=1}} so this may be related



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney

On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou  wrote:
>
>
> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > In the context of a "next version of the Feather format" ARROW-5510
> > (which is consumed only by Python and R at the moment), I have been
> > looking at compressing buffers using fast compressors like ZSTD when
> > writing the RecordBatch bodies. This could be handled privately as an
> > implementation detail of the Feather file, but since ZSTD compression
> > could improve throughput in Flight, for example, I thought I would
> > bring it up for discussion.
> >
> > I can see two simple compression strategies:
> >
> > * Compress the entire message body in one-shot, writing the result out
> > with an 8-byte int64 prefix indicating the uncompressed size
> > * Compress each non-zero-length constituent Buffer prior to writing to
> > the body (and using the same uncompressed-length-prefix when writing
> > the compressed buffer)
> >
> > The latter strategy is preferable for scenarios where we may project
> > out only a few fields from a larger record batch (such as reading from
> > a memory-mapped file).
>
> Agreed.  It may also allow using different compression strategies for
> different kinds of buffers (for example a bytestream splitting strategy
> for floats and doubles, or a delta encoding strategy for integers).

If we wanted to allow for different compression to apply to different
buffers, I think we will need a new Message type because this would
inflate metadata sizes in a way that is not likely to be acceptable
for the current uncompressed use case.

Here is my strawman proposal

https://github.com/apache/arrow/compare/master...wesm:compression-strawman

> > Implementation could be accomplished by one of the following methods:
> >
> > * Setting a field in Message.custom_metadata
> > * Adding a new field to Message
>
> I think it has to be a new field in Message.  Making it an ignorable
> metadata field means non-supporting receivers will decode and interpret
> the data wrongly.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Neville Dipale

I also support compression at the buffer level, and making it an extra
message.

Talking about compression and flight, has anyone tested using grpc's
compression to compress at the transport level (if that's a correct way to
describe it)? I believe only gzip and brotli are currently supported, so
that might be insufficient.

On Sun, 01 Mar 2020, 23:14 Antoine Pitrou,  wrote:

>
> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > In the context of a "next version of the Feather format" ARROW-5510
> > (which is consumed only by Python and R at the moment), I have been
> > looking at compressing buffers using fast compressors like ZSTD when
> > writing the RecordBatch bodies. This could be handled privately as an
> > implementation detail of the Feather file, but since ZSTD compression
> > could improve throughput in Flight, for example, I thought I would
> > bring it up for discussion.
> >
> > I can see two simple compression strategies:
> >
> > * Compress the entire message body in one-shot, writing the result out
> > with an 8-byte int64 prefix indicating the uncompressed size
> > * Compress each non-zero-length constituent Buffer prior to writing to
> > the body (and using the same uncompressed-length-prefix when writing
> > the compressed buffer)
> >
> > The latter strategy is preferable for scenarios where we may project
> > out only a few fields from a larger record batch (such as reading from
> > a memory-mapped file).
>
> Agreed.  It may also allow using different compression strategies for
> different kinds of buffers (for example a bytestream splitting strategy
> for floats and doubles, or a delta encoding strategy for integers).
>
> > Implementation could be accomplished by one of the following methods:
> >
> > * Setting a field in Message.custom_metadata
> > * Adding a new field to Message
>
> I think it has to be a new field in Message.  Making it an ignorable
> metadata field means non-supporting receivers will decode and interpret
> the data wrongly.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Antoine Pitrou



Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> In the context of a "next version of the Feather format" ARROW-5510
> (which is consumed only by Python and R at the moment), I have been
> looking at compressing buffers using fast compressors like ZSTD when
> writing the RecordBatch bodies. This could be handled privately as an
> implementation detail of the Feather file, but since ZSTD compression
> could improve throughput in Flight, for example, I thought I would
> bring it up for discussion.
> 
> I can see two simple compression strategies:
> 
> * Compress the entire message body in one-shot, writing the result out
> with an 8-byte int64 prefix indicating the uncompressed size
> * Compress each non-zero-length constituent Buffer prior to writing to
> the body (and using the same uncompressed-length-prefix when writing
> the compressed buffer)
> 
> The latter strategy is preferable for scenarios where we may project
> out only a few fields from a larger record batch (such as reading from
> a memory-mapped file).

Agreed.  It may also allow using different compression strategies for
different kinds of buffers (for example a bytestream splitting strategy
for floats and doubles, or a delta encoding strategy for integers).

> Implementation could be accomplished by one of the following methods:
> 
> * Setting a field in Message.custom_metadata
> * Adding a new field to Message

I think it has to be a new field in Message.  Making it an ignorable
metadata field means non-supporting receivers will decode and interpret
the data wrongly.

Regards

Antoine.

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney

On Sun, Mar 1, 2020 at 3:01 PM Wes McKinney  wrote:
>
> In the context of a "next version of the Feather format" ARROW-5510
> (which is consumed only by Python and R at the moment), I have been
> looking at compressing buffers using fast compressors like ZSTD when
> writing the RecordBatch bodies. This could be handled privately as an
> implementation detail of the Feather file, but since ZSTD compression
> could improve throughput in Flight, for example, I thought I would
> bring it up for discussion.

I should also add that I'm nearly done with implementing this for
experimentation purposes which would allow us to collect some
benchmark data about how this affects Flight throughput on data having
good compression ratios.

> I can see two simple compression strategies:
>
> * Compress the entire message body in one-shot, writing the result out
> with an 8-byte int64 prefix indicating the uncompressed size
> * Compress each non-zero-length constituent Buffer prior to writing to
> the body (and using the same uncompressed-length-prefix when writing
> the compressed buffer)
>
> The latter strategy is preferable for scenarios where we may project
> out only a few fields from a larger record batch (such as reading from
> a memory-mapped file).
>
> Implementation could be accomplished by one of the following methods:
>
> * Setting a field in Message.custom_metadata
> * Adding a new field to Message
>
> There have been past discussions about standardizing encodings and
> allowing for sparse data representations, so compression could get
> rolled up in that, but I still think there would be value in having a
> very simple one-shot compression option for record batch bodies, so I
> don't think the initiatives are in conflict with each other.
>
> If this were of interest, it would be important to add this to the
> columnar specification ASAP for forward compatibility reasons, and any
> implementation that does not want to implement decompression right
> away can at least raise an error to say "this isn't supported".
>
> thanks
> Wes

[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-01 Thread Wes McKinney

In the context of a "next version of the Feather format" ARROW-5510
(which is consumed only by Python and R at the moment), I have been
looking at compressing buffers using fast compressors like ZSTD when
writing the RecordBatch bodies. This could be handled privately as an
implementation detail of the Feather file, but since ZSTD compression
could improve throughput in Flight, for example, I thought I would
bring it up for discussion.

I can see two simple compression strategies:

* Compress the entire message body in one-shot, writing the result out
with an 8-byte int64 prefix indicating the uncompressed size
* Compress each non-zero-length constituent Buffer prior to writing to
the body (and using the same uncompressed-length-prefix when writing
the compressed buffer)

The latter strategy is preferable for scenarios where we may project
out only a few fields from a larger record batch (such as reading from
a memory-mapped file).

Implementation could be accomplished by one of the following methods:

* Setting a field in Message.custom_metadata
* Adding a new field to Message

There have been past discussions about standardizing encodings and
allowing for sparse data representations, so compression could get
rolled up in that, but I still think there would be value in having a
very simple one-shot compression option for record batch bodies, so I
don't think the initiatives are in conflict with each other.

If this were of interest, it would be important to add this to the
columnar specification ASAP for forward compatibility reasons, and any
implementation that does not want to implement decompression right
away can at least raise an error to say "this isn't supported".

thanks
Wes

[jira] [Created] (ARROW-7972) Allow reading CSV in chunks

2020-03-01 Thread Bulat Yaminov (Jira)

Bulat Yaminov created ARROW-7972:


 Summary: Allow reading CSV in chunks
 Key: ARROW-7972
 URL: https://issues.apache.org/jira/browse/ARROW-7972
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 0.16.0
Reporter: Bulat Yaminov


Currently in the Python API you can read a CSV using 
[{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html].
 There are some settings for the reader that you can pass in 
[{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions],
 but I don't see an option to read a part of the CSV file instead of the whole 
(or starting from `skip_rows`). As a result if I have a big CSV file that 
cannot be fit into memory, I cannot process it with this API.

Is it possible to implement a chunked iterator in the similar way that [Pandas 
allows 
it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
{code:python}
from pyarrow import csv
for table_chunk in csv.read_csv("big.csv", 
read_options=csv.ReadOptions(chunksize=1_000_000)):
# do something with the table_chunk, e.g. filter and save to disk
pass
{code}

Thanks in advance for your reaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-29-0

2020-03-01 Thread Krisztián Szűcs

On Sat, Feb 29, 2020 at 3:57 PM Neal Richardson
 wrote:
>
> I'm looking into the R failures (https://github.com/apache/arrow/pull/6509).
> Since all of those docker-compose jobs are failing on Crossbow on Azure,
> but the one that we run on push/pull_request on GHA is passing (
> https://github.com/apache/arrow/actions/runs/46824058), my guess is
> something transient. Spot-checking one of the wheel failures, there's a
> timeout trying to download Boost from bintray, so could be the same issue.
I've updated the OSX wheels to use the system boost installed by brew
Although the issue still persists in other builds, like the ubuntu 16.04 ones
where we try to build boost external project.

The download is rejected by bintray with 403 Forbidden, there is an issue
about it [1]. The github release of boost is not identical with the
bintay source
release and 1.71 is not available on sourceforge [2].

[1]: https://github.com/boostorg/boost/issues/375
[2]: https://sourceforge.net/projects/boost/files/boost/1.71.0/
> Either way I'll try to reproduce and get more failure logging.
>
> Neal
>
> On Sat, Feb 29, 2020 at 8:31 AM Crossbow  wrote:
>
> >
> > Arrow Build Report for Job nightly-2020-02-29-0
> >
> > All tasks:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0
> >
> > Failed Tasks:
> > - centos-7:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-github-centos-7
> > - centos-8:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-github-centos-8
> > - conda-linux-gcc-py37:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-conda-linux-gcc-py37
> > - conda-osx-clang-py36:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-conda-osx-clang-py36
> > - gandiva-jar-trusty:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-gandiva-jar-trusty
> > - test-conda-python-3.7-pandas-master:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-pandas-master
> > - test-conda-python-3.7-turbodbc-latest:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-turbodbc-latest
> > - test-conda-python-3.7-turbodbc-master:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-conda-python-3.7-turbodbc-master
> > - test-r-rhub-debian-gcc-devel:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rhub-debian-gcc-devel
> > - test-r-rhub-ubuntu-gcc-release:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rhub-ubuntu-gcc-release
> > - test-r-rstudio-r-base-3.6-bionic:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-bionic
> > - test-r-rstudio-r-base-3.6-centos6:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-centos6
> > - test-r-rstudio-r-base-3.6-opensuse15:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-opensuse15
> > - test-r-rstudio-r-base-3.6-opensuse42:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-test-r-rstudio-r-base-3.6-opensuse42
> > - test-ubuntu-16.04-cpp:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-ubuntu-16.04-cpp
> > - test-ubuntu-18.04-cpp-cmake32:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-circle-test-ubuntu-18.04-cpp-cmake32
> > - wheel-manylinux1-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux1-cp35m
> > - wheel-manylinux2010-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux2010-cp35m
> > - wheel-manylinux2014-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-azure-wheel-manylinux2014-cp35m
> > - wheel-osx-cp35m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp35m
> > - wheel-osx-cp36m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp36m
> > - wheel-osx-cp37m:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp37m
> > - wheel-osx-cp38:
> >   URL:
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-29-0-travis-wheel-osx-cp38
> > - wheel-win-cp38:
> >   URL:
> >

[jira] [Created] (ARROW-7971) Create rowcount utility in Rust

2020-03-01 Thread Ken Suenobu (Jira)

Ken Suenobu created ARROW-7971:
--

 Summary: Create rowcount utility in Rust
 Key: ARROW-7971
 URL: https://issues.apache.org/jira/browse/ARROW-7971
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Ken Suenobu


As a developer, I would like the ability to count the number of rows that are 
present in a Parquet file from the command line.  Ideally, this would be 
something similar to {{parquet-rowcount}} or {{parquet-rows}} to count the 
number of rows in a Parquet file(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7979) [C++] Implement experimental buffer compression in IPC messages

[jira] [Created] (ARROW-7978) [Developer] GitHub Actions "lint" task is running include-what-you-use and failing

[jira] [Created] (ARROW-7977) [C++] Rename fs::FileStats to fs::FileStat

[jira] [Created] (ARROW-7976) [C++] Add field to IpcOptions to include padding in Buffer metadata accounting

[jira] [Created] (ARROW-7975) [C++] Do not include padding bytes in "Buffer" IPC metadata accounting

[jira] [Created] (ARROW-7974) [Developer][C++] ResourceWarning in "make check-format"

[jira] [Created] (ARROW-7973) [Developer][C++] ResourceWarnings in run_cpplint.py

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

[DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

[jira] [Created] (ARROW-7972) Allow reading CSV in chunks

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-29-0

[jira] [Created] (ARROW-7971) Create rowcount utility in Rust

15 matches

Site Navigation

Mail list logo

Footer information