[jira] [Created] (ARROW-8055) [GLib][Ruby] Add some metadata bindings to GArrowSchema

2020-03-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8055: --- Summary: [GLib][Ruby] Add some metadata bindings to GArrowSchema Key: ARROW-8055 URL: https://issues.apache.org/jira/browse/ARROW-8055 Project: Apache Arrow Is

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Radev, Martin
Hey Evan, thank you for the interest. There has been some effort for compressing floating-point data on the Parquet side, namely the BYTE_STREAM_SPLIT encoding. On its own it does not compress floating point data but makes it more compressible for when a compressor, such as ZSTD, LZ4, etc, is

[jira] [Created] (ARROW-8056) [R] Support read and write orc file format

2020-03-10 Thread Dyfan Jones (Jira)
Dyfan Jones created ARROW-8056: -- Summary: [R] Support read and write orc file format Key: ARROW-8056 URL: https://issues.apache.org/jira/browse/ARROW-8056 Project: Apache Arrow Issue Type: New F

[jira] [Created] (ARROW-8057) Schema equality not roundtrip safe

2020-03-10 Thread Florian Jetter (Jira)
Florian Jetter created ARROW-8057: - Summary: Schema equality not roundtrip safe Key: ARROW-8057 URL: https://issues.apache.org/jira/browse/ARROW-8057 Project: Apache Arrow Issue Type: Bug

[DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Fan Liya
Dear all, A non-nullable vector is one that is guaranteed to contain no nulls. We want to support non-nullable vectors in Java. *Motivations:* 1. It is widely used in practice. For example, in a database engine, a column can be declared as not null, so it cannot contain null values. 2.Non-nullabl

[jira] [Created] (ARROW-8058) [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions

2020-03-10 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8058: --- Summary: [C++][Python][Dataset] Provide an option to skip validation in FileSystemDatasetFactoryOptions Key: ARROW-8058 URL: https://issues.apache.org/jira/browse/ARROW-8058

Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Wes McKinney
hi Liya, In C++ we elect certain faster code paths when the null count is 0 or computed to be zero. When the null count is 0, we do not allocate a validity bitmap. And there is a "nullable" metadata-only flag at the Field level. Could the same kinds of optimizations be implemented in Java without

Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Fan Liya
Hi Wes, Thanks a lot for your quick reply. I think what you mentioned is almost exactly what we want to do in Java.The concept is not important. Maybe there are only some minor differences: 1. In C++, the null_count is mutable, while for Java, once a vector is constructed as non-nullable, its nul

Re: Making a patch 0.16.1 Arrow release

2020-03-10 Thread Wes McKinney
It seems like the consensus is to push for a 0.17.0 major release sooner rather than doing a patch release, since releases in general are costly. This is fine with me. I see that a 0.17.0 milestone has been created in JIRA and some JIRA gardening has begun. Do you think we can be in a position to r

Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0

2020-03-10 Thread Don Hilborn
Unsubscribe -Don On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira) wrote: > Wes McKinney created ARROW-8049: > --- > > Summary: [C++] Upgrade bundled Thrift version to 0.13.0 > Key: ARROW-8049 > URL: https://issue

[jira] [Created] (ARROW-8059) [Python] Make FileSystem objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8059: Summary: [Python] Make FileSystem objects serializable Key: ARROW-8059 URL: https://issues.apache.org/jira/browse/ARROW-8059 Project: Apache Arrow

[jira] [Created] (ARROW-8060) [Python] Make dataset Expression objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8060: Summary: [Python] Make dataset Expression objects serializable Key: ARROW-8060 URL: https://issues.apache.org/jira/browse/ARROW-8060 Project: Apache Ar

[jira] [Created] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8061: Summary: [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups) Key: ARROW-8061 URL: https://issues.apache.org/jira/browse/ARROW

[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8062: Summary: [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file Key: ARROW-8062 URL: https://issues.apache.org/jira/browse/ARROW-8062

[jira] [Created] (ARROW-8063) [Python] Add user guide documentation for Datasets API

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8063: Summary: [Python] Add user guide documentation for Datasets API Key: ARROW-8063 URL: https://issues.apache.org/jira/browse/ARROW-8063 Project: Apache A

Re: [Rust] Dictionary encoding for strings?

2020-03-10 Thread Wes McKinney
I believe that dictionary encoding in-memory was very recently implemented (February 28) in https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab. Not sure about the other questions On Mon, Mar 9, 2020 at 11:07 PM Evan Chan wrote: >

[jira] [Created] (ARROW-8064) [Dev] Implement Comment bot via Github actions

2020-03-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8064: -- Summary: [Dev] Implement Comment bot via Github actions Key: ARROW-8064 URL: https://issues.apache.org/jira/browse/ARROW-8064 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions

2020-03-10 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8065: - Summary: [C++][Dataset] Untangle Dataset, Fragment and ScanOptions Key: ARROW-8065 URL: https://issues.apache.org/jira/browse/ARROW-8065 Project: Apa

[jira] [Created] (ARROW-8066) PyArrow discards timezones

2020-03-10 Thread Markovtsev Vadim (Jira)
Markovtsev Vadim created ARROW-8066: --- Summary: PyArrow discards timezones Key: ARROW-8066 URL: https://issues.apache.org/jira/browse/ARROW-8066 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-8067) [Python] FindPython3 fails on Python 3.5

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8067: --- Summary: [Python] FindPython3 fails on Python 3.5 Key: ARROW-8067 URL: https://issues.apache.org/jira/browse/ARROW-8067 Project: Apache Arrow Issue Type: Bug

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Wes McKinney
See this past mailing list thread https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937%40%3Cdev.arrow.apache.org%3E and associated PR https://github.com/apache/arrow/pull/4815 There hasn't been a lot of movement on this but primarily because all the key

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Evan Chan
Martin, Many thanks for the links. My main concern is not actually FP and integer data, but sparse string data. Having many very sparse arrays, each with a bitmap and values (assume dictionary also), would be really expensive. I have some ideas I’d like to throw out there, around something

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Evan Chan
Thank you Wes. If the stars line up I’d be interested in joining and contributing to this effort. I have a ton of ideas around efficient encodings for different types of data. > On Mar 10, 2020, at 2:52 PM, Wes McKinney wrote: > > See this past mailing list thread > > https://lists.apache.

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Wes McKinney
On Tue, Mar 10, 2020 at 5:01 PM Evan Chan wrote: > > Martin, > > Many thanks for the links. > > My main concern is not actually FP and integer data, but sparse string data. > Having many very sparse arrays, each with a bitmap and values (assume > dictionary also), would be really expensive. I h

[jira] [Created] (ARROW-8068) [Python] Externalize option whether to bundle zlib DLL in Python packages

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8068: --- Summary: [Python] Externalize option whether to bundle zlib DLL in Python packages Key: ARROW-8068 URL: https://issues.apache.org/jira/browse/ARROW-8068 Project: Apache

[jira] [Created] (ARROW-8069) [C++] Should the default value of "check_metadata" arguments of Equals methods be "true"?

2020-03-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8069: --- Summary: [C++] Should the default value of "check_metadata" arguments of Equals methods be "true"? Key: ARROW-8069 URL: https://issues.apache.org/jira/browse/ARROW-8069

[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0

2020-03-10 Thread Crossbow
Arrow Build Report for Job nightly-2020-03-10-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0 Failed Tasks: - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-10-0-github-centos-7 - centos-8: URL: https://g

[jira] [Created] (ARROW-8070) [Python] Casting Segfault

2020-03-10 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-8070: Summary: [Python] Casting Segfault Key: ARROW-8070 URL: https://issues.apache.org/jira/browse/ARROW-8070 Project: Apache Arrow Issue Type: Bug Re

[jira] [Created] (ARROW-8071) [GLib] Build error with configure

2020-03-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8071: --- Summary: [GLib] Build error with configure Key: ARROW-8071 URL: https://issues.apache.org/jira/browse/ARROW-8071 Project: Apache Arrow Issue Type: Bug

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0

2020-03-10 Thread Sutou Kouhei
Hi, Failures of Linux packages will be fixed by https://github.com/apache/arrow/pull/6575 . Sorry. Thanks, -- kou In <5e6834bf.1c69fb81.a268f.f...@mx.google.com> "[NIGHTLY] Arrow Build Report for Job nightly-2020-03-10-0" on Tue, 10 Mar 2020 17:45:51 -0700 (PDT), Crossbow wrote: > > Arro

Re: Summary of RLE and other compression efforts?

2020-03-10 Thread Micah Kornfield
+1 to what Wes said. I'm hoping to have some more time to spend on this end of Q2/beginning of Q3 if no progress is made by then. I still think we should be careful on what is added to the spec, in particular, we should be focused on encodings that can be used to improve computational efficiency

Re: [jira] [Created] (ARROW-8049) [C++] Upgrade bundled Thrift version to 0.13.0

2020-03-10 Thread Micah Kornfield
Hi Don, I believe you send an e-mail to dev-unsubscr...@arrow.apache.org instead of simply replying to the list. Thanks, Micah On Tue, Mar 10, 2020 at 8:57 AM Don Hilborn wrote: > Unsubscribe > > > -Don > > > On Mon, Mar 9, 2020 at 6:19 PM Wes McKinney (Jira) > wrote: > > > Wes McKinney create

Re: [Java] Port vector validate functionality

2020-03-10 Thread Micah Kornfield
I agree, it would also be good to run with some of the fuzzed IPC files. On Fri, Mar 6, 2020 at 6:20 AM Wes McKinney wrote: > Seems useful. It may be a good idea to run within integration tests as > an extra sanity check also > > On Fri, Mar 6, 2020 at 2:27 AM Ji Liu wrote: > > > > > > Hi all,

Re: [Discuss] [Java] Implement vector diff functionality

2020-03-10 Thread Micah Kornfield
I'm in favor of this. I think this can be combined with a custom matcher for Google's Truth [1] library, to make a lot of our unit tests much more readable [1] https://github.com/google/truth On Thu, Mar 5, 2020 at 11:29 PM Ji Liu wrote: > > Hi all, > In C++ side, we already have array diff fu

Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-10 Thread Micah Kornfield
Hi Liya Fan, I'm a little concerned that this will change assumptions for at least some of the clients using the library (some might always rely on the validity buffer being present). I think this is a good feature to have for the reasons you mentioned. It seems like there would need to be some so

[jira] [Created] (ARROW-8072) Add const constraint when parsing data

2020-03-10 Thread Siyuan Zhuang (Jira)
Siyuan Zhuang created ARROW-8072: Summary: Add const constraint when parsing data Key: ARROW-8072 URL: https://issues.apache.org/jira/browse/ARROW-8072 Project: Apache Arrow Issue Type: Impro

[jira] [Created] (ARROW-8073) [GLib] Add binding of arrow::fs::PathForest

2020-03-10 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8073: --- Summary: [GLib] Add binding of arrow::fs::PathForest Key: ARROW-8073 URL: https://issues.apache.org/jira/browse/ARROW-8073 Project: Apache Arrow Issue Type: Ne