Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
Hi Antoine, I think Liya Fan raised some good points in his reply but I'd like to answer your questions directly. > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? I

Re: IPC Tensor + Indices

2019-07-12 Thread Micah Kornfield
Hi Razvan, I'm not sure about plans around tensors. However, depending on how you are trying to transfer the data and consume it, you might consider using an extension type [1]. For the physical representation you could model it as something like: { RowLabel : Date32/64 ColumnLabels :

[jira] [Created] (ARROW-5943) [GLib][Gandiva] Add support for function aliases

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5943: --- Summary: [GLib][Gandiva] Add support for function aliases Key: ARROW-5943 URL: https://issues.apache.org/jira/browse/ARROW-5943 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5942) [JS] Implement Tensor Type

2019-07-12 Thread Todd Hay (JIRA)
Todd Hay created ARROW-5942: --- Summary: [JS] Implement Tensor Type Key: ARROW-5942 URL: https://issues.apache.org/jira/browse/ARROW-5942 Project: Apache Arrow Issue Type: New Feature

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Sutou Kouhei
Hi, I've created pull requests that were used to release 0.14.0: ARROW-5937: [Release] Stop parallel binary upload https://github.com/apache/arrow/pull/4868 ARROW-5938: [Release] Create branch for adding release note automatically https://github.com/apache/arrow/pull/4869 ARROW-5939: [Release]

[jira] [Created] (ARROW-5941) [Release] Avoid re-uploading already uploaded binary artifacts

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5941: --- Summary: [Release] Avoid re-uploading already uploaded binary artifacts Key: ARROW-5941 URL: https://issues.apache.org/jira/browse/ARROW-5941 Project: Apache Arrow

[jira] [Created] (ARROW-5940) [Release] Add support for re-uploading sign/checksum for binary artifacts

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5940: --- Summary: [Release] Add support for re-uploading sign/checksum for binary artifacts Key: ARROW-5940 URL: https://issues.apache.org/jira/browse/ARROW-5940 Project:

[jira] [Created] (ARROW-5939) [Release] Add support for generating vote email template separately

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5939: --- Summary: [Release] Add support for generating vote email template separately Key: ARROW-5939 URL: https://issues.apache.org/jira/browse/ARROW-5939 Project: Apache

RE: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-12 Thread Malakhov, Anton
Hi, folks We were discussing improvements for the threading engine back in May and agreed to implement benchmarks (sorry, I've lost the original mail thread, here is the link:

[jira] [Created] (ARROW-5938) [Release] Create branch for adding release note automatically

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5938: --- Summary: [Release] Create branch for adding release note automatically Key: ARROW-5938 URL: https://issues.apache.org/jira/browse/ARROW-5938 Project: Apache Arrow

[jira] [Created] (ARROW-5937) [Release] Stop parallel binary upload

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5937: --- Summary: [Release] Stop parallel binary upload Key: ARROW-5937 URL: https://issues.apache.org/jira/browse/ARROW-5937 Project: Apache Arrow Issue Type:

Re: IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Sure. I'd like to bundle an M x N shaped tensor along with the M row labels (dates) and N column labels (string identifiers) in one response. Razvan On Fri, Jul 12, 2019, 6:53 PM Wes McKinney wrote: > hi Razvan -- can you clarify what "together with a row and a column > index? means? > > On

[jira] [Created] (ARROW-5936) [C++] [FlightRPC] user_metadata is not present in fields read from flight

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5936: Summary: [C++] [FlightRPC] user_metadata is not present in fields read from flight Key: ARROW-5936 URL: https://issues.apache.org/jira/browse/ARROW-5936

[jira] [Created] (ARROW-5935) [C++] ArrayBuilders with mutable type are not robustly supported

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5935: Summary: [C++] ArrayBuilders with mutable type are not robustly supported Key: ARROW-5935 URL: https://issues.apache.org/jira/browse/ARROW-5935 Project:

[jira] [Created] (ARROW-5934) [Python] Bundle arrow's LICENSE with the wheels

2019-07-12 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5934: -- Summary: [Python] Bundle arrow's LICENSE with the wheels Key: ARROW-5934 URL: https://issues.apache.org/jira/browse/ARROW-5934 Project: Apache Arrow

Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-12 Thread Ben Kietzman
Thanks all, this is helpful and I've added https://issues.apache.org/jira/browse/ARROW-5933 to improve the documentation for future developers. On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau wrote: > I was also supportive of this pattern. We definitely have used it before to > optimize in

[jira] [Created] (ARROW-5933) [C++] [Documentation] add discussion of Union.typeIds to Layout.rst

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5933: Summary: [C++] [Documentation] add discussion of Union.typeIds to Layout.rst Key: ARROW-5933 URL: https://issues.apache.org/jira/browse/ARROW-5933 Project:

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Krisztián Szűcs
Thanks for collecting them! We should also run the packaging tasks on them before cutting RC0. On Fri, Jul 12, 2019 at 8:28 PM Wes McKinney wrote: > I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 > to include all the cited patches, as well as the Parquet forward >

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Wes McKinney
I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 to include all the cited patches, as well as the Parquet forward compatibility fix. I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered IPC crash) and the ARROW-5889 (Parquet backwards compatibility with 0.13)

[jira] [Created] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

2019-07-12 Thread Cong Ding (JIRA)
Cong Ding created ARROW-5932: Summary: undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11' Key: ARROW-5932 URL: https://issues.apache.org/jira/browse/ARROW-5932 Project: Apache Arrow

[jira] [Created] (ARROW-5931) [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5931: --- Summary: [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips Key: ARROW-5931 URL:

Re: IPC Tensor + Indices

2019-07-12 Thread Wes McKinney
hi Razvan -- can you clarify what "together with a row and a column index? means? On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu wrote: > > Hi, > > Does the IPC format currently support streaming a tensor together with a > row and a column index? If not, are there any plans for this to be >

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Francois Saint-Jacques
There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This one fixes a segfault found via fuzzing. François On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs wrote: > > PRs touching the wheel packaging scripts: > - https://github.com/apache/arrow/pull/4828 (lz4) > -

[jira] [Created] (ARROW-5930) [FlightRPC] [Python] Flight CI tests are failing

2019-07-12 Thread lidavidm (JIRA)
lidavidm created ARROW-5930: --- Summary: [FlightRPC] [Python] Flight CI tests are failing Key: ARROW-5930 URL: https://issues.apache.org/jira/browse/ARROW-5930 Project: Apache Arrow Issue Type: Bug

IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Hi, Does the IPC format currently support streaming a tensor together with a row and a column index? If not, are there any plans for this to be supported? It'd be quite a useful for matrices that could have 10s of thousands of either rows, columns or both. For my use case I am currently

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-12 Thread Wes McKinney
hi Liya -- yes, it seems reasonable to defer the conversion from your pointer-based extension representation to a proper VarCharVector until you need to send over IPC. Note that there is no mechanism yet in Java with extension types to cause a conversion to take place when the IPC step is

[jira] [Created] (ARROW-5929) [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5929: --- Summary: [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC Key: ARROW-5929 URL: https://issues.apache.org/jira/browse/ARROW-5929

[jira] [Created] (ARROW-5928) [JS] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5928: --- Summary: [JS] Test fuzzer inputs Key: ARROW-5928 URL: https://issues.apache.org/jira/browse/ARROW-5928 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-5927) [Go] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5927: --- Summary: [Go] Test fuzzer inputs Key: ARROW-5927 URL: https://issues.apache.org/jira/browse/ARROW-5927 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-5926) [Java] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5926: --- Summary: [Java] Test fuzzer inputs Key: ARROW-5926 URL: https://issues.apache.org/jira/browse/ARROW-5926 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-5925) [Gandiva][C++] cast decimal to int should round up

2019-07-12 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5925: - Summary: [Gandiva][C++] cast decimal to int should round up Key: ARROW-5925 URL: https://issues.apache.org/jira/browse/ARROW-5925 Project: Apache Arrow

[jira] [Created] (ARROW-5924) [C++][Plasma] It is not convenient to release a GPU object

2019-07-12 Thread shengjun.li (JIRA)
shengjun.li created ARROW-5924: -- Summary: [C++][Plasma] It is not convenient to release a GPU object Key: ARROW-5924 URL: https://issues.apache.org/jira/browse/ARROW-5924 Project: Apache Arrow

[jira] [Created] (ARROW-5923) [C++] Fix int96 comment

2019-07-12 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5923: - Summary: [C++] Fix int96 comment Key: ARROW-5923 URL: https://issues.apache.org/jira/browse/ARROW-5923 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-5922) Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-07-12 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-5922: Summary: Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API Key: ARROW-5922 URL: https://issues.apache.org/jira/browse/ARROW-5922

[jira] [Created] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-07-12 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5921: Summary: [C++][Fuzzing] Missing nullptr checks in IPC Key: ARROW-5921 URL: https://issues.apache.org/jira/browse/ARROW-5921 Project: Apache Arrow Issue

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Krisztián Szűcs
PRs touching the wheel packaging scripts: - https://github.com/apache/arrow/pull/4828 (lz4) - https://github.com/apache/arrow/pull/4833 (uriparser - only if https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a is cherry picked as well) -

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Fan Liya
@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme. For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to

[jira] [Created] (ARROW-5920) [Java] Support sort & compare for all variable width vectors

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5920: --- Summary: [Java] Support sort & compare for all variable width vectors Key: ARROW-5920 URL: https://issues.apache.org/jira/browse/ARROW-5920 Project: Apache Arrow

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Hatem Helal
Thanks François, I closed PARQUET-1623 this morning. It would be nice to include the PR in the patch release: https://github.com/apache/arrow/pull/4857 This bug has been around for a few releases but I think it should be a low risk change to include. Hatem On 7/12/19, 2:27 AM, "Francois

Re: [Python] Wheel questions

2019-07-12 Thread Antoine Pitrou
Le 12/07/2019 à 11:39, Uwe L. Korn a écrit : > Actually the most pragmatic way I have thought of yet would be to use conda > and build all our dependencies. Instead of using the compilers defaults and > conda-forge use, we should build the dependencies in the manylinux image > and then

[jira] [Created] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge

2019-07-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5919: -- Summary: [R] Add nightly tests for building r-arrow with dependencies from conda-forge Key: ARROW-5919 URL: https://issues.apache.org/jira/browse/ARROW-5919 Project:

Re: [Python] Wheel questions

2019-07-12 Thread Uwe L. Korn
Hallo, On Thu, Jul 11, 2019, at 9:51 PM, Wes McKinney wrote: > On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou wrote: > > > > > > Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit : > > > Hi All, > > > > > > I have a couple of questions about the wheel packaging: > > > - why do we build an arrow

Re: [DISCUSS][FORMAT] Data Integrity

2019-07-12 Thread Antoine Pitrou
Le 12/07/2019 à 09:56, Micah Kornfield a écrit : > Per Antoine's recommendation. I'm splitting off the discussion about data > integrity from the previous e-mail thread about the format additions [1]. > To re-cap I made a proposal including data integrity [2] by adding a new > message type to

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Antoine Pitrou
Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > OK, I've created a separate thread for data integrity/digests [1], and > retitled this thread to continue the discussion on compression and > encodings. As a reminder the PR for the format additions [2] suggested a > new SparseRecordBatch that

[jira] [Created] (ARROW-5918) [Java] Revise the BaseIntVector interface

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5918: --- Summary: [Java] Revise the BaseIntVector interface Key: ARROW-5918 URL: https://issues.apache.org/jira/browse/ARROW-5918 Project: Apache Arrow Issue Type: Improvement

[jira] [Created] (ARROW-5917) [Java] Redesign the dictionary encoder

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5917: --- Summary: [Java] Redesign the dictionary encoder Key: ARROW-5917 URL: https://issues.apache.org/jira/browse/ARROW-5917 Project: Apache Arrow Issue Type: New Feature

[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
OK, I've created a separate thread for data integrity/digests [1], and retitled this thread to continue the discussion on compression and encodings. As a reminder the PR for the format additions [2] suggested a new SparseRecordBatch that would allow for the following features: 1. Different data

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-12 Thread Antoine Pitrou
I think it would be worthwhile to split the discussion into two separate threads. One thread for compression & encodings (which are related or even the same topic), one thread for data integrity. Regards Antoine. Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > - Compression: >*