Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
Hi Antoine,
I think Liya Fan raised some good points in his reply but I'd like to
answer your questions directly.


> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?

I tried to separate the two concepts into Encodings (things Arrow can
operate directly on) and Compression (solely for transport).  While there
is some overlap I think the two features can be considered separately.

For each encoding there is additional implementation complexity to properly
exploit it.  However, the benefit for some workloads can be large [1][2].

If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.


This is a reasonable point.  However there is  continuum here between file
size and read and write times.  Parquet will likely always be the smallest
with the largest times to convert to and from Arrow.  An uncompressed
Feather/Arrow file will likely always take the most space but will much
faster conversion times.The question is whether a buffer level or some
other sub-file level compression scheme provides enough values compared
with compressing of the entire Feather file.  This is somewhat hand-wavy
but if we feel we might want to investigate this further I can write some
benchmarks to quantify the differences.

Cheers,
Micah

[1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf

On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou  wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>


Re: IPC Tensor + Indices

2019-07-12 Thread Micah Kornfield
Hi Razvan,
I'm not sure about plans around tensors.  However, depending on how you are
trying to transfer the data and consume it, you might consider using an
extension type [1].  For the physical representation you could model it as
something like:

{
  RowLabel : Date32/64
  ColumnLabels : FixedSizeList (dictionary encoded)
  Data : FixedSize
}

which would be more compact that making N individual columns if N is
large.  You would have to handle the mapping from column label to index at
the application level though.

Hope this helps.

-Micah

[1]
https://github.com/apache/arrow/blob/6fb850cf57fd6227573cca6d43a46e1d5d2b0a66/docs/source/format/Metadata.rst#extension-types

On Fri, Jul 12, 2019 at 1:53 PM Razvan Chitu 
wrote:

> Sure. I'd like to bundle an M x N shaped tensor along with the M row labels
> (dates) and N column labels (string identifiers) in one response.
>
> Razvan
>
> On Fri, Jul 12, 2019, 6:53 PM Wes McKinney  wrote:
>
> > hi Razvan -- can you clarify what "together with a row and a column
> > index? means?
> >
> > On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu 
> > wrote:
> > >
> > > Hi,
> > >
> > > Does the IPC format currently support streaming a tensor together with
> a
> > > row and a column index? If not, are there any plans for this to be
> > > supported? It'd be quite a useful for matrices that could have 10s of
> > > thousands of either rows, columns or both. For my use case I am
> currently
> > > representing matrices as record batches, but performance is not that
> > great
> > > when there are many columns and few rows.
> > >
> > > Thanks,
> > > Razvan
> >
>


[jira] [Created] (ARROW-5943) [GLib][Gandiva] Add support for function aliases

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5943:
---

 Summary: [GLib][Gandiva] Add support for function aliases
 Key: ARROW-5943
 URL: https://issues.apache.org/jira/browse/ARROW-5943
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5942) [JS] Implement Tensor Type

2019-07-12 Thread Todd Hay (JIRA)
Todd Hay created ARROW-5942:
---

 Summary: [JS] Implement Tensor Type
 Key: ARROW-5942
 URL: https://issues.apache.org/jira/browse/ARROW-5942
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Todd Hay


Implement a generic N-dimensional tensor type for JavaScript



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Sutou Kouhei
Hi,

I've created pull requests that were used to release 0.14.0:

ARROW-5937: [Release] Stop parallel binary upload
https://github.com/apache/arrow/pull/4868

ARROW-5938: [Release] Create branch for adding release note automatically
https://github.com/apache/arrow/pull/4869

ARROW-5939: [Release] Add support for generating vote email template separately
https://github.com/apache/arrow/pull/4870

ARROW-5940: [Release] Add support for re-uploading sign/checksum for binary 
artifacts
https://github.com/apache/arrow/pull/4871

ARROW-5941: [Release] Avoid re-uploading already uploaded binary artifacts
https://github.com/apache/arrow/pull/4872
(This will be conflicted with https://github.com/apache/arrow/pull/4868 .)


They will be useful to release 0.14.1.


Thanks,
--
kou

In 
  "Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, 
Parquet forward compatibility problems" on Fri, 12 Jul 2019 13:27:41 -0500,
  Wes McKinney  wrote:

> I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> to include all the cited patches, as well as the Parquet forward
> compatibility fix.
> 
> I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered
> IPC crash) and the ARROW-5889 (Parquet backwards compatibility with
> 0.13) needs to be rebased
> 
> https://github.com/apache/arrow/pull/4856
> 
> I think those are the last 2 patches that should go into the branch
> unless something else comes up. Once those land I'll update the
> commands and then push up the patch release branch (hopefully
> everything will cherry pick cleanly)
> 
> On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques
>  wrote:
>>
>> There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
>> one fixes a segfault found via fuzzing.
>>
>> François
>>
>> On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
>>  wrote:
>> >
>> > PRs touching the wheel packaging scripts:
>> > - https://github.com/apache/arrow/pull/4828 (lz4)
>> > - https://github.com/apache/arrow/pull/4833 (uriparser - only if
>> > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
>> > is cherry picked as well)
>> > - https://github.com/apache/arrow/pull/4834 (zlib)
>> >
>> > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  wrote:
>> >
>> > > Thanks François, I closed PARQUET-1623 this morning.  It would be nice to
>> > > include the PR in the patch release:
>> > >
>> > > https://github.com/apache/arrow/pull/4857
>> > >
>> > > This bug has been around for a few releases but I think it should be a 
>> > > low
>> > > risk change to include.
>> > >
>> > > Hatem
>> > >
>> > >
>> > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
>> > > wrote:
>> > >
>> > > I just merged PARQUET-1623, I think it's worth inserting since it
>> > > fixes an invalid memory write. Note that I couldn't resolve/close the
>> > > parquet issue, do I have to be contributor to the project?
>> > >
>> > > François
>> > >
>> > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
>> > > wrote:
>> > > >
>> > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all 
>> > > the
>> > > > patches since the release commit and have come up with the 
>> > > following
>> > > > list of 32 fix-only patches to pick into a maintenance branch:
>> > > >
>> > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
>> > > >
>> > > > Note there's still unresolved Parquet forward/backward 
>> > > compatibility
>> > > > issues in C++ that we haven't merged patches for yet, so that is
>> > > > pending.
>> > > >
>> > > > Are there any other patches / JIRA issues people would like to see
>> > > > resolved in a patch release?
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney 
>> > > wrote:
>> > > > >
>> > > > > Eric -- you are free to set the Fix Version prior to the patch
>> > > being merged
>> > > > >
>> > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
>> > > > >  wrote:
>> > > > > >
>> > > > > > The two C# fixes I'd like in the 0.14.1 release are:
>> > > > > >
>> > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
>> > > marked with 0.14.1 fix version.
>> > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been
>> > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one
>> > > approver and the Rust failure doesn't appear to be caused by my change.
>> > > > > >
>> > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version
>> > > until the PR has been merged.
>> > > > > >
>> > > > > > -Original Message-
>> > > > > > From: Neal Richardson 
>> > > > > > Sent: Thursday, July 11, 2019 11:59 AM
>> > > > > > To: dev@arrow.apache.org
>> > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
>> > > package problems, Parquet forward compatibility problems
>> > > > > >
>> > 

[jira] [Created] (ARROW-5941) [Release] Avoid re-uploading already uploaded binary artifacts

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5941:
---

 Summary: [Release] Avoid re-uploading already uploaded binary 
artifacts
 Key: ARROW-5941
 URL: https://issues.apache.org/jira/browse/ARROW-5941
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0, 0.14.1






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5940) [Release] Add support for re-uploading sign/checksum for binary artifacts

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5940:
---

 Summary: [Release] Add support for re-uploading sign/checksum for 
binary artifacts
 Key: ARROW-5940
 URL: https://issues.apache.org/jira/browse/ARROW-5940
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0, 0.14.1






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5939) [Release] Add support for generating vote email template separately

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5939:
---

 Summary: [Release] Add support for generating vote email template 
separately
 Key: ARROW-5939
 URL: https://issues.apache.org/jira/browse/ARROW-5939
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0, 0.14.1






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


RE: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-12 Thread Malakhov, Anton
Hi, folks

We were discussing improvements for the threading engine back in May and agreed 
to implement benchmarks (sorry, I've lost the original mail thread, here is the 
link: 
https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E
 )

Here is update of what's going on with this effort.
We've implemented a rough prototype for group_by, aggregate, and transform 
execution nodes on top of Arrow (along with studying the whole data analytics 
domain along the way :-) ) and made them parallel, as you can see in this 
repository: https://github.com/anton-malakhov/nyc_taxi

The result is that all these execution nodes scale well enough and run under 
100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV 
reader takes several seconds to complete even reading from in-memory file 
(8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus my 
focus recently has been around optimization of CSV parser where I have achieved 
50% improvement substituting all the small object allocations via TBB scalable 
allocator and using TBB-based memory pool instead of default one with 
pre-allocated huge (2Mb) memory pages (echo 3 > /proc/sys/vm/nr_hugepages). 
I found no way yet how to do both of these tricks with jemalloc, so please try 
to beat or meet my times without TBB allocator. I also see other hotspots and 
opportunities for optimizations, some examples are memset is being heavily used 
while resizing buffers (why and why?) and the column builder trashes caches by 
not using of streaming stores.

I used TBB directly to make the execution nodes parallel, however I have also 
implemented a simple TBB-based ThreadPool and TaskGroup as you can see in this 
PR: https://github.com/aregm/arrow/pull/6
I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and 
BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world task 
of CSV reader, I don't see any improvements yet. Or even worse, while reading 
the file, TBB wastes some cycles spinning.. probably because of read-ahead 
thread, which oversubscribes the machine. Arrow's threading better interacts 
with OS scheduler thus shows better performance. So, this simple approach to 
TBB without a deeper redesign didn't help. I'll be looking into applying more 
sophisticated NUMA and locality-aware tricks as I'll be cleaning paths for the 
data streams in the parser. Though, I'll take some time off before returning to 
this effort. See you in September!


Regards,
// Anton



[jira] [Created] (ARROW-5938) [Release] Create branch for adding release note automatically

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5938:
---

 Summary: [Release] Create branch for adding release note 
automatically
 Key: ARROW-5938
 URL: https://issues.apache.org/jira/browse/ARROW-5938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0, 0.14.1






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5937) [Release] Stop parallel binary upload

2019-07-12 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5937:
---

 Summary: [Release] Stop parallel binary upload
 Key: ARROW-5937
 URL: https://issues.apache.org/jira/browse/ARROW-5937
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 1.0.0, 0.14.1






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Sure. I'd like to bundle an M x N shaped tensor along with the M row labels
(dates) and N column labels (string identifiers) in one response.

Razvan

On Fri, Jul 12, 2019, 6:53 PM Wes McKinney  wrote:

> hi Razvan -- can you clarify what "together with a row and a column
> index? means?
>
> On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu 
> wrote:
> >
> > Hi,
> >
> > Does the IPC format currently support streaming a tensor together with a
> > row and a column index? If not, are there any plans for this to be
> > supported? It'd be quite a useful for matrices that could have 10s of
> > thousands of either rows, columns or both. For my use case I am currently
> > representing matrices as record batches, but performance is not that
> great
> > when there are many columns and few rows.
> >
> > Thanks,
> > Razvan
>


[jira] [Created] (ARROW-5936) [C++] [FlightRPC] user_metadata is not present in fields read from flight

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5936:


 Summary: [C++] [FlightRPC] user_metadata is not present in fields 
read from flight
 Key: ARROW-5936
 URL: https://issues.apache.org/jira/browse/ARROW-5936
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Benjamin Kietzman


Should this go in the arrow::Field::metadata property somewhere? Does 
user_metadata round trip through some other channel?

https://github.com/apache/arrow/pull/4841#discussion_r302623241




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5935) [C++] ArrayBuilders with mutable type are not robustly supported

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5935:


 Summary: [C++] ArrayBuilders with mutable type are not robustly 
supported
 Key: ARROW-5935
 URL: https://issues.apache.org/jira/browse/ARROW-5935
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


(Dense|Sparse)UnionBuilder, DictionaryBuilder, Addaptive(U)IntBuilders and any 
nested builder which contains one of those may Finish to an array whose type 
disagrees with what was passed to MakeBuilder. This is not well documented or 
supported; ListBuilder checks if its child has changed type but StructBuilder 
does not. Furthermore ListBuilder's check does not catch modifications to a 
DictionaryBuidler's type and results in an invalid array on Finish: 
https://github.com/apache/arrow/blob/1bcfbe1/cpp/src/arrow/array-dict-test.cc#L951-L994

Let's add to the ArrayBuilder contract: the type property is null iff that 
builder's type is indeterminate until Finish() is called. Then all nested 
builders can check this on their children at construction and bubble type 
mutability correclty



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5934) [Python] Bundle arrow's LICENSE with the wheels

2019-07-12 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5934:
--

 Summary: [Python] Bundle arrow's LICENSE with the wheels
 Key: ARROW-5934
 URL: https://issues.apache.org/jira/browse/ARROW-5934
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Krisztian Szucs
 Fix For: 1.0.0, 0.14.1


Guide to bundle LICENSE files with the wheels: 
https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file

We also need to ensure, that all thirdparty dependencies' license are attached 
to it, especially because we're statically linking multiple 3rdparty 
dependencies, and for example uriparser is missing from the LICENSE file.

cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-12 Thread Ben Kietzman
Thanks all, this is helpful and I've added
https://issues.apache.org/jira/browse/ARROW-5933 to improve the
documentation for future developers.

On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau  wrote:

> I was also supportive of this pattern. We definitely have used it before to
> optimize in certain cases.
>
> On Wed, Jul 10, 2019, 2:40 PM Wes McKinney  wrote:
>
> > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman 
> > wrote:
> > >
> > > In this scenario option A (include child arrays for each child type,
> even
> > > if that type is not observed) seems like the clearly correct choice to
> > me.
> > > It yiedls a more intuitive layout for the union array and incurs no
> > runtime
> > > overhead (since the absent children are empty/null arrays).
> >
> > I am not sure this is right. The child arrays still occupy memory in
> > the Sparse Union case (where all child arrays have the same length).
> > In order to satisfy the requirements of the IPC protocol, the child
> > arrays need to be of the same type as the types in the union. In the
> > Dense Union case, the not-present children will have length 0.
> >
> > >
> > > > why not allow them to be flexible in this regard?
> > >
> > > I would say that if code doesn't add anything except cognitive overhead
> > > then it's worthwhile to remove it.
> >
> > The cognitive overhead comes for the Arrow library implementer --
> > users of the libraries aren't required to deal with this detail
> > necessarily. The type ids are optional, after all. Even if it is
> > removed, you still have ids, so whether it's
> >
> > type 0, id=0
> > type 1, id=1
> > type 2, id=2
> >
> > or
> >
> > type 0, id=3
> > type 1, id=7
> > type 2, id=10
> >
> > the difference is in the second case, you have to look up the code
> > corresponding to each type rather than assuming that the type's
> > position and its code are the same.
> >
> > In processing, branching should occur at the Type level, so a function
> > to process a child looks like
> >
> > ProcessChild(child, child_id, ...)
> >
> > In either case you have to match a child with its id that appears in the
> > data.
> >
> > Anyway, since Julien and I are responsible for introducing this
> > concept in the early stages of the project I'm interested to hear more
> > from others. Note that this doesn't serve to resolve the
> > Union-of-Nested-Types problem that has prevented the development of
> > integration tests between Java and C++.
> >
> > >
> > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney 
> > wrote:
> > >
> > > > hi Ben,
> > > >
> > > > Some applications use static type ids for various data types. Let's
> > > > consider one possibility:
> > > >
> > > > BOOLEAN: 0
> > > > INT32: 1
> > > > DOUBLE: 2
> > > > STRING (UTF8): 3
> > > >
> > > > If you were parsing JSON and constructing unions while parsing, you
> > > > might encounter some types, but not all. So if we _don't_ have the
> > > > option of having type ids in the metadata then we are left with some
> > > > unsatisfactory options:
> > > >
> > > > A: Include all types in the resulting union, even if they are
> > unobserved,
> > > > or
> > > > B: Assign type id dynamically to types when they are observed
> > > >
> > > > Option B: is potentially bad because it does not parallelize across
> > > > threads or nodes.
> > > >
> > > > So I do think the feature is useful. It does make the implementations
> > > > of unions more complex, though, so it does not come without cost. But
> > > > unions are already the most complex tool we have in our nested data
> > > > toolbox, so why not allow them to be flexible in this regard?
> > > >
> > > > In any case I'm -0 on making changes, but would be interested in
> > > > feedback of others if there is strong sentiment about deprecating the
> > > > feature.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <
> ben.kietz...@rstudio.com
> > >
> > > > wrote:
> > > > >
> > > > > The Union.typeIds property is confusing and its utility is unclear.
> > I'd
> > > > > like to remove it (or at least document it better). Unless anyone
> > knows a
> > > > > real advantage for keeping it I plan to assemble a PR to drop it
> > from the
> > > > > format and the C++ implementation.
> > > > >
> > > > > ARROW-257 ( resolved by pull request
> > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > > optional
> > > > > typeIds property (in the C++ implementation, this is
> > > > > UnionType::type_codes). Prior to that pull request each element
> > (int8) in
> > > > > the type_ids (second) buffer of a union array was the index of a
> > child
> > > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> > union
> > > > > array began with a value from child_data[5]. After that change to
> > > > interpret
> > > > > a type_id of 5 one must look through the typeIds property and the
> > index
> > > > at
> > > > > which a 5 is found is the index of the corresponding child array.
> > > > >
> > > > > The 

[jira] [Created] (ARROW-5933) [C++] [Documentation] add discussion of Union.typeIds to Layout.rst

2019-07-12 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5933:


 Summary: [C++] [Documentation] add discussion of Union.typeIds to 
Layout.rst 
 Key: ARROW-5933
 URL: https://issues.apache.org/jira/browse/ARROW-5933
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


Union.typeIds is poorly documented and the corresponding property in UnionType 
is confusingly named type_codes. In particular, Layout.rst doesn't include an 
explanation of Union.typeIds and implies that an element of a union array's 
type_ids buffer is always the index of a child array.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Krisztián Szűcs
Thanks for collecting them!
We should also run the packaging tasks on them before cutting RC0.


On Fri, Jul 12, 2019 at 8:28 PM Wes McKinney  wrote:

> I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> to include all the cited patches, as well as the Parquet forward
> compatibility fix.
>
> I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered
> IPC crash) and the ARROW-5889 (Parquet backwards compatibility with
> 0.13) needs to be rebased
>
> https://github.com/apache/arrow/pull/4856
>
> I think those are the last 2 patches that should go into the branch
> unless something else comes up. Once those land I'll update the
> commands and then push up the patch release branch (hopefully
> everything will cherry pick cleanly)
>
> On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques
>  wrote:
> >
> > There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
> > one fixes a segfault found via fuzzing.
> >
> > François
> >
> > On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
> >  wrote:
> > >
> > > PRs touching the wheel packaging scripts:
> > > - https://github.com/apache/arrow/pull/4828 (lz4)
> > > - https://github.com/apache/arrow/pull/4833 (uriparser - only if
> > >
> https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
> > > is cherry picked as well)
> > > - https://github.com/apache/arrow/pull/4834 (zlib)
> > >
> > > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal 
> wrote:
> > >
> > > > Thanks François, I closed PARQUET-1623 this morning.  It would be
> nice to
> > > > include the PR in the patch release:
> > > >
> > > > https://github.com/apache/arrow/pull/4857
> > > >
> > > > This bug has been around for a few releases but I think it should be
> a low
> > > > risk change to include.
> > > >
> > > > Hatem
> > > >
> > > >
> > > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" <
> fsaintjacq...@gmail.com>
> > > > wrote:
> > > >
> > > > I just merged PARQUET-1623, I think it's worth inserting since it
> > > > fixes an invalid memory write. Note that I couldn't
> resolve/close the
> > > > parquet issue, do I have to be contributor to the project?
> > > >
> > > > François
> > > >
> > > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > wrote:
> > > > >
> > > > > I just merged Eric's 2nd patch ARROW-5908 and I went through
> all the
> > > > > patches since the release commit and have come up with the
> following
> > > > > list of 32 fix-only patches to pick into a maintenance branch:
> > > > >
> > > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > > > >
> > > > > Note there's still unresolved Parquet forward/backward
> compatibility
> > > > > issues in C++ that we haven't merged patches for yet, so that
> is
> > > > > pending.
> > > > >
> > > > > Are there any other patches / JIRA issues people would like to
> see
> > > > > resolved in a patch release?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Eric -- you are free to set the Fix Version prior to the
> patch
> > > > being merged
> > > > > >
> > > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> > > > > >  wrote:
> > > > > > >
> > > > > > > The two C# fixes I'd like in the 0.14.1 release are:
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
> > > > marked with 0.14.1 fix version.
> > > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't
> been
> > > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has
> one
> > > > approver and the Rust failure doesn't appear to be caused by my
> change.
> > > > > > >
> > > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix
> version
> > > > until the PR has been merged.
> > > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Neal Richardson 
> > > > > > > Sent: Thursday, July 11, 2019 11:59 AM
> > > > > > > To: dev@arrow.apache.org
> > > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to
> Python
> > > > package problems, Parquet forward compatibility problems
> > > > > > >
> > > > > > > I just moved
> > > >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0
> > > > from 1.0.0 to 0.14.1.
> > > > > > >
> > > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney <
> > > > wesmck...@gmail.com> wrote:
> > > > > > >
> > > > > > > > To limit uncertainty, I'm going to start preparing a
> 0.14.1
> > > > patch
> > > > > > > > release branch. 

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Wes McKinney
I updated https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
to include all the cited patches, as well as the Parquet forward
compatibility fix.

I'm waiting on CI to be able to pass ARROW-5921 (fuzzing-discovered
IPC crash) and the ARROW-5889 (Parquet backwards compatibility with
0.13) needs to be rebased

https://github.com/apache/arrow/pull/4856

I think those are the last 2 patches that should go into the branch
unless something else comes up. Once those land I'll update the
commands and then push up the patch release branch (hopefully
everything will cherry pick cleanly)

On Fri, Jul 12, 2019 at 12:34 PM Francois Saint-Jacques
 wrote:
>
> There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
> one fixes a segfault found via fuzzing.
>
> François
>
> On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
>  wrote:
> >
> > PRs touching the wheel packaging scripts:
> > - https://github.com/apache/arrow/pull/4828 (lz4)
> > - https://github.com/apache/arrow/pull/4833 (uriparser - only if
> > https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
> > is cherry picked as well)
> > - https://github.com/apache/arrow/pull/4834 (zlib)
> >
> > On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  wrote:
> >
> > > Thanks François, I closed PARQUET-1623 this morning.  It would be nice to
> > > include the PR in the patch release:
> > >
> > > https://github.com/apache/arrow/pull/4857
> > >
> > > This bug has been around for a few releases but I think it should be a low
> > > risk change to include.
> > >
> > > Hatem
> > >
> > >
> > > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
> > > wrote:
> > >
> > > I just merged PARQUET-1623, I think it's worth inserting since it
> > > fixes an invalid memory write. Note that I couldn't resolve/close the
> > > parquet issue, do I have to be contributor to the project?
> > >
> > > François
> > >
> > > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
> > > wrote:
> > > >
> > > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the
> > > > patches since the release commit and have come up with the following
> > > > list of 32 fix-only patches to pick into a maintenance branch:
> > > >
> > > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > > >
> > > > Note there's still unresolved Parquet forward/backward compatibility
> > > > issues in C++ that we haven't merged patches for yet, so that is
> > > > pending.
> > > >
> > > > Are there any other patches / JIRA issues people would like to see
> > > > resolved in a patch release?
> > > >
> > > > Thanks
> > > >
> > > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney 
> > > wrote:
> > > > >
> > > > > Eric -- you are free to set the Fix Version prior to the patch
> > > being merged
> > > > >
> > > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> > > > >  wrote:
> > > > > >
> > > > > > The two C# fixes I'd like in the 0.14.1 release are:
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
> > > marked with 0.14.1 fix version.
> > > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been
> > > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one
> > > approver and the Rust failure doesn't appear to be caused by my change.
> > > > > >
> > > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version
> > > until the PR has been merged.
> > > > > >
> > > > > > -Original Message-
> > > > > > From: Neal Richardson 
> > > > > > Sent: Thursday, July 11, 2019 11:59 AM
> > > > > > To: dev@arrow.apache.org
> > > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
> > > package problems, Parquet forward compatibility problems
> > > > > >
> > > > > > I just moved
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0
> > > from 1.0.0 to 0.14.1.
> > > > > >
> > > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney <
> > > wesmck...@gmail.com> wrote:
> > > > > >
> > > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1
> > > patch
> > > > > > > release branch. I will update the list with the patches that
> > > are being
> > > > > > > cherry-picked. If other folks could give me a list of other
> > > PRs that
> > > > > > > need to be backported I will add them to the list. Any JIRA
> > > that needs
> > > > > > > to be included should have the "0.14.1" fix version added so
> > > we can
> > > > > > > keep track
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > > > > > >  

[jira] [Created] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

2019-07-12 Thread Cong Ding (JIRA)
Cong Ding created ARROW-5932:


 Summary: undefined reference to 
`__cxa_init_primary_exception@CXXABI_1.3.11'
 Key: ARROW-5932
 URL: https://issues.apache.org/jira/browse/ARROW-5932
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.0
 Environment: Linux Mint 19.1 Tessa
g++-6
Reporter: Cong Ding


I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed 
the instructions on the official arrow website (using the ubuntu 18.04 method). 
However, when I was trying to compile the examples, the g++ compiler threw out 
some errors.

I have updated my g++ to g++-6, update my libstdc++ library, and using flag 
-lstdc++, but it still didn't work.

 
{code:java}
//代码占位符
g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ 
{code}
The error message:

/usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
`__cxa_init_primary_exception@CXXABI_1.3.11'
/usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
`std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
collect2: error: ld returned 1 exit status.

 

I do not know what to do this moment. Can anyone help me?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5931) [C++] Extend extension types facility to provide for serialization and deserialization in IPC roundtrips

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5931:
---

 Summary: [C++] Extend extension types facility to provide for 
serialization and deserialization in IPC roundtrips
 Key: ARROW-5931
 URL: https://issues.apache.org/jira/browse/ARROW-5931
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


A use case here is when an array needs to reference some external data. For 
example, suppose that we wanted to implement an array that references a 
sequence of Python objects as {{PyObject**}}. Obviously, a {{PyObject*}} must 
be managed by the Python interpreter.

For a vector of some {{T*}} to be sent through the IPC machinery, it must be 
embedded in some Arrow type on the wire. For example, the memory resident 
version of {{PyObject**} might be 8-bytes per value (1 pointer per value) while 
being serialized to the binary IPC protocol, such {{PyObject*}} values must be 
serialized into an Arrow Binary type.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: IPC Tensor + Indices

2019-07-12 Thread Wes McKinney
hi Razvan -- can you clarify what "together with a row and a column
index? means?

On Fri, Jul 12, 2019 at 11:17 AM Razvan Chitu  wrote:
>
> Hi,
>
> Does the IPC format currently support streaming a tensor together with a
> row and a column index? If not, are there any plans for this to be
> supported? It'd be quite a useful for matrices that could have 10s of
> thousands of either rows, columns or both. For my use case I am currently
> representing matrices as record batches, but performance is not that great
> when there are many columns and few rows.
>
> Thanks,
> Razvan


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Francois Saint-Jacques
There's also ARROW-5921 (I tagged it 0.14.1) if it passes travis. This
one fixes a segfault found via fuzzing.

François

On Fri, Jul 12, 2019 at 6:54 AM Krisztián Szűcs
 wrote:
>
> PRs touching the wheel packaging scripts:
> - https://github.com/apache/arrow/pull/4828 (lz4)
> - https://github.com/apache/arrow/pull/4833 (uriparser - only if
> https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
> is cherry picked as well)
> - https://github.com/apache/arrow/pull/4834 (zlib)
>
> On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  wrote:
>
> > Thanks François, I closed PARQUET-1623 this morning.  It would be nice to
> > include the PR in the patch release:
> >
> > https://github.com/apache/arrow/pull/4857
> >
> > This bug has been around for a few releases but I think it should be a low
> > risk change to include.
> >
> > Hatem
> >
> >
> > On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
> > wrote:
> >
> > I just merged PARQUET-1623, I think it's worth inserting since it
> > fixes an invalid memory write. Note that I couldn't resolve/close the
> > parquet issue, do I have to be contributor to the project?
> >
> > François
> >
> > On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
> > wrote:
> > >
> > > I just merged Eric's 2nd patch ARROW-5908 and I went through all the
> > > patches since the release commit and have come up with the following
> > > list of 32 fix-only patches to pick into a maintenance branch:
> > >
> > > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> > >
> > > Note there's still unresolved Parquet forward/backward compatibility
> > > issues in C++ that we haven't merged patches for yet, so that is
> > > pending.
> > >
> > > Are there any other patches / JIRA issues people would like to see
> > > resolved in a patch release?
> > >
> > > Thanks
> > >
> > > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney 
> > wrote:
> > > >
> > > > Eric -- you are free to set the Fix Version prior to the patch
> > being merged
> > > >
> > > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> > > >  wrote:
> > > > >
> > > > > The two C# fixes I'd like in the 0.14.1 release are:
> > > > >
> > > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
> > marked with 0.14.1 fix version.
> > > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been
> > resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one
> > approver and the Rust failure doesn't appear to be caused by my change.
> > > > >
> > > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version
> > until the PR has been merged.
> > > > >
> > > > > -Original Message-
> > > > > From: Neal Richardson 
> > > > > Sent: Thursday, July 11, 2019 11:59 AM
> > > > > To: dev@arrow.apache.org
> > > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
> > package problems, Parquet forward compatibility problems
> > > > >
> > > > > I just moved
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0
> > from 1.0.0 to 0.14.1.
> > > > >
> > > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney <
> > wesmck...@gmail.com> wrote:
> > > > >
> > > > > > To limit uncertainty, I'm going to start preparing a 0.14.1
> > patch
> > > > > > release branch. I will update the list with the patches that
> > are being
> > > > > > cherry-picked. If other folks could give me a list of other
> > PRs that
> > > > > > need to be backported I will add them to the list. Any JIRA
> > that needs
> > > > > > to be included should have the "0.14.1" fix version added so
> > we can
> > > > > > keep track
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > > > > >  wrote:
> > > > > > >
> > > > > > > I personally prefer 0.14.1 over 0.15.0. I think that is
> > clearer in
> > > > > > > communication, as we are fixing regressions of the 0.14.0
> > release.
> > > > > > >
> > > > > > > (but I haven't been involved much in releases, so certainly
> > no
> > > > > > > strong
> > > > > > > opinion)
> > > > > > >
> > > > > > > Joris
> > > > > > >
> > > > > > >
> > > > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney <
> > wesmck...@gmail.com>:
> > > > > > >
> > > > > > > > hi folks,
> > > > > > > >
> > > > > > > > Are there any opinions / strong feelings about the two
> > options:
> > > > > > > >
> > > > > > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > > > > > * Release 0.15.0 out of master
> > > > > > > >
> > > 

[jira] [Created] (ARROW-5930) [FlightRPC] [Python] Flight CI tests are failing

2019-07-12 Thread lidavidm (JIRA)
lidavidm created ARROW-5930:
---

 Summary: [FlightRPC] [Python] Flight CI tests are failing
 Key: ARROW-5930
 URL: https://issues.apache.org/jira/browse/ARROW-5930
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Python
Affects Versions: 0.14.0
Reporter: lidavidm


Flight tests segfault on Travis: 
[https://travis-ci.org/apache/arrow/jobs/557690959]

The relevant part is:
{noformat}
Fatal Python error: Aborted
Thread 0x7fcf009fe700 (most recent call first):
  File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", 
line 386 in _server_thread
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", 
line 864 in run
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", 
line 916 in _bootstrap_inner
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/threading.py", 
line 884 in _bootstrap
Current thread 0x7fcf1f9fa700 (most recent call first):
  File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", 
line 411 in flight_server
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/contextlib.py", 
line 99 in __exit__
  File "/home/travis/build/apache/arrow/python/pyarrow/tests/test_flight.py", 
line 670 in test_tls_do_get
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/python.py",
 line 165 in pytest_pyfunc_call
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 81 in 
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in _hookexec
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py",
 line 289 in __call__
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/python.py",
 line 1451 in runtest
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 117 in pytest_runtest_call
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 81 in 
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in _hookexec
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py",
 line 289 in __call__
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 192 in 
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 220 in from_call
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 192 in call_runtest_hook
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 167 in call_and_report
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 87 in runtestprotocol
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/runner.py",
 line 72 in pytest_runtest_protocol
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 81 in 
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in _hookexec
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py",
 line 289 in __call__
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py",
 line 278 in pytest_runtestloop
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/callers.py",
 line 187 in _multicall
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 81 in 
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/manager.py",
 line 87 in _hookexec
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pluggy/hooks.py",
 line 289 in __call__
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py",
 line 257 in _main
  File 
"/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/_pytest/main.py",
 

IPC Tensor + Indices

2019-07-12 Thread Razvan Chitu
Hi,

Does the IPC format currently support streaming a tensor together with a
row and a column index? If not, are there any plans for this to be
supported? It'd be quite a useful for matrices that could have 10s of
thousands of either rows, columns or both. For my use case I am currently
representing matrices as record batches, but performance is not that great
when there are many columns and few rows.

Thanks,
Razvan


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-12 Thread Wes McKinney
hi Liya -- yes, it seems reasonable to defer the conversion from your
pointer-based extension representation to a proper VarCharVector until
you need to send over IPC.

Note that there is no mechanism yet in Java with extension types to
cause a conversion to take place when the IPC step is reached.

I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try
to explain this issue. Let me know if it is not clear

I'm interested to experiment with the same thing in C++. We would have
an ExtensionArray in C++ whose values are string_view referencing
external memory, for example.

- Wes

On Thu, Jul 11, 2019 at 10:16 PM Fan Liya  wrote:
>
> @Wes McKinney,
>
> Thanks a lot for the brainstorming. I think your ideas are reasonable and
> feasible.
> About IPC, my idea is that we can send the vector as a PointerStringVector,
> and receive it as a VarCharVector, so that the overhead of memory
> compaction can be hidden.
> What do you think?
>
> Best,
> Liya Fan
>
> On Fri, Jul 12, 2019 at 11:07 AM Fan Liya  wrote:
>
> > @Uwe L. Korn
> >
> > Thanks a lot for the suggestion. I think this is exactly what we are doing
> > right now.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney  wrote:
> >
> >> hi Liya -- have you thought about implementing this as an
> >> ExtensionType / ExtensionVector? You actually can already do this, so
> >> if this helps you reference strings stored in some external memory
> >> then that seems reasonable. Such a PointerStringVector could have a
> >> method that converts it into the Arrow varbinary columnar
> >> representation.
> >>
> >> You wouldn't be able to put such an object into the IPC binary
> >> protocol, though. If that's a requirement (being able to use the IPC
> >> protocol) for this kind of data, before going any further in the
> >> discussion I would suggest that you work out exactly how such data
> >> would be moved from one process address space to another (using
> >> Buffers).
> >>
> >> - Wes
> >>
> >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn  wrote:
> >> >
> >> > Hello Liya Fan,
> >> >
> >> > here your best approach is to copy into the Arrow format as you can
> >> then use this as the basis for working with the Arrow-native representation
> >> as well as your internal representation. You will have to use two different
> >> offset vector as those two will always differ but in the case of your
> >> internal representation, you don't have the requirement of consecutive data
> >> as Arrow has but you can still work with the strings just as before even
> >> when stored consecutively.
> >> >
> >> > Uwe
> >> >
> >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> >> > > Hi Korn,
> >> > >
> >> > > Thanks a lot for your comments.
> >> > >
> >> > > In my opinion, your comments make sense to me. Allowing
> >> non-consecutive
> >> > > memory segments will break some good design choices of Arrow.
> >> > > However, there are wide-spread user requirements for non-consecutive
> >> memory
> >> > > segments. I am wondering how can we help such users. What advice we
> >> can
> >> > > give to them?
> >> > >
> >> > > Memory copy/move can be a solution, but is there a better solution?
> >> > > Is there a third alternative? Can we virtualize the non-consecutive
> >> memory
> >> > > segments into a consecutive one? (Although performance overhead is
> >> > > unavoidable.)
> >> > >
> >> > > What do you think? Let's brain-storm it.
> >> > >
> >> > > Best,
> >> > > Liya Fan
> >> > >
> >> > >
> >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
> >> > >
> >> > > > Hello Liya,
> >> > > >
> >> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> >> structures.
> >> > > > We have opened the standard also to matrix-like types but always
> >> keep the
> >> > > > constraint of consecutive memory. Now also adding types where
> >> memory is no
> >> > > > longer consecutive but spread in the heap will make the scope of the
> >> > > > project much wider (It seems that we then just turn into a general
> >> > > > serialization framework).
> >> > > >
> >> > > > One of the ideas of a common standard is that some need to make
> >> > > > compromises. I think in this case it is a necessary compromise to
> >> not allow
> >> > > > all kind of string representations.
> >> > > >
> >> > > > Uwe
> >> > > >
> >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> >> > > > > Hi all,
> >> > > > >
> >> > > > >
> >> > > > > We are thinking of providing varchar/varbinary vectors with a
> >> different
> >> > > > > memory layout which exists in a wide range of systems. The memory
> >> layout
> >> > > > is
> >> > > > > different from that of VarCharVector in the following ways:
> >> > > > >
> >> > > > >
> >> > > > >1.
> >> > > > >
> >> > > > >Instead of storing (start offset, end offset), the new layout
> >> stores
> >> > > > >(start offset, length)
> >> > > > >2.
> >> > > > >
> >> > > > >The content of varchars may not be in a 

[jira] [Created] (ARROW-5929) [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5929:
---

 Summary: [Java] Define API for ExtensionVector whose data must be 
serialized prior to being sent via IPC
 Key: ARROW-5929
 URL: https://issues.apache.org/jira/browse/ARROW-5929
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Wes McKinney


As being discussed on the mailing list, a possible use case for ExtensionVector 
involves having the Arrow buffers contain pointer-type values referring to 
memory outside of the Arrow memory heap. In IPC, such vectors would need to be 
serialized to a wholly Arrow-resident form, such as a VarBinaryVector. We do 
not have an API to allow for this, so this JIRA proposes to add new functions 
that can indicate to the IPC layer that an ExtensionVector requires additional 
serialization to a native Arrow type (in such case, the extension type metadata 
would be discarded)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5928) [JS] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5928:
---

 Summary: [JS] Test fuzzer inputs
 Key: ARROW-5928
 URL: https://issues.apache.org/jira/browse/ARROW-5928
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: 1.0.0


We are developing a fuzzer-based corpus of malformed IPC inputs

https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc

The JavaScript implementation should also test against these to verify that the 
correct kind of exception is raised



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5927) [Go] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5927:
---

 Summary: [Go] Test fuzzer inputs
 Key: ARROW-5927
 URL: https://issues.apache.org/jira/browse/ARROW-5927
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney
 Fix For: 1.0.0


We are developing a fuzzer-based corpus of malformed IPC inputs

https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc

The Go implementation should also test against these to verify that the correct 
kind of exception is raised



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5926) [Java] Test fuzzer inputs

2019-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5926:
---

 Summary: [Java] Test fuzzer inputs
 Key: ARROW-5926
 URL: https://issues.apache.org/jira/browse/ARROW-5926
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Wes McKinney
 Fix For: 1.0.0


We are developing a fuzzer-based corpus of malformed IPC inputs

https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc

The Java implementation should also test against these to verify that the 
correct kind of exception is raised



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5925) [Gandiva][C++] cast decimal to int should round up

2019-07-12 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5925:
-

 Summary: [Gandiva][C++] cast decimal to int should round up
 Key: ARROW-5925
 URL: https://issues.apache.org/jira/browse/ARROW-5925
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5924) [C++][Plasma] It is not convenient to release a GPU object

2019-07-12 Thread shengjun.li (JIRA)
shengjun.li created ARROW-5924:
--

 Summary: [C++][Plasma] It is not convenient to release a GPU object
 Key: ARROW-5924
 URL: https://issues.apache.org/jira/browse/ARROW-5924
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Affects Versions: 0.14.0
Reporter: shengjun.li
 Fix For: 0.14.1


cmake_modules/DefineOptions.cmake
  define_option(ARROW_CUDA "Build the Arrow CUDA extensions (requires CUDA 
toolkit)" ON)
  define_option(ARROW_PLASMA "Build the plasma object store along with Arrow" 
ON)


The corrent sequence is as follow:
(1) plasma_client.Create(object_id, size, nullptr, 0, , 1);  // where 
device_num > 0
(2) plasma_client.Seal(object_id);
(3) buff = nullptr;
(4) plasma_client.Release(object_id);
(5) plasma_client.Delete(object_id);


To set buff nullptr (step 3) just before release the object (step 4) because 
CloseIpcBuffer is in its destructor (class CudaBuffer).
If a user does not do that promptly, CloseIpcBuffer will be blocked. Then, the 
following error may occure when another object created:
    IOError: Cuda Driver API call in 
/home/zilliz/arrow/cpp/src/arrow/gpu/cuda_context.cc at line 156 failed with 
code 208: cuIpcOpenMemHandle(, *handle, 
CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS) (nil)


To prevent the risk, we can call CloseIpcBuffer manually.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5923) [C++] Fix int96 comment

2019-07-12 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5923:
-

 Summary: [C++] Fix int96 comment
 Key: ARROW-5923
 URL: https://issues.apache.org/jira/browse/ARROW-5923
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques
Assignee: Micah Kornfield






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5922) Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-07-12 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-5922:


 Summary: Unable to connect to HDFS from a worker/data node on a 
Kerberized cluster using pyarrow' hdfs API
 Key: ARROW-5922
 URL: https://issues.apache.org/jira/browse/ARROW-5922
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
 Environment: Unix
Reporter: Saurabh Bajaj
 Fix For: 0.14.0


Here's what I'm trying:

{{```}}

{{import pyarrow as pa }}

{{conf = \{"hadoop.security.authentication": "kerberos"} }}

{{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}

{{```}}

However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
following error:

```

{{File "test/run.py", line 3 fs = 
pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
"/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 211, in connect File 
"/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}

{{```}}

I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however I 
run into the same error. Since the error is not descriptive, I'm not sure which 
setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-07-12 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5921:


 Summary: [C++][Fuzzing] Missing nullptr checks in IPC
 Key: ARROW-5921
 URL: https://issues.apache.org/jira/browse/ARROW-5921
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.0
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, 
crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, 
crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, 
crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, 
crash-fd237566879dc60fff4d956d5fe3533d74a367f3

{{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-xxx
{code}

The attached crashes have all distinct sources and are all related with missing 
nullptr checks. I have a fix basically ready.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Krisztián Szűcs
PRs touching the wheel packaging scripts:
- https://github.com/apache/arrow/pull/4828 (lz4)
- https://github.com/apache/arrow/pull/4833 (uriparser - only if
https://github.com/apache/arrow/commit/88fcb096c4f24861bc7f8181cba1ad8be0e4048a
is cherry picked as well)
- https://github.com/apache/arrow/pull/4834 (zlib)

On Fri, Jul 12, 2019 at 11:49 AM Hatem Helal  wrote:

> Thanks François, I closed PARQUET-1623 this morning.  It would be nice to
> include the PR in the patch release:
>
> https://github.com/apache/arrow/pull/4857
>
> This bug has been around for a few releases but I think it should be a low
> risk change to include.
>
> Hatem
>
>
> On 7/12/19, 2:27 AM, "Francois Saint-Jacques" 
> wrote:
>
> I just merged PARQUET-1623, I think it's worth inserting since it
> fixes an invalid memory write. Note that I couldn't resolve/close the
> parquet issue, do I have to be contributor to the project?
>
> François
>
> On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney 
> wrote:
> >
> > I just merged Eric's 2nd patch ARROW-5908 and I went through all the
> > patches since the release commit and have come up with the following
> > list of 32 fix-only patches to pick into a maintenance branch:
> >
> > https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
> >
> > Note there's still unresolved Parquet forward/backward compatibility
> > issues in C++ that we haven't merged patches for yet, so that is
> > pending.
> >
> > Are there any other patches / JIRA issues people would like to see
> > resolved in a patch release?
> >
> > Thanks
> >
> > On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney 
> wrote:
> > >
> > > Eric -- you are free to set the Fix Version prior to the patch
> being merged
> > >
> > > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> > >  wrote:
> > > >
> > > > The two C# fixes I'd like in the 0.14.1 release are:
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-5887 - already
> marked with 0.14.1 fix version.
> > > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been
> resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one
> approver and the Rust failure doesn't appear to be caused by my change.
> > > >
> > > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version
> until the PR has been merged.
> > > >
> > > > -Original Message-
> > > > From: Neal Richardson 
> > > > Sent: Thursday, July 11, 2019 11:59 AM
> > > > To: dev@arrow.apache.org
> > > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
> package problems, Parquet forward compatibility problems
> > > >
> > > > I just moved
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0
> from 1.0.0 to 0.14.1.
> > > >
> > > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney <
> wesmck...@gmail.com> wrote:
> > > >
> > > > > To limit uncertainty, I'm going to start preparing a 0.14.1
> patch
> > > > > release branch. I will update the list with the patches that
> are being
> > > > > cherry-picked. If other folks could give me a list of other
> PRs that
> > > > > need to be backported I will add them to the list. Any JIRA
> that needs
> > > > > to be included should have the "0.14.1" fix version added so
> we can
> > > > > keep track
> > > > >
> > > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > > > >  wrote:
> > > > > >
> > > > > > I personally prefer 0.14.1 over 0.15.0. I think that is
> clearer in
> > > > > > communication, as we are fixing regressions of the 0.14.0
> release.
> > > > > >
> > > > > > (but I haven't been involved much in releases, so certainly
> no
> > > > > > strong
> > > > > > opinion)
> > > > > >
> > > > > > Joris
> > > > > >
> > > > > >
> > > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney <
> wesmck...@gmail.com>:
> > > > > >
> > > > > > > hi folks,
> > > > > > >
> > > > > > > Are there any opinions / strong feelings about the two
> options:
> > > > > > >
> > > > > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > > > > * Release 0.15.0 out of master
> > > > > > >
> > > > > > > Aside from the Parquet forward compatibility issues we're
> still
> > > > > > > discussing, and Eric's C# patch PR 4836, are there any
> other
> > > > > > > issues that need to be fixed before we go down one of
> these paths?
> > > > > > >
> > > > > > > Would anyone like to help with release management? I can
> do so if
> > > > > > > necessary, but I've already done a lot of release
> 

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Fan Liya
@Antoine Pitrou,

Good question. I think the answer depends on the concrete encoding scheme.

For some encoding schemes, it is not a good idea to use them for in-memory
data compression.
For others, it is beneficial to operator directly on the compressed data.

For example, it is beneficial to directly work on RLE data, with better
locality and fewer cache misses.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou  wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-5920) [Java] Support sort & compare for all variable width vectors

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5920:
---

 Summary: [Java] Support sort & compare for all variable width 
vectors
 Key: ARROW-5920
 URL: https://issues.apache.org/jira/browse/ARROW-5920
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


All variable-width vector can reuse the same comparator for sorting & searching.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-12 Thread Hatem Helal
Thanks François, I closed PARQUET-1623 this morning.  It would be nice to 
include the PR in the patch release:

https://github.com/apache/arrow/pull/4857

This bug has been around for a few releases but I think it should be a low risk 
change to include.

Hatem


On 7/12/19, 2:27 AM, "Francois Saint-Jacques"  wrote:

I just merged PARQUET-1623, I think it's worth inserting since it
fixes an invalid memory write. Note that I couldn't resolve/close the
parquet issue, do I have to be contributor to the project?

François

On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney  wrote:
>
> I just merged Eric's 2nd patch ARROW-5908 and I went through all the
> patches since the release commit and have come up with the following
> list of 32 fix-only patches to pick into a maintenance branch:
>
> https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
>
> Note there's still unresolved Parquet forward/backward compatibility
> issues in C++ that we haven't merged patches for yet, so that is
> pending.
>
> Are there any other patches / JIRA issues people would like to see
> resolved in a patch release?
>
> Thanks
>
> On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney  wrote:
> >
> > Eric -- you are free to set the Fix Version prior to the patch being 
merged
> >
> > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> >  wrote:
> > >
> > > The two C# fixes I'd like in the 0.14.1 release are:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-5887 - already marked 
with 0.14.1 fix version.
> > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been 
resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one approver 
and the Rust failure doesn't appear to be caused by my change.
> > >
> > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until 
the PR has been merged.
> > >
> > > -Original Message-
> > > From: Neal Richardson 
> > > Sent: Thursday, July 11, 2019 11:59 AM
> > > To: dev@arrow.apache.org
> > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package 
problems, Parquet forward compatibility problems
> > >
> > > I just moved 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3Dreserved=0
 from 1.0.0 to 0.14.1.
> > >
> > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  
wrote:
> > >
> > > > To limit uncertainty, I'm going to start preparing a 0.14.1 patch
> > > > release branch. I will update the list with the patches that are 
being
> > > > cherry-picked. If other folks could give me a list of other PRs that
> > > > need to be backported I will add them to the list. Any JIRA that 
needs
> > > > to be included should have the "0.14.1" fix version added so we can
> > > > keep track
> > > >
> > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > > >  wrote:
> > > > >
> > > > > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> > > > > communication, as we are fixing regressions of the 0.14.0 release.
> > > > >
> > > > > (but I haven't been involved much in releases, so certainly no
> > > > > strong
> > > > > opinion)
> > > > >
> > > > > Joris
> > > > >
> > > > >
> > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney 
:
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > Are there any opinions / strong feelings about the two options:
> > > > > >
> > > > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > > > * Release 0.15.0 out of master
> > > > > >
> > > > > > Aside from the Parquet forward compatibility issues we're still
> > > > > > discussing, and Eric's C# patch PR 4836, are there any other
> > > > > > issues that need to be fixed before we go down one of these 
paths?
> > > > > >
> > > > > > Would anyone like to help with release management? I can do so 
if
> > > > > > necessary, but I've already done a lot of release management :)
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 

> > > > wrote:
> > > > > > >
> > > > > > > Hi Eric -- of course!
> > > > > > >
> > > > > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> > > > eric.erha...@microsoft.com.invalid>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Can we propose getting changes other than Python or Parquet
> > > > > > >> related
> > > > > > into this release?
> > > > > > >>
> > > > > > >> For example, I found a critical issue in the C# 
implementation
> > 

Re: [Python] Wheel questions

2019-07-12 Thread Antoine Pitrou


Le 12/07/2019 à 11:39, Uwe L. Korn a écrit :
> Actually the most pragmatic way I have thought of yet would be to use conda 
> and build all our dependencies. Instead of using the compilers defaults and 
> conda-forge use, we should build the dependencies in the manylinux image 
> and then upload them to a custom channel. This should also make the 
> maintenance of the arrow-manylinx docker container easy as this won't require 
> you then to do a full recompile of LLVM just because you changed something in 
> a preceeding step.

That sounds cumbersome though.  Each upgrade or modification in the
building of those libraries needs changing and updating some conda
packages somewhere...  So we would be trading one inconvenience against
another.

Note I recently moved llvm and clang compilation up in the Dockerfile,
so most changes can now be done without recompiling them.

Regards

Antoine.


[jira] [Created] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge

2019-07-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5919:
--

 Summary: [R] Add nightly tests for building r-arrow with 
dependencies from conda-forge
 Key: ARROW-5919
 URL: https://issues.apache.org/jira/browse/ARROW-5919
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Python] Wheel questions

2019-07-12 Thread Uwe L. Korn
Hallo,

On Thu, Jul 11, 2019, at 9:51 PM, Wes McKinney wrote:
> On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou  wrote:
> >
> >
> > Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit :
> > > Hi All,
> > >
> > > I have a couple of questions about the wheel packaging:
> > > - why do we build an arrow namespaced boost on linux and osx, could we 
> > > link
> > > statically like with the windows wheels?
> >
> > No idea.  Boost shouldn't leak in the public APIs, so theoretically a
> > static build would be fine...

Static linkage is fine as long as we don't expose any Boost symbols. We had 
that historically in the Decimals. If this is gone, we can switch static 
linkage.

> > > - do we explicitly say somewhere in the linux wheels to link the 3rdparty
> > > dependencies statically or just implicitly, by removing (or not building)
> > > the shared libs for the 3rdparty dependencies?
> >
> > It's implicit by removing the shared libs (or not building them).
> > Some time ago the compression libs were always linked statically by
> > default, but it was changed to dynamic along the time, probably to
> > please system packagers.
> 
> I think only libz shared library is being bundled, for security reasons

Ah, yes. This was why we made the dynamic linkage! Can you add a comment the 
next time you touch the build scripts?

> > > - couldn't we use the 3rdparty toolchain to build the smaller 3rdparty
> > > dependencies for the linux wheels instead of building them manually in the
> > > manylinux docker image - it'd easier to say _SOURCE=BUNDLED
> >
> > I don't think so.  The conda-forge and Anaconda packages use a different
> > build chain (different compiler, different libstdc++ version) and may
> > not be usable directly on manylinux-compliant systems.
> 
> I think you may misunderstand. Krisztian is suggesting building the
> dependencies through the ExternalProject mechanism during "docker run"
> on the image rather than caching pre-built versions in the Docker
> image.
> 
> For small dependencies, I don't see why we couldn't used the BUNDLED
> approach. This might spare us having to maintain some of the build
> scripts. It will strictly increase build times, though -- I think the
> reason that everything is cached now is to save on build times (which
> have historically been quite long)

Actually the most pragmatic way I have thought of yet would be to use conda and 
build all our dependencies. Instead of using the compilers defaults and 
conda-forge use, we should build the dependencies in the manylinux image 
and then upload them to a custom channel. This should also make the maintenance 
of the arrow-manylinx docker container easy as this won't require you then to 
do a full recompile of LLVM just because you changed something in a preceeding 
step.

Uwe


Re: [DISCUSS][FORMAT] Data Integrity

2019-07-12 Thread Antoine Pitrou



Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
> Per Antoine's recommendation.  I'm splitting off the discussion about data
> integrity from the previous e-mail thread about the format additions [1].
> To re-cap I made a proposal including data integrity [2] by adding a new
> message type to the
> 
> From the previous thread the main question was at what level to apply
> digests to Arrow data (Message level, array, buffer or potentially some
> hybrid).
> 
> Some trade-offs I've thought of for each approach:
> * Message level
> + Simplest implementation and can be applied across all messages with the
> pretty much the same code.
> + Smallest amount of additional data (each digest will likely be 8-64 bytes)
> - It lacks granularity to recover partial data from a record batch if there
> is corruption.

Also:
- Will only apply to transmission errors using the IPC mechanism, not
other kinds of errors that may occur

> Array level:
> + Allows for reading non-corrupted columns
> + Allows for potentially more complicated use-cases like have different
> compute engines "collaborate" and sign each array they computed to
> establish a "chain-of-trust"
> - Adds some implementation complexity. Will need different schemes for
> message types other than RecordBatch and for message metadata.  We also
> need to determine digest boundaries (would a complex column be consumed
> entirely or would child arrays be separate).

Also:
- Need to compute a new checksum when slicing an array?

> Buffer level:
> More or less same issues as array but with the following other factors:
> - The most amount of additional data

It's not clear that's much of a problem (currently?), especially if
checksumming is optional.  Arrow isn't well-suited for use cases with
many tiny buffers...

> - Its not clear if there is a benefit of detecting if a single buffer is
> corrupted if it means we can't accurately decode the array.

Also:
+ decorrelated from logical interpretation of buffer, e.g. slicing

I think the possibility of a hybrid scheme should be discussed as well.
 For example, compute physical checksums at the buffer level, then
devise a lightweight formula for the checkum of an array based on those
physical checksums.  And a formula for an IPC message's checksum based
on its type (schema, record batch, dictionary...).

Regards

Antoine.


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Antoine Pitrou


Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> OK, I've created a separate thread for data integrity/digests [1], and
> retitled this thread to continue the discussion on compression and
> encodings.  As a reminder the PR for the format additions [2] suggested a
> new SparseRecordBatch that would allow for the following features:
> 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> (e.g. narrower bit-width integers)
> 2.  Compression at the buffer level
> 3.  Eliding all metadata and data for empty columns.

So the question is whether this really needs to be in the in-memory
format, i.e. is it desired to operate directly on this compressed
format, or is it solely for transport?

If the latter, I wonder why Parquet cannot simply be used instead of
reinventing something similar but different.

Regards

Antoine.


[jira] [Created] (ARROW-5918) [Java] Revise the BaseIntVector interface

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5918:
---

 Summary: [Java] Revise the BaseIntVector interface
 Key: ARROW-5918
 URL: https://issues.apache.org/jira/browse/ARROW-5918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan


1. In the set method should not use long as parameter. It is hardly the case 
that there are more than 2^32 distinct values in a dictionary. If it really 
happens, maybe it means we should not have used dictionary in the first place. 

2. In addition to the get method, there should also be a set method. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5917) [Java] Redesign the dictionary encoder

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5917:
---

 Summary: [Java] Redesign the dictionary encoder
 Key: ARROW-5917
 URL: https://issues.apache.org/jira/browse/ARROW-5917
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current dictionary encoder implementation 
(org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance 
overhead, which prevents it from being useful in practice:
 # There are repeated conversions between Java objects and bytes (e.g. 
vector.getObject(i)).
 # Unnecessary memory copy (the vector data must be copied to the hash table).
 # The hash table cannot be reused for encoding multiple vectors (other data 
structure & results cannot be reused either).
 # The output vector should not be created/managed by the encoder (just like in 
the out-of-place sorter)
 # The hash table requires that the hashCode & equals methods be implemented 
appropriately, but this is not guaranteed.

We plan to implement a new one in the algorithm module, and gradually deprecate 
the current one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
OK, I've created a separate thread for data integrity/digests [1], and
retitled this thread to continue the discussion on compression and
encodings.  As a reminder the PR for the format additions [2] suggested a
new SparseRecordBatch that would allow for the following features:
1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
(e.g. narrower bit-width integers)
2.  Compression at the buffer level
3.  Eliding all metadata and data for empty columns.

To recap my understanding of the highlights discussion so far:

Encodings:
There are some concerns over efficiency of some of the encodings in
different scenarios.
 * Eliding null values makes many algorithms less efficient
 * Joins might become harder with these encodings.
 * Also the additional code complexity came up on the Arrow sync call.

Compression:
- Buffer level compression might be too small a granularity for data
compression.
- General purpose compression at this level might not add much value, so it
might be better to keep it at the transport level.

Alternative designs:
* Put buffer level compression in specific transports (e.g. flight)
* Try to use the extension mechanism to support different encodings

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/4815

On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou  wrote:

>
> I think it would be worthwhile to split the discussion into two separate
> threads.  One thread for compression & encodings (which are related or
> even the same topic), one thread for data integrity.
>
> Regards
>
> Antoine.
>
>
> Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> >
> > - Compression:
> >*  Use parquet for random access to data elements.
> >-  This is one option, the main downside I see to this is
> generally
> > higher encoding/decoding costs.  Per below, I think it is reasonable to
> > wait until we have more data to add compression into the the spec.
> >*  Have the transport layer do buffer specific compression:
> >   - I'm not a fan of this approach.  Once nice thing about the
> current
> > communication protocols is once you strip away "framing" data all the
> byte
> > streams are equivalent.  I think the simplicity that follows in code from
> > this is a nice feature.
> >
> >
> > *Computational efficiency of array encodings:*
> >
> >> How does "more efficient computation" play out for operations such as
> >> hash or join?
> >
> > You would still need to likely materialize rows in most case.   In some
> > "join" cases the sparse encoding of the null bitmap buffer could be a win
> > because it serves as an index to non-null values.
> >
> > I think I should clarify that these encodings aren't always a win
> depending
> > on workload/data shape, but can have a large impact when used
> appropriately
> > (especially at the "Expression evaluation stage").  Also, any wins don't
> > come for free, to exploit encodings properly  will add some level of
> > complication to existing computation code.
> >
> > On a packed sparse array representation:
> >
> >> This would be fine for simple SIMD aggregations like count/avg/mean, but
> >> compacting null slots complicates more advanced parallel routines that
> >> execute independently and rely on indices aligning with an element's
> >> logical position.
> >
> >
> > The main use-case I had in mind here was for scenarios like loading data
> > directly parquet (i.e. nulls are already elided) doing some computation
> and
> > then potentially translating to a dense representation.  Similarly it
> > appears other have had advantage in some contexts for saving time at
> > shuffle [1].  In many cases there is an overlap with RLE, so I'd be open
> to
> > removing this from the proposal.
> >
> >
> > *On buffer encodings:*
> > To paraphrase, the main concern here seems to be it is similar to
> metadata
> > that was already removed [2].
> >
> > A few points on this:
> > 1.  There was a typo in the original e-mail on sparse-integer set
> encoding
> > where it said "all" values are either null or not null.  This should have
> > read "most" values.  The elision of buffers is a separate feature.
> > 2.  I believe these are different then the previous metadata because this
> > isn't repetitive information. It provides new information about the
> > contents of buffers not available anywhere else.
> > 3.  The proposal is to create a new message type for the this feature so
> it
> > wouldn't be bringing back the old code and hopefully would have minimal
> > impact on already existing IPC code.
> >
> >
> > *On Compression:*
> > So far my take is the consensus is that this can probably be applied at
> the
> > transport level without being in the spec directly.  There might be value
> > in more specific types of compression at the buffer level, but we should
> > benchmark them first..
> >
> > *Data Integrity/Digest:*
> >
> >> 

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-12 Thread Antoine Pitrou


I think it would be worthwhile to split the discussion into two separate
threads.  One thread for compression & encodings (which are related or
even the same topic), one thread for data integrity.

Regards

Antoine.


Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> 
> - Compression:
>*  Use parquet for random access to data elements.
>-  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>*  Have the transport layer do buffer specific compression:
>   - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
> 
> 
> *Computational efficiency of array encodings:*
> 
>> How does "more efficient computation" play out for operations such as
>> hash or join?
> 
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
> 
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
> 
> On a packed sparse array representation:
> 
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently and rely on indices aligning with an element's
>> logical position.
> 
> 
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
> 
> 
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
> 
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
> 
> 
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
> 
> *Data Integrity/Digest:*
> 
>> one question is whether this occurs at the table level, column level,
>> sequential array level, etc.
> 
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
> 
> Cheers,
> Micah
> 
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
> 
> 
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor 
> wrote:
> 
>> Hi Micah,
>>
>> Similar to Jacques I'm not disagreeing, but wondering if they belong in
>> Arrow vs. can be done externally. I'm mostly interested in changes that
>> might impact SIMD processing, considering Arrow's already made conscious
>> design decisions to trade memory for speed. Apologies in advance if I've
>> misunderstood any of the proposals.
>>
>>> a. Add a run-length encoding scheme to efficiently represent repeated
>>> values (the actual scheme encodes run ends instead of length to preserve
>>> sub-linear random access).
>> Couldn't one do RLE at the buffer level via a custom
>> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
>>
>>> b. Add a “packed” sparse representation (null values don’t take up
>>> space in value buffers)
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently