RE: [C++] Runtime SIMD dispatching for Arrow

2020-09-03 Thread Du, Frank
Just want to give some updates on the dispatching. Now we has workable runtime functionality include dispatch mechanism[1][2] and build framework for both the compute kernels and other parts of C++. There are some remaining SIMD static complier code under the code base that I will try to work

Re: Decimal128 scale limits

2020-09-03 Thread Micah Kornfield
With regards to scale, my colleague discovered some inconsistencies and filed a JIRA with a proposed fix (a PR should be attached shortly). I think this is an edge case that should be fixed but if someone with more historical context has opinions, I'd like to here them. [1]

Re: Multifile parquet support

2020-09-03 Thread Micah Kornfield
Hi Radu, This is a conversation best had on dev@parquet. It came up recently [1] and I cross-posted there as well. [1] https://lists.apache.org/thread.html/re4fe4bc80c9eadd446761588f9b03d827193f91269a7c14ce0c444dd%40%3Cdev.arrow.apache.org%3E On Thu, Sep 3, 2020 at 3:20 PM Radu Teodorescu

Multifile parquet support

2020-09-03 Thread Radu Teodorescu
Hello, What is the current thinking around allowing the logical content of a parquet file to be split across multiple files? I see that in theory there is support for reading files where different row groups are in separate files but I cannot see any features that allow that for writing. On a

[Rust][DataFusion] Proposal for Basic Timestamp Support

2020-09-03 Thread Andrew Lamb
I am working on an engine for processing timeseries data. Unsurprisingly for such a system, values of timestamp type feature prominently and we need basic support for them in DataFusion. Initially, we want to use DataFusion with predicates such as '=', '<', '>', etc on timestamp columns and

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Antoine Pitrou
It would be useful for outsiders to expose what those two API levels are, and to what usage they correspond. Is Parquet encryption used only with that Spark? While Spark interoperability is important, Parquet files are more ubiquitous than that. Regards Antoine. Le 03/09/2020 à 22:31, Gidon

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Why would the low level API be exposed directly.. This will break the interop between the two analytic ecosystems down the road. Again, let me suggest leveraging the high level interface, based on the PropertiesDrivenCryptoFactory. It should address your technical requirements; if it doesn't, we

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Roee Shlomo
Hi Itamar, I implemented some python wrappers for the low level API and would be happy to collaborate on that. The reason I didn't push this forward yet is what Gidon mentioned. The API to expose to python users needs to be finalized first and it must include the key tools API for interop with

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Itamar Turner-Trauring
On Thu, Sep 3, 2020, at 11:01 AM, Antoine Pitrou wrote: > > Hi Gidon, > > Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > > Hi Itamar, > > > > My suggestion would be wrap a different API in Python - the high-level > > encryption interface of > > https://github.com/apache/arrow/pull/8023 >

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Hi Antoine, Sounds good to me. This PR is already being actively reviewed, and it'd be good to have Itamar's assessment. Cheers, Gidon On Thu, Sep 3, 2020 at 6:01 PM Antoine Pitrou wrote: > > Hi Gidon, > > Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > > Hi Itamar, > > > > My

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Antoine Pitrou
Hi Gidon, Le 03/09/2020 à 16:53, Gidon Gershinsky a écrit : > Hi Itamar, > > My suggestion would be wrap a different API in Python - the high-level > encryption interface of > https://github.com/apache/arrow/pull/8023 We need a strategy for reviewing those changes. The PR is quite large,

Re: Adding Parquet encryption support to PyArrow

2020-09-03 Thread Gidon Gershinsky
Hi Itamar, My suggestion would be wrap a different API in Python - the high-level encryption interface of https://github.com/apache/arrow/pull/8023 This will enable interoperability with Apache Spark (and other frameworks), where we don't expose the low level parquet encryption API. If such a

Re: Sort int tuples across Arrow arrays in C++

2020-09-03 Thread Wes McKinney
There are various open source columnar database engines you could look at to get inspiration for a varargs variant of sort_indices. On Thu, Sep 3, 2020 at 9:26 AM Ben Kietzman wrote: > > Hi Rares, > > The arrow API does not currently support sorting against multiple columns. > We'd welcome a

Re: Sort int tuples across Arrow arrays in C++

2020-09-03 Thread Ben Kietzman
Hi Rares, The arrow API does not currently support sorting against multiple columns. We'd welcome a JIRA/PR to add that support. One potential workaround is storing the tuple as a single column of fixed_size_list(int32, 2), which could then be viewed [1] as int64 (for which sorting is

Adding Parquet encryption support to PyArrow

2020-09-03 Thread Itamar Turner-Trauring
Hi, I'm looking into implementing this, and it seems like there are two parts: packaging, but also wrapping the APIs in Python. Is the latter item accurate? If so, any examples of similar existing wrapped APIs, or should I just come up with something on my own? Context:

Re: [Flight Format] Authentication Redesign

2020-09-03 Thread David Li
The C++/Python authentication implementation is entirely different (because the C++/Python/Java gRPC APIs are in turn entirely different). In particular, gRPC middleware in C++ is still experimental (compared to Java) and much more limited (unless recent versions changed this). C++/Python might

Re: pyarrow filesystem interface for Azure Data Lake gen2

2020-09-03 Thread Joris Van den Bossche
Thanks for sharing! It's cool to see the new PyFileSystem directly being used ;) Note that there is also an fsspec-compatible Azule filesystem implementation that should support Data Lake Gen2 ( https://github.com/dask/adlfs) for another python-based implemenation, and which can be used with

Sort int tuples across Arrow arrays in C++

2020-09-03 Thread Rares Vernica
Hello, I have a set of integer tuples that need to be collected and sorted at a coordinator. Here is an example with tuples of length 2: [(1, 10), (1, 15), (2, 10), (2, 15)] I am considering storing each column in an Arrow array, e.g., [1, 1, 2, 2] and [10, 15, 10, 15], and have the Arrow

[NIGHTLY] Arrow Build Report for Job nightly-2020-09-03-0

2020-09-03 Thread Crossbow
Arrow Build Report for Job nightly-2020-09-03-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-03-0 Failed Tasks: - test-conda-python-3.7-hdfs-2.9.2: URL: