Property-driven Parquet encryption

2020-07-06 Thread Gidon Gershinsky
Hi all,

We are working on the Parquet modular encryption, and are currently adding
a high-level interface that allows to encrypt/decrypt parquet files via
properties only (without calling the low level API). In the
spark/parquet-mr domain, we're using the Hadoop configuration properties
for that purpose - they are already passed from Spark to Parquet, and allow
to add custom key-value properties that can carry the list of encrypted
columns, key identities etc, as described in the
https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing

I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp
ecosystem. Is there an analog of Hadoop configuration (a free key-value
map, passed all the way down to parquet-cpp)? Or a more structured
configuration object (where we'll need to add the encryption-related
properties)? All suggestions are welcome.

Cheers, Gidon


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou  wrote:
>
>
> Le 06/07/2020 à 17:57, Steve Kim a écrit :
> > The Parquet format specification is ambiguous about the exact details of
> > LZ4 compression. However, the *de facto* reference implementation in Java
> > (parquet-mr) uses the Hadoop LZ4 codec.
> >
> > I think that it is important for Parquet c++ to have compatibility and
> > feature parity with parquet-mr when possible. I prefer to change the
> > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> > that is used by parquet-mr (
> > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> > change will be quick and easy. I have an intern under my supervision who is
> > available to work on it full time, starting immediately. Please let me know
> > if we ought to proceed.
>
> Would that keep compatibility with existing files produces by Parquet C++?

Given that LZ4 has been constantly broken in C++ (first using the raw
format, then the block format -- still incompatible apparently) I
think we would recommend that in the rare event that people have
LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used
mostly) they should rewrite their files with a different codec using
e.g. pyarrow 0.17.1

> Regards
>
> Antoine.


Re: language independent representation of filter expressions

2020-07-06 Thread Wes McKinney
I would also be interested in having a reusable serialized format for
filter- and projection-like expressions. I think trying to go so far
as full logical query plans suitable for building a SQL engine is
perhaps a bit too far but we could start small with the use case from
the JNI Datasets PR as a motivating example. We should also consider
replacing or deprecating Gandiva's serialized expressions in favor of
something more general.

It may be a slight bikeshed issue, but I wouldn't be thrilled about
having this be based on Protocol Buffers, because of the runtime
requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++
applications. Flatbuffers might be less pleasant developer UX in Java
but at least in C++ the fact that Flatbuffers results in zero build-
or runtime dependencies is a significant advantage.

On Mon, Jul 6, 2020 at 4:12 PM Andy Grove  wrote:
>
> This is something that I am also interested in.
>
> My current approach in my personal project that uses Arrow is to use
> protobuf to represent expressions (as well as logical and physical query
> plans). I used the Gandiva protobuf definition as a starting point.
>
> Protobuf works for going between different languages in the same process as
> well as for passing query plans over the network. I'm passing these
> protobuf definitions over the Flight protocol.
>
> I only have support for a few simple expressions so far, but here is my
> protobuf file for reference:
>
> https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto
>
> Andy.
>
> On Mon, Jul 6, 2020 at 1:50 PM Steve Kim  wrote:
>
> > I have been following the discussion on a pull request (
> > https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the
> > high-level dataset API via JNI.
> >
> > An obstacle that was encountered in this PR is that there is not a good way
> > to pass a filter expression via JNI. Expressions have a defined
> > serialization in the C++ implementation, but this serialization includes
> > enums and types that are only defined in C++ and are not accessible in
> > other languages.
> >
> > I agree with Micah Kornfield's comment (
> > https://github.com/apache/arrow/pull/7030#discussion_r425563920) that
> > there
> > ought to be one representation that we reuse across languages. If we had
> > this cross-language functionality, then we could do the following:
> >
> >1. build an arbitrary filter expression in Java
> >2. serialize the expression to bytes to be passed via JNI
> >3. deserialize from bytes to a native filter expression in the C++
> >implementation
> >
> > Has there already been discussion about what a cross-language
> > representation of filter expressions (and possibly other parts of the
> > Dataset API) might look like? I see that we use Flatbuffers in other parts
> > of Arrow.
> >
> > What would need to change in the C++ implementation to make use of such a
> > representation?
> >
> > Steve
> >


Re: language independent representation of filter expressions

2020-07-06 Thread Andy Grove
This is something that I am also interested in.

My current approach in my personal project that uses Arrow is to use
protobuf to represent expressions (as well as logical and physical query
plans). I used the Gandiva protobuf definition as a starting point.

Protobuf works for going between different languages in the same process as
well as for passing query plans over the network. I'm passing these
protobuf definitions over the Flight protocol.

I only have support for a few simple expressions so far, but here is my
protobuf file for reference:

https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto

Andy.

On Mon, Jul 6, 2020 at 1:50 PM Steve Kim  wrote:

> I have been following the discussion on a pull request (
> https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the
> high-level dataset API via JNI.
>
> An obstacle that was encountered in this PR is that there is not a good way
> to pass a filter expression via JNI. Expressions have a defined
> serialization in the C++ implementation, but this serialization includes
> enums and types that are only defined in C++ and are not accessible in
> other languages.
>
> I agree with Micah Kornfield's comment (
> https://github.com/apache/arrow/pull/7030#discussion_r425563920) that
> there
> ought to be one representation that we reuse across languages. If we had
> this cross-language functionality, then we could do the following:
>
>1. build an arbitrary filter expression in Java
>2. serialize the expression to bytes to be passed via JNI
>3. deserialize from bytes to a native filter expression in the C++
>implementation
>
> Has there already been discussion about what a cross-language
> representation of filter expressions (and possibly other parts of the
> Dataset API) might look like? I see that we use Flatbuffers in other parts
> of Arrow.
>
> What would need to change in the C++ implementation to make use of such a
> representation?
>
> Steve
>


language independent representation of filter expressions

2020-07-06 Thread Steve Kim
I have been following the discussion on a pull request (
https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the
high-level dataset API via JNI.

An obstacle that was encountered in this PR is that there is not a good way
to pass a filter expression via JNI. Expressions have a defined
serialization in the C++ implementation, but this serialization includes
enums and types that are only defined in C++ and are not accessible in
other languages.

I agree with Micah Kornfield's comment (
https://github.com/apache/arrow/pull/7030#discussion_r425563920) that there
ought to be one representation that we reuse across languages. If we had
this cross-language functionality, then we could do the following:

   1. build an arbitrary filter expression in Java
   2. serialize the expression to bytes to be passed via JNI
   3. deserialize from bytes to a native filter expression in the C++
   implementation

Has there already been discussion about what a cross-language
representation of filter expressions (and possibly other parts of the
Dataset API) might look like? I see that we use Flatbuffers in other parts
of Arrow.

What would need to change in the C++ implementation to make use of such a
representation?

Steve


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++?

Changing the lz4 implementation to be compatible with parquet-mr/hadoop
would break compatibility with any existing files that were written by
Parquet C++ using lz4 compression. I believe that it is not possible to
reliably detect, from inspection of the first few bytes, which
implementation variant was used by the writer. But I could be misinformed,
as I do not have expert knowledge of LZ4 compression.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Antoine Pitrou


Le 06/07/2020 à 17:57, Steve Kim a écrit :
> The Parquet format specification is ambiguous about the exact details of
> LZ4 compression. However, the *de facto* reference implementation in Java
> (parquet-mr) uses the Hadoop LZ4 codec.
> 
> I think that it is important for Parquet c++ to have compatibility and
> feature parity with parquet-mr when possible. I prefer to change the
> LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> that is used by parquet-mr (
> https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> change will be quick and easy. I have an intern under my supervision who is
> available to work on it full time, starting immediately. Please let me know
> if we ought to proceed.

Would that keep compatibility with existing files produces by Parquet C++?

Regards

Antoine.


Re: Question: How to pass data between two languages interprocess without extra libraries?

2020-07-06 Thread Neal Richardson
Could you clarify what you mean by "without external libraries"? Do you
mean without using pyarrow and the arrow R package?

Neal

On Mon, Jul 6, 2020 at 1:40 AM Fan Liya  wrote:

> Hi Teng,
>
> Arrow provides two formats for IPC between different languages: streaming
> and file.
> This article gives a tutorial for Java:
> https://arrow.apache.org/docs/java/ipc.html
>
> For other languages, it may be helpful to read the test cases.
>
> Best,
> Liya Fan
>
>
> On Sun, Jul 5, 2020 at 4:24 PM Teng Peng  wrote:
>
> > Hi dev,
> >
> > I have read the article "Introducing the Apache Arrow C Data Interface"
> > <
> >
> https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/
> > >
> > and
> > I have a questions about pass data between two languages:
> >
> > In the article, R library reticulate is used for sharing data between R
> and
> > Python. Is it possible to share data without external libraries? Let's
> say
> > I want to create data from R and then read it from my python script. If
> it
> > is possible, are there any tutorials on this? I believe I have to record
> > the memory address of the data in R, correct?
> >
> > Thanks.
> >
>


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of
LZ4 compression. However, the *de facto* reference implementation in Java
(parquet-mr) uses the Hadoop LZ4 codec.

I think that it is important for Parquet c++ to have compatibility and
feature parity with parquet-mr when possible. I prefer to change the
LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
that is used by parquet-mr (
https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
change will be quick and easy. I have an intern under my supervision who is
available to work on it full time, starting immediately. Please let me know
if we ought to proceed.

If it is not feasible to achieve compatibility in the next release, then I
am in favor of disabling lz4 support (
https://issues.apache.org/jira/browse/PARQUET-1515) until it can be fixed.

Thanks,
Steve


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn"  wrote:
> I'm also in favor of disabling support for now. Having to deal with
broken files or the detection of various incompatible implementations in
the long-term will harm more than not supporting LZ4 for a while. Snappy is
generally more used than LZ4 in this category as it has been available
since the inception of Parquet and thus should be considered as a viable
alternative.
>
> Cheers
> Uwe
>
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou 
wrote:
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++
implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of
them.
> >
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Neville Dipale
Thanks Rok and Antoine,

I couldn't see what the issue could have been, so the SO link was
very helpful and informative.

I'll try it out, and submit a PR if I get it right.

On Mon, 6 Jul 2020 at 14:30, Antoine Pitrou  wrote:

>
> Yes, that's certainly the case.
> Changing:
> values = np.random.randint(lower, upper, size=size)
> to:
> values = np.random.randint(lower, upper, size=size, dtype=np.int64)
>
> would hopefully fix the issue.  Neville, could you try it out?
>
> Thank you
>
> Antoine.
>
> Le 06/07/2020 à 14:16, Rok Mihevc a écrit :
> > Numpy on windows has different default bitwidth than on linux. Perhaps
> this
> > is causing the issue? (see:
> >
> https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine
> > )
> >
> > Rok
> >
> > On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale 
> > wrote:
> >
> >> Hi Arrow devs,
> >>
> >> I'm trying to run archery integration tests in Windows 10 (Python 3.7.7;
> >> conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds
> >> for int32* (
> >> https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1
> >> ).
> >>
> >> Has someone else encountered this problem before?
> >>
> >> Regards
> >> Neville
> >>
> >
>


Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Antoine Pitrou


Yes, that's certainly the case.
Changing:
values = np.random.randint(lower, upper, size=size)
to:
values = np.random.randint(lower, upper, size=size, dtype=np.int64)

would hopefully fix the issue.  Neville, could you try it out?

Thank you

Antoine.

Le 06/07/2020 à 14:16, Rok Mihevc a écrit :
> Numpy on windows has different default bitwidth than on linux. Perhaps this
> is causing the issue? (see:
> https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine
> )
> 
> Rok
> 
> On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale 
> wrote:
> 
>> Hi Arrow devs,
>>
>> I'm trying to run archery integration tests in Windows 10 (Python 3.7.7;
>> conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds
>> for int32* (
>> https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1
>> ).
>>
>> Has someone else encountered this problem before?
>>
>> Regards
>> Neville
>>
> 


Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Rok Mihevc
Numpy on windows has different default bitwidth than on linux. Perhaps this
is causing the issue? (see:
https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine
)

Rok

On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale 
wrote:

> Hi Arrow devs,
>
> I'm trying to run archery integration tests in Windows 10 (Python 3.7.7;
> conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds
> for int32* (
> https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1
> ).
>
> Has someone else encountered this problem before?
>
> Regards
> Neville
>


[Integration] Errors running archery integration on Windows

2020-07-06 Thread Neville Dipale
Hi Arrow devs,

I'm trying to run archery integration tests in Windows 10 (Python 3.7.7;
conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds
for int32* (https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1
).

Has someone else encountered this problem before?

Regards
Neville


[NIGHTLY] Arrow Build Report for Job nightly-2020-07-06-0

2020-07-06 Thread Crossbow


Arrow Build Report for Job nightly-2020-07-06-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0

Failed Tasks:
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-homebrew-cpp
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.8-jpype
- test-ubuntu-20.04-cpp-14:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-ubuntu-20.04-cpp-14
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-wheel-osx-cp35m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-wheel-osx-cp38
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp35m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?q

Re: Question: How to pass data between two languages interprocess without extra libraries?

2020-07-06 Thread Fan Liya
Hi Teng,

Arrow provides two formats for IPC between different languages: streaming
and file.
This article gives a tutorial for Java:
https://arrow.apache.org/docs/java/ipc.html

For other languages, it may be helpful to read the test cases.

Best,
Liya Fan


On Sun, Jul 5, 2020 at 4:24 PM Teng Peng  wrote:

> Hi dev,
>
> I have read the article "Introducing the Apache Arrow C Data Interface"
> <
> https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/
> >
> and
> I have a questions about pass data between two languages:
>
> In the article, R library reticulate is used for sharing data between R and
> Python. Is it possible to share data without external libraries? Let's say
> I want to create data from R and then read it from my python script. If it
> is possible, are there any tutorials on this? I believe I have to record
> the memory address of the data in R, correct?
>
> Thanks.
>