Property-driven Parquet encryption
Hi all, We are working on the Parquet modular encryption, and are currently adding a high-level interface that allows to encrypt/decrypt parquet files via properties only (without calling the low level API). In the spark/parquet-mr domain, we're using the Hadoop configuration properties for that purpose - they are already passed from Spark to Parquet, and allow to add custom key-value properties that can carry the list of encrypted columns, key identities etc, as described in the https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing I'm not sufficiently familiar with the pandas/pyarrow/parquet-cpp ecosystem. Is there an analog of Hadoop configuration (a free key-value map, passed all the way down to parquet-cpp)? Or a more structured configuration object (where we'll need to add the encryption-related properties)? All suggestions are welcome. Cheers, Gidon
Re: [DISCUSS] Ongoing LZ4 problems with Parquet files
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou wrote: > > > Le 06/07/2020 à 17:57, Steve Kim a écrit : > > The Parquet format specification is ambiguous about the exact details of > > LZ4 compression. However, the *de facto* reference implementation in Java > > (parquet-mr) uses the Hadoop LZ4 codec. > > > > I think that it is important for Parquet c++ to have compatibility and > > feature parity with parquet-mr when possible. I prefer to change the > > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation > > that is used by parquet-mr ( > > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this > > change will be quick and easy. I have an intern under my supervision who is > > available to work on it full time, starting immediately. Please let me know > > if we ought to proceed. > > Would that keep compatibility with existing files produces by Parquet C++? Given that LZ4 has been constantly broken in C++ (first using the raw format, then the block format -- still incompatible apparently) I think we would recommend that in the rare event that people have LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used mostly) they should rewrite their files with a different codec using e.g. pyarrow 0.17.1 > Regards > > Antoine.
Re: language independent representation of filter expressions
I would also be interested in having a reusable serialized format for filter- and projection-like expressions. I think trying to go so far as full logical query plans suitable for building a SQL engine is perhaps a bit too far but we could start small with the use case from the JNI Datasets PR as a motivating example. We should also consider replacing or deprecating Gandiva's serialized expressions in favor of something more general. It may be a slight bikeshed issue, but I wouldn't be thrilled about having this be based on Protocol Buffers, because of the runtime requirement (on libprotobuf.so / libprotobuf.a) it introduces into C++ applications. Flatbuffers might be less pleasant developer UX in Java but at least in C++ the fact that Flatbuffers results in zero build- or runtime dependencies is a significant advantage. On Mon, Jul 6, 2020 at 4:12 PM Andy Grove wrote: > > This is something that I am also interested in. > > My current approach in my personal project that uses Arrow is to use > protobuf to represent expressions (as well as logical and physical query > plans). I used the Gandiva protobuf definition as a starting point. > > Protobuf works for going between different languages in the same process as > well as for passing query plans over the network. I'm passing these > protobuf definitions over the Flight protocol. > > I only have support for a few simple expressions so far, but here is my > protobuf file for reference: > > https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto > > Andy. > > On Mon, Jul 6, 2020 at 1:50 PM Steve Kim wrote: > > > I have been following the discussion on a pull request ( > > https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the > > high-level dataset API via JNI. > > > > An obstacle that was encountered in this PR is that there is not a good way > > to pass a filter expression via JNI. Expressions have a defined > > serialization in the C++ implementation, but this serialization includes > > enums and types that are only defined in C++ and are not accessible in > > other languages. > > > > I agree with Micah Kornfield's comment ( > > https://github.com/apache/arrow/pull/7030#discussion_r425563920) that > > there > > ought to be one representation that we reuse across languages. If we had > > this cross-language functionality, then we could do the following: > > > >1. build an arbitrary filter expression in Java > >2. serialize the expression to bytes to be passed via JNI > >3. deserialize from bytes to a native filter expression in the C++ > >implementation > > > > Has there already been discussion about what a cross-language > > representation of filter expressions (and possibly other parts of the > > Dataset API) might look like? I see that we use Flatbuffers in other parts > > of Arrow. > > > > What would need to change in the C++ implementation to make use of such a > > representation? > > > > Steve > >
Re: language independent representation of filter expressions
This is something that I am also interested in. My current approach in my personal project that uses Arrow is to use protobuf to represent expressions (as well as logical and physical query plans). I used the Gandiva protobuf definition as a starting point. Protobuf works for going between different languages in the same process as well as for passing query plans over the network. I'm passing these protobuf definitions over the Flight protocol. I only have support for a few simple expressions so far, but here is my protobuf file for reference: https://github.com/ballista-compute/ballista/blob/main/proto/ballista.proto Andy. On Mon, Jul 6, 2020 at 1:50 PM Steve Kim wrote: > I have been following the discussion on a pull request ( > https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the > high-level dataset API via JNI. > > An obstacle that was encountered in this PR is that there is not a good way > to pass a filter expression via JNI. Expressions have a defined > serialization in the C++ implementation, but this serialization includes > enums and types that are only defined in C++ and are not accessible in > other languages. > > I agree with Micah Kornfield's comment ( > https://github.com/apache/arrow/pull/7030#discussion_r425563920) that > there > ought to be one representation that we reuse across languages. If we had > this cross-language functionality, then we could do the following: > >1. build an arbitrary filter expression in Java >2. serialize the expression to bytes to be passed via JNI >3. deserialize from bytes to a native filter expression in the C++ >implementation > > Has there already been discussion about what a cross-language > representation of filter expressions (and possibly other parts of the > Dataset API) might look like? I see that we use Flatbuffers in other parts > of Arrow. > > What would need to change in the C++ implementation to make use of such a > representation? > > Steve >
language independent representation of filter expressions
I have been following the discussion on a pull request ( https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the high-level dataset API via JNI. An obstacle that was encountered in this PR is that there is not a good way to pass a filter expression via JNI. Expressions have a defined serialization in the C++ implementation, but this serialization includes enums and types that are only defined in C++ and are not accessible in other languages. I agree with Micah Kornfield's comment ( https://github.com/apache/arrow/pull/7030#discussion_r425563920) that there ought to be one representation that we reuse across languages. If we had this cross-language functionality, then we could do the following: 1. build an arbitrary filter expression in Java 2. serialize the expression to bytes to be passed via JNI 3. deserialize from bytes to a native filter expression in the C++ implementation Has there already been discussion about what a cross-language representation of filter expressions (and possibly other parts of the Dataset API) might look like? I see that we use Flatbuffers in other parts of Arrow. What would need to change in the C++ implementation to make use of such a representation? Steve
Re: [DISCUSS] Ongoing LZ4 problems with Parquet files
> Would that keep compatibility with existing files produces by Parquet C++? Changing the lz4 implementation to be compatible with parquet-mr/hadoop would break compatibility with any existing files that were written by Parquet C++ using lz4 compression. I believe that it is not possible to reliably detect, from inspection of the first few bytes, which implementation variant was used by the writer. But I could be misinformed, as I do not have expert knowledge of LZ4 compression.
Re: [DISCUSS] Ongoing LZ4 problems with Parquet files
Le 06/07/2020 à 17:57, Steve Kim a écrit : > The Parquet format specification is ambiguous about the exact details of > LZ4 compression. However, the *de facto* reference implementation in Java > (parquet-mr) uses the Hadoop LZ4 codec. > > I think that it is important for Parquet c++ to have compatibility and > feature parity with parquet-mr when possible. I prefer to change the > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation > that is used by parquet-mr ( > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this > change will be quick and easy. I have an intern under my supervision who is > available to work on it full time, starting immediately. Please let me know > if we ought to proceed. Would that keep compatibility with existing files produces by Parquet C++? Regards Antoine.
Re: Question: How to pass data between two languages interprocess without extra libraries?
Could you clarify what you mean by "without external libraries"? Do you mean without using pyarrow and the arrow R package? Neal On Mon, Jul 6, 2020 at 1:40 AM Fan Liya wrote: > Hi Teng, > > Arrow provides two formats for IPC between different languages: streaming > and file. > This article gives a tutorial for Java: > https://arrow.apache.org/docs/java/ipc.html > > For other languages, it may be helpful to read the test cases. > > Best, > Liya Fan > > > On Sun, Jul 5, 2020 at 4:24 PM Teng Peng wrote: > > > Hi dev, > > > > I have read the article "Introducing the Apache Arrow C Data Interface" > > < > > > https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/ > > > > > and > > I have a questions about pass data between two languages: > > > > In the article, R library reticulate is used for sharing data between R > and > > Python. Is it possible to share data without external libraries? Let's > say > > I want to create data from R and then read it from my python script. If > it > > is possible, are there any tutorials on this? I believe I have to record > > the memory address of the data in R, correct? > > > > Thanks. > > >
Re: [DISCUSS] Ongoing LZ4 problems with Parquet files
The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec. I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when possible. I prefer to change the LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation that is used by parquet-mr ( https://issues.apache.org/jira/browse/PARQUET-1878). I think that this change will be quick and easy. I have an intern under my supervision who is available to work on it full time, starting immediately. Please let me know if we ought to proceed. If it is not feasible to achieve compatibility in the next release, then I am in favor of disabling lz4 support ( https://issues.apache.org/jira/browse/PARQUET-1515) until it can be fixed. Thanks, Steve On Tue, 30 Jun 2020 14:33:17 +0200 "Uwe L. Korn" wrote: > I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative. > > Cheers > Uwe > > On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote: > > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou wrote: > > > > > > > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit : > > > > hi folks, > > > > > > > > (cross-posting to dev@arrow and dev@parquet since there are > > > > stakeholders in both places) > > > > > > > > It seems there are still problems at least with the C++ implementation > > > > of LZ4 compression in Parquet files > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1241 > > > > https://issues.apache.org/jira/browse/PARQUET-1878 > > > > > > I don't have any particular opinion on how to solve the LZ4 issue, but > > > I'd like to mention that LZ4 and ZStandard are the two most efficient > > > compression algorithms available, and they span different parts of the > > > speed/compression spectrum, so it would be a pity to disable one of them. > > > > It's true, however I think it's worse to write LZ4-compressed files > > that cannot be read by other Parquet implementations (if that's what's > > happening as I understand it?). If we are indeed shipping something > > broken then we either should fix it or disable it until it can be > > fixed. > > > > > Regards > > > > > > Antoine. > > >
Re: [Integration] Errors running archery integration on Windows
Thanks Rok and Antoine, I couldn't see what the issue could have been, so the SO link was very helpful and informative. I'll try it out, and submit a PR if I get it right. On Mon, 6 Jul 2020 at 14:30, Antoine Pitrou wrote: > > Yes, that's certainly the case. > Changing: > values = np.random.randint(lower, upper, size=size) > to: > values = np.random.randint(lower, upper, size=size, dtype=np.int64) > > would hopefully fix the issue. Neville, could you try it out? > > Thank you > > Antoine. > > Le 06/07/2020 à 14:16, Rok Mihevc a écrit : > > Numpy on windows has different default bitwidth than on linux. Perhaps > this > > is causing the issue? (see: > > > https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine > > ) > > > > Rok > > > > On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale > > wrote: > > > >> Hi Arrow devs, > >> > >> I'm trying to run archery integration tests in Windows 10 (Python 3.7.7; > >> conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds > >> for int32* ( > >> https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1 > >> ). > >> > >> Has someone else encountered this problem before? > >> > >> Regards > >> Neville > >> > > >
Re: [Integration] Errors running archery integration on Windows
Yes, that's certainly the case. Changing: values = np.random.randint(lower, upper, size=size) to: values = np.random.randint(lower, upper, size=size, dtype=np.int64) would hopefully fix the issue. Neville, could you try it out? Thank you Antoine. Le 06/07/2020 à 14:16, Rok Mihevc a écrit : > Numpy on windows has different default bitwidth than on linux. Perhaps this > is causing the issue? (see: > https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine > ) > > Rok > > On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale > wrote: > >> Hi Arrow devs, >> >> I'm trying to run archery integration tests in Windows 10 (Python 3.7.7; >> conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds >> for int32* ( >> https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1 >> ). >> >> Has someone else encountered this problem before? >> >> Regards >> Neville >> >
Re: [Integration] Errors running archery integration on Windows
Numpy on windows has different default bitwidth than on linux. Perhaps this is causing the issue? (see: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine ) Rok On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale wrote: > Hi Arrow devs, > > I'm trying to run archery integration tests in Windows 10 (Python 3.7.7; > conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds > for int32* ( > https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1 > ). > > Has someone else encountered this problem before? > > Regards > Neville >
[Integration] Errors running archery integration on Windows
Hi Arrow devs, I'm trying to run archery integration tests in Windows 10 (Python 3.7.7; conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds for int32* (https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1 ). Has someone else encountered this problem before? Regards Neville
[NIGHTLY] Arrow Build Report for Job nightly-2020-07-06-0
Arrow Build Report for Job nightly-2020-07-06-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0 Failed Tasks: - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-homebrew-cpp - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.7-turbodbc-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.8-dask-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-conda-python-3.8-jpype - test-ubuntu-20.04-cpp-14: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-test-ubuntu-20.04-cpp-14 - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-wheel-osx-cp35m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-wheel-osx-cp38 - wheel-win-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp35m - wheel-win-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp36m - wheel-win-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp37m - wheel-win-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-appveyor-wheel-win-cp38 Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-centos-8-amd64 - conda-clean: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-clean - conda-linux-gcc-py36-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py36-cpu - conda-linux-gcc-py36-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py36-cuda - conda-linux-gcc-py37-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py37-cpu - conda-linux-gcc-py37-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py37-cuda - conda-linux-gcc-py38-cpu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py38-cpu - conda-linux-gcc-py38-cuda: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-linux-gcc-py38-cuda - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-azure-conda-win-vs2015-py38 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?q
Re: Question: How to pass data between two languages interprocess without extra libraries?
Hi Teng, Arrow provides two formats for IPC between different languages: streaming and file. This article gives a tutorial for Java: https://arrow.apache.org/docs/java/ipc.html For other languages, it may be helpful to read the test cases. Best, Liya Fan On Sun, Jul 5, 2020 at 4:24 PM Teng Peng wrote: > Hi dev, > > I have read the article "Introducing the Apache Arrow C Data Interface" > < > https://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/ > > > and > I have a questions about pass data between two languages: > > In the article, R library reticulate is used for sharing data between R and > Python. Is it possible to share data without external libraries? Let's say > I want to create data from R and then read it from my python script. If it > is possible, are there any tutorials on this? I believe I have to record > the memory address of the data in R, correct? > > Thanks. >