Crash with 0.15.1 when transposing dicts with nulls values
When I recover an array of type dictionary int32 -> string from a parquet file and that array has null positions, it seems that the indices that correspond to null positions are undefined. I.e. not guaranteed to be 0. This causes a crash when using a transpose map when trying to read the transpose value. Does this seem possible? Fixed in 0.16.0? If not I can create a JIRA but it is difficult to create a code snippet to reproduce because it depends on uninitialized memory. Pierre
[jira] [Created] (ARROW-7967) [CI][Crossbow] Move autobrew job back to old macOS
Neal Richardson created ARROW-7967: -- Summary: [CI][Crossbow] Move autobrew job back to old macOS Key: ARROW-7967 URL: https://issues.apache.org/jira/browse/ARROW-7967 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, R Reporter: Neal Richardson Assignee: Neal Richardson Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere in Travis, revert the changes in that issue so that we're still testing on old macOS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Flight testing inconsistency for empty batches
Thanks all, I agree with validating each record batch independently. I made https://issues.apache.org/jira/browse/ARROW-7966 to ensure this, and that will hopefully iron out any kinks in the different implementations. Thanks, Bryan On Wed, Feb 26, 2020 at 3:13 PM Wes McKinney wrote: > I agree with independent validation. > > On Tue, Feb 25, 2020 at 2:55 PM David Li wrote: > > > > Hey Bryan, > > > > Thanks for looking into this issue. I would vote that we should > > validate each batch independently, so we can catch issues related to > > the structure of the data and not just the content. C++ doesn't do any > > detection of empty batches per se, but on both ends it reads all the > > data into a table, which would eliminate any empty batches. > > > > It also wouldn't be reasonable to stop sending batches that are empty, > > because Flight lets you attach metadata to batches, and so an empty > > batch might still have metadata that the client or server wants. > > > > Best, > > David > > > > On 2/24/20, Bryan Cutler wrote: > > > While looking into Null type testing for ARROW-7899, a couple small > issues > > > came up regarding Flight integration testing with empty batches (row > count > > > == 0) that could be worked out with a quick discussion. It seems there > is a > > > small difference between the C++ and Java Flight servers when there are > > > empty record batches at the end of a stream, more details in PR > > > https://github.com/apache/arrow/pull/6476. > > > > > > The Java server sends all record batches, even the empty ones, and the > test > > > client verifies each of these batches matches the batches read from a > JSON > > > file. The C++ servers seems to recognize if the end of the stream is > only > > > empty batches (please correct me if I'm wrong) and will not serve them. > > > This seems reasonable, as there is no more actual data left in the > stream. > > > The C++ test client reads all batches into a table, does the same for > the > > > JSON file, and compares final Tables. I also noticed that empty > batches in > > > the middle of the stream will be served. My questions are: > > > > > > 1) What is the expected behavior of a Flight server for empty record > > > batches, can they be ignored and not sent to the Client? > > > > > > 2) Is it good enough to test against a final concatenation of all > batches > > > in the stream or should each batch be verified individually to ensure > the > > > server is sending out correctly batched data? > > > > > > Thanks, > > > Bryan > > > >
[jira] [Created] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
Bryan Cutler created ARROW-7966: --- Summary: [Integration][Flight][C++] Client should verify each batch independently Key: ARROW-7966 URL: https://issues.apache.org/jira/browse/ARROW-7966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Bryan Cutler Currently the C++ Flight test client in {{test_integration_client.cc}} reads all batches from JSON into a Table, reads all batches in the flight stream from the server into a Table, then compares the Tables for equality. This is potentially a problem because a record batch might have specific information that is then lost in the conversion to a Table. For example, if the server sends empty batches, the resulting Table would not be different from one with no empty batches. Instead, the client should check each record batch from the JSON file against each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7965) [Python] Hold a reference to the dataset factory for later reuse
Krisztian Szucs created ARROW-7965: -- Summary: [Python] Hold a reference to the dataset factory for later reuse Key: ARROW-7965 URL: https://issues.apache.org/jira/browse/ARROW-7965 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Provide a more intuitive way to construct nested dataset: ```python # instead of using confusing factory function dataset([ factory("s3://old-taxi-data", format="parquet"), factory("local/path/to/new/data", format="csv") ]) # let the user to construct a new dataset directly from dataset objects dataset([ dataset("s3://old-taxi-data", format="parquet"), dataset("local/path/to/new/data", format="csv") ]) ``` In the future we might want to introduce a new Dataset class which wraps functionality of both the dataset actory and the materialized dataset enabling optimizations over rediscovery of already materialized datasets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments
Joris Van den Bossche created ARROW-7963: Summary: [C++][Python][Dataset] Expose listing fragments Key: ARROW-7963 URL: https://issues.apache.org/jira/browse/ARROW-7963 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Assignee: Ben Kietzman It would be useful to able to list the fragments, to get their paths / partition expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-02-28-0
Arrow Build Report for Job nightly-2020-02-28-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0 Failed Tasks: - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-debian-stretch - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-macos-r-autobrew - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-turbodbc-master - test-ubuntu-16.04-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-ubuntu-16.04-cpp - ubuntu-eoan: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-ubuntu-eoan - ubuntu-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-ubuntu-xenial - wheel-manylinux1-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux1-cp35m - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux2010-cp35m - wheel-manylinux2014-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux2014-cp35m - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-wheel-osx-cp35m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-8 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py38 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-gandiva-jar-osx - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-homebrew-cpp - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-cpp-valgrind - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL:
[jira] [Created] (ARROW-7962) [R][Dataset] Followup to "Consolidate Source and Dataset classes"
Neal Richardson created ARROW-7962: -- Summary: [R][Dataset] Followup to "Consolidate Source and Dataset classes" Key: ARROW-7962 URL: https://issues.apache.org/jira/browse/ARROW-7962 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset, R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 This was pushed to ARROW-7886 but it got dropped in a force push. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7961) pyarrow 0.16.0 cannot deserialize content serialised with < 0.16.0
Rob created ARROW-7961: -- Summary: pyarrow 0.16.0 cannot deserialize content serialised with < 0.16.0 Key: ARROW-7961 URL: https://issues.apache.org/jira/browse/ARROW-7961 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: MacOS, python 3.7 Reporter: Rob Pandas data frame has been serialised into a Redis cache using pyarrow 0.14.x. After upgrading to 0.16.0 deserialise() fails. Have upgraded to 0.15.1 and this works with objects serialised with 0.14.x {{import pyarrow as pa, redis}} {{print(pa.__version__)}} {{c = redis.Redis.from_url("redis://127.0.0.1")}} {{obj = c.get("breakable")}} {{df = pa.deserialize(obj)}} {{print(df.head())}} {{c.set("breakable", pa.serialize(df).to_buffer().to_pybytes())}} When run venv with 0.15.1 installed. No errors. Version number goes to stdout and no stderror. When run in 0.16.0 following error is generated {{'0.16.0',}} {{ 'Traceback (most recent call last):',}} {{ ' File "/tmp/pa.py", line 6, in ',}} {{ ' df = pa.deserialize(obj)',}} {{ ' File "pyarrow/serialization.pxi", line 476, in pyarrow.lib.deserialize',}} {{ ' File "pyarrow/serialization.pxi", line 438, in pyarrow.lib.deserialize_from',}} {{ ' File "pyarrow/serialization.pxi", line 414, in pyarrow.lib.read_serialized',}} {{ ' File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status',}} {{ 'OSError: Expected IPC message of type unknown but got unknown']}} -- This message was sent by Atlassian Jira (v8.3.4#803005)