Crash with 0.15.1 when transposing dicts with nulls values

2020-02-28 Thread Pierre Belzile
When I recover an array of type dictionary int32 -> string from a parquet
file and that array has null positions, it seems that the indices that
correspond to null positions are undefined. I.e. not guaranteed to be 0.
This causes a crash when using a transpose map when trying to read the
transpose value. Does this seem possible? Fixed in 0.16.0?

If not I can create a JIRA but it is difficult to create a code snippet to
reproduce because it depends on uninitialized memory.

Pierre


[jira] [Created] (ARROW-7967) [CI][Crossbow] Move autobrew job back to old macOS

2020-02-28 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7967:
--

 Summary: [CI][Crossbow] Move autobrew job back to old macOS
 Key: ARROW-7967
 URL: https://issues.apache.org/jira/browse/ARROW-7967
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson


Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere 
in Travis, revert the changes in that issue so that we're still testing on old 
macOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Flight testing inconsistency for empty batches

2020-02-28 Thread Bryan Cutler
Thanks all, I agree with validating each record batch independently. I made
https://issues.apache.org/jira/browse/ARROW-7966 to ensure this, and that
will hopefully iron out any kinks in the different implementations.

Thanks,
Bryan

On Wed, Feb 26, 2020 at 3:13 PM Wes McKinney  wrote:

> I agree with independent validation.
>
> On Tue, Feb 25, 2020 at 2:55 PM David Li  wrote:
> >
> > Hey Bryan,
> >
> > Thanks for looking into this issue. I would vote that we should
> > validate each batch independently, so we can catch issues related to
> > the structure of the data and not just the content. C++ doesn't do any
> > detection of empty batches per se, but on both ends it reads all the
> > data into a table, which would eliminate any empty batches.
> >
> > It also wouldn't be reasonable to stop sending batches that are empty,
> > because Flight lets you attach metadata to batches, and so an empty
> > batch might still have metadata that the client or server wants.
> >
> > Best,
> > David
> >
> > On 2/24/20, Bryan Cutler  wrote:
> > > While looking into Null type testing for ARROW-7899, a couple small
> issues
> > > came up regarding Flight integration testing with empty batches (row
> count
> > > == 0) that could be worked out with a quick discussion. It seems there
> is a
> > > small difference between the C++ and Java Flight servers when there are
> > > empty record batches at the end of a stream, more details in PR
> > > https://github.com/apache/arrow/pull/6476.
> > >
> > > The Java server sends all record batches, even the empty ones, and the
> test
> > > client verifies each of these batches matches the batches read from a
> JSON
> > > file. The C++ servers seems to recognize if the end of the stream is
> only
> > > empty batches (please correct me if I'm wrong) and will not serve them.
> > > This seems reasonable, as there is no more actual data left in the
> stream.
> > > The C++ test client reads all batches into a table, does the same for
> the
> > > JSON file, and compares final Tables. I also noticed that empty
> batches in
> > > the middle of the stream will be served.  My questions are:
> > >
> > > 1) What is the expected behavior of a Flight server for empty record
> > > batches, can they be ignored and not sent to the Client?
> > >
> > > 2) Is it good enough to test against a final concatenation of all
> batches
> > > in the stream or should each batch be verified individually to ensure
> the
> > > server is sending out correctly batched data?
> > >
> > > Thanks,
> > > Bryan
> > >
>


[jira] [Created] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-02-28 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7966:
---

 Summary: [Integration][Flight][C++] Client should verify each 
batch independently
 Key: ARROW-7966
 URL: https://issues.apache.org/jira/browse/ARROW-7966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Bryan Cutler


Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
all batches from JSON into a Table, reads all batches in the flight stream from 
the server into a Table, then compares the Tables for equality.  This is 
potentially a problem because a record batch might have specific information 
that is then lost in the conversion to a Table. For example, if the server 
sends empty batches, the resulting Table would not be different from one with 
no empty batches.

Instead, the client should check each record batch from the JSON file against 
each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7965) [Python] Hold a reference to the dataset factory for later reuse

2020-02-28 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7965:
--

 Summary: [Python] Hold a reference to the dataset factory for 
later reuse
 Key: ARROW-7965
 URL: https://issues.apache.org/jira/browse/ARROW-7965
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


Provide a more intuitive way to construct nested dataset:

```python
# instead of using confusing factory function
dataset([
 factory("s3://old-taxi-data", format="parquet"),
 factory("local/path/to/new/data", format="csv")
])

# let the user to construct a new dataset directly from dataset objects
dataset([ 
dataset("s3://old-taxi-data", format="parquet"),
dataset("local/path/to/new/data", format="csv")
])
```

In the future we might want to introduce a new Dataset class which wraps 
functionality of both the dataset actory and the materialized dataset enabling 
optimizations over rediscovery of already materialized datasets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments

2020-02-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7963:


 Summary: [C++][Python][Dataset] Expose listing fragments
 Key: ARROW-7963
 URL: https://issues.apache.org/jira/browse/ARROW-7963
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche
Assignee: Ben Kietzman


It would be useful to able to list the fragments, to get their paths / 
partition expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-02-28-0

2020-02-28 Thread Crossbow


Arrow Build Report for Job nightly-2020-02-28-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0

Failed Tasks:
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-debian-stretch
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-macos-r-autobrew
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-turbodbc-master
- test-ubuntu-16.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-ubuntu-16.04-cpp
- ubuntu-eoan:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-ubuntu-eoan
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-ubuntu-xenial
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux1-cp35m
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux2010-cp35m
- wheel-manylinux2014-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-wheel-manylinux2014-cp35m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-wheel-osx-cp35m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-azure-conda-win-vs2015-py38
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-gandiva-jar-osx
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-travis-homebrew-cpp
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-28-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 

[jira] [Created] (ARROW-7962) [R][Dataset] Followup to "Consolidate Source and Dataset classes"

2020-02-28 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7962:
--

 Summary: [R][Dataset] Followup to "Consolidate Source and Dataset 
classes"
 Key: ARROW-7962
 URL: https://issues.apache.org/jira/browse/ARROW-7962
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


This was pushed to ARROW-7886 but it got dropped in a force push.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7961) pyarrow 0.16.0 cannot deserialize content serialised with < 0.16.0

2020-02-28 Thread Rob (Jira)
Rob created ARROW-7961:
--

 Summary: pyarrow 0.16.0 cannot deserialize content serialised with 
< 0.16.0
 Key: ARROW-7961
 URL: https://issues.apache.org/jira/browse/ARROW-7961
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: MacOS, python 3.7
Reporter: Rob


Pandas data frame has been serialised into a Redis cache using pyarrow 0.14.x.  
After upgrading to 0.16.0 deserialise() fails.  Have upgraded to 0.15.1 and 
this works with objects serialised with 0.14.x

{{import pyarrow as pa, redis}}
{{print(pa.__version__)}}
{{c = redis.Redis.from_url("redis://127.0.0.1")}}
{{obj = c.get("breakable")}}
{{df = pa.deserialize(obj)}}
{{print(df.head())}}
{{c.set("breakable", pa.serialize(df).to_buffer().to_pybytes())}}

When run venv with 0.15.1 installed.  No errors. Version number goes to stdout 
and no stderror.

When run in 0.16.0 following error is generated

{{'0.16.0',}}
{{ 'Traceback (most recent call last):',}}
{{ ' File "/tmp/pa.py", line 6, in ',}}
{{ ' df = pa.deserialize(obj)',}}
{{ ' File "pyarrow/serialization.pxi", line 476, in pyarrow.lib.deserialize',}}
{{ ' File "pyarrow/serialization.pxi", line 438, in 
pyarrow.lib.deserialize_from',}}
{{ ' File "pyarrow/serialization.pxi", line 414, in 
pyarrow.lib.read_serialized',}}
{{ ' File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status',}}
{{ 'OSError: Expected IPC message of type unknown but got unknown']}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)