[jira] [Created] (ARROW-8128) [C#] NestedType values serialized on wrong length

2020-03-15 Thread Takashi Hashida (Jira)
Takashi Hashida created ARROW-8128:
--

 Summary: [C#] NestedType values serialized on wrong length
 Key: ARROW-8128
 URL: https://issues.apache.org/jira/browse/ARROW-8128
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Takashi Hashida


NestedType Values is serialized on parent node Length and NullCount.

 

[https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs#L219]

 

{code}
Flatbuf.FieldNode childFieldNode = recordBatchEnumerator.CurrentNode;
recordBatchEnumerator.MoveNextNode();
{code}

At this lines, MoveNextNode should be executed before assigning CurrentNode.

this can be reproduced by changing TestData.ArrayCreator.Visit(ListType type) 
like below and execute ArrowFileReaderTests.

{code}
public void Visit(ListType type)
 {
 var builder = new ListArray.Builder(type.ValueField).Reserve(Length);

//Todo : Support various types
 var valueBuilder = (Int64Array.Builder)builder.ValueBuilder.Reserve(Length);

for (var i = 0; i < Length; i++)
 {
 builder.Append();
 valueBuilder.Append(i);
 }

//Add a value to check if Values length can exceed ListArray length
 valueBuilder.Append(0);

Array = builder.Build();
 }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8127) [C++} [Parquet] Incorrect column chunk metadata for multipage batch writes

2020-03-15 Thread TP Boudreau (Jira)
TP Boudreau created ARROW-8127:
--

 Summary: [C++} [Parquet] Incorrect column chunk metadata for 
multipage batch writes
 Key: ARROW-8127
 URL: https://issues.apache.org/jira/browse/ARROW-8127
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: TP Boudreau
Assignee: TP Boudreau
 Attachments: multipage-batch-write.cc

When writing to a buffered column writer using PLAIN encoding, if the size of 
the batch supplied for writing exceeds the page size for the writer, the 
resulting file has an incorrect data_page_offset set in its column chunk 
metadata.  This causes an exception to be thrown when reading the file (file 
appears to be too short to the reader).

For example, the attached code, which attempts to write a batch of 262145 
Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes 
(with buffered writer, PLAIN encoding), fails on reading, throwing the error: 
"Tried reading 1048678 bytes starting at position 1048633 from file but only 
got 333".

The error is caused by the second page write tripping the conditional here 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302,
 in the serialized in-memory writer wrapped by the buffered writer.

The fix builds the metadata with offsets from the terminal sink rather than the 
in memory buffered sink.  A PR is coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8126) [C++][Compute] Add Top-K kernel benchmark

2020-03-15 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8126:
---

 Summary: [C++][Compute] Add Top-K kernel benchmark
 Key: ARROW-8126
 URL: https://issues.apache.org/jira/browse/ARROW-8126
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8125) [C++] "arrow-tests" target broken with ninja build

2020-03-15 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8125:
---

 Summary: [C++] "arrow-tests" target broken with ninja build
 Key: ARROW-8125
 URL: https://issues.apache.org/jira/browse/ARROW-8125
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


{code}
$ ninja arrow-tests
ninja: no work to do.
{code}

According to git bisect this was introduced by 

{code}
$ git bisect bad
7db3855cd4a2e2f704b8715af3a36cbef0bb2a27 is the first bad commit
commit 7db3855cd4a2e2f704b8715af3a36cbef0bb2a27
Author: Benjamin Kietzman 
Date:   Mon Mar 9 16:40:21 2020 +0100

ARROW-8014: [C++] Provide CMake targets exercising tests with a label

To run a subset of the tests, use:
```shell-session
$ ninja -C ~/arrow/cpp/debug-build test-arrow_dataset
```

Closes #6547 from bkietz/8014-Provide-CMake-targets-to- and squashes the 
following commits:

cf9bbb06a  test-lable- => test-
90a1a7f3b  ARROW-8014:  Provide Cmake targets exercising 
tests with a label

Authored-by: Benjamin Kietzman 
Signed-off-by: Antoine Pitrou 

 cpp/cmake_modules/BuildUtils.cmake | 15 +++
 cpp/src/arrow/CMakeLists.txt   |  2 --
 2 files changed, 15 insertions(+), 2 deletions(-)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-15-0

2020-03-15 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-15-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0

Failed Tasks:
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-cpp-valgrind
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-turbodbc-master
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7:
  URL: 
https:/

Re: [DISCUSS] Field reference ambiguity

2020-03-15 Thread Wes McKinney
It seems like there are two common patterns for projection from a record batch:

* Selecting top-level fields by index
* Selecting a collection of column paths.

I'm on board with deprecating std::vector-based APIs
since these are a special case of selecting a collection of column
paths that include all children of nested types

Suppose we have the following schema:

a: int64
b: struct, f1: float64, f2: struct>

What would be the proposed syntax of projecting this to

a: int64
b: struct, f2: struct>

?

Probably something like

{
  FieldRef("a"),
  FieldRef("b", {FieldRef("f0"), FieldRef("f2", {FieldRef("f3"})})
}

(I apologize if this is already addressed in the PR, I will certainly
take a closer look)

- Wes

On Fri, Mar 13, 2020 at 3:04 PM Francois Saint-Jacques
 wrote:
>
> Hello,
>
> the recent dataset and compute work has forced us to think about
> schema projection. One problem that surfaced is referencing fields in
> nested schemas and/or schemas where duplicate column names exists. We
> currently have (C++) APIs that either pass a vector or a
> vector to represent fields subset, both way poses
> challenges:
>
> - Referencing a column by index can't access sub-fields of nested type.
> - Referencing a column by name can return more than one field.
>
> Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully)
> non-ambiguous way. This is divided into 2 concepts:
>
> - FieldPath: A stack of indices pointing into nested structures. It
> points to exactly one field, or none if ill formed. If the depth is
> one, it is equivalent to referencing a field by index.
> - FieldRef: A friendlier version that supports referencing by names
> and/or a tiny string DSL similar to JSONPath. One can "dereference" a
> FieldRef into a FieldPath given a schema. Since it supports name
> component, a FieldRef can expand to more than one FieldPath.
>
> We'd like to standardise most C++ APIs where a vector of indices (or
> names) is given as an indicator of subset of columns to use this new
> facility. For this reason, we'd like feedback on the implementation. I
> encourage other language developers to look at this as they'll likely
> face the same issues.
>
> Thank you,
> François
>
> [1] https://github.com/apache/arrow/pull/6545


[jira] [Created] (ARROW-8124) Update library dependencies

2020-03-15 Thread Bryant Biggs (Jira)
Bryant Biggs created ARROW-8124:
---

 Summary: Update library dependencies
 Key: ARROW-8124
 URL: https://issues.apache.org/jira/browse/ARROW-8124
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Bryant Biggs
 Fix For: 0.17.0


Update rust library dependencies to the latest - except for thrift and 
sqlparser which require additional work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8123) [Rust] [DataFusion] Create LogicalPlanBuilder

2020-03-15 Thread Andy Grove (Jira)
Andy Grove created ARROW-8123:
-

 Summary: [Rust] [DataFusion] Create LogicalPlanBuilder
 Key: ARROW-8123
 URL: https://issues.apache.org/jira/browse/ARROW-8123
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.0


Building logical plans is arduous and a builder would make this nicer. Example:
{code:java}
let plan = LogicalPlanBuilder::new()
.scan(
"default",
"employee.csv",
&employee_schema(),
Some(vec![0, 3]),
)?
.filter(col(1).eq(&lit_str("CO")))?
.project(vec![col(0)])?
.build()?; {code}
Note that I am already working on this and will have a PR shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

2020-03-15 Thread Antoine Pitrou


Le 15/03/2020 à 04:57, Wes McKinney a écrit :
> On Sat, Mar 14, 2020, 10:52 PM Micah Kornfield 
> wrote:
> 
>> Hi Antoine,
>> Could you clarify what you mean by:
>>
>>> Given our current resource utilization on Github Actions, it seems that
>>> even a non-auto-scaling setup could be useful.
>>
>>
>> I could interpret it in a couple of ways ...
>>
> 
> I think he means that we would not have difficulty keeping some persistent
> nodes fully (or at least > 50%) utilized during regular working hours.

Right.  And we have a non-trivial number of "nightly" jobs (depending on
where you are on Earth) as well :-)

Regards

Antoine.