[jira] [Created] (ARROW-8128) [C#] NestedType values serialized on wrong length
Takashi Hashida created ARROW-8128: -- Summary: [C#] NestedType values serialized on wrong length Key: ARROW-8128 URL: https://issues.apache.org/jira/browse/ARROW-8128 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Takashi Hashida NestedType Values is serialized on parent node Length and NullCount. [https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs#L219] {code} Flatbuf.FieldNode childFieldNode = recordBatchEnumerator.CurrentNode; recordBatchEnumerator.MoveNextNode(); {code} At this lines, MoveNextNode should be executed before assigning CurrentNode. this can be reproduced by changing TestData.ArrayCreator.Visit(ListType type) like below and execute ArrowFileReaderTests. {code} public void Visit(ListType type) { var builder = new ListArray.Builder(type.ValueField).Reserve(Length); //Todo : Support various types var valueBuilder = (Int64Array.Builder)builder.ValueBuilder.Reserve(Length); for (var i = 0; i < Length; i++) { builder.Append(); valueBuilder.Append(i); } //Add a value to check if Values length can exceed ListArray length valueBuilder.Append(0); Array = builder.Build(); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8127) [C++} [Parquet] Incorrect column chunk metadata for multipage batch writes
TP Boudreau created ARROW-8127: -- Summary: [C++} [Parquet] Incorrect column chunk metadata for multipage batch writes Key: ARROW-8127 URL: https://issues.apache.org/jira/browse/ARROW-8127 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: TP Boudreau Assignee: TP Boudreau Attachments: multipage-batch-write.cc When writing to a buffered column writer using PLAIN encoding, if the size of the batch supplied for writing exceeds the page size for the writer, the resulting file has an incorrect data_page_offset set in its column chunk metadata. This causes an exception to be thrown when reading the file (file appears to be too short to the reader). For example, the attached code, which attempts to write a batch of 262145 Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes (with buffered writer, PLAIN encoding), fails on reading, throwing the error: "Tried reading 1048678 bytes starting at position 1048633 from file but only got 333". The error is caused by the second page write tripping the conditional here https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302, in the serialized in-memory writer wrapped by the buffered writer. The fix builds the metadata with offsets from the terminal sink rather than the in memory buffered sink. A PR is coming. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8126) [C++][Compute] Add Top-K kernel benchmark
Yibo Cai created ARROW-8126: --- Summary: [C++][Compute] Add Top-K kernel benchmark Key: ARROW-8126 URL: https://issues.apache.org/jira/browse/ARROW-8126 Project: Apache Arrow Issue Type: Improvement Components: C++ - Compute Reporter: Yibo Cai Assignee: Yibo Cai -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8125) [C++] "arrow-tests" target broken with ninja build
Wes McKinney created ARROW-8125: --- Summary: [C++] "arrow-tests" target broken with ninja build Key: ARROW-8125 URL: https://issues.apache.org/jira/browse/ARROW-8125 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 {code} $ ninja arrow-tests ninja: no work to do. {code} According to git bisect this was introduced by {code} $ git bisect bad 7db3855cd4a2e2f704b8715af3a36cbef0bb2a27 is the first bad commit commit 7db3855cd4a2e2f704b8715af3a36cbef0bb2a27 Author: Benjamin Kietzman Date: Mon Mar 9 16:40:21 2020 +0100 ARROW-8014: [C++] Provide CMake targets exercising tests with a label To run a subset of the tests, use: ```shell-session $ ninja -C ~/arrow/cpp/debug-build test-arrow_dataset ``` Closes #6547 from bkietz/8014-Provide-CMake-targets-to- and squashes the following commits: cf9bbb06a test-lable- => test- 90a1a7f3b ARROW-8014: Provide Cmake targets exercising tests with a label Authored-by: Benjamin Kietzman Signed-off-by: Antoine Pitrou cpp/cmake_modules/BuildUtils.cmake | 15 +++ cpp/src/arrow/CMakeLists.txt | 2 -- 2 files changed, 15 insertions(+), 2 deletions(-) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-03-15-0
Arrow Build Report for Job nightly-2020-03-15-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0 Failed Tasks: - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-gandiva-jar-trusty - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-cpp-valgrind - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-turbodbc-master - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-centos-8 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-azure-conda-win-vs2015-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-github-debian-stretch - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-gandiva-jar-osx - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-15-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7: URL: https:/
Re: [DISCUSS] Field reference ambiguity
It seems like there are two common patterns for projection from a record batch: * Selecting top-level fields by index * Selecting a collection of column paths. I'm on board with deprecating std::vector-based APIs since these are a special case of selecting a collection of column paths that include all children of nested types Suppose we have the following schema: a: int64 b: struct, f1: float64, f2: struct> What would be the proposed syntax of projecting this to a: int64 b: struct, f2: struct> ? Probably something like { FieldRef("a"), FieldRef("b", {FieldRef("f0"), FieldRef("f2", {FieldRef("f3"})}) } (I apologize if this is already addressed in the PR, I will certainly take a closer look) - Wes On Fri, Mar 13, 2020 at 3:04 PM Francois Saint-Jacques wrote: > > Hello, > > the recent dataset and compute work has forced us to think about > schema projection. One problem that surfaced is referencing fields in > nested schemas and/or schemas where duplicate column names exists. We > currently have (C++) APIs that either pass a vector or a > vector to represent fields subset, both way poses > challenges: > > - Referencing a column by index can't access sub-fields of nested type. > - Referencing a column by name can return more than one field. > > Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully) > non-ambiguous way. This is divided into 2 concepts: > > - FieldPath: A stack of indices pointing into nested structures. It > points to exactly one field, or none if ill formed. If the depth is > one, it is equivalent to referencing a field by index. > - FieldRef: A friendlier version that supports referencing by names > and/or a tiny string DSL similar to JSONPath. One can "dereference" a > FieldRef into a FieldPath given a schema. Since it supports name > component, a FieldRef can expand to more than one FieldPath. > > We'd like to standardise most C++ APIs where a vector of indices (or > names) is given as an indicator of subset of columns to use this new > facility. For this reason, we'd like feedback on the implementation. I > encourage other language developers to look at this as they'll likely > face the same issues. > > Thank you, > François > > [1] https://github.com/apache/arrow/pull/6545
[jira] [Created] (ARROW-8124) Update library dependencies
Bryant Biggs created ARROW-8124: --- Summary: Update library dependencies Key: ARROW-8124 URL: https://issues.apache.org/jira/browse/ARROW-8124 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Bryant Biggs Fix For: 0.17.0 Update rust library dependencies to the latest - except for thrift and sqlparser which require additional work -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8123) [Rust] [DataFusion] Create LogicalPlanBuilder
Andy Grove created ARROW-8123: - Summary: [Rust] [DataFusion] Create LogicalPlanBuilder Key: ARROW-8123 URL: https://issues.apache.org/jira/browse/ARROW-8123 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 1.0.0 Building logical plans is arduous and a builder would make this nicer. Example: {code:java} let plan = LogicalPlanBuilder::new() .scan( "default", "employee.csv", &employee_schema(), Some(vec![0, 3]), )? .filter(col(1).eq(&lit_str("CO")))? .project(vec![col(0)])? .build()?; {code} Note that I am already working on this and will have a PR shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads
Le 15/03/2020 à 04:57, Wes McKinney a écrit : > On Sat, Mar 14, 2020, 10:52 PM Micah Kornfield > wrote: > >> Hi Antoine, >> Could you clarify what you mean by: >> >>> Given our current resource utilization on Github Actions, it seems that >>> even a non-auto-scaling setup could be useful. >> >> >> I could interpret it in a couple of ways ... >> > > I think he means that we would not have difficulty keeping some persistent > nodes fully (or at least > 50%) utilized during regular working hours. Right. And we have a non-trivial number of "nightly" jobs (depending on where you are on Earth) as well :-) Regards Antoine.