[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-1
Arrow Build Report for Job nightly-2020-04-14-1 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1 Failed Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-centos-6-amd64 - ubuntu-focal-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-ubuntu-focal-amd64 - ubuntu-xenial-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-ubuntu-xenial-amd64 - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-travis-wheel-osx-cp36m - wheel-win-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-appveyor-wheel-win-cp35m Pending Tasks: - test-conda-cpp-hiveserver2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp-hiveserver2 - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp-valgrind - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-turbodbc-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.8-jpype - test-conda-python-3.8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.8 - test-conda-r-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-r-3.6 - test-debian-10-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-cpp - test-debian-10-go-1.12: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-go-1.12 - test-debian-10-python-3: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-python-3 - test-debian-c-glib: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-c-glib - test-debian-ruby: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-ruby - test-fedora-30-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-fedora-30-cpp - test-ubuntu-16.04-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-16.04-cpp - test-ubuntu-18.04-cpp-release: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-cpp-release - test-ubuntu-18.04-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-cpp - test-ubuntu-18.04-docs: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-docs - test-ubuntu-18.04-r-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-r-3.6 - test-ubuntu-c-glib: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-c-glib - test-ubuntu-ruby: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-ruby - wheel-win-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-appveyor-wheel-win-cp36m Succeeded Tasks: - centos-7-amd64: URL: https://github.com/ursa-labs/
[jira] [Created] (ARROW-8439) [Python] Filesystem docs are outdated
Joris Van den Bossche created ARROW-8439: Summary: [Python] Filesystem docs are outdated Key: ARROW-8439 URL: https://issues.apache.org/jira/browse/ARROW-8439 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8440) Refine simd header files
Yibo Cai created ARROW-8440: --- Summary: Refine simd header files Key: ARROW-8440 URL: https://issues.apache.org/jira/browse/ARROW-8440 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai This is a follow up of ARROW-8227. It aims to unify simd header files and simplify code. Currently, sse header files are included in sse_util.h, neon header files in neon_util.h and avx header files included directly in c source file. sse_util.h/neon_util.h also contain crc code which is not used by cpp files #include them. It may be better to put all simd header files in a single simd.h, and move crc code to where they are used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8441) [C++] Fix crashes on invalid input (OSS-Fuzz)
Antoine Pitrou created ARROW-8441: - Summary: [C++] Fix crashes on invalid input (OSS-Fuzz) Key: ARROW-8441 URL: https://issues.apache.org/jira/browse/ARROW-8441 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8442) [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy
Joris Van den Bossche created ARROW-8442: Summary: [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy Key: ARROW-8442 URL: https://issues.apache.org/jira/browse/ARROW-8442 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche There is this behaviour of {{to_pandas_dtype}} returning float, while all actual conversions to numpy or pandas use object dtype: {code} In [23]: pa.null().to_pandas_dtype() Out[23]: numpy.float64 In [24]: pa.array([], pa.null()).to_pandas() Out[24]: Series([], dtype: object) In [25]: pa.array([], pa.null()).to_numpy(zero_copy_only=False) Out[25]: array([], dtype=object) {code} So we should probably fix {{NullType.to_pandas_dtype}} to return object, which is used in practice. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8443) [Gandiva][C++] Fix round/truncate to no-op for special cases
Praveen Kumar created ARROW-8443: Summary: [Gandiva][C++] Fix round/truncate to no-op for special cases Key: ARROW-8443 URL: https://issues.apache.org/jira/browse/ARROW-8443 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Affects Versions: 1.0.0 Reporter: Praveen Kumar In cases for round and truncate where the target scale is greater than input scale then make the operation as a no-op. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8444) [Documentation] Fix spelling errors across the codebase
Krisztian Szucs created ARROW-8444: -- Summary: [Documentation] Fix spelling errors across the codebase Key: ARROW-8444 URL: https://issues.apache.org/jira/browse/ARROW-8444 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Krisztian Szucs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8445) [Gandiva][UDF] Add a udf for gandiva to extract the first capture in regex.
ZMZ91 created ARROW-8445: Summary: [Gandiva][UDF] Add a udf for gandiva to extract the first capture in regex. Key: ARROW-8445 URL: https://issues.apache.org/jira/browse/ARROW-8445 Project: Apache Arrow Issue Type: New Feature Components: C++, C++ - Gandiva Reporter: ZMZ91 add a gandiva udf to extract the first capture in regex [https://github.com/apache/arrow/pull/6925] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8446) [Python][Dataset] Detect and use _metadata file in a list of file paths
Joris Van den Bossche created ARROW-8446: Summary: [Python][Dataset] Detect and use _metadata file in a list of file paths Key: ARROW-8446 URL: https://issues.apache.org/jira/browse/ARROW-8446 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From https://github.com/dask/dask/pull/6047#discussion_r402391318 When specifying a directory to {{ParquetDataset}}, we will detect if a {{_metadata}} file is present in the directory and use that to populate the {{metadata}} attribute (and not include this file in the list of "pieces", since it does not include any data). However, when passing a list of files to {{ParquetDataset}}, with one being "_metadata", the metadata attribute is not populated, and the "_metadata" path is included as one of the ParquetDatasetPiece objects instead (which leads to an ArrowIOError during the read of that piece). We _could_ detect it in a list of paths as well. Note, I mentioned {{ParquetDataset}}, but if working on this, we should probably directly do it in the datasets API-based version. Also, I labeled this as Python and not C++ for now, as this might be something that can be handled on the Python side (once the C++ side knows how to process this kind of metadata -> ARROW-8062) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering
Francois Saint-Jacques created ARROW-8447: - Summary: [C++][Dataset] Ensure Scanner::ToTable preserve ordering Key: ARROW-8447 URL: https://issues.apache.org/jira/browse/ARROW-8447 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques This can be refactored with a little effort in Scanner::ToTable: # Change `batches` to `std::vector` # When pushing the closure to the TaskGroup, also track an incrementing integer, e.g. scan_task_id # In the closure, store the RecordBatches for this ScanTask in a local vector, when all batches are consumed, move the local vector in the `batches` at the right index, resizing and emplacing with mutex # After waiting for the task group completion either * Concatenate into a single vector and call `Table::FromRecordBatch` or * Write a RecordBatchReader that supports vector and add method `Table::FromRecordBatchReader` The later involves more work but is the clean way, the other FromRecordBatch method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)
hi Micah, I'm glad that we have the write side of nested completed for 0.17.0. As far as completing the read side and then implementing sufficient testing to exercise corner cases in end-to-end reads/writes, do you anticipate being able to work on this in the next 4-6 weeks (obviously the state of the world has affected everyone's availability / bandwidth)? I ask because someone from my team (or me also) may be able to get involved and help this move along. It'd be great to have this 100% completed and checked off our list for the next release (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration tests get completed also) thanks Wes On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield wrote: >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> think about setting up a feature branch for you to merge PRs into? >> Then the branch can be iterated on and we can merge it back when it's >> feature complete and does not have perf regressions for the flat >> read/write path. >> > I'd like to avoid a separate branch if possible. I'm willing to close the > open PR till I'm sure it is needed but I'm hoping keeping PRs as small > focused as possible with performance testing a long the way will be a better > reviewer and developer experience here. > >> The earliest I'd have time to work on this myself would likely be >> sometime in March. Others are welcome to jump in as well (and it'd be >> great to increase the overall level of knowledge of the Parquet >> codebase) > > Hopefully, Igor can help out otherwise I'll take up the read path after I > finish the write path. > > -Micah > > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney wrote: >> >> hi Micah >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield >> wrote: >> > >> > Just to give an update. I've been a little bit delayed, but my progress is >> > as follows: >> > 1. Had 1 PR merged that will exercise basic end-to-end tests. >> > 2. Have another PR open that allows a configuration option in C++ to >> > determine which algorithm version to use for reading/writing, the existing >> > version and the new version supported complex-nested arrays. I think a >> > large amount of code will be reused/delegated to but I will err on the side >> > of not touching the existing code/algorithms so that any errors in the >> > implementation or performance regressions can hopefully be mitigated at >> > runtime. I expect in later releases (once the code has "baked") will >> > become a no-op. >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> think about setting up a feature branch for you to merge PRs into? >> Then the branch can be iterated on and we can merge it back when it's >> feature complete and does not have perf regressions for the flat >> read/write path. >> >> > 3. Started coding the write path. >> > >> > Which leaves: >> > 1. Finishing the write path (I estimate 2-3 weeks) to be code complete >> > 2. Implementing the read path. >> >> The earliest I'd have time to work on this myself would likely be >> sometime in March. Others are welcome to jump in as well (and it'd be >> great to increase the overall level of knowledge of the Parquet >> codebase) >> >> > Again, I'm happy to collaborate if people have bandwidth and want to >> > contribute. >> > >> > Thanks, >> > Micah >> > >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield >> > wrote: >> > >> > > Hi Wes, >> > > I'm still interested in doing the work. But don't to hold anybody up if >> > > they have bandwidth. >> > > >> > > In order to actually make progress on this, my plan will be to: >> > > 1. Help with the current Java review backlog through early next week or >> > > so (this has been taking the majority of my time allocated for Arrow >> > > contributions for the last 6 months or so). >> > > 2. Shift all my attention to trying to get this done (this means no >> > > reviews other then closing out existing ones that I've started until it >> > > is >> > > done). Hopefully, other Java committers can help shrink the backlog >> > > further (Jacques thanks for you recent efforts here). >> > > >> > > Thanks, >> > > Micah >> > > >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney wrote: >> > > >> > >> hi folks, >> > >> >> > >> I think we have reached a point where the incomplete C++ Parquet >> > >> nested data assembly/disassembly is harming the value of several >> > >> others parts of the project, for example the Datasets API. As another >> > >> example, it's possible to ingest nested data from JSON but not write >> > >> it to Parquet in general. >> > >> >> > >> Implementing the nested data read and write path completely is a >> > >> difficult project requiring at least several weeks of dedicated work, >> > >> so it's not so surprising that it hasn't been accomplished yet. I know >> > >> that several people have expressed interest in working on it, but I >> > >> would like to see if anyone would be able to volunteer a commitment of >> > >> tim
[jira] [Created] (ARROW-8448) [Package] Can't build apt packages with ubuntu-focal
Francois Saint-Jacques created ARROW-8448: - Summary: [Package] Can't build apt packages with ubuntu-focal Key: ARROW-8448 URL: https://issues.apache.org/jira/browse/ARROW-8448 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Francois Saint-Jacques Assignee: Kouhei Sutou While trying to debug the failing nightly (due to disk space), I encounter the following error, the tar generated by the build script does not conform to what debuilder expects. It blocks {code} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlSuccessfully built ecdda7ea015d Successfully tagged apache-arrow-ubuntu-focal:latest docker run --rm --tty --volume /home/fsaintjacques/src/db/arrow/dev/tasks/linux-packages/apache-arrow/apt:/host:rw --env DEBUG=yes apache-arrow-ubuntu-focal /host/build.sh This package has a Debian revision number but there does not seem to be an appropriate original tar file or .orig directory in the parent directory; (expected one of apache-arrow_0.16.0.orig.tar.gz, apache-arrow_0.16.0.orig.tar.bz2, apache-arrow_0.16.0.orig.tar.lzma, apache-arrow_0.16.0.orig.tar.xz or apache-arrow-1.0.0~dev20200414.orig) continue anyway? (y/n) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8449) [R] Use CMAKE_UNITY_BUILD everywhere
Neal Richardson created ARROW-8449: -- Summary: [R] Use CMAKE_UNITY_BUILD everywhere Key: ARROW-8449 URL: https://issues.apache.org/jira/browse/ARROW-8449 Project: Apache Arrow Issue Type: Improvement Components: Packaging, R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8450) [Integration][C++] Implement large list/binary/utf8 integration
Antoine Pitrou created ARROW-8450: - Summary: [Integration][C++] Implement large list/binary/utf8 integration Key: ARROW-8450 URL: https://issues.apache.org/jira/browse/ARROW-8450 Project: Apache Arrow Issue Type: Improvement Components: C++, Integration Reporter: Antoine Pitrou Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8451) [Rust] [Datafusion]
Remi Dettai created ARROW-8451: -- Summary: [Rust] [Datafusion] Key: ARROW-8451 URL: https://issues.apache.org/jira/browse/ARROW-8451 Project: Apache Arrow Issue Type: Wish Components: Rust - DataFusion Reporter: Remi Dettai Datafusion is a great example of how to use Arrow. But having Datafusion inside the Arrow project has several drawbacks: * longer build times (rust build already slow) * more frequent updates (creates noise) * its roadmap can be quite independent of that of Arrow What is the actual benefit of having Datafusion inside the Arrow repo? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8452) [Go][Integration] Go JSON producer generates incorrect nullable flag for nested types
Antoine Pitrou created ARROW-8452: - Summary: [Go][Integration] Go JSON producer generates incorrect nullable flag for nested types Key: ARROW-8452 URL: https://issues.apache.org/jira/browse/ARROW-8452 Project: Apache Arrow Issue Type: Bug Components: Go, Integration Reporter: Antoine Pitrou It seems that when generating JSON integration data for a nested type, e.g. "list(int32)", the list's nullable flag is also inherited by child fields. This is wrong, because child fields have independent nullable flags, e.g. you may have: * "list(field("ints", int32, nullable=True), nullable=True)" * "list(field("ints", int32, nullable=False), nullable=True)" * "list(field("ints", int32, nullable=True), nullable=False)" * "list(field("ints", int32, nullable=False), nullable=False)" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8453) [Integration][Go] Recursive nested types unsupported
Antoine Pitrou created ARROW-8453: - Summary: [Integration][Go] Recursive nested types unsupported Key: ARROW-8453 URL: https://issues.apache.org/jira/browse/ARROW-8453 Project: Apache Arrow Issue Type: Bug Components: Go, Integration Reporter: Antoine Pitrou The Go JSON integration implementation doesn't support recursive nested types, e.g. "list(list(int32))". Here is an example traceback when Go is the consumer: {code} panic: runtime error: index out of range goroutine 1 [running]: github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1687c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x1710 github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc16858, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x838 github.com/apache/arrow/go/arrow/internal/arrjson.fieldFromJSON(0xc16860, 0xb, 0xc16858, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:309 +0xb5 github.com/apache/arrow/go/arrow/internal/arrjson.fieldsFromJSON(0xcca280, 0x4, 0x4, 0x0, 0x6f6d08, 0xc0db60) /arrow/go/arrow/internal/arrjson/arrjson.go:301 +0xfe github.com/apache/arrow/go/arrow/internal/arrjson.schemaFromJSON(0xcca280, 0x4, 0x4, 0xc0db60) /arrow/go/arrow/internal/arrjson/arrjson.go:274 +0x3f github.com/apache/arrow/go/arrow/internal/arrjson.NewReader(0x5b4700, 0xc0e028, 0x0, 0x0, 0x0, 0x0, 0x0, 0xd0) /arrow/go/arrow/internal/arrjson/reader.go:56 +0x13d main.validate(0x7ffbc819, 0x37, 0x7ffbc857, 0x26, 0x4acf01, 0x0, 0x0) /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:181 +0x1c8 main.runCommand(0x7ffbc857, 0x26, 0x7ffbc819, 0x37, 0x7ffbc884, 0x8, 0xc16101, 0xc86260, 0x40568f) /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:65 +0x228 main.main() /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:44 +0x24e {code} When Go is the producer: {code} panic: runtime error: index out of range goroutine 1 [running]: github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1687c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x1710 github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1686c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x838 github.com/apache/arrow/go/arrow/internal/arrjson.fieldFromJSON(0xc16860, 0xb, 0xc1686c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /arrow/go/arrow/internal/arrjson/arrjson.go:309 +0xb5 github.com/apache/arrow/go/arrow/internal/arrjson.fieldsFromJSON(0xcca280, 0x4, 0x4, 0x0, 0x6f6d08, 0xc0db60) /arrow/go/arrow/internal/arrjson/arrjson.go:301 +0xfe github.com/apache/arrow/go/arrow/internal/arrjson.schemaFromJSON(0xcca280, 0x4, 0x4, 0xc0db60) /arrow/go/arrow/internal/arrjson/arrjson.go:274 +0x3f github.com/apache/arrow/go/arrow/internal/arrjson.NewReader(0x5b4700, 0xc0e028, 0x0, 0x0, 0x0, 0x0, 0x0, 0xcc37a1760fc5b719) /arrow/go/arrow/internal/arrjson/reader.go:56 +0x13d main.cnvToARROW(0x7ffbc814, 0x37, 0x7ffbc852, 0x26, 0x4acf01, 0x0, 0x0) /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:137 +0x319 main.runCommand(0x7ffbc852, 0x26, 0x7ffbc814, 0x37, 0x7ffbc87f, 0xd, 0xc16101, 0xc86260, 0x40568f) /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:63 +0x172 main.main() /arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:44 +0x24e {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-2
Arrow Build Report for Job nightly-2020-04-14-2 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2 Failed Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-6-amd64 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-gandiva-jar-osx - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-cpp - test-conda-cpp-hiveserver2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp-hiveserver2 - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-turbodbc-master - ubuntu-focal-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-ubuntu-focal-amd64 - ubuntu-xenial-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-ubuntu-xenial-amd64 - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-wheel-osx-cp36m - wheel-win-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-appveyor-wheel-win-cp35m Pending Tasks: - test-debian-ruby: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-debian-ruby - test-fedora-30-python-3: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-fedora-30-python-3 Succeeded Tasks: - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-7-amd64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-8-amd64 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py38 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-debian-buster-amd64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-debian-stretch-amd64 - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-gandiva-jar-xenial - homebrew-cpp-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-cpp-autobrew - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp-valgrind - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: ht
[jira] [Created] (ARROW-8454) [CI] Add 3rdparty Apache dependency tarballs to github
Krisztian Szucs created ARROW-8454: -- Summary: [CI] Add 3rdparty Apache dependency tarballs to github Key: ARROW-8454 URL: https://issues.apache.org/jira/browse/ARROW-8454 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Krisztian Szucs Follow-up on https://github.com/apache/arrow/pull/6922#issuecomment-613527789 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8455) [Rust] Parquet Arrow column read on partially compatible files
Remi Dettai created ARROW-8455: -- Summary: [Rust] Parquet Arrow column read on partially compatible files Key: ARROW-8455 URL: https://issues.apache.org/jira/browse/ARROW-8455 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.15.1 Reporter: Remi Dettai Seen behavior: When reading a Parquet file into Arrow with `get_record_reader_by_columns`, it will fail if one of the column of the file is a list (or any other unsupported type). Expected behavior: it should only fail if you are actually reading the column with unsuported type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8456) [Release] Add python script to help curating JIRA
Krisztian Szucs created ARROW-8456: -- Summary: [Release] Add python script to help curating JIRA Key: ARROW-8456 URL: https://issues.apache.org/jira/browse/ARROW-8456 Project: Apache Arrow Issue Type: Task Components: Developer Tools Reporter: Krisztian Szucs Fix For: 1.0.0 The following script produces reports like https://gist.github.com/kszucs/9857ef69c92a230ce5a5068551b83ed8 {code:python} from jira import JIRA import warnings import pygit2 import pandas as pd from io import StringIO class Patch: def __init__(self, commit): self.commit = commit self.issue_key, self.msg = self._parse(commit.message) def _parse(self, message): first_line = message.splitlines()[0] m = re.match("(?P((ARROW|PARQUET)\-\d+)):?(?P.*)", first_line) if m is None: return None, '' values = m.groupdict() return values['ticket'], values['msg'] @property def shortmessage(self): if not self.msg: return self.commit.message.splitlines()[0] else: return self.msg @property def sha(self): return self.commit.id @property def issue_url(self): return 'https://issues.apache.org/jira/browse/{}'.format(self.issue_key) @property def commit_url(self): return 'https://github.com/apache/arrow/commit/{}'.format(self.sha) def to_markdown(self): if self.issue_key is None: return "[{}]({})\n".format( self.shortmessage, self.commit_url ) else: return "[{}]({}): [{}]({})\n".format( self.issue_key, self.issue_url, self.shortmessage, self.commit_url ) JIRA_SEARCH_LIMIT = 1 # JIRA_SEARCH_LIMIT = 50 class Release: """Release object for querying issues and commits Usage: jira = JIRA( {'server': 'https://issues.apache.org/jira'}, basic_auth=(user, password) ) repo = pygit2.Repository('path/to/arrow/repo') release = Release(jira, repo, '0.15.1', '0.15.0') # show the commits in application order for commit in release.commits(): print(commit.oid) # cherry-pick the patches to a branch release.apply_patches_to('a-branch') """ def __init__(self, jira, repo, version, previous_version): self.jira = jira self.repo = repo self.version = version self.previous_version = previous_version self._issues = None self._patches = None def _tag(self, version): return self.repo.revparse_single(f'refs/tags/apache-arrow-{version}') def issues(self): # FIXME(kszucs): paginate instead of maxresults if self._issues is None: query = f'project=ARROW AND fixVersion={self.version}' self._issues = self.jira.search_issues(query, maxResults=JIRA_SEARCH_LIMIT) return self._issues def patches(self): """Commits belonging to release applied on master branch The returned commits' order corresponds to the output of git log. """ if self._patches is None: previous_tag = self._tag(self.previous_version) master = self.repo.branches['master'] ordering = pygit2.GIT_SORT_TOPOLOGICAL | pygit2.GIT_SORT_REVERSE walker = self.repo.walk(master.target, ordering) walker.hide(previous_tag.oid) self._patches = list(map(Patch, walker)) return self._patches def curate(self): issues = self.issues() patches = self.patches() issue_keys = {issue.key for issue in self.issues()} within, outside, nojira = [], [], [] for p in patches: if p.issue_key is None: nojira.append(p) elif p.issue_key in issue_keys: within.append(p) issue_keys.remove(p.issue_key) else: outside.append(p) # remaining jira tickets nopatch = list(issue_keys) return within, outside, nojira, nopatch def curation_report(self): out = StringIO() out.write('Total number of JIRA tickets assigned to version {}: {}\n' .format(self.version, len(self.issues( out.write('\n') out.write('Total number of applied patches since {}: {}\n' .format(self.previous_version, len(self.patches( out.write('\n\n') within, outside, nojira, nopatch = self.curate() out.write('Patches with assigned issu
Follow up on ARROW-8451, datafusion part of Arrow
This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451. First thanks for your answer! It's true that I was also surprised to see all implementations of Arrow mixed up in a single repository! I was really considering the separation of the repositories as a mean to separate concerns. I am not 100% sure to understand how it would fragment the community but I think I get the point, even though I still believe that it is at the cost of extra complexity. As for the legal protection, I did not take that aspect into consideration, and I find it very interesting! What is the PMC exactly and why would Datafusion be more exposed in a separate repository?
[jira] [Created] (ARROW-8457) [C++] bridge test does not take care of endianness
Kazuaki Ishizaki created ARROW-8457: --- Summary: [C++] bridge test does not take care of endianness Key: ARROW-8457 URL: https://issues.apache.org/jira/browse/ARROW-8457 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kazuaki Ishizaki According to the [specification|https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst] of ArrowSchema, memory format uses the native endian of the CPU. However, the test cases assume only little endian. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8458) [C++] Prefer the original mirrors for the bundled thirdparty dependencies
Krisztian Szucs created ARROW-8458: -- Summary: [C++] Prefer the original mirrors for the bundled thirdparty dependencies Key: ARROW-8458 URL: https://issues.apache.org/jira/browse/ARROW-8458 Project: Apache Arrow Issue Type: Task Components: C++, Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8459) [Dev][Archery] Use a more recent cmake-format
Krisztian Szucs created ARROW-8459: -- Summary: [Dev][Archery] Use a more recent cmake-format Key: ARROW-8459 URL: https://issues.apache.org/jira/browse/ARROW-8459 Project: Apache Arrow Issue Type: Task Components: Developer Tools Reporter: Krisztian Szucs Fix For: 1.0.0 Reading through the cmake-format releases page it seems to contain improvements. Additionally we should check cmake-format's version in run-cmake-format.py to have unified behaviour both locally and on the CI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8460) [Packaging][deb] Ubuntu Focal build is failed
Kouhei Sutou created ARROW-8460: --- Summary: [Packaging][deb] Ubuntu Focal build is failed Key: ARROW-8460 URL: https://issues.apache.org/jira/browse/ARROW-8460 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 0.17.0 It seems that this is no disk space error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8461) [Packaging][deb] Use zstd package for Ubuntu Xenial
Kouhei Sutou created ARROW-8461: --- Summary: [Packaging][deb] Use zstd package for Ubuntu Xenial Key: ARROW-8461 URL: https://issues.apache.org/jira/browse/ARROW-8461 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Follow up on ARROW-8451, datafusion part of Arrow
hi Remi, It's no problem, it's a common question we get. Some developers believe as a matter of principle that large projects should be broken up into many smaller repositories. Arrow is a different than many open source projects. Maintaining protocol-level interoperability (although note that Rust does not yet participate in the integration tests) has been a great deal of effort, and the community has felt that trying to coordinate changes that impact interoperability is substantially simpler in a monorepo arrangement on GitHub. That we always know with relative certainty whether any pull request may break interoperability between one component and another. It's very easy to get into a situation where you have a mess of cross-repository (or even circular) build and runtime dependencies -- the monorepo makes all of this pain go away. If you have a change that affects multiple repositories, CI tools don't make it easy to test those PRs together, generally you'll just see that a PR on one repo is breaking against the master of the other repository. In some cases, components may not have integrations with other languages but that may not always be the case in the future. We have just developed the C interface, for example, which would enable DataFusion to be built as a shared library and imported in Python (if someone wanted to do that). Another dimension is that all of the PLs and components have benefited greatly from the community's investment in CI and packaging infrastructure. I also believe that the project's common PR queue helps create a sense of community awareness and solidarity amongst projects contributors. If Rust were working off in their own corner of GitHub, I think it would be easy for people who are not working on Rust to ignore them. I think the net result of the way that we currently operate is that we're producing higher quality software and have a healthier community than we would otherwise with a more fragmented approach. Lastly, the shared release cycle creates social pressure to get patches finished and merged. Anecdotally this seems to be effective. On the governance questions, see the roles section on https://www.apache.org/foundation/how-it-works.html#roles If a part of apache/arrow truly believed that they were being hindered by being a part of monorepo, we could create a new repository under apache/ on GitHub for the part that wants to split into a standalone GitHub repository. That wouldn't change the governance of that code. - Wes On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai wrote: > > This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451. > > First thanks for your answer! > > It's true that I was also surprised to see all implementations of Arrow > mixed up in a single repository! > > I was really considering the separation of the repositories as a mean to > separate concerns. I am not 100% sure to understand how it would fragment > the community but I think I get the point, even though I still believe that > it is at the cost of extra complexity. > > As for the legal protection, I did not take that aspect into consideration, > and I find it very interesting! What is the PMC exactly and why would > Datafusion be more exposed in a separate repository?
[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows
Tom Augspurger created ARROW-8462: - Summary: Crash in lib.concat_tables on Windows Key: ARROW-8462 URL: https://issues.apache.org/jira/browse/ARROW-8462 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Tom Augspurger This crashes for me with pyarrow 0.16 on my Windows VM {{ import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{ concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8463) [CI] Balance the nightly test builds between CircleCI, Azure and Github
Krisztian Szucs created ARROW-8463: -- Summary: [CI] Balance the nightly test builds between CircleCI, Azure and Github Key: ARROW-8463 URL: https://issues.apache.org/jira/browse/ARROW-8463 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Krisztian Szucs Most of our nightly docker builds are running on circleci and its queuing, so try to offload some of the builds. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8464) [Rust] [DataFusion] Add support for dictionary types
Andy Grove created ARROW-8464: - Summary: [Rust] [DataFusion] Add support for dictionary types Key: ARROW-8464 URL: https://issues.apache.org/jira/browse/ARROW-8464 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove * BatchIterator should accept both DictionaryBatch and RecordBatch * Type Coercion optimizer rule should inject expression for converting dictionary value types to index types (for equality expressions, and IN(values, ...) * Physical expression would lookup index for dictionary values referenced in the query so that at runtime, only indices are being compared per batch -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-3
Arrow Build Report for Job nightly-2020-04-14-3 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3 Failed Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-6-amd64 - homebrew-cpp-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-cpp-autobrew - test-conda-cpp-hiveserver2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp-hiveserver2 - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-hdfs-2.9.2 - ubuntu-focal-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-ubuntu-focal-amd64 - ubuntu-xenial-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-ubuntu-xenial-amd64 - wheel-manylinux2014-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-wheel-manylinux2014-cp36m - wheel-win-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-appveyor-wheel-win-cp35m Pending Tasks: - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-spark-master Succeeded Tasks: - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-7-amd64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-8-amd64 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py38 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-debian-buster-amd64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-debian-stretch-amd64 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-gandiva-jar-osx - gandiva-jar-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-gandiva-jar-xenial - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp-valgrind - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-kartothek-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-kartothek-latest - test-conda-python-3.7-kartothek-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-kartothek-master - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-
[jira] [Created] (ARROW-8465) [Packaging][Python] Windows py35 wheel build fails because of boost
Krisztian Szucs created ARROW-8465: -- Summary: [Packaging][Python] Windows py35 wheel build fails because of boost Key: ARROW-8465 URL: https://issues.apache.org/jira/browse/ARROW-8465 Project: Apache Arrow Issue Type: Bug Components: Packaging, Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 See build log https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-appveyor-wheel-win-cp35m -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8466) [Packaging] The python unittests are not running in the windows wheel builds
Krisztian Szucs created ARROW-8466: -- Summary: [Packaging] The python unittests are not running in the windows wheel builds Key: ARROW-8466 URL: https://issues.apache.org/jira/browse/ARROW-8466 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Krisztian Szucs Appveyors log swallows why those tests are not running. Requires investigation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)
Hi Wes, Yes, I'm making progress and at this point I anticipate being able to finish it off by next release, possibly without support for round tripping fixed size lists. I've been spending some time thinking about different approaches and have started coding some of the building blocks, which I think in the common case (relatively low nesting levels) should be fairly performant (I'm also going to write some benchmarks to sanity check this). One caveat to this is my schedule is going to change slightly next week and its possible my bandwidth might be more limited, I'll update the list if this happens. I think there are at least two areas that I'm not working on that could be parallelized if you or your team has bandwidth. 1. It would be good to have some parquet files representing real world datasets available to benchmark against. 2. The higher level book keeping of tracking which def-levels/rep-levels are needed to compare against for any particular column (i.e. preceding repeated parent). I'm currently working on the code that takes these and converts them to offsets/null fields. I can go into more details if you or your team would like to collaborate. Thanks, Micah On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney wrote: > hi Micah, > > I'm glad that we have the write side of nested completed for 0.17.0. > > As far as completing the read side and then implementing sufficient > testing to exercise corner cases in end-to-end reads/writes, do you > anticipate being able to work on this in the next 4-6 weeks (obviously > the state of the world has affected everyone's availability / > bandwidth)? I ask because someone from my team (or me also) may be > able to get involved and help this move along. It'd be great to have > this 100% completed and checked off our list for the next release > (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration > tests get completed also) > > thanks > Wes > > On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield > wrote: > >> > >> Glad to hear about the progress. As I mentioned on #2, what do you > >> think about setting up a feature branch for you to merge PRs into? > >> Then the branch can be iterated on and we can merge it back when it's > >> feature complete and does not have perf regressions for the flat > >> read/write path. > >> > > I'd like to avoid a separate branch if possible. I'm willing to close > the open PR till I'm sure it is needed but I'm hoping keeping PRs as small > focused as possible with performance testing a long the way will be a > better reviewer and developer experience here. > > > >> The earliest I'd have time to work on this myself would likely be > >> sometime in March. Others are welcome to jump in as well (and it'd be > >> great to increase the overall level of knowledge of the Parquet > >> codebase) > > > > Hopefully, Igor can help out otherwise I'll take up the read path after > I finish the write path. > > > > -Micah > > > > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney wrote: > >> > >> hi Micah > >> > >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield > wrote: > >> > > >> > Just to give an update. I've been a little bit delayed, but my > progress is > >> > as follows: > >> > 1. Had 1 PR merged that will exercise basic end-to-end tests. > >> > 2. Have another PR open that allows a configuration option in C++ to > >> > determine which algorithm version to use for reading/writing, the > existing > >> > version and the new version supported complex-nested arrays. I think > a > >> > large amount of code will be reused/delegated to but I will err on > the side > >> > of not touching the existing code/algorithms so that any errors in the > >> > implementation or performance regressions can hopefully be mitigated > at > >> > runtime. I expect in later releases (once the code has "baked") will > >> > become a no-op. > >> > >> Glad to hear about the progress. As I mentioned on #2, what do you > >> think about setting up a feature branch for you to merge PRs into? > >> Then the branch can be iterated on and we can merge it back when it's > >> feature complete and does not have perf regressions for the flat > >> read/write path. > >> > >> > 3. Started coding the write path. > >> > > >> > Which leaves: > >> > 1. Finishing the write path (I estimate 2-3 weeks) to be code > complete > >> > 2. Implementing the read path. > >> > >> The earliest I'd have time to work on this myself would likely be > >> sometime in March. Others are welcome to jump in as well (and it'd be > >> great to increase the overall level of knowledge of the Parquet > >> codebase) > >> > >> > Again, I'm happy to collaborate if people have bandwidth and want to > >> > contribute. > >> > > >> > Thanks, > >> > Micah > >> > > >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield < > emkornfi...@gmail.com> > >> > wrote: > >> > > >> > > Hi Wes, > >> > > I'm still interested in doing the work. But don't to hold anybody > up if > >> > > they have bandwidth. > >> > > > >>
Re: Follow up on ARROW-8451, datafusion part of Arrow
Hi Wes ! Thanks for your reply, all much clearer now. I guess it is just a question of getting used to it :-) Remi Le mar. 14 avr. 2020 à 22:54, Wes McKinney a écrit : > hi Remi, > > It's no problem, it's a common question we get. Some developers > believe as a matter of principle that large projects should be broken > up into many smaller repositories. > > Arrow is a different than many open source projects. Maintaining > protocol-level interoperability (although note that Rust does not yet > participate in the integration tests) has been a great deal of effort, > and the community has felt that trying to coordinate changes that > impact interoperability is substantially simpler in a monorepo > arrangement on GitHub. That we always know with relative certainty > whether any pull request may break interoperability between one > component and another. It's very easy to get into a situation where > you have a mess of cross-repository (or even circular) build and > runtime dependencies -- the monorepo makes all of this pain go away. > If you have a change that affects multiple repositories, CI tools > don't make it easy to test those PRs together, generally you'll just > see that a PR on one repo is breaking against the master of the other > repository. > > In some cases, components may not have integrations with other > languages but that may not always be the case in the future. We have > just developed the C interface, for example, which would enable > DataFusion to be built as a shared library and imported in Python (if > someone wanted to do that). > > Another dimension is that all of the PLs and components have benefited > greatly from the community's investment in CI and packaging > infrastructure. > > I also believe that the project's common PR queue helps create a sense > of community awareness and solidarity amongst projects contributors. > If Rust were working off in their own corner of GitHub, I think it > would be easy for people who are not working on Rust to ignore them. I > think the net result of the way that we currently operate is that > we're producing higher quality software and have a healthier community > than we would otherwise with a more fragmented approach. > > Lastly, the shared release cycle creates social pressure to get > patches finished and merged. Anecdotally this seems to be effective. > > On the governance questions, see the roles section on > https://www.apache.org/foundation/how-it-works.html#roles > > If a part of apache/arrow truly believed that they were being hindered > by being a part of monorepo, we could create a new repository under > apache/ on GitHub for the part that wants to split into a standalone > GitHub repository. That wouldn't change the governance of that code. > > - Wes > > On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai wrote: > > > > This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451. > > > > First thanks for your answer! > > > > It's true that I was also surprised to see all implementations of Arrow > > mixed up in a single repository! > > > > I was really considering the separation of the repositories as a mean to > > separate concerns. I am not 100% sure to understand how it would fragment > > the community but I think I get the point, even though I still believe > that > > it is at the cost of extra complexity. > > > > As for the legal protection, I did not take that aspect into > consideration, > > and I find it very interesting! What is the PMC exactly and why would > > Datafusion be more exposed in a separate repository? >