[jira] [Created] (ARROW-6393) [C++]Add EqualOptions support in SparseTensor::Equals
Kenta Murata created ARROW-6393: --- Summary: [C++]Add EqualOptions support in SparseTensor::Equals Key: ARROW-6393 URL: https://issues.apache.org/jira/browse/ARROW-6393 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata SparseTensor::Equals should take EqualOptions argument as Tensor::Equals does. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6392) [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor is return value validated
Wes McKinney created ARROW-6392: --- Summary: [Python][Flight] list_actions Server RPC is not tested in test_flight.py, nor is return value validated Key: ARROW-6392 URL: https://issues.apache.org/jira/browse/ARROW-6392 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Python Reporter: Wes McKinney Fix For: 0.15.0 This server method is implemented and part of the Python server vtable, but it is not tested. If you mistakenly return a "string" action type, it will pass silently. We might want to constrain the output to be ActionType or a tuple -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6391) [Python][Flight] Add built-in methods on FlightServerBase to start server and wait for it to be available
Wes McKinney created ARROW-6391: --- Summary: [Python][Flight] Add built-in methods on FlightServerBase to start server and wait for it to be available Key: ARROW-6391 URL: https://issues.apache.org/jira/browse/ARROW-6391 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Python Reporter: Wes McKinney Fix For: 0.15.0 It seems like this logic could be a part of the library / made general purpose to make it more convenient to spawn servers in Python https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_flight.py#L414 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
Wes McKinney created ARROW-6390: --- Summary: [Python][Flight] Add Python documentation / tutorial for Flight Key: ARROW-6390 URL: https://issues.apache.org/jira/browse/ARROW-6390 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Python Reporter: Wes McKinney Fix For: 0.15.0 There is no Sphinx documentation for using Flight from Python. I have found that writing documentation is an effective way to uncover usability problems -- I would suggest we write comprehensive documentation for using Flight from Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6389) java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
Ben Schreck created ARROW-6389: -- Summary: java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] Key: ARROW-6389 URL: https://issues.apache.org/jira/browse/ARROW-6389 Project: Apache Arrow Issue Type: Bug Components: Java, Python Affects Versions: 0.14.1 Environment: Hadoop 2.85 EMR 5.24.1 python version: 3.7.4 skein version: 0.8.0 Reporter: Ben Schreck I can't access hdfs through pyarrow ( from inside a yarn container created by skein) This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var: ```{{import pyarrow; pyarrow.hdfs.connect()```}} However, when running on yarn by submitting the following skein application, I get a Java error. {{name: test_conn queue: default master: env: ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native JAVA_HOME: /etc/alternatives/jre resources: vcores: 1 memory: 10 GiB files: conda_env: /home/hadoop/environment.tar.gz script: | echo $HADOOP_HOME echo $JAVA_HOME echo $HADOOP_CLASSPATH echo $ARROW_LIBHDFS_DIR source conda_env/bin/activate python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())" echo "Hello World!"}} FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs" "fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem" "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}} The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find that class. Logs: {{= LogType:application.driver.log Log Upload Time:Thu Aug 29 20:51:59 + 2019 LogLength:2635 Log Contents: /usr/lib/hadoop /usr/lib/jvm/java-openjdk :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error: java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884) at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439) at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414) at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411) Traceback (most recent call last): File "", line 1, in File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed Hello World! End of LogType:application.driver.log LogType:application.master.log Log Upload Time:Thu Aug 29 20:51:59
Re: [DISCUSS] Ternary logic
Indeed it's not about sanitizing nulls; it's about how nulls should interact with boolean (and other) expressions. For purposes of discussion, I'm naming the current approach of propagating null "NaN logic" (since all expressions involving NaN evaluate to NaN). To give some context for this discussion, I'm currently working on support for filter expressions (ARROW-6243). As an example of when this would come into play, let there be a dataset spanning several files. The older files have an IPV4 column while the newer files have IPV6 as well. With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields null for all of the older files since they lack an IPV6 column (regardless of their IPV4 column) which seems undesirable. Could you explain what you mean by "safest"? Under NaN logic, the Kleene result can be recovered with (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false)) Under Kleene logic, the NaN result can be recovered with (case IPV4 is null or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end) I don't think we're losing information either way. I'm not attached to either system but I'd like to understand and document the rationale behind our choice. On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou wrote: > > IIUC it's not about sanitizing to false. Ben explained it in more > detail in private to me, perhaps he want to copy that explanation here ;-) > > Regards > > Antoine. > > > Le 29/08/2019 à 19:05, Wes McKinney a écrit : > > hi Ben, > > > > My instinct is that always propagating null (at least by default) is > > the safest choice. Applications can choose to sanitize null to false > > if that's what they want semantically. > > > > - Wes > > > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman > wrote: > >> > >> To my knowledge, there isn't explicit documentation on how null slots > in an > >> array should be interpreted. SQL uses Kleene logic, wherein a null is > >> explicitly an unknown rather than a special value. This yields for > example > >> `(null AND false) -> false`, since `(x AND false) -> false` for all > >> possible values of x. This is also the behavior of Gandiva's boolean > >> expressions. > >> > >> By contrast the boolean kernels implement something closer to the > behavior > >> of NaN: `(null AND false) -> null`. I think this is simply an error in > the > >> boolean kernels but in any case I think explicit documentation should be > >> added to prevent future confusion. > >> > >> https://issues.apache.org/jira/browse/ARROW-6386 >
Re: [DISCUSS] Ternary logic
IIUC it's not about sanitizing to false. Ben explained it in more detail in private to me, perhaps he want to copy that explanation here ;-) Regards Antoine. Le 29/08/2019 à 19:05, Wes McKinney a écrit : > hi Ben, > > My instinct is that always propagating null (at least by default) is > the safest choice. Applications can choose to sanitize null to false > if that's what they want semantically. > > - Wes > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman wrote: >> >> To my knowledge, there isn't explicit documentation on how null slots in an >> array should be interpreted. SQL uses Kleene logic, wherein a null is >> explicitly an unknown rather than a special value. This yields for example >> `(null AND false) -> false`, since `(x AND false) -> false` for all >> possible values of x. This is also the behavior of Gandiva's boolean >> expressions. >> >> By contrast the boolean kernels implement something closer to the behavior >> of NaN: `(null AND false) -> null`. I think this is simply an error in the >> boolean kernels but in any case I think explicit documentation should be >> added to prevent future confusion. >> >> https://issues.apache.org/jira/browse/ARROW-6386
Re: [DISCUSS] Ternary logic
hi Ben, My instinct is that always propagating null (at least by default) is the safest choice. Applications can choose to sanitize null to false if that's what they want semantically. - Wes On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman wrote: > > To my knowledge, there isn't explicit documentation on how null slots in an > array should be interpreted. SQL uses Kleene logic, wherein a null is > explicitly an unknown rather than a special value. This yields for example > `(null AND false) -> false`, since `(x AND false) -> false` for all > possible values of x. This is also the behavior of Gandiva's boolean > expressions. > > By contrast the boolean kernels implement something closer to the behavior > of NaN: `(null AND false) -> null`. I think this is simply an error in the > boolean kernels but in any case I think explicit documentation should be > added to prevent future confusion. > > https://issues.apache.org/jira/browse/ARROW-6386
[jira] [Created] (ARROW-6388) [C++] Consider implementing BufferOuputStream using BufferBuilder internally
Wes McKinney created ARROW-6388: --- Summary: [C++] Consider implementing BufferOuputStream using BufferBuilder internally Key: ARROW-6388 URL: https://issues.apache.org/jira/browse/ARROW-6388 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney See discussion in ARROW-6381 https://github.com/apache/arrow/pull/5222 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6387) [Archery] Errors with make
Omer Ozarslan created ARROW-6387: Summary: [Archery] Errors with make Key: ARROW-6387 URL: https://issues.apache.org/jira/browse/ARROW-6387 Project: Apache Arrow Issue Type: Bug Reporter: Omer Ozarslan {{archery --debug benchmark run}} gives error on Debian 10, CMake 3.13.4, GNU make 4.2.1: {code:java} (.venv) omer@omer ~/src/ext/arrow/cpp/build master ● archery --debug benchmark run DEBUG:archery:Running benchmark WORKSPACE DEBUG:archery:Executing `['/usr/bin/cmake', '-GMake', '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON', '-DCMAKE_BUILD_TYPE=release', '-DBUILD_WARNING_LEVEL=production', '-DARROW_BUILD_TESTS=ON', '-DARROW_BUILD_BENCHMARKS=ON', '-DARROW_PYTHON=OFF', '-DARROW_PARQUET=OFF', '-DARROW_GANDIVA=OFF', '-DARROW_PLASMA=OFF', '-DARROW_FLIGHT=OFF', '/home/omer/src/ext/arrow/cpp']` CMake Error: Could not create named generator Make Generators Unix Makefiles = Generates standard UNIX makefiles. Ninja= Generates build.ninja files. Watcom WMake = Generates Watcom WMake makefiles. CodeBlocks - Ninja = Generates CodeBlocks project files. CodeBlocks - Unix Makefiles = Generates CodeBlocks project files. CodeLite - Ninja = Generates CodeLite project files. CodeLite - Unix Makefiles= Generates CodeLite project files. Sublime Text 2 - Ninja = Generates Sublime Text 2 project files. Sublime Text 2 - Unix Makefiles = Generates Sublime Text 2 project files. Kate - Ninja = Generates Kate project files. Kate - Unix Makefiles= Generates Kate project files. Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files. Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files. Traceback (most recent call last): [[[cropped]]]{code} After trivial fix: {code:java} diff --git a/dev/archery/archery/utils/cmake.py b/dev/archery/archery/utils/cmake.py index 38aedab2d..3150ea9a6 100644 --- a/dev/archery/archery/utils/cmake.py +++ b/dev/archery/archery/utils/cmake.py @@ -34,7 +34,7 @@ class CMake(Command): in the search path. """ found_ninja = which("ninja") -return "Ninja" if found_ninja else "Make" +return "Ninja" if found_ninja else "Unix Makefiles"{code} I get another error: {code:java} [[[cropped]] -- Generating done -- Build files have been written to: /tmp/arrow-bench-48x_yleb/WORKSPACE/build DEBUG:archery:Executing `[None]` Traceback (most recent call last): File "/home/omer/src/ext/arrow/.venv/bin/archery", line 11, in load_entry_point('archery', 'console_scripts', 'archery')() File "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/omer/src/ext/arrow/.venv/lib/python3.7/site-packages/click/core.py", line 1137, in invoke return
[jira] [Created] (ARROW-6386) [C++][Documentation] Explicit documentation of null slot interpretation
Benjamin Kietzman created ARROW-6386: Summary: [C++][Documentation] Explicit documentation of null slot interpretation Key: ARROW-6386 URL: https://issues.apache.org/jira/browse/ARROW-6386 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman To my knowledge, there isn't explicit documentation on how null slots in an array should be interpreted. SQL uses Kleene logic, wherein a null is explicitly an unknown rather than a special value. This yields for example `(null AND false) -> false`, since `(x AND false) -> false` for all possible values of x. This is also the behavior of Gandiva's boolean expressions. By contrast the boolean kernels implement something closer to the behavior of NaN: `(null AND false) -> null`. I think this is simply an error in the boolean kernels but in any case I think explicit documentation should be added to prevent future confusion. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[DISCUSS] Ternary logic
To my knowledge, there isn't explicit documentation on how null slots in an array should be interpreted. SQL uses Kleene logic, wherein a null is explicitly an unknown rather than a special value. This yields for example `(null AND false) -> false`, since `(x AND false) -> false` for all possible values of x. This is also the behavior of Gandiva's boolean expressions. By contrast the boolean kernels implement something closer to the behavior of NaN: `(null AND false) -> null`. I think this is simply an error in the boolean kernels but in any case I think explicit documentation should be added to prevent future confusion. https://issues.apache.org/jira/browse/ARROW-6386
Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)
Here is the Java implementation https://github.com/apache/arrow/pull/5229 cc @Wes McKinney @emkornfield Thanks, Ji Liu -- From:Ji Liu Send Time:2019年8月28日(星期三) 17:34 To:emkornfield ; dev Cc:Paul Taylor Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote) I could take the Java implementation and will take a close watch on this issue in the next few days. Thanks, Ji Liu -- From:Micah Kornfield Send Time:2019年8月28日(星期三) 17:14 To:dev Cc:Paul Taylor Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote) I should have integration tests with 0.14.1 generated binaries in the next few days. I think the one remaining unassigned piece of work in the Java implementation, i can take that up next if no one else gets to it. On Tue, Aug 27, 2019 at 7:19 PM Wes McKinney wrote: > Here's the C++ changes > > https://github.com/apache/arrow/pull/5211 > > I'm going to create a integration branch where we can merge each patch > before merging to master > > On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney wrote: > > > > It isn't implemented in C++ yet but I will try to get a patch up for > > that soon (today maybe). I think we should create a branch where we > > can stack the patches that implement this for each language. > > > > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor > wrote: > > > > > > I'll do the JS updates. Is it safe to validate against the Arrow C++ > > > integration tests? > > > > > > > > > On 8/22/19 7:28 PM, Micah Kornfield wrote: > > > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a > tracking > > > > issue with sub-issues on the development work. So far no-one has > claimed > > > > Java and Javascript tasks. > > > > > > > > Would it make sense to have a separate dev branch for this work? > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney > wrote: > > > > > > > >> The vote carries with 4 binding +1 votes and 1 non-binding +1 > > > >> > > > >> I'll merge the specification patch later today and we can begin > > > >> working on implementations so we can get this done for 0.15.0 > > > >> > > > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler > wrote: > > > >>> +1 (non-binding) > > > >>> > > > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou > > > >> wrote: > > > Sorry, had forgotten to send my vote on this. > > > > > > +1 from me. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > On Wed, 14 Aug 2019 17:42:33 -0500 > > > Wes McKinney wrote: > > > > hi all, > > > > > > > > As we've been discussing [1], there is a need to introduce 4 > bytes of > > > > padding into the preamble of the "encapsulated IPC message" > format to > > > > ensure that the Flatbuffers metadata payload begins on an 8-byte > > > > aligned memory offset. The alternative to this would be for Arrow > > > > implementations where alignment is important (e.g. C or C++) to > copy > > > > the metadata (which is not always small) into memory when it is > > > > unaligned. > > > > > > > > Micah has proposed to address this by adding a > > > > 4-byte "continuation" value at the beginning of the payload > > > > having the value 0x. The reason to do it this way is that > > > > old clients will see an invalid length (what is currently the > > > > first 4 bytes of the message -- a 32-bit little endian signed > > > > integer indicating the metadata length) rather than potentially > > > > crashing on a valid length. We also propose to expand the "end of > > > > stream" marker used in the stream and file format from 4 to 8 > > > > bytes. This has the additional effect of aligning the file footer > > > > defined in File.fbs. > > > > > > > > This would be a backwards incompatible protocol change, so older > > > >> Arrow > > > > libraries would not be able to read these new messages. > Maintaining > > > > forward compatibility (reading data produced by older libraries) > > > >> would > > > > be possible as we can reason that a value other than the > continuation > > > > value was produced by an older library (and then validate the > > > > Flatbuffer message of course). Arrow implementations could offer > a > > > > backward compatibility mode for the sake of old readers if they > > > >> desire > > > > (this may also assist with testing). > > > > > > > > Additionally with this vote, we want to formally approve the > change > > > >> to > > > > the Arrow "file" format to always write the (new 8-byte) > > > >> end-of-stream > > > > marker, which enables code that processes Arrow streams to safely > > > >> read > > > > the file's
[PROPOSAL] Consolidate Arrow's CI configuration
Hi, Arrow's current continuous integration setup utilizes multiple CI providers, tools, and scripts: - Unit tests are running on Travis and Appveyor - Binary packaging builds are running on crossbow, an abstraction over multiple CI providers driven through a GitHub repository - For local tests and tasks, there is a docker-compose setup, or of course you can maintain your own environment This setup has run into some limitations: - It’s slow: the CI parallelism of Travis has degraded over the last couple of months. Testing a PR takes more than an hour, which is a long time for both the maintainers and the contributors, and it has a negative effect on the development throughput. - Build configurations are not portable, they are tied to specific services. You can’t just take a Travis script and run it somewhere else. - Because they’re not portable, build configurations are duplicated in several places. - The Travis, Appveyor and crossbow builds are not reproducible locally, so developing them requires the slow git push cycles. - Public CI has limited platform support, just for example ARM machines are not available. - Public CI also has limited hardware support, no GPUs are available Resolving all of the issues above is complicated, but is a must for the long term sustainability of Arrow. For some time, we’ve been working on a tool called Ursabot[1], a library on top of the CI framework Buildbot[2]. Buildbot is well maintained and widely used for complex projects, including CPython, Webkit, LLVM, MariaDB, etc. Buildbot is not another hosted CI service like Travis or Appveyor: it is an extensible framework to implement various automations like continuous integration tasks. You’ve probably noticed additional “Ursabot” builds appearing on pull requests, in addition to the Travis and Appveyor builds. We’ve been testing the framework with a fully featured CI server at ci.ursalabs.org. This service runs build configurations we can’t run on Travis, does it faster than Travis, and has the GitHub comment bot integration for ad hoc build triggering. While we’re not prepared to propose moving all CI to a self-hosted setup, our work has demonstrated the potential of using buildbot to resolve Arrow’s continuous integration challenges: - The docker-based builders are reusing the docker images, which eliminate slow dependency installation steps. Some builds on this setup, run on Ursa Labs’s infrastructure, run 20 minutes faster than the comparable Travis-CI jobs. - It’s scalable. We can deploy buildbot wherever and add more masters and workers, which we can’t do with public CI. - It’s platform and CI-provider independent. Builds can be run on arbitrary architectures, operating systems, and hardware: Python is the only requirement. Additionally builds specified in buildbot/ursabot can be run anywhere: not only on custom buildbot infrastructure but also on Travis, or even on your own machine. - It improves reproducibility and encourages consolidation of configuration. You can run the exact job locally that ran on Travis, and you can even get an interactive shell in the build so you can debug a test failure. And because you can run the same job anywhere, we wouldn’t need to have duplicated, Travis-specific or the docker-compose build configuration stored separately. - It’s extensible. More exotic features like a comment bot, benchmark database, benchmark dashboard, artifact store, integrating other systems are easily implementable within the same system. I’m proposing to donate the build configuration we’ve been iterating on in Ursabot to the Arrow codebase. Here [3] is a patch that adds the configuration. This will enable us to explore consolidating build configuration using the buildbot framework. A next step after to explore that would be to port a Travis build to Ursabot, and in the Travis configuration, execute the build by the shell command `$ ursabot project build `. This is the same way we would be able to execute the build locally--something we can’t currently do with the Travis builds. I am not proposing here that we stop using Travis-CI and Appveyor to run CI for apache/arrow, though that may well be a direction we choose to go in the future. Moving build configuration into something like buildbot would be a necessary first step to do that; that said, there are other immediate benefits to be had by porting build configuration into buildbot: local reproducibility, consolidation of build logic, independence from a particular CI provider, and ease of using and maintaining faster, Docker-based jobs. Self-hosting CI brings a number of other challenges, which we will concurrently continue to explore, but we believe that there are benefits to adopting buildbot build configuration regardless. Regards, Krisztian [1]: https://github.com/ursa-labs/ursabot [2]: https://buildbot.net https://docs.buildbot.net
[jira] [Created] (ARROW-6385) [C++] Investigate xxh3
Antoine Pitrou created ARROW-6385: - Summary: [C++] Investigate xxh3 Key: ARROW-6385 URL: https://issues.apache.org/jira/browse/ARROW-6385 Project: Apache Arrow Issue Type: Task Components: Benchmarking, C++ Reporter: Antoine Pitrou xxh3 is a new hash algorithm by Yann Collet that claims excellent speed on both small/tiny and large keys. It has accelerated paths for x86 SSE2, AVX and ARM NEON. It also has excellent hash quality. https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html Perhaps this can replace our current complex strategy involving a custom tiny string hashing implementation, a HW CRC32-based path where available for large strings, and a murmurhash2 fallback. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6384) [C++] Bump dependencies
Antoine Pitrou created ARROW-6384: - Summary: [C++] Bump dependencies Key: ARROW-6384 URL: https://issues.apache.org/jira/browse/ARROW-6384 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6383) [Java] report outstanding child allocators on parent allocator close
Pindikura Ravindra created ARROW-6383: - Summary: [Java] report outstanding child allocators on parent allocator close Key: ARROW-6383 URL: https://issues.apache.org/jira/browse/ARROW-6383 Project: Apache Arrow Issue Type: Task Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra when a parent allocator is closed, we should report the child allocators if any are outstanding. This helps in debugging memory leaks - will tell if the leak happened in the parent or the child. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6382) Unable to catch Python UDF exceptions when using PyArrow
Jan created ARROW-6382: -- Summary: Unable to catch Python UDF exceptions when using PyArrow Key: ARROW-6382 URL: https://issues.apache.org/jira/browse/ARROW-6382 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Environment: Ubuntu 18.04 Reporter: Jan When PyArrow is enabled, Pandas UDF exceptions raised by the Executor become impossible to catch: see example below. Is this expected behavior? If so, what is the rationale. If not, how do I fix this? Confirmed behavior in PyArrow 0.11 and 0.14.1 (latest) and PySpark 2.4.0 and 2.4.3. Python 3.6.5. To reproduce: {{import pandas as pdfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate()# setting this to false will allow the exception to be caughtspark.conf.set("spark.sql.execution.arrow.enabled", "true")@udfdef disrupt(x):raise Exception("Test EXCEPTION")data = spark.createDataFrame(pd.DataFrame({"A": [1, 2, 3]}))try:test = data.withColumn("test", disrupt("A")).toPandas()except:print("exception caught")print('end')}} I would hope there's a way to catch the exception with the general except clause. -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: [DISCUSS] Release cadence and release vote conventions
Hi Andy, Just curious with the next release coming up if you had a chance to look into the Maven issues yet. Thanks, Micah On Thu, Aug 1, 2019 at 2:04 PM Wes McKinney wrote: > I agree. In my experiences as RM I have found the involvement of Maven > in the release process to be a nuisance. I think it makes more sense > in Java-only projects > > On Thu, Aug 1, 2019 at 2:54 PM Andy Grove wrote: > > > > I'll start taking a look at the maven issue. We might not want to use > maven > > release plugin given that we control the version number already across > this > > repository via other means. > > > > On Wed, Jul 31, 2019 at 4:26 PM Sutou Kouhei wrote: > > > > > Hi, > > > > > > Sorry for not replying this thread. > > > > > > I think that the biggest problem is related to our Java > > > package. > > > > > > > > > We'll be able to resolve the GPG key problem by creating a > > > GPG key only for nightly release test. We can share the test > > > GPG key publicly because it's a just for testing. > > > > > > It'll work for our binary artifacts and APT/Yum repositories > > > but not work for our Java package. I don't know where GPG > > > key is used in our Java package... > > > > > > > > > We'll be able to resolve the Git commit problem by creating > > > a cloned Git repository for test. It's done in our > > > dev/release/00-prepare-test.rb[1]. > > > > > > [1] > > > > https://github.com/apache/arrow/blob/master/dev/release/00-prepare-test.rb#L30 > > > > > > The biggest problem for the Git commit is our Java package > > > requires "apache-arrow-${VERSION}" tag on > > > https://github.com/apache/arrow . (Right?) > > > I think that "mvm release:perform" in > > > dev/release/01-perform.sh does so but I don't know the > > > details of "mvm release:perform"... > > > > > > > > > More details: > > > > > > dev/release/00-prepare.sh: > > > > > > We'll be able to run this automatically when we can resolve > > > the above GPG key problem in our Java package. We can > > > resolve the Git commit problem by creating a cloned Git > > > repository. > > > > > > dev/release/01-prepare.sh: > > > > > > We'll be able to run this automatically when we can resolve > > > the above Git commit ("apche-arrow-${VERSION}" tag) problem > > > in our Java package. > > > > > > dev/release/02-source.sh: > > > > > > We'll be able to run this automatically by creating a GPG > > > key for nightly release test. We'll use Bintray to upload RC > > > source archive instead of dist.apache.org. Ah, we need a > > > Bintray API key for this. It must be secret. > > > > > > dev/release/03-binary.sh: > > > > > > We'll be able to run this automatically by creating a GPG > > > key for nightly release test. We need a Bintray API key too. > > > > > > We need to improve this to support nightly release test. It > > > use "XXX-rc" such as "debian-rc" for Bintray "package" name. > > > It should use "XXX-nightly" such as "debian-nightly" for > > > nightly release test instead. > > > > > > dev/release/post-00-release.sh: > > > > > > We'll be able to skip this. > > > > > > dev/release/post-01-upload.sh: > > > > > > We'll be able to skip this. > > > > > > dev/release/post-02-binary.sh: > > > > > > We'll be able to run this automatically by creating Bintray > > > "packages" for nightly release and use them. We can create > > > "XXX-nightly-release" ("debian-nightly-release") Bintray > > > "packages" and use them instead of "XXX" ("debian") Bintray > > > "packages". > > > > > > "debian" Bintray "package": https://bintray.com/apache/debian/ > > > > > > We need to improve this to support nightly release. > > > > > > dev/release/post-03-website.sh: > > > > > > We'll be able to run this automatically by creating a cloned > > > Git repository for test. > > > > > > It's better that we have a Web site to show generated pages. > > > We can create > > > https://github.com/apache/arrow-site/tree/asf-site/nightly > > > and use it but I don't like it. Because arrow-site increases > > > a commit day by day. > > > Can we prepare a Web site for this? (arrow-nightly.ursalabs.org?) > > > > > > dev/release/post-04-rubygems.sh: > > > > > > We may be able to use GitHub Package Registry[2] to upload > > > RubyGems. We can use "pre-release" package feature of > > > https://rubygems.org/ but it's not suitable for > > > nightly. It's for RC or beta release. > > > > > > [2] > https://github.blog/2019-05-10-introducing-github-package-registry/ > > > > > > dev/release/post-05-js.sh: > > > > > > We may be able to use GitHub Package Registry[2] to upload > > > npm packages. > > > > > > dev/release/post-06-csharp.sh: > > > > > > We may be able to use GitHub Package Registry[2] to upload > > > NuGet packages. > > > > > > dev/release/post-07-rust.sh: > > > > > > I don't have any idea. But it must be ran > > > automatically. It's always failed. I needed to run each > > > command manually. > > > > > > dev/release/post-08-remove-rc.sh: > > > > > > We'll be able to skip this. > > > > > > > > > Thanks, > > >
Re: [Format] Semantics for dictionary batches in streams
> > > > I was thinking the file format must satisfy one of two conditions: > > 1. Exactly one dictionarybatch per encoded column > > 2. DictionaryBatches are interleaved correctly. Could you clarify? I think you clarified it very well :) My motivation for suggesting the additional complexity is I see two use-cases for the file format. These roughly correspond with the two options I suggested: 1. We are encoding data from scratch. In this case, it seems like all dictionaries would be built incrementally, not need replacement and we write them at the end of the file [1] 2. The data being written out is essentially a "tee" off of some stream that is generating new dictionaries requiring replacement on the fly (i.e. reading back two parquet files). It might be better to disallow replacements > in the file format (which does introduce semantic slippage between the > file and stream formats as Antoine was saying). It is is certainly possible, to accept the slippage from the stream format for now and later add this capability, since it should be forwards compatible. Thanks, Micah [1] There is also medium complexity option where we require one non-delta dictionary and as many delta dictionaries as the user want. On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney wrote: > On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield > wrote: > > > > I was thinking the file format must satisfy one of two conditions: > > 1. Exactly one dictionarybatch per encoded column > > 2. DictionaryBatches are interleaved correctly. > > Could you clarify? In the first case, there is no issue with > dictionary replacements. I'm not sure about the second case -- if a > dictionary id appears twice, then you'll see it twice in the file > footer. I suppose you could look at the file offsets to determine > whether a dictionary batch precedes a particular record batch block > (to know which dictionary you should be using), but that's rather > complicated to implement. It might be better to disallow replacements > in the file format (which does introduce semantic slippage between the > file and stream formats as Antoine was saying). > > > > > On Tuesday, August 27, 2019, Wes McKinney wrote: > > > > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou > wrote: > > > > > > > > > > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > > > > So the current situation we have right now in C++ is that if we > tried > > > > > to create an IPC stream from a sequence of record batches that > don't > > > > > all have the same dictionary, we'd run into two scenarios: > > > > > > > > > > * Batches that either have a prefix of a prior-observed > dictionary, or > > > > > the prior dictionary is a prefix of their dictionary. For example, > > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > > > > > such case we could compute and send a delta batch > > > > > > > > > > * Batches with a dictionary that is a permutation of values, and > > > > > possibly new unique values. > > > > > > > > > > In this latter case, without the option of replacing an existing > ID in > > > > > the stream, we would have to do a unification / permutation of > indices > > > > > and then also possibly send a delta batch. We should probably have > > > > > code at some point that deals with both cases, but in the meantime > I > > > > > would like to allow dictionaries to be redefined in this case. > Seems > > > > > like we might need a vote to formalize this? > > > > > > > > Isn't the stream format deviating from the file format then? In the > > > > file format, IIUC, dictionaries can appear after the respective > record > > > > batches, so there's no way to tell whether the original or redefined > > > > version of a dictionary is being referred to. > > > > > > You make a good point -- we can consider changes to the file format to > > > allow for record batches to have different dictionaries. Even handling > > > delta dictionaries with the current file format would be a bit tedious > > > (though not indeterminate) > > > > > > > Regards > > > > > > > > Antoine. > > > >