Re: [VOTE] Release Apache Arrow 8.0.0 - RC0

2022-04-26 Thread Sutou Kouhei
-1 There are some problems for RPM package and C GLib. I've fixed them: * https://github.com/apache/arrow/pull/13002 * https://github.com/apache/arrow/pull/13006 I'm still verifying RC0. Krisztián, it seems that you didn't press the "Close" button in

Re: [VOTE] Release Apache Arrow 8.0.0 - RC0

2022-04-26 Thread Sutou Kouhei
> - Binary verification consistently fails for Ubuntu hirsute. I dropped support for Ubuntu hirsute. But I forget it from our verification targets. Sorry. We can just ignore it. We already removed it from our verification targets in the master by

Re: [VOTE] Release Apache Arrow 8.0.0 - RC0

2022-04-26 Thread David Li
I ran into some issues: - Source verification for C++ fails with USE_CONDA=1 because mamba for whatever reason resolves orc==1.6.4 which is too old to build adapters/orc/adapter.cc. Manually patching the extracted source after verification works around this (although this is annoying to do). -

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Yue Ni
This is a very interesting topic. I wonder if we have a UDF mechanism in arrow compute, is there any chance Gandiva's UDF could be integrated with arrow compute's UDF function registry? [1] >From an external user's perspective, Gandiva is part of arrow project, having two UDF registries that are

Re: [Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

2022-04-26 Thread Kevin Crouse
Hi Antoine (and all), Thanks for your thoughts. I'll finish up the prototype and share my branch. It also wouldn't increase the time to import pyarrow by itself. but the rough idea would increase the import of pyarrow.compute as it's currently written, but more on that below. The only external

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Weston Pace
There was an old design document I proposed on this ML a while back. I never got around to implementing it and I think it has aged somewhat but it covers some of the points I brought up and it might be worth reviewing.

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Sasha Krassovsky
An ExecPlan is composed of a bunch of implicit “pipelines”. Each node in a pipeline (starting with a source node) implements `InputReceived` and `InputFinished`. On `InputReceived`, it performs its computation and calls `InputReceived` on its output. On `InputFinished`, it performs any cleanup

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Weston Pace
I think this is doable. I think we want to introduce the concept of a batch index. The scanner is then responsible for assigning a batch index to each outgoing batch. Some ExecNode's would reset or destroy the batch index (for example, you cannot generally do an asof join after a hash join

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Li Jin
Hey thanks again for the reply! > I would suggest accumulating all batches just like in Hash Join This is something I intentionally try to avoid because asof join (and many other time series operations) can be performed in a streaming fashion to reduce memory footprint. > When you want to scale

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
Also, this may sound counter-intuitive, but LLVM IR is actually architecture-specific because it is tied to various parameters of the architecture such as type widths and alignments. Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit : I think I can help answer these: 1) LLVM IR is an

Re: [DISC] Improving Arrow's database support

2022-04-26 Thread Wes McKinney
I don't have major new things to add on this topic except that I've long had the aspiration of creating something like Python's DBAPI 2.0 [1] at the C or C++ level to enable a measure of API standardization for Arrow-native read/write interfaces with database drivers. It seems like a natural

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Sasha Krassovsky
I would advise against relying on any specific ordering of batches coming in. When you want to scale up to multiple threads, you will no longer be able to rely on any order because scheduling is generally pretty nondeterministic. I would suggest accumulating all batches just like in Hash Join,

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Li Jin
> In order to produce a output for a left batch, I would need to wait until I received enough batches from the right tables to cover all potential matches (wait until I have seen right timestamps outside the matching range) Add a bit more explanation, let's say the time range of the current left

Re: [Compute][C++] Question on compute scheduler

2022-04-26 Thread Li Jin
Thanks both for the reply. To add a bit more context, I am trying to implement an "asof join". Here I have one left table and n right table, and all batches arrive in time order. In order to produce a output for a left batch, I would need to wait until I received enough batches from the right

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Sasha Krassovsky
I think I can help answer these: 1) LLVM IR is an intermediate representation for compilers, WASM is an open standard for sandboxed computation. They fulfill different but complimentary roles. If the query engine were handed LLVM IR, it would still have to JIT the IR to wasm in order to

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Li Jin
This is a very interesting topic and one that we care a lot about when using/thinking about Arrow compute. I come from Python data analytics where most of our users use Pandas/Numpy. This is also my first time learning about WASM and my previous understanding of "Python UDF in Arrow C++ compute"

Re: [DISC] Improving Arrow's database support

2022-04-26 Thread Antoine Pitrou
Do we want something more flexible than dlopen() and runtime symbol lookup (a mechanism which constrains the way you can organize and distribute drivers)? For example, perhaps we could expose an API struct of function pointers that could be obtained through driver-specific means. Le

Re: Perf/Benchmark for temporal operations

2022-04-26 Thread Rok Mihevc
I've opened a PR for temporal benchmarks: https://github.com/apache/arrow/pull/12997 Please chime in if some more benchmarks are needed. Results for the first run are here: https://conbench.ursa.dev/runs/019c6f9cdd82415382280c89be122b58/ Rok

[DISC] Improving Arrow's database support

2022-04-26 Thread David Li
Hello, In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Weston Pace
In addition to the memory copy it looks like WASM is going to bounds check all loads/stores. It does, at least, have some vectorized load/store operations so that can help amortize the cost. It appears you aren't going to get the same performance as native today using WASM but I'm guessing that

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
I need to correct myself here - it is currently not possible to pass memory at zero cost between the engine and WASM interpreter. This is related to your point about safety - WASM provides memory safety guarantees because it controls the memory region that it can read from and write to. Therefore,

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread David Li
Ah, fair point Antoine. Yes, I believe you are expected to copy data in/out right now: https://github.com/WebAssembly/design/issues/1162 On Tue, Apr 26, 2022, at 10:43, Antoine Pitrou wrote: > Le 26/04/2022 à 16:30, Gavin Ray a écrit : >> Antoine, sandboxing comes into play from two places: >>

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
Le 26/04/2022 à 16:30, Gavin Ray a écrit : Antoine, sandboxing comes into play from two places: 1) The WASM specification itself, which puts a bounds on the types of behaviors possible 2) The implementation of the WASM bytecode interpreter chosen, like Jorge mentioned in the comment above

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Gavin Ray
Antoine, sandboxing comes into play from two places: 1) The WASM specification itself, which puts a bounds on the types of behaviors possible 2) The implementation of the WASM bytecode interpreter chosen, like Jorge mentioned in the comment above The wasmtime docs have a pretty solid section

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : Would WASM be able to interact in-process with non-WASM buffers safely? AFAIK yes. My understanding from playing with it in JS is that a WASM-backed udf execution would be something like: 1. compile the C++/Rust/etc UDF to WASM (a binary

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
> Would WASM be able to interact in-process with non-WASM buffers safely? AFAIK yes. My understanding from playing with it in JS is that a WASM-backed udf execution would be something like: 1. compile the C++/Rust/etc UDF to WASM (a binary format) 2. provide a small WASM-compiled middleware of

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
Le 25/04/2022 à 23:04, David Li a écrit : The WebAssembly documentation has a rundown of the techniques used: https://webassembly.org/docs/security/ I think usually you would run WASM in-process, though we could indeed also put it in a subprocess to further isolate things. Would WASM be

[VOTE] Release Apache Arrow 8.0.0 - RC0

2022-04-26 Thread Krisztián Szűcs
Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 8.0.0. This is a release consisting of 564 resolved JIRA issues[1]. This release candidate is based on commit: 4d2f6991574a7e494679c891ab76c6c110af89a0 [2] The source release rc0 is hosted at [3]. The

Re: [Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

2022-04-26 Thread Antoine Pitrou
Hi Kevin, There are a couple of concerns to keep in mind: - we don't want to increase the import time of PyArrow too much - we would like to limit the required runtime dependencies for PyArrow (an issue is open to move docstring generation at package build time: