-1
There are some problems for RPM package and C GLib. I've
fixed them:
* https://github.com/apache/arrow/pull/13002
* https://github.com/apache/arrow/pull/13006
I'm still verifying RC0.
Krisztián, it seems that you didn't press the "Close" button
in
> - Binary verification consistently fails for Ubuntu hirsute.
I dropped support for Ubuntu hirsute. But I forget it from
our verification targets. Sorry.
We can just ignore it. We already removed it from
our verification targets in the master by
I ran into some issues:
- Source verification for C++ fails with USE_CONDA=1 because mamba for whatever
reason resolves orc==1.6.4 which is too old to build adapters/orc/adapter.cc.
Manually patching the extracted source after verification works around this
(although this is annoying to do).
-
This is a very interesting topic. I wonder if we have a UDF mechanism in
arrow compute, is there any chance Gandiva's UDF could be integrated with
arrow compute's UDF function registry? [1]
>From an external user's perspective, Gandiva is part of arrow project,
having two UDF registries that are
Hi Antoine (and all),
Thanks for your thoughts. I'll finish up the prototype and share my
branch. It also wouldn't increase the time to import pyarrow by itself. but
the rough idea would increase the import of pyarrow.compute as it's
currently written, but more on that below. The only external
There was an old design document I proposed on this ML a while back.
I never got around to implementing it and I think it has aged somewhat
but it covers some of the points I brought up and it might be worth
reviewing.
An ExecPlan is composed of a bunch of implicit “pipelines”. Each node in a
pipeline (starting with a source node) implements `InputReceived` and
`InputFinished`. On `InputReceived`, it performs its computation and calls
`InputReceived` on its output. On `InputFinished`, it performs any cleanup
I think this is doable. I think we want to introduce the concept of a
batch index. The scanner is then responsible for assigning a batch
index to each outgoing batch. Some ExecNode's would reset or destroy
the batch index (for example, you cannot generally do an asof join
after a hash join
Hey thanks again for the reply!
> I would suggest accumulating all batches just like in Hash Join
This is something I intentionally try to avoid because asof join (and many
other time series operations) can be performed in a streaming fashion to
reduce memory footprint.
> When you want to scale
Also, this may sound counter-intuitive, but LLVM IR is actually
architecture-specific because it is tied to various parameters of the
architecture such as type widths and alignments.
Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit :
I think I can help answer these:
1) LLVM IR is an
I don't have major new things to add on this topic except that I've
long had the aspiration of creating something like Python's DBAPI 2.0
[1] at the C or C++ level to enable a measure of API standardization
for Arrow-native read/write interfaces with database drivers. It seems
like a natural
I would advise against relying on any specific ordering of batches coming in.
When you want to scale up to multiple threads, you will no longer be able to
rely on any order because scheduling is generally pretty nondeterministic. I
would suggest accumulating all batches just like in Hash Join,
> In order to produce a output for a left batch, I would need to wait until
I received enough batches from the right tables to cover all potential
matches (wait until I have seen right timestamps outside the matching range)
Add a bit more explanation, let's say the time range of the current left
Thanks both for the reply. To add a bit more context, I am trying to
implement an "asof join". Here I have one left table and n right table, and
all batches arrive in time order.
In order to produce a output for a left batch, I would need to wait until I
received enough batches from the right
I think I can help answer these:
1) LLVM IR is an intermediate representation for compilers, WASM is an open
standard for sandboxed computation. They fulfill different but complimentary
roles. If the query engine were handed LLVM IR, it would still have to JIT the
IR to wasm in order to
This is a very interesting topic and one that we care a lot about when
using/thinking about Arrow compute.
I come from Python data analytics where most of our users use Pandas/Numpy.
This is also my first time learning about WASM and my previous
understanding of "Python UDF in Arrow C++ compute"
Do we want something more flexible than dlopen() and runtime symbol
lookup (a mechanism which constrains the way you can organize and
distribute drivers)?
For example, perhaps we could expose an API struct of function pointers
that could be obtained through driver-specific means.
Le
I've opened a PR for temporal benchmarks:
https://github.com/apache/arrow/pull/12997
Please chime in if some more benchmarks are needed.
Results for the first run are here:
https://conbench.ursa.dev/runs/019c6f9cdd82415382280c89be122b58/
Rok
Hello,
In light of recent efforts around Flight SQL, projects like pgeon [1], and
long-standing tickets/discussions about database support in Arrow [2], it seems
there's an opportunity to define standard database interfaces for Arrow that
could unify these efforts. So we've put together a
In addition to the memory copy it looks like WASM is going to bounds
check all loads/stores. It does, at least, have some vectorized
load/store operations so that can help amortize the cost. It appears
you aren't going to get the same performance as native today using
WASM but I'm guessing that
I need to correct myself here - it is currently not possible to pass memory
at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because it
controls the memory region that it can read from and write to. Therefore,
Ah, fair point Antoine. Yes, I believe you are expected to copy data in/out
right now: https://github.com/WebAssembly/design/issues/1162
On Tue, Apr 26, 2022, at 10:43, Antoine Pitrou wrote:
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
>> Antoine, sandboxing comes into play from two places:
>>
Le 26/04/2022 à 16:30, Gavin Ray a écrit :
Antoine, sandboxing comes into play from two places:
1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above
Antoine, sandboxing comes into play from two places:
1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above
The wasmtime docs have a pretty solid section
Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
Would WASM be able to interact in-process with non-WASM buffers safely?
AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:
1. compile the C++/Rust/etc UDF to WASM (a binary
> Would WASM be able to interact in-process with non-WASM buffers safely?
AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:
1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of
Le 25/04/2022 à 23:04, David Li a écrit :
The WebAssembly documentation has a rundown of the techniques used:
https://webassembly.org/docs/security/
I think usually you would run WASM in-process, though we could indeed also put
it in a subprocess to further isolate things.
Would WASM be
Hi,
I would like to propose the following release candidate (RC0) of Apache
Arrow version 8.0.0. This is a release consisting of 564
resolved JIRA issues[1].
This release candidate is based on commit:
4d2f6991574a7e494679c891ab76c6c110af89a0 [2]
The source release rc0 is hosted at [3].
The
Hi Kevin,
There are a couple of concerns to keep in mind:
- we don't want to increase the import time of PyArrow too much
- we would like to limit the required runtime dependencies for PyArrow
(an issue is open to move docstring generation at package build time:
29 matches
Mail list logo