Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Yue Ni
This is a very interesting topic. I wonder if we have a UDF mechanism in
arrow compute, is there any chance Gandiva's UDF could be integrated with
arrow compute's UDF function registry? [1]
>From an external user's perspective, Gandiva is part of arrow project,
having two UDF registries that are not interoperable seems a bit of a
waste. If arrow compute has the option to make Gandiva UDFs accessible, it
would be great for users. LLVM IR is used in Gandiva's precompiled UDF as
far as I know.

[1] https://www.dremio.com/blog/adding-a-user-define-function-to-gandiva/

On Wed, Apr 27, 2022 at 3:37 AM Antoine Pitrou  wrote:

>
> Also, this may sound counter-intuitive, but LLVM IR is actually
> architecture-specific because it is tied to various parameters of the
> architecture such as type widths and alignments.
>
>
> Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit :
> > I think I can help answer these:
> > 1) LLVM IR is an intermediate representation for compilers, WASM is an
> open standard for sandboxed computation. They fulfill different but
> complimentary roles. If the query engine were handed LLVM IR, it would
> still have to JIT the IR to wasm in order to maintain the sandboxing
> guarantees. It would also tie the query engine to LLVM, whereas there may
> be other wasm generators out there.
> >
> > 2) The idea would be for the user to use some external tool or compiler
> that generates wasm, and pass the wasm to the query engine. This would mean
> that you could write a UDF in any language of your choosing. It seems like
> it wouldn’t be much work to use your existing numpy + numba pipeline as
> well, you would just have to add a step to generate wasm from your LLVM IR
> before passing it to the engine.
> >
> > Sasha
> >
> >> 26 апр. 2022 г., в 10:39, Li Jin  написал(а):
> >>
> >> This is a very interesting topic and one that we care a lot about when
> >> using/thinking about Arrow compute.
> >>
> >> I come from Python data analytics where most of our users use
> Pandas/Numpy.
> >> This is also my first time learning about WASM and my previous
> >> understanding of "Python UDF in Arrow C++ compute" engine is more of:
> >>
> >> UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
> >> Execute LLVM IR within Arrow C++ engine on Arrow Arrays
> >>
> >> Which in my understanding is similar to UDFs in Impala with LLVM IR that
> >> Wes mentioned.
> >>
> >> I wonder how WASM potentially changing things. A couple of questions:
> >> (1) What is the advantage of using WASM instead of sth like LLVM IR?
> >> (2) Do we envision using sth like a NumPy API as the language that
> writes
> >> these UDFs or sth completely different? (Another DSL?)
> >>
> >> Li
> >>
> >>> On Tue, Apr 26, 2022 at 11:04 AM Weston Pace 
> wrote:
> >>>
> >>> In addition to the memory copy it looks like WASM is going to bounds
> >>> check all loads/stores.  It does, at least, have some vectorized
> >>> load/store operations so that can help amortize the cost.  It appears
> >>> you aren't going to get the same performance as native today using
> >>> WASM but I'm guessing that is an active area of research and
> >>> investment.
> >>>
>  On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
>   wrote:
> 
>  I need to correct myself here - it is currently not possible to pass
> >>> memory
>  at zero cost between the engine and WASM interpreter. This is related
> to
>  your point about safety - WASM provides memory safety guarantees
> because
> >>> it
>  controls the memory region that it can read from and write to.
> Therefore,
>  currently passing data from and into WASM requires a memcopy.
> 
>  There is a proposal [1] to improve the situation, but currently would
> >>> incur
>  a cost in the query engine, since we would need to memcopy the regions
>  around.
> 
>  I forgot that on my poc I passed the parquet file from js to WASM and
>  de-serialized it to arrow directly in wasm - so memory was already
> being
>  allocated from within WASM sandbox, not JS. Sorry for the confusion.
> 
>  [1] https://github.com/WebAssembly/design/issues/1439
> 
>  Best,
>  Jorge
> 
> 
> 
>  On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou 
> >>> wrote:
> 
> >
> > Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> >> Antoine, sandboxing comes into play from two places:
> >>
> >> 1) The WASM specification itself, which puts a bounds on the types
> of
> >> behaviors possible
> >> 2) The implementation of the WASM bytecode interpreter chosen, like
> >>> Jorge
> >> mentioned in the comment above
> >>
> >> The wasmtime docs have a pretty solid section covering the
> sandboxing
> >> guarantees of WASM, and then the interpreter-specific
> >>> behavior/abilities
> > of
> >> wasmtime FWIW:
> >> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> >
> > This doesn't really answer my 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou



Also, this may sound counter-intuitive, but LLVM IR is actually 
architecture-specific because it is tied to various parameters of the 
architecture such as type widths and alignments.



Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit :

I think I can help answer these:
1) LLVM IR is an intermediate representation for compilers, WASM is an open 
standard for sandboxed computation. They fulfill different but complimentary 
roles. If the query engine were handed LLVM IR, it would still have to JIT the 
IR to wasm in order to maintain the sandboxing guarantees. It would also tie 
the query engine to LLVM, whereas there may be other wasm generators out there.

2) The idea would be for the user to use some external tool or compiler that 
generates wasm, and pass the wasm to the query engine. This would mean that you 
could write a UDF in any language of your choosing. It seems like it wouldn’t 
be much work to use your existing numpy + numba pipeline as well, you would 
just have to add a step to generate wasm from your LLVM IR before passing it to 
the engine.

Sasha


26 апр. 2022 г., в 10:39, Li Jin  написал(а):

This is a very interesting topic and one that we care a lot about when
using/thinking about Arrow compute.

I come from Python data analytics where most of our users use Pandas/Numpy.
This is also my first time learning about WASM and my previous
understanding of "Python UDF in Arrow C++ compute" engine is more of:

UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
Execute LLVM IR within Arrow C++ engine on Arrow Arrays

Which in my understanding is similar to UDFs in Impala with LLVM IR that
Wes mentioned.

I wonder how WASM potentially changing things. A couple of questions:
(1) What is the advantage of using WASM instead of sth like LLVM IR?
(2) Do we envision using sth like a NumPy API as the language that writes
these UDFs or sth completely different? (Another DSL?)

Li


On Tue, Apr 26, 2022 at 11:04 AM Weston Pace  wrote:

In addition to the memory copy it looks like WASM is going to bounds
check all loads/stores.  It does, at least, have some vectorized
load/store operations so that can help amortize the cost.  It appears
you aren't going to get the same performance as native today using
WASM but I'm guessing that is an active area of research and
investment.


On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
 wrote:

I need to correct myself here - it is currently not possible to pass

memory

at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because

it

controls the memory region that it can read from and write to. Therefore,
currently passing data from and into WASM requires a memcopy.

There is a proposal [1] to improve the situation, but currently would

incur

a cost in the query engine, since we would need to memcopy the regions
around.

I forgot that on my poc I passed the parquet file from js to WASM and
de-serialized it to arrow directly in wasm - so memory was already being
allocated from within WASM sandbox, not JS. Sorry for the confusion.

[1] https://github.com/WebAssembly/design/issues/1439

Best,
Jorge



On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou 

wrote:




Le 26/04/2022 à 16:30, Gavin Ray a écrit :

Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like

Jorge

mentioned in the comment above

The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific

behavior/abilities

of

wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core


This doesn't really answer my question, does it?




On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 

wrote:




Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :

Would WASM be able to interact in-process with non-WASM buffers

safely?


AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface

that

consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine

program

to the interpreter with WASM-compiled middleware


Ok, but the key word in my question was "safely". What mechanisms

are in

place such that the WASM user function will not access Arrow

buffers out

of bounds? Nothing really stands out in
https://webassembly.github.io/spec/core/index.html, but it's the

first

time I try to have a look at the WebAssembly spec.

Regards

Antoine.




Step 2 is necessary to read the buffers from FFI and output the

result

back

from the interpreter once the UDF is done, similar to what 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Sasha Krassovsky
I think I can help answer these:
1) LLVM IR is an intermediate representation for compilers, WASM is an open 
standard for sandboxed computation. They fulfill different but complimentary 
roles. If the query engine were handed LLVM IR, it would still have to JIT the 
IR to wasm in order to maintain the sandboxing guarantees. It would also tie 
the query engine to LLVM, whereas there may be other wasm generators out there. 

2) The idea would be for the user to use some external tool or compiler that 
generates wasm, and pass the wasm to the query engine. This would mean that you 
could write a UDF in any language of your choosing. It seems like it wouldn’t 
be much work to use your existing numpy + numba pipeline as well, you would 
just have to add a step to generate wasm from your LLVM IR before passing it to 
the engine. 

Sasha

> 26 апр. 2022 г., в 10:39, Li Jin  написал(а):
> 
> This is a very interesting topic and one that we care a lot about when
> using/thinking about Arrow compute.
> 
> I come from Python data analytics where most of our users use Pandas/Numpy.
> This is also my first time learning about WASM and my previous
> understanding of "Python UDF in Arrow C++ compute" engine is more of:
> 
> UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
> Execute LLVM IR within Arrow C++ engine on Arrow Arrays
> 
> Which in my understanding is similar to UDFs in Impala with LLVM IR that
> Wes mentioned.
> 
> I wonder how WASM potentially changing things. A couple of questions:
> (1) What is the advantage of using WASM instead of sth like LLVM IR?
> (2) Do we envision using sth like a NumPy API as the language that writes
> these UDFs or sth completely different? (Another DSL?)
> 
> Li
> 
>> On Tue, Apr 26, 2022 at 11:04 AM Weston Pace  wrote:
>> 
>> In addition to the memory copy it looks like WASM is going to bounds
>> check all loads/stores.  It does, at least, have some vectorized
>> load/store operations so that can help amortize the cost.  It appears
>> you aren't going to get the same performance as native today using
>> WASM but I'm guessing that is an active area of research and
>> investment.
>> 
>>> On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
>>>  wrote:
>>> 
>>> I need to correct myself here - it is currently not possible to pass
>> memory
>>> at zero cost between the engine and WASM interpreter. This is related to
>>> your point about safety - WASM provides memory safety guarantees because
>> it
>>> controls the memory region that it can read from and write to. Therefore,
>>> currently passing data from and into WASM requires a memcopy.
>>> 
>>> There is a proposal [1] to improve the situation, but currently would
>> incur
>>> a cost in the query engine, since we would need to memcopy the regions
>>> around.
>>> 
>>> I forgot that on my poc I passed the parquet file from js to WASM and
>>> de-serialized it to arrow directly in wasm - so memory was already being
>>> allocated from within WASM sandbox, not JS. Sorry for the confusion.
>>> 
>>> [1] https://github.com/WebAssembly/design/issues/1439
>>> 
>>> Best,
>>> Jorge
>>> 
>>> 
>>> 
>>> On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou 
>> wrote:
>>> 
 
 Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> Antoine, sandboxing comes into play from two places:
> 
> 1) The WASM specification itself, which puts a bounds on the types of
> behaviors possible
> 2) The implementation of the WASM bytecode interpreter chosen, like
>> Jorge
> mentioned in the comment above
> 
> The wasmtime docs have a pretty solid section covering the sandboxing
> guarantees of WASM, and then the interpreter-specific
>> behavior/abilities
 of
> wasmtime FWIW:
> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
 
 This doesn't really answer my question, does it?
 
 
> 
> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 
 wrote:
> 
>> 
>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
 Would WASM be able to interact in-process with non-WASM buffers
 safely?
>>> 
>>> AFAIK yes. My understanding from playing with it in JS is that a
>>> WASM-backed udf execution would be something like:
>>> 
>>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
>>> 2. provide a small WASM-compiled middleware of the c data interface
 that
>>> consumes (binary, c data interface pointers)
>>> 3. ship a WASM interpreter as part of the query engine
>>> 4. pass binary and c data interface pointers from the query engine
>> program
>>> to the interpreter with WASM-compiled middleware
>> 
>> Ok, but the key word in my question was "safely". What mechanisms
>> are in
>> place such that the WASM user function will not access Arrow
>> buffers out
>> of bounds? Nothing really stands out in
>> https://webassembly.github.io/spec/core/index.html, but it's the
>> first
>> 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Li Jin
This is a very interesting topic and one that we care a lot about when
using/thinking about Arrow compute.

I come from Python data analytics where most of our users use Pandas/Numpy.
This is also my first time learning about WASM and my previous
understanding of "Python UDF in Arrow C++ compute" engine is more of:

UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
Execute LLVM IR within Arrow C++ engine on Arrow Arrays

Which in my understanding is similar to UDFs in Impala with LLVM IR that
Wes mentioned.

I wonder how WASM potentially changing things. A couple of questions:
(1) What is the advantage of using WASM instead of sth like LLVM IR?
(2) Do we envision using sth like a NumPy API as the language that writes
these UDFs or sth completely different? (Another DSL?)

Li

On Tue, Apr 26, 2022 at 11:04 AM Weston Pace  wrote:

> In addition to the memory copy it looks like WASM is going to bounds
> check all loads/stores.  It does, at least, have some vectorized
> load/store operations so that can help amortize the cost.  It appears
> you aren't going to get the same performance as native today using
> WASM but I'm guessing that is an active area of research and
> investment.
>
> On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
>  wrote:
> >
> > I need to correct myself here - it is currently not possible to pass
> memory
> > at zero cost between the engine and WASM interpreter. This is related to
> > your point about safety - WASM provides memory safety guarantees because
> it
> > controls the memory region that it can read from and write to. Therefore,
> > currently passing data from and into WASM requires a memcopy.
> >
> > There is a proposal [1] to improve the situation, but currently would
> incur
> > a cost in the query engine, since we would need to memcopy the regions
> > around.
> >
> > I forgot that on my poc I passed the parquet file from js to WASM and
> > de-serialized it to arrow directly in wasm - so memory was already being
> > allocated from within WASM sandbox, not JS. Sorry for the confusion.
> >
> > [1] https://github.com/WebAssembly/design/issues/1439
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > > > Antoine, sandboxing comes into play from two places:
> > > >
> > > > 1) The WASM specification itself, which puts a bounds on the types of
> > > > behaviors possible
> > > > 2) The implementation of the WASM bytecode interpreter chosen, like
> Jorge
> > > > mentioned in the comment above
> > > >
> > > > The wasmtime docs have a pretty solid section covering the sandboxing
> > > > guarantees of WASM, and then the interpreter-specific
> behavior/abilities
> > > of
> > > > wasmtime FWIW:
> > > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> > >
> > > This doesn't really answer my question, does it?
> > >
> > >
> > > >
> > > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > >>
> > > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> > >  Would WASM be able to interact in-process with non-WASM buffers
> > > safely?
> > > >>>
> > > >>> AFAIK yes. My understanding from playing with it in JS is that a
> > > >>> WASM-backed udf execution would be something like:
> > > >>>
> > > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > > >>> 2. provide a small WASM-compiled middleware of the c data interface
> > > that
> > > >>> consumes (binary, c data interface pointers)
> > > >>> 3. ship a WASM interpreter as part of the query engine
> > > >>> 4. pass binary and c data interface pointers from the query engine
> > > >> program
> > > >>> to the interpreter with WASM-compiled middleware
> > > >>
> > > >> Ok, but the key word in my question was "safely". What mechanisms
> are in
> > > >> place such that the WASM user function will not access Arrow
> buffers out
> > > >> of bounds? Nothing really stands out in
> > > >> https://webassembly.github.io/spec/core/index.html, but it's the
> first
> > > >> time I try to have a look at the WebAssembly spec.
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>>
> > > >>> Step 2 is necessary to read the buffers from FFI and output the
> result
> > > >> back
> > > >>> from the interpreter once the UDF is done, similar to what we do in
> > > >>> datafusion to run Python from Rust. In the case of datafusion the
> > > >> "binary"
> > > >>> is a Python function, which has security implications since the
> Python
> > > >>> interpreter allows everything by default.
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou  >
> > > >> wrote:
> > > >>>
> > > 
> > >  Le 25/04/2022 à 23:04, David Li a écrit :
> > > > The WebAssembly documentation has a rundown of the techniques
> used:
> > >  https://webassembly.org/docs/security/
> > > >
> > > 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Weston Pace
In addition to the memory copy it looks like WASM is going to bounds
check all loads/stores.  It does, at least, have some vectorized
load/store operations so that can help amortize the cost.  It appears
you aren't going to get the same performance as native today using
WASM but I'm guessing that is an active area of research and
investment.

On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
 wrote:
>
> I need to correct myself here - it is currently not possible to pass memory
> at zero cost between the engine and WASM interpreter. This is related to
> your point about safety - WASM provides memory safety guarantees because it
> controls the memory region that it can read from and write to. Therefore,
> currently passing data from and into WASM requires a memcopy.
>
> There is a proposal [1] to improve the situation, but currently would incur
> a cost in the query engine, since we would need to memcopy the regions
> around.
>
> I forgot that on my poc I passed the parquet file from js to WASM and
> de-serialized it to arrow directly in wasm - so memory was already being
> allocated from within WASM sandbox, not JS. Sorry for the confusion.
>
> [1] https://github.com/WebAssembly/design/issues/1439
>
> Best,
> Jorge
>
>
>
> On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou  wrote:
>
> >
> > Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > > Antoine, sandboxing comes into play from two places:
> > >
> > > 1) The WASM specification itself, which puts a bounds on the types of
> > > behaviors possible
> > > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
> > > mentioned in the comment above
> > >
> > > The wasmtime docs have a pretty solid section covering the sandboxing
> > > guarantees of WASM, and then the interpreter-specific behavior/abilities
> > of
> > > wasmtime FWIW:
> > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> >
> > This doesn't really answer my question, does it?
> >
> >
> > >
> > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >  Would WASM be able to interact in-process with non-WASM buffers
> > safely?
> > >>>
> > >>> AFAIK yes. My understanding from playing with it in JS is that a
> > >>> WASM-backed udf execution would be something like:
> > >>>
> > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > >>> 2. provide a small WASM-compiled middleware of the c data interface
> > that
> > >>> consumes (binary, c data interface pointers)
> > >>> 3. ship a WASM interpreter as part of the query engine
> > >>> 4. pass binary and c data interface pointers from the query engine
> > >> program
> > >>> to the interpreter with WASM-compiled middleware
> > >>
> > >> Ok, but the key word in my question was "safely". What mechanisms are in
> > >> place such that the WASM user function will not access Arrow buffers out
> > >> of bounds? Nothing really stands out in
> > >> https://webassembly.github.io/spec/core/index.html, but it's the first
> > >> time I try to have a look at the WebAssembly spec.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> Step 2 is necessary to read the buffers from FFI and output the result
> > >> back
> > >>> from the interpreter once the UDF is done, similar to what we do in
> > >>> datafusion to run Python from Rust. In the case of datafusion the
> > >> "binary"
> > >>> is a Python function, which has security implications since the Python
> > >>> interpreter allows everything by default.
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
> > >> wrote:
> > >>>
> > 
> >  Le 25/04/2022 à 23:04, David Li a écrit :
> > > The WebAssembly documentation has a rundown of the techniques used:
> >  https://webassembly.org/docs/security/
> > >
> > > I think usually you would run WASM in-process, though we could indeed
> >  also put it in a subprocess to further isolate things.
> > 
> >  Would WASM be able to interact in-process with non-WASM buffers
> > safely?
> >  It's not obvious from reading the page above.
> > 
> > 
> > >
> > > It would be interesting to define the Flight "harness" protocol.
> >  Handling heterogeneous arguments may require some evolution in Flight
> > >> (e.g.
> >  if the function is non scalar and arguments are of different length -
> > >> we'd
> >  need something like the ColumnBag proposal, so this might be a good
> > >> reason
> >  to revive that).
> > >
> > > On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> > >> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> > >>> I was going to reply to this e-mail thread on user@ but thought I
> > >>> would start a new thread on dev@.
> > >>>
> > >>> Executing user-defined functions in memory, especially untrusted
> > >>> functions, in general is unsafe. For "trusted" 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
I need to correct myself here - it is currently not possible to pass memory
at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because it
controls the memory region that it can read from and write to. Therefore,
currently passing data from and into WASM requires a memcopy.

There is a proposal [1] to improve the situation, but currently would incur
a cost in the query engine, since we would need to memcopy the regions
around.

I forgot that on my poc I passed the parquet file from js to WASM and
de-serialized it to arrow directly in wasm - so memory was already being
allocated from within WASM sandbox, not JS. Sorry for the confusion.

[1] https://github.com/WebAssembly/design/issues/1439

Best,
Jorge



On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou  wrote:

>
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > Antoine, sandboxing comes into play from two places:
> >
> > 1) The WASM specification itself, which puts a bounds on the types of
> > behaviors possible
> > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
> > mentioned in the comment above
> >
> > The wasmtime docs have a pretty solid section covering the sandboxing
> > guarantees of WASM, and then the interpreter-specific behavior/abilities
> of
> > wasmtime FWIW:
> > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
>
> This doesn't really answer my question, does it?
>
>
> >
> > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
>  Would WASM be able to interact in-process with non-WASM buffers
> safely?
> >>>
> >>> AFAIK yes. My understanding from playing with it in JS is that a
> >>> WASM-backed udf execution would be something like:
> >>>
> >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> >>> 2. provide a small WASM-compiled middleware of the c data interface
> that
> >>> consumes (binary, c data interface pointers)
> >>> 3. ship a WASM interpreter as part of the query engine
> >>> 4. pass binary and c data interface pointers from the query engine
> >> program
> >>> to the interpreter with WASM-compiled middleware
> >>
> >> Ok, but the key word in my question was "safely". What mechanisms are in
> >> place such that the WASM user function will not access Arrow buffers out
> >> of bounds? Nothing really stands out in
> >> https://webassembly.github.io/spec/core/index.html, but it's the first
> >> time I try to have a look at the WebAssembly spec.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Step 2 is necessary to read the buffers from FFI and output the result
> >> back
> >>> from the interpreter once the UDF is done, similar to what we do in
> >>> datafusion to run Python from Rust. In the case of datafusion the
> >> "binary"
> >>> is a Python function, which has security implications since the Python
> >>> interpreter allows everything by default.
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>>
> >>>
> >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
> >> wrote:
> >>>
> 
>  Le 25/04/2022 à 23:04, David Li a écrit :
> > The WebAssembly documentation has a rundown of the techniques used:
>  https://webassembly.org/docs/security/
> >
> > I think usually you would run WASM in-process, though we could indeed
>  also put it in a subprocess to further isolate things.
> 
>  Would WASM be able to interact in-process with non-WASM buffers
> safely?
>  It's not obvious from reading the page above.
> 
> 
> >
> > It would be interesting to define the Flight "harness" protocol.
>  Handling heterogeneous arguments may require some evolution in Flight
> >> (e.g.
>  if the function is non scalar and arguments are of different length -
> >> we'd
>  need something like the ColumnBag proposal, so this might be a good
> >> reason
>  to revive that).
> >
> > On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>> I was going to reply to this e-mail thread on user@ but thought I
> >>> would start a new thread on dev@.
> >>>
> >>> Executing user-defined functions in memory, especially untrusted
> >>> functions, in general is unsafe. For "trusted" functions, having an
> >>> in-memory API for writing them in user languages is very useful. I
> >>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >>> would allow UDFs to have performance consistent with built-ins
> >>> (because built-in functions are all inlined into code-generated
> >>> expressions), but segfaults would bring down the server, so only
> >>> admins could be trusted to add new UDFs.
> >>>
> >>> However, I wonder if we should eventually define an "external UDF"
> >>> protocol and an example UDF "harness", using Flight to do RPC
> across
> >>> the process 

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread David Li
Ah, fair point Antoine. Yes, I believe you are expected to copy data in/out 
right now: https://github.com/WebAssembly/design/issues/1162

On Tue, Apr 26, 2022, at 10:43, Antoine Pitrou wrote:
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
>> Antoine, sandboxing comes into play from two places:
>> 
>> 1) The WASM specification itself, which puts a bounds on the types of
>> behaviors possible
>> 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
>> mentioned in the comment above
>> 
>> The wasmtime docs have a pretty solid section covering the sandboxing
>> guarantees of WASM, and then the interpreter-specific behavior/abilities of
>> wasmtime FWIW:
>> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
>
> This doesn't really answer my question, does it?
>
>
>> 
>> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou  wrote:
>> 
>>>
>>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> Would WASM be able to interact in-process with non-WASM buffers safely?

 AFAIK yes. My understanding from playing with it in JS is that a
 WASM-backed udf execution would be something like:

 1. compile the C++/Rust/etc UDF to WASM (a binary format)
 2. provide a small WASM-compiled middleware of the c data interface that
 consumes (binary, c data interface pointers)
 3. ship a WASM interpreter as part of the query engine
 4. pass binary and c data interface pointers from the query engine
>>> program
 to the interpreter with WASM-compiled middleware
>>>
>>> Ok, but the key word in my question was "safely". What mechanisms are in
>>> place such that the WASM user function will not access Arrow buffers out
>>> of bounds? Nothing really stands out in
>>> https://webassembly.github.io/spec/core/index.html, but it's the first
>>> time I try to have a look at the WebAssembly spec.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>

 Step 2 is necessary to read the buffers from FFI and output the result
>>> back
 from the interpreter once the UDF is done, similar to what we do in
 datafusion to run Python from Rust. In the case of datafusion the
>>> "binary"
 is a Python function, which has security implications since the Python
 interpreter allows everything by default.

 Best,
 Jorge



 On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
>>> wrote:

>
> Le 25/04/2022 à 23:04, David Li a écrit :
>> The WebAssembly documentation has a rundown of the techniques used:
> https://webassembly.org/docs/security/
>>
>> I think usually you would run WASM in-process, though we could indeed
> also put it in a subprocess to further isolate things.
>
> Would WASM be able to interact in-process with non-WASM buffers safely?
> It's not obvious from reading the page above.
>
>
>>
>> It would be interesting to define the Flight "harness" protocol.
> Handling heterogeneous arguments may require some evolution in Flight
>>> (e.g.
> if the function is non scalar and arguments are of different length -
>>> we'd
> need something like the ColumnBag proposal, so this might be a good
>>> reason
> to revive that).
>>
>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
 I was going to reply to this e-mail thread on user@ but thought I
 would start a new thread on dev@.

 Executing user-defined functions in memory, especially untrusted
 functions, in general is unsafe. For "trusted" functions, having an
 in-memory API for writing them in user languages is very useful. I
 remember tinkering with adding UDFs in Impala with LLVM IR, which
 would allow UDFs to have performance consistent with built-ins
 (because built-in functions are all inlined into code-generated
 expressions), but segfaults would bring down the server, so only
 admins could be trusted to add new UDFs.

 However, I wonder if we should eventually define an "external UDF"
 protocol and an example UDF "harness", using Flight to do RPC across
 the process boundaries. So the idea is that an external local UDF
 Flight execution service is spun up, and then data is sent to the UDF
 in a DoExchange call.

 As Jacques pointed out in an interview 1], a compelling solution to
 the UDF sandboxing problem is WASM. This allows "untrusted" WASM
 functions to be run safely in-process.
>>>
>>> How does the sandboxing work in this case? Is it simply executing in a
>>> separate process with restricted capabilities, or are other mechanisms
>>> put in place?
>

>>>
>>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou



Le 26/04/2022 à 16:30, Gavin Ray a écrit :

Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above

The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific behavior/abilities of
wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core


This doesn't really answer my question, does it?




On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou  wrote:



Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :

Would WASM be able to interact in-process with non-WASM buffers safely?


AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface that
consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine

program

to the interpreter with WASM-compiled middleware


Ok, but the key word in my question was "safely". What mechanisms are in
place such that the WASM user function will not access Arrow buffers out
of bounds? Nothing really stands out in
https://webassembly.github.io/spec/core/index.html, but it's the first
time I try to have a look at the WebAssembly spec.

Regards

Antoine.




Step 2 is necessary to read the buffers from FFI and output the result

back

from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the

"binary"

is a Python function, which has security implications since the Python
interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 

wrote:




Le 25/04/2022 à 23:04, David Li a écrit :

The WebAssembly documentation has a rundown of the techniques used:

https://webassembly.org/docs/security/


I think usually you would run WASM in-process, though we could indeed

also put it in a subprocess to further isolate things.

Would WASM be able to interact in-process with non-WASM buffers safely?
It's not obvious from reading the page above.




It would be interesting to define the Flight "harness" protocol.

Handling heterogeneous arguments may require some evolution in Flight

(e.g.

if the function is non scalar and arguments are of different length -

we'd

need something like the ColumnBag proposal, so this might be a good

reason

to revive that).


On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:

Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply executing in a
separate process with restricted capabilities, or are other mechanisms
put in place?










Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Gavin Ray
Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above

The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific behavior/abilities of
wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core

On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou  wrote:

>
> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >
> > AFAIK yes. My understanding from playing with it in JS is that a
> > WASM-backed udf execution would be something like:
> >
> > 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > 2. provide a small WASM-compiled middleware of the c data interface that
> > consumes (binary, c data interface pointers)
> > 3. ship a WASM interpreter as part of the query engine
> > 4. pass binary and c data interface pointers from the query engine
> program
> > to the interpreter with WASM-compiled middleware
>
> Ok, but the key word in my question was "safely". What mechanisms are in
> place such that the WASM user function will not access Arrow buffers out
> of bounds? Nothing really stands out in
> https://webassembly.github.io/spec/core/index.html, but it's the first
> time I try to have a look at the WebAssembly spec.
>
> Regards
>
> Antoine.
>
>
> >
> > Step 2 is necessary to read the buffers from FFI and output the result
> back
> > from the interpreter once the UDF is done, similar to what we do in
> > datafusion to run Python from Rust. In the case of datafusion the
> "binary"
> > is a Python function, which has security implications since the Python
> > interpreter allows everything by default.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 25/04/2022 à 23:04, David Li a écrit :
> >>> The WebAssembly documentation has a rundown of the techniques used:
> >> https://webassembly.org/docs/security/
> >>>
> >>> I think usually you would run WASM in-process, though we could indeed
> >> also put it in a subprocess to further isolate things.
> >>
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >> It's not obvious from reading the page above.
> >>
> >>
> >>>
> >>> It would be interesting to define the Flight "harness" protocol.
> >> Handling heterogeneous arguments may require some evolution in Flight
> (e.g.
> >> if the function is non scalar and arguments are of different length -
> we'd
> >> need something like the ColumnBag proposal, so this might be a good
> reason
> >> to revive that).
> >>>
> >>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
>  Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> > I was going to reply to this e-mail thread on user@ but thought I
> > would start a new thread on dev@.
> >
> > Executing user-defined functions in memory, especially untrusted
> > functions, in general is unsafe. For "trusted" functions, having an
> > in-memory API for writing them in user languages is very useful. I
> > remember tinkering with adding UDFs in Impala with LLVM IR, which
> > would allow UDFs to have performance consistent with built-ins
> > (because built-in functions are all inlined into code-generated
> > expressions), but segfaults would bring down the server, so only
> > admins could be trusted to add new UDFs.
> >
> > However, I wonder if we should eventually define an "external UDF"
> > protocol and an example UDF "harness", using Flight to do RPC across
> > the process boundaries. So the idea is that an external local UDF
> > Flight execution service is spun up, and then data is sent to the UDF
> > in a DoExchange call.
> >
> > As Jacques pointed out in an interview 1], a compelling solution to
> > the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> > functions to be run safely in-process.
> 
>  How does the sandboxing work in this case? Is it simply executing in a
>  separate process with restricted capabilities, or are other mechanisms
>  put in place?
> >>
> >
>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou



Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :

Would WASM be able to interact in-process with non-WASM buffers safely?


AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface that
consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine program
to the interpreter with WASM-compiled middleware


Ok, but the key word in my question was "safely". What mechanisms are in 
place such that the WASM user function will not access Arrow buffers out 
of bounds? Nothing really stands out in 
https://webassembly.github.io/spec/core/index.html, but it's the first 
time I try to have a look at the WebAssembly spec.


Regards

Antoine.




Step 2 is necessary to read the buffers from FFI and output the result back
from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the "binary"
is a Python function, which has security implications since the Python
interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou  wrote:



Le 25/04/2022 à 23:04, David Li a écrit :

The WebAssembly documentation has a rundown of the techniques used:

https://webassembly.org/docs/security/


I think usually you would run WASM in-process, though we could indeed

also put it in a subprocess to further isolate things.

Would WASM be able to interact in-process with non-WASM buffers safely?
It's not obvious from reading the page above.




It would be interesting to define the Flight "harness" protocol.

Handling heterogeneous arguments may require some evolution in Flight (e.g.
if the function is non scalar and arguments are of different length - we'd
need something like the ColumnBag proposal, so this might be a good reason
to revive that).


On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:

Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply executing in a
separate process with restricted capabilities, or are other mechanisms
put in place?






Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Jorge Cardoso Leitão
> Would WASM be able to interact in-process with non-WASM buffers safely?

AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface that
consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine program
to the interpreter with WASM-compiled middleware

Step 2 is necessary to read the buffers from FFI and output the result back
from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the "binary"
is a Python function, which has security implications since the Python
interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou  wrote:

>
> Le 25/04/2022 à 23:04, David Li a écrit :
> > The WebAssembly documentation has a rundown of the techniques used:
> https://webassembly.org/docs/security/
> >
> > I think usually you would run WASM in-process, though we could indeed
> also put it in a subprocess to further isolate things.
>
> Would WASM be able to interact in-process with non-WASM buffers safely?
> It's not obvious from reading the page above.
>
>
> >
> > It would be interesting to define the Flight "harness" protocol.
> Handling heterogeneous arguments may require some evolution in Flight (e.g.
> if the function is non scalar and arguments are of different length - we'd
> need something like the ColumnBag proposal, so this might be a good reason
> to revive that).
> >
> > On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>> I was going to reply to this e-mail thread on user@ but thought I
> >>> would start a new thread on dev@.
> >>>
> >>> Executing user-defined functions in memory, especially untrusted
> >>> functions, in general is unsafe. For "trusted" functions, having an
> >>> in-memory API for writing them in user languages is very useful. I
> >>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >>> would allow UDFs to have performance consistent with built-ins
> >>> (because built-in functions are all inlined into code-generated
> >>> expressions), but segfaults would bring down the server, so only
> >>> admins could be trusted to add new UDFs.
> >>>
> >>> However, I wonder if we should eventually define an "external UDF"
> >>> protocol and an example UDF "harness", using Flight to do RPC across
> >>> the process boundaries. So the idea is that an external local UDF
> >>> Flight execution service is spun up, and then data is sent to the UDF
> >>> in a DoExchange call.
> >>>
> >>> As Jacques pointed out in an interview 1], a compelling solution to
> >>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >>> functions to be run safely in-process.
> >>
> >> How does the sandboxing work in this case? Is it simply executing in a
> >> separate process with restricted capabilities, or are other mechanisms
> >> put in place?
>


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou



Le 25/04/2022 à 23:04, David Li a écrit :

The WebAssembly documentation has a rundown of the techniques used: 
https://webassembly.org/docs/security/

I think usually you would run WASM in-process, though we could indeed also put 
it in a subprocess to further isolate things.


Would WASM be able to interact in-process with non-WASM buffers safely?
It's not obvious from reading the page above.




It would be interesting to define the Flight "harness" protocol. Handling 
heterogeneous arguments may require some evolution in Flight (e.g. if the function is non 
scalar and arguments are of different length - we'd need something like the ColumnBag 
proposal, so this might be a good reason to revive that).

On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:

Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply executing in a
separate process with restricted capabilities, or are other mechanisms
put in place?


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Weston Pace
I think there is a certain amount of tricky "package management"
involved with such a harness.  For example, if I want to build my UDF
on top of tensorflow then I would need a version of the tensorflow C
libs that has been compiled to WASM and (potentially) language
runtimes for whatever language users might want to write the
computation in.  I wonder if there are existing WASM solutions for
this kind of challenge.

On Mon, Apr 25, 2022 at 11:05 AM David Li  wrote:
>
> The WebAssembly documentation has a rundown of the techniques used: 
> https://webassembly.org/docs/security/
>
> I think usually you would run WASM in-process, though we could indeed also 
> put it in a subprocess to further isolate things.
>
> It would be interesting to define the Flight "harness" protocol. Handling 
> heterogeneous arguments may require some evolution in Flight (e.g. if the 
> function is non scalar and arguments are of different length - we'd need 
> something like the ColumnBag proposal, so this might be a good reason to 
> revive that).
>
> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> > Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >> I was going to reply to this e-mail thread on user@ but thought I
> >> would start a new thread on dev@.
> >>
> >> Executing user-defined functions in memory, especially untrusted
> >> functions, in general is unsafe. For "trusted" functions, having an
> >> in-memory API for writing them in user languages is very useful. I
> >> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >> would allow UDFs to have performance consistent with built-ins
> >> (because built-in functions are all inlined into code-generated
> >> expressions), but segfaults would bring down the server, so only
> >> admins could be trusted to add new UDFs.
> >>
> >> However, I wonder if we should eventually define an "external UDF"
> >> protocol and an example UDF "harness", using Flight to do RPC across
> >> the process boundaries. So the idea is that an external local UDF
> >> Flight execution service is spun up, and then data is sent to the UDF
> >> in a DoExchange call.
> >>
> >> As Jacques pointed out in an interview 1], a compelling solution to
> >> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >> functions to be run safely in-process.
> >
> > How does the sandboxing work in this case? Is it simply executing in a
> > separate process with restricted capabilities, or are other mechanisms
> > put in place?


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread David Li
The WebAssembly documentation has a rundown of the techniques used: 
https://webassembly.org/docs/security/

I think usually you would run WASM in-process, though we could indeed also put 
it in a subprocess to further isolate things.

It would be interesting to define the Flight "harness" protocol. Handling 
heterogeneous arguments may require some evolution in Flight (e.g. if the 
function is non scalar and arguments are of different length - we'd need 
something like the ColumnBag proposal, so this might be a good reason to revive 
that).

On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
>> I was going to reply to this e-mail thread on user@ but thought I
>> would start a new thread on dev@.
>> 
>> Executing user-defined functions in memory, especially untrusted
>> functions, in general is unsafe. For "trusted" functions, having an
>> in-memory API for writing them in user languages is very useful. I
>> remember tinkering with adding UDFs in Impala with LLVM IR, which
>> would allow UDFs to have performance consistent with built-ins
>> (because built-in functions are all inlined into code-generated
>> expressions), but segfaults would bring down the server, so only
>> admins could be trusted to add new UDFs.
>> 
>> However, I wonder if we should eventually define an "external UDF"
>> protocol and an example UDF "harness", using Flight to do RPC across
>> the process boundaries. So the idea is that an external local UDF
>> Flight execution service is spun up, and then data is sent to the UDF
>> in a DoExchange call.
>> 
>> As Jacques pointed out in an interview 1], a compelling solution to
>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
>> functions to be run safely in-process.
>
> How does the sandboxing work in this case? Is it simply executing in a 
> separate process with restricted capabilities, or are other mechanisms 
> put in place?


Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Antoine Pitrou



Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply executing in a 
separate process with restricted capabilities, or are other mechanisms 
put in place?




Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Gavin Ray
Sounds like a fantastic idea, and WASM seems a natural choice

You get the ability to opt into IO if you want/need to, with WASI, but by
default
you can rest assured about worst-case consequences being contained.

On Mon, Apr 25, 2022 at 4:20 PM Wes McKinney  wrote:

> I was going to reply to this e-mail thread on user@ but thought I
> would start a new thread on dev@.
>
> Executing user-defined functions in memory, especially untrusted
> functions, in general is unsafe. For "trusted" functions, having an
> in-memory API for writing them in user languages is very useful. I
> remember tinkering with adding UDFs in Impala with LLVM IR, which
> would allow UDFs to have performance consistent with built-ins
> (because built-in functions are all inlined into code-generated
> expressions), but segfaults would bring down the server, so only
> admins could be trusted to add new UDFs.
>
> However, I wonder if we should eventually define an "external UDF"
> protocol and an example UDF "harness", using Flight to do RPC across
> the process boundaries. So the idea is that an external local UDF
> Flight execution service is spun up, and then data is sent to the UDF
> in a DoExchange call.
>
> As Jacques pointed out in an interview 1], a compelling solution to
> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> functions to be run safely in-process. However, we would need to
> harden and document the details of the interface between the host
> language and the user WASM code.
>
> Since there are many different potential kinds of user-defined
> functions aside from scalar functions, that increases the complexity /
> scope of specification work here also.
>
> - Wes
>
> [1]:
> https://reneeshah.medium.com/how-webassembly-gets-used-the-18-most-exciting-startups-building-with-wasm-939474e951db
>
> On Fri, Apr 22, 2022 at 2:09 PM David Li  wrote:
> >
> > This is currently being implemented for Python:
> https://github.com/apache/arrow/pull/12590 It may not land for 8.0.0 but
> should be there for 9.0.0, presumably.
> >
> > It is already possible in C++. The same APIs that built-in functions use
> to register themselves should be available to applications and there's a
> fairly trivial example of this in [1]. Such a function would also be
> available from Python/R/etc. if you could figure out how to
> package/distribute/load the application library appropriately.
> >
> > [1]:
> https://github.com/apache/arrow/blob/e1e782a4542817e8a6139d6d5e022b56abdbc81d/cpp/examples/arrow/compute_register_example.cc
> >
> > On Fri, Apr 22, 2022, at 15:04, Wenlei Xie wrote:
> >
> > Hi,
> >
> > I am wondering if I can define my own Arrow Compute function and use it,
> say in PyArrow? It looks like Compute Function has a FuntionRegistry, but I
> didn't find documentation about how to write your own Arrow Compute
> function (but maybe just didn't find the right place)
> >
> > Thank you so much!
> >
> > --
> > Best Regards,
> > Wenlei Xie
> >
> > Email: wenlei@gmail.com
> >
> >
>