I need to correct myself here - it is currently not possible to pass memory
at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because it
controls the memory region that it can read from and write to. Therefore,
currently passing data from and into WASM requires a memcopy.

There is a proposal [1] to improve the situation, but currently would incur
a cost in the query engine, since we would need to memcopy the regions
around.

I forgot that on my poc I passed the parquet file from js to WASM and
de-serialized it to arrow directly in wasm - so memory was already being
allocated from within WASM sandbox, not JS. Sorry for the confusion.

[1] https://github.com/WebAssembly/design/issues/1439

Best,
Jorge



On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > Antoine, sandboxing comes into play from two places:
> >
> > 1) The WASM specification itself, which puts a bounds on the types of
> > behaviors possible
> > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
> > mentioned in the comment above
> >
> > The wasmtime docs have a pretty solid section covering the sandboxing
> > guarantees of WASM, and then the interpreter-specific behavior/abilities
> of
> > wasmtime FWIW:
> > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
>
> This doesn't really answer my question, does it?
>
>
> >
> > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >>>> Would WASM be able to interact in-process with non-WASM buffers
> safely?
> >>>
> >>> AFAIK yes. My understanding from playing with it in JS is that a
> >>> WASM-backed udf execution would be something like:
> >>>
> >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> >>> 2. provide a small WASM-compiled middleware of the c data interface
> that
> >>> consumes (binary, c data interface pointers)
> >>> 3. ship a WASM interpreter as part of the query engine
> >>> 4. pass binary and c data interface pointers from the query engine
> >> program
> >>> to the interpreter with WASM-compiled middleware
> >>
> >> Ok, but the key word in my question was "safely". What mechanisms are in
> >> place such that the WASM user function will not access Arrow buffers out
> >> of bounds? Nothing really stands out in
> >> https://webassembly.github.io/spec/core/index.html, but it's the first
> >> time I try to have a look at the WebAssembly spec.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Step 2 is necessary to read the buffers from FFI and output the result
> >> back
> >>> from the interpreter once the UDF is done, similar to what we do in
> >>> datafusion to run Python from Rust. In the case of datafusion the
> >> "binary"
> >>> is a Python function, which has security implications since the Python
> >>> interpreter allows everything by default.
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>>
> >>>
> >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Le 25/04/2022 à 23:04, David Li a écrit :
> >>>>> The WebAssembly documentation has a rundown of the techniques used:
> >>>> https://webassembly.org/docs/security/
> >>>>>
> >>>>> I think usually you would run WASM in-process, though we could indeed
> >>>> also put it in a subprocess to further isolate things.
> >>>>
> >>>> Would WASM be able to interact in-process with non-WASM buffers
> safely?
> >>>> It's not obvious from reading the page above.
> >>>>
> >>>>
> >>>>>
> >>>>> It would be interesting to define the Flight "harness" protocol.
> >>>> Handling heterogeneous arguments may require some evolution in Flight
> >> (e.g.
> >>>> if the function is non scalar and arguments are of different length -
> >> we'd
> >>>> need something like the ColumnBag proposal, so this might be a good
> >> reason
> >>>> to revive that).
> >>>>>
> >>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>>>>>> I was going to reply to this e-mail thread on user@ but thought I
> >>>>>>> would start a new thread on dev@.
> >>>>>>>
> >>>>>>> Executing user-defined functions in memory, especially untrusted
> >>>>>>> functions, in general is unsafe. For "trusted" functions, having an
> >>>>>>> in-memory API for writing them in user languages is very useful. I
> >>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >>>>>>> would allow UDFs to have performance consistent with built-ins
> >>>>>>> (because built-in functions are all inlined into code-generated
> >>>>>>> expressions), but segfaults would bring down the server, so only
> >>>>>>> admins could be trusted to add new UDFs.
> >>>>>>>
> >>>>>>> However, I wonder if we should eventually define an "external UDF"
> >>>>>>> protocol and an example UDF "harness", using Flight to do RPC
> across
> >>>>>>> the process boundaries. So the idea is that an external local UDF
> >>>>>>> Flight execution service is spun up, and then data is sent to the
> UDF
> >>>>>>> in a DoExchange call.
> >>>>>>>
> >>>>>>> As Jacques pointed out in an interview 1], a compelling solution to
> >>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >>>>>>> functions to be run safely in-process.
> >>>>>>
> >>>>>> How does the sandboxing work in this case? Is it simply executing
> in a
> >>>>>> separate process with restricted capabilities, or are other
> mechanisms
> >>>>>> put in place?
> >>>>
> >>>
> >>
> >
>

Reply via email to