I need to correct myself here - it is currently not possible to pass memory at zero cost between the engine and WASM interpreter. This is related to your point about safety - WASM provides memory safety guarantees because it controls the memory region that it can read from and write to. Therefore, currently passing data from and into WASM requires a memcopy.
There is a proposal [1] to improve the situation, but currently would incur a cost in the query engine, since we would need to memcopy the regions around. I forgot that on my poc I passed the parquet file from js to WASM and de-serialized it to arrow directly in wasm - so memory was already being allocated from within WASM sandbox, not JS. Sorry for the confusion. [1] https://github.com/WebAssembly/design/issues/1439 Best, Jorge On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <anto...@python.org> wrote: > > Le 26/04/2022 à 16:30, Gavin Ray a écrit : > > Antoine, sandboxing comes into play from two places: > > > > 1) The WASM specification itself, which puts a bounds on the types of > > behaviors possible > > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge > > mentioned in the comment above > > > > The wasmtime docs have a pretty solid section covering the sandboxing > > guarantees of WASM, and then the interpreter-specific behavior/abilities > of > > wasmtime FWIW: > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core > > This doesn't really answer my question, does it? > > > > > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : > >>>> Would WASM be able to interact in-process with non-WASM buffers > safely? > >>> > >>> AFAIK yes. My understanding from playing with it in JS is that a > >>> WASM-backed udf execution would be something like: > >>> > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format) > >>> 2. provide a small WASM-compiled middleware of the c data interface > that > >>> consumes (binary, c data interface pointers) > >>> 3. ship a WASM interpreter as part of the query engine > >>> 4. pass binary and c data interface pointers from the query engine > >> program > >>> to the interpreter with WASM-compiled middleware > >> > >> Ok, but the key word in my question was "safely". What mechanisms are in > >> place such that the WASM user function will not access Arrow buffers out > >> of bounds? Nothing really stands out in > >> https://webassembly.github.io/spec/core/index.html, but it's the first > >> time I try to have a look at the WebAssembly spec. > >> > >> Regards > >> > >> Antoine. > >> > >> > >>> > >>> Step 2 is necessary to read the buffers from FFI and output the result > >> back > >>> from the interpreter once the UDF is done, similar to what we do in > >>> datafusion to run Python from Rust. In the case of datafusion the > >> "binary" > >>> is a Python function, which has security implications since the Python > >>> interpreter allows everything by default. > >>> > >>> Best, > >>> Jorge > >>> > >>> > >>> > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <anto...@python.org> > >> wrote: > >>> > >>>> > >>>> Le 25/04/2022 à 23:04, David Li a écrit : > >>>>> The WebAssembly documentation has a rundown of the techniques used: > >>>> https://webassembly.org/docs/security/ > >>>>> > >>>>> I think usually you would run WASM in-process, though we could indeed > >>>> also put it in a subprocess to further isolate things. > >>>> > >>>> Would WASM be able to interact in-process with non-WASM buffers > safely? > >>>> It's not obvious from reading the page above. > >>>> > >>>> > >>>>> > >>>>> It would be interesting to define the Flight "harness" protocol. > >>>> Handling heterogeneous arguments may require some evolution in Flight > >> (e.g. > >>>> if the function is non scalar and arguments are of different length - > >> we'd > >>>> need something like the ColumnBag proposal, so this might be a good > >> reason > >>>> to revive that). > >>>>> > >>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: > >>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit : > >>>>>>> I was going to reply to this e-mail thread on user@ but thought I > >>>>>>> would start a new thread on dev@. > >>>>>>> > >>>>>>> Executing user-defined functions in memory, especially untrusted > >>>>>>> functions, in general is unsafe. For "trusted" functions, having an > >>>>>>> in-memory API for writing them in user languages is very useful. I > >>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which > >>>>>>> would allow UDFs to have performance consistent with built-ins > >>>>>>> (because built-in functions are all inlined into code-generated > >>>>>>> expressions), but segfaults would bring down the server, so only > >>>>>>> admins could be trusted to add new UDFs. > >>>>>>> > >>>>>>> However, I wonder if we should eventually define an "external UDF" > >>>>>>> protocol and an example UDF "harness", using Flight to do RPC > across > >>>>>>> the process boundaries. So the idea is that an external local UDF > >>>>>>> Flight execution service is spun up, and then data is sent to the > UDF > >>>>>>> in a DoExchange call. > >>>>>>> > >>>>>>> As Jacques pointed out in an interview 1], a compelling solution to > >>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM > >>>>>>> functions to be run safely in-process. > >>>>>> > >>>>>> How does the sandboxing work in this case? Is it simply executing > in a > >>>>>> separate process with restricted capabilities, or are other > mechanisms > >>>>>> put in place? > >>>> > >>> > >> > > >