I think there is a certain amount of tricky "package management" involved with such a harness. For example, if I want to build my UDF on top of tensorflow then I would need a version of the tensorflow C libs that has been compiled to WASM and (potentially) language runtimes for whatever language users might want to write the computation in. I wonder if there are existing WASM solutions for this kind of challenge.
On Mon, Apr 25, 2022 at 11:05 AM David Li <lidav...@apache.org> wrote: > > The WebAssembly documentation has a rundown of the techniques used: > https://webassembly.org/docs/security/ > > I think usually you would run WASM in-process, though we could indeed also > put it in a subprocess to further isolate things. > > It would be interesting to define the Flight "harness" protocol. Handling > heterogeneous arguments may require some evolution in Flight (e.g. if the > function is non scalar and arguments are of different length - we'd need > something like the ColumnBag proposal, so this might be a good reason to > revive that). > > On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: > > Le 25/04/2022 à 22:19, Wes McKinney a écrit : > >> I was going to reply to this e-mail thread on user@ but thought I > >> would start a new thread on dev@. > >> > >> Executing user-defined functions in memory, especially untrusted > >> functions, in general is unsafe. For "trusted" functions, having an > >> in-memory API for writing them in user languages is very useful. I > >> remember tinkering with adding UDFs in Impala with LLVM IR, which > >> would allow UDFs to have performance consistent with built-ins > >> (because built-in functions are all inlined into code-generated > >> expressions), but segfaults would bring down the server, so only > >> admins could be trusted to add new UDFs. > >> > >> However, I wonder if we should eventually define an "external UDF" > >> protocol and an example UDF "harness", using Flight to do RPC across > >> the process boundaries. So the idea is that an external local UDF > >> Flight execution service is spun up, and then data is sent to the UDF > >> in a DoExchange call. > >> > >> As Jacques pointed out in an interview 1], a compelling solution to > >> the UDF sandboxing problem is WASM. This allows "untrusted" WASM > >> functions to be run safely in-process. > > > > How does the sandboxing work in this case? Is it simply executing in a > > separate process with restricted capabilities, or are other mechanisms > > put in place?