Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Weston Pace Mon, 25 Apr 2022 16:25:25 -0700

I think there is a certain amount of tricky "package management"
involved with such a harness.  For example, if I want to build my UDF
on top of tensorflow then I would need a version of the tensorflow C
libs that has been compiled to WASM and (potentially) language
runtimes for whatever language users might want to write the
computation in.  I wonder if there are existing WASM solutions for
this kind of challenge.


On Mon, Apr 25, 2022 at 11:05 AM David Li <[email protected]> wrote:
>
> The WebAssembly documentation has a rundown of the techniques used: 
> https://webassembly.org/docs/security/
>
> I think usually you would run WASM in-process, though we could indeed also 
> put it in a subprocess to further isolate things.
>
> It would be interesting to define the Flight "harness" protocol. Handling 
> heterogeneous arguments may require some evolution in Flight (e.g. if the 
> function is non scalar and arguments are of different length - we'd need 
> something like the ColumnBag proposal, so this might be a good reason to 
> revive that).
>
> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> > Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >> I was going to reply to this e-mail thread on user@ but thought I
> >> would start a new thread on dev@.
> >>
> >> Executing user-defined functions in memory, especially untrusted
> >> functions, in general is unsafe. For "trusted" functions, having an
> >> in-memory API for writing them in user languages is very useful. I
> >> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >> would allow UDFs to have performance consistent with built-ins
> >> (because built-in functions are all inlined into code-generated
> >> expressions), but segfaults would bring down the server, so only
> >> admins could be trusted to add new UDFs.
> >>
> >> However, I wonder if we should eventually define an "external UDF"
> >> protocol and an example UDF "harness", using Flight to do RPC across
> >> the process boundaries. So the idea is that an external local UDF
> >> Flight execution service is spun up, and then data is sent to the UDF
> >> in a DoExchange call.
> >>
> >> As Jacques pointed out in an interview 1], a compelling solution to
> >> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >> functions to be run safely in-process.
> >
> > How does the sandboxing work in this case? Is it simply executing in a
> > separate process with restricted capabilities, or are other mechanisms
> > put in place?

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to