GitHub user timsaucer created a discussion: Requesting feedback on long term plan for FFI approach
## Current Status Currently when working with the foreign function interface, we use protobuf parsing to pass back and forth many things which do not have FFI representations. For example in `FFI_TableProvdider` when we need to evaluate push down filters [pass the expression as serialized bytes](https://github.com/apache/datafusion/blob/6d52e54bc8b2129be371c40df2a09164c7a522ed/datafusion/ffi/src/table_provider.rs#L185). This has been very helpful in getting us up and running on table providers. In fact we now have multiple users of this approach - https://github.com/datafusion-contrib/datafusion-table-providers - https://github.com/delta-io/delta-rs - https://github.com/apache/iceberg-rust - https://github.com/rerun-io/rerun - https://github.com/lancedb/lance These are the ones I know about. I suspect the actual list is even longer! ## The Problem The problem we have is twofold: - We create large libraries because we create a `SessionContext` at any point where we need to do the serialization and deserialization. This is because we need to know how to serialize/deserialize functions. This happens in places like `parse_exprs` and others. - The default `SessionContext` we use does not know anything about custom functions. So this method will fail when passing across custom registered functions. In general the protobuf code requires `TaskContext`, `FunctionRegistry`, `LogicalExtensionCodec`, and `PhysicalExtensionCodec`. I have been working on a few implementations. I have branches on my local machine where I have implemented FFI equivalents for `PhysicalExpr`, `FunctionRegistry`, and `Session`. I believe we can also implement a limited version of the extension codecs. With these implementations we could pass around something like a `FFI_Session` where the code can then get the task context. **The core implementation problem:** Suppose I want to create a `FFI_TableProvider` and what we want to do is pass along the `FFI_Session`. I might need to instead pass `FFI_FunctionRegistry`, or both - depends on the need of which FFI interface we're talking about. This means I am cloning an `Arc<dyn Session>` under the hood when I create my `FFI_TableProvider`. But then I will be registering my table provider with my session context. So now I have a loop one the `Arc`. The Session Context holds the table provider which holds the session context. Please correct the above thinking if I've overlooked something. ## Proposed solutions These are the approaches I've been considering: 1. Push forward with passing along the `FFI_Session`, `FFI_FunctionRegistry`, etc when creating the `FFI_TableProvider` (or others). This means a pretty decent breaking change to the existing implementations, but it may in the long term be well worth the effort. This would mean resolving my concern above. 2. Create some kind of singleton in these libraries and have the users call setting functions on it to pass along the function registry, session, etc. I really don't like this idea because it violates some of the ideas I have about what it means to hold one or many `SessionContext`s. In practice, this may not actually be a problem. I'm open to input. 3. Keep going further with the FFI coverage to include logical expressions. Since I have `PhysicalExpr` trait implemented, it's one more thing to duplicate and maintain. If we do this then we can basically remove the protobuf serialization with the notable exception of `ScalarValue`. That comes from the `datafusion-proto-common` crate which is relatively light. Maybe there is another solution or a misunderstanding in my evaluation. I'm open to all input and insight. GitHub link: https://github.com/apache/datafusion/discussions/18208 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
