GitHub user timsaucer edited a discussion: Requesting feedback on long term 
plan for FFI approach

## Current Status

Currently when working with the foreign function interface, we use protobuf 
parsing to pass back and forth many things which do not have FFI 
representations. For example in `FFI_TableProvdider` when we need to evaluate 
push down filters [pass the expression as serialized 
bytes](https://github.com/apache/datafusion/blob/6d52e54bc8b2129be371c40df2a09164c7a522ed/datafusion/ffi/src/table_provider.rs#L185).
 This has been very helpful in getting us up and running on table providers. In 
fact we now have multiple users of this approach

- https://github.com/datafusion-contrib/datafusion-table-providers
- https://github.com/delta-io/delta-rs
- https://github.com/apache/iceberg-rust
- https://github.com/rerun-io/rerun
- https://github.com/lancedb/lance

These are the ones I know about. I suspect the actual list is even longer!

## The Problem

The problem we have is twofold:

- We create large libraries because we create a `SessionContext` at any point 
where we need to do the serialization and deserialization. This is because we 
need to know how to serialize/deserialize functions. This happens in places 
like `parse_exprs` and others.
- The default `SessionContext` we use does not know anything about custom 
functions. So this method will fail when passing across custom registered 
functions.

In general the protobuf code requires `TaskContext`, `FunctionRegistry`, 
`LogicalExtensionCodec`, and `PhysicalExtensionCodec`.

I have been working on a few implementations. I have branches on my local 
machine where I have implemented FFI equivalents for `PhysicalExpr`, 
`FunctionRegistry`, and `Session`. I believe we can also implement a limited 
version of the extension codecs. With these implementations we could pass 
around something like a `FFI_Session` where the code can then get the task 
context.

**The core implementation problem:**

Suppose I want to create a `FFI_TableProvider` and what we want to do is pass 
along the `FFI_Session`. I might need to instead pass  `FFI_FunctionRegistry`, 
or both - depends on the need of which FFI interface we're talking about. This 
means I am cloning an `Arc<dyn Session>` under the hood when I create my 
`FFI_TableProvider`. But then I will be registering my table provider with my 
session context. So now I have a loop on the `Arc`. The Session Context holds 
the table provider which holds the session context.

Please correct the above thinking if I've overlooked something.

## Proposed solutions

These are the approaches I've been considering:

1. Push forward with passing along the `FFI_Session`, `FFI_FunctionRegistry`, 
etc when creating the `FFI_TableProvider` (or others). This means a pretty 
decent breaking change to the existing implementations, but it may in the long 
term be well worth the effort. This would mean resolving my concern above.
2. Create some kind of singleton in these libraries and have the users call 
setting functions on it to pass along the function registry, session, etc. I 
really don't like this idea because it violates some of the ideas I have about 
what it means to hold one or many `SessionContext`s. In practice, this may not 
actually be a problem. I'm open to input.
3. Keep going further with the FFI coverage to include logical expressions. 
Since I have `PhysicalExpr` trait implemented, it's one more thing to duplicate 
and maintain. If we do this then we can basically remove the protobuf 
serialization with the notable exception of `ScalarValue`. That comes from the 
`datafusion-proto-common` crate which is relatively light.

Maybe there is another solution or a misunderstanding in my evaluation. I'm 
open to all input and insight.

GitHub link: https://github.com/apache/datafusion/discussions/18208

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to