colinmarc commented on issue #1181:
URL: 
https://github.com/apache/datafusion-python/issues/1181#issuecomment-3064059046

   I explored the solution space a bit today, and I don't think this problem is 
really solvable with the APIs as they currently exist.
   
   Just to be clear about what is actually needed, consider the following setup:
   
   1. A python service, let's call it "frontend", builds a datafusion plan and 
wants to serialize it. The plan includes an iceberg scan, which is provided by 
[`IcebergTableProvider`](https://rust.iceberg.apache.org/api/iceberg_datafusion/table/struct.IcebergTableProvider.html),
 exposed into python via the pycapsule interface 
(`__datafusion_table_provider__`).
   2. A downstream rust service ("backend") wants to deserialize the plan.
   3. A downstream python service ("logger") also wants to deserialize the plan.
   
   ### `LogicalExtensionCodec` is not compatible with `ForeignTableProvider`
   
   As stated above, the python bindings currently use 
`DefaultLogicalExtensionCodec`:
   
   
https://github.com/apache/datafusion-python/blob/2e1b71369eefc97c22b82be84bbabb414f748fb9/src/sql/logical.rs#L187
   
   However, this is only half of the problem. Imagine that I fork 
datafusion-python and change the code to use a custom `LogicalExtensionCodec` - 
thereby skipping the problem of how to expose the right API. My custom 
`LogicalExtensionCodec` would look something like this:
   
   ```rust
   fn try_encode_table_provider(
       &self,
       table_ref: &TableReference,
       node: std::sync::Arc<dyn TableProvider>,
       buf: &mut Vec<u8>,
   ) -> datafusion::common::Result<()> {
       let source = node.as_any();
   
       // This downcast won't work on ForeignTableProvider.
       if let Some(iceberg_table) = 
source.downcast_ref::<IcebergTableProvider>() {
           todo!()
       }
   }
   ```
   
   The problem is that the `downcast` will fail, because the `TableProvider` is 
a `ForeignTableProvider` - the type information was lost during the "detour" 
through python.
   
   ### Python type association is one-way
   
   In step 3 above, we're in the "logger" python code, trying to deserialize 
the plan that the "frontend" python serialized.
   
   Our `decode` implementation might look like:
   
   ```rust
   fn try_decode_table_provider(
       &self,
       buf: &[u8],
       table_ref: &TableReference,
       schema: SchemaRef,
       ctx: &SessionContext,
   ) -> datafusion::common::Result<std::sync::Arc<dyn TableProvider>> {
       if let Ok(iceberg_data) = IcebergTableData::decode(buf) {
           let table_provider = IcebergTableProvider::new(...)
           // what do we do with the table provider now?
       }
   ```
   
   This will work, but the python code has no way to turn the resulting 
provider back into the python wrapper type, since that information is lost. 
Therefore the plan can't really "roundtrip" in any meaningful sense.
   
   ### `encode` can be stored on `FFI_TableProvider`, but not `decode`
   
   As described above, we could somehow force an implementation of `encode` 
into `FFI_TableProvider`. (It would probably have to involve storing an 
`Arc<dyn LogicalExtensionCodec>` in private_data.) However, in order to iterate 
through and try possible "decoders", a global registry of 
`LogicalExtensionCodec`s would be needed somewhere. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to