Re: [I] C++/Rust UDF implementation [texera]

via GitHub Sat, 23 May 2026 17:55:12 -0700


carloea2 commented on issue #5162:
URL: https://github.com/apache/texera/issues/5162#issuecomment-4526974521


   Thanks @Yicong-Huang and @Ma77Ball.
   
   The implementation idea I have in mind is not to introduce a full new 
C++/Rust Amber engine at first. I think that would be too heavy for the initial 
version. Python has its own runtime under `amber/src/main/python` because 
Python support is a broader execution layer: it handles Python workers, 
environments, imports, Python objects/dataframes, UDF/source operators, etc.
   
   For C++/Rust, I think the first useful step can be smaller: keep Texera’s 
existing worker/executor architecture, and treat compiled native code as a 
persistent UDF worker process owned by a normal Texera operator executor.
   
   So the flow would be:
   
   - Texera still creates a normal logical/physical operator.
   - The JVM-side executor owns scheduling, schema propagation, ports, 
buffering, timeout, and result emission.
   - On `open()`, it compiles or reuses a cached native binary.
   - It starts one persistent C++/Rust process for that executor.
   - It sends the input schema once.
   - During execution, it streams tuple/batch/table frames to the process.
   - The native process calls the user’s `process_tuple`, `process_batch`, or 
`process_table`.
   - The executor reads typed output rows back and emits normal Texera tuples.
   - On `close()`, it shuts down the native process.
   
   This means C++/Rust would not be replacing Amber or becoming a second 
execution engine initially. It would be more like a native-code UDF runtime 
plugged into the existing Texera worker path.
   
   For @Yicong-Huang’s tuple-by-tuple point, I think we can support that 
directly with the API:
   
   - `process_tuple`
   - `process_batch`
   - `process_table`
   
   The native process can stay alive across calls, so tuple-by-tuple execution 
does not require spawning a process per tuple. That also leaves room for 
stateful native operators later, since the same operator instance can be reused 
inside the native process.
   
   For @Ma77Ball’s point about compiler/env complexity, I agree. I think the 
first version should be explicit that this is deployer/local-environment 
managed. Similar to how Python currently depends on configured Python 
environments, C++/Rust would depend on configured compiler paths like `CXX` / 
`RUSTC` and supported compiler flags. Native library/package support should 
probably be future work, not part of the first pass.
   
   So the main tradeoff is:
   
   - Python UDF remains the flexible, general-purpose scripting path.
   - C++/Rust UDFs become a focused native-code path for CPU-heavy UDF kernels.
   - A full C++/Rust engine can be considered later if we need native 
source/sink operators, native dependency management, distributed binary 
deployment, or deeper fault-tolerance/runtime integration.
   
   I think this keeps the first version useful without committing Texera to a 
full new language engine immediately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] C++/Rust UDF implementation [texera]

Reply via email to