carloea2 commented on issue #5162: URL: https://github.com/apache/texera/issues/5162#issuecomment-4526974521
Thanks @Yicong-Huang and @Ma77Ball. The implementation idea I have in mind is not to introduce a full new C++/Rust Amber engine at first. I think that would be too heavy for the initial version. Python has its own runtime under `amber/src/main/python` because Python support is a broader execution layer: it handles Python workers, environments, imports, Python objects/dataframes, UDF/source operators, etc. For C++/Rust, I think the first useful step can be smaller: keep Texera’s existing worker/executor architecture, and treat compiled native code as a persistent UDF worker process owned by a normal Texera operator executor. So the flow would be: - Texera still creates a normal logical/physical operator. - The JVM-side executor owns scheduling, schema propagation, ports, buffering, timeout, and result emission. - On `open()`, it compiles or reuses a cached native binary. - It starts one persistent C++/Rust process for that executor. - It sends the input schema once. - During execution, it streams tuple/batch/table frames to the process. - The native process calls the user’s `process_tuple`, `process_batch`, or `process_table`. - The executor reads typed output rows back and emits normal Texera tuples. - On `close()`, it shuts down the native process. This means C++/Rust would not be replacing Amber or becoming a second execution engine initially. It would be more like a native-code UDF runtime plugged into the existing Texera worker path. For @Yicong-Huang’s tuple-by-tuple point, I think we can support that directly with the API: - `process_tuple` - `process_batch` - `process_table` The native process can stay alive across calls, so tuple-by-tuple execution does not require spawning a process per tuple. That also leaves room for stateful native operators later, since the same operator instance can be reused inside the native process. For @Ma77Ball’s point about compiler/env complexity, I agree. I think the first version should be explicit that this is deployer/local-environment managed. Similar to how Python currently depends on configured Python environments, C++/Rust would depend on configured compiler paths like `CXX` / `RUSTC` and supported compiler flags. Native library/package support should probably be future work, not part of the first pass. So the main tradeoff is: - Python UDF remains the flexible, general-purpose scripting path. - C++/Rust UDFs become a focused native-code path for CPU-heavy UDF kernels. - A full C++/Rust engine can be considered later if we need native source/sink operators, native dependency management, distributed binary deployment, or deeper fault-tolerance/runtime integration. I think this keeps the first version useful without committing Texera to a full new language engine immediately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
