carloea2 commented on issue #5162: URL: https://github.com/apache/texera/issues/5162#issuecomment-4539728462
Good question. In the closed prototype PR #5078, I did not use Arrow IPC yet. The MVP used a simple typed row-frame protocol over stdin/stdout: - The JVM executor selected the input columns, then wrote the column count, base64-encoded column names, type tags, and row data. - Each field was encoded as `<type-tag>:<is-null>:<payload>`; strings/binary used base64, while numeric/boolean/timestamp values used textual payloads. - The native side returned typed output rows on stdout, and the JVM parsed/enforced them against the declared output schema. The prototype supported three execution APIs: - `process_tuple`: one-row frames, mainly for simple or low-latency cases. - `process_batch`: the default path, accumulating a configurable batch size before sending a frame. - `process_table`: collect input and send once on finish for whole-table algorithms. One thing I would change from #5078 is the lifecycle. The hackathon prototype was intentionally simple and started the compiled executable per flush. For the real sidecar design, I think the executor should compile/reuse the binary in `open()`, start one persistent native process, send the schema once, stream tuple/batch/table frames to it, and shut it down in `close()`. In that design, tuple mode does not mean spawning one process per tuple. So my preference for the first version is batch-basis by default, while still exposing tuple and table APIs. Arrow IPC is a good candidate for a later transport, especially for larger batches, binary-heavy data, or columnar data. I would keep the transport pluggable: start with a simple debuggable framed protocol like the prototype, then add Arrow IPC after the API/lifecycle is settled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
