Yicong-Huang opened a new pull request, #5557:
URL: https://github.com/apache/texera/pull/5557

   ### What changes were proposed in this PR?
   
   Adds an end-to-end micro-benchmark of the Arrow Flight data path between 
Scala/Pekko and the Python UDF worker, plus a bench-agnostic CI workflow that 
publishes results to a github-action-benchmark dashboard.
   
   - `amber/src/test/scala/.../bench/ArrowFlightActorBench.scala` — spawns a 
real `PythonWorkflowWorker` actor (real Pekko mailbox + real 
`texera_run_python_worker.py` subprocess + real Arrow Flight gRPC) wired to an 
identity Python UDF; sweeps a 36-config grid (`batch_size × schema_width × 
string_len`) with per-batch send→echo latency percentiles and throughput; 
writes CSV + JSON incrementally after each config so a killed/timed-out sweep 
still leaves usable artifacts.
   - `bin/run-benchmarks.sh` — single opaque CI entry point. Future bench 
suites (e.g. JMH for `ArrowUtils.fromTexeraSchema` / `appendTexeraTuple`) plug 
in by appending one line here.
   - `.github/workflows/benchmarks.yml` — bench-agnostic umbrella workflow. 
Label-based trigger gate mirrors `amber-integration` exactly (`python` / 
`engine` / `amber-integration` / `common` / `ddl-change` / `ci`); the `Wait for 
Pull Request Labeler` step is lifted from `required-checks.yml` so the labeler 
race is handled the same way. Adding a new bench is one publish-step block; 
this workflow file otherwise stays unchanged.
   
   Non-blocking by design: not included in `required-checks.yml`'s 
`required-checks` aggregator, so a flaky bench does not gate merges. Gh-pages 
auto-push is gated on push-to-main so PR runs do not pollute the tracked 
baseline.
   
   ASF policy: `benchmark-action/github-action-benchmark` is SHA-pinned to 
`52576c92bccf6ac60c8223ec7eb2565637cae9ba` (v1.22.1), matching the entry 
already on `apache/infrastructure-actions`'s allow-list.
   
   ### Any related issues, documentation, discussions?
   
   Closes #5556
   
   ### How was this PR tested?
   
   Local smoke runs on macOS through the full Pass-1 round-trip and a 2-config 
sweep — confirmed control sequence (`InitializeExecutor` → `AssignPort` × 2 → 
`AddInputChannel` → `AddPartitioning` → `OpenExecutor` → `StartWorker`) + 
`StartChannel` ECM + `DataFrame` echo through the real Python identity UDF, and 
that `bench-results/arrow-flight-e2e-{throughput,latency}.json` validate 
against the github-action-benchmark schema.
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to