Yicong-Huang opened a new pull request, #5557:
URL: https://github.com/apache/texera/pull/5557
### What changes were proposed in this PR?
Adds an end-to-end micro-benchmark of the Arrow Flight data path between
Scala/Pekko and the Python UDF worker, plus a bench-agnostic CI workflow that
publishes results to a github-action-benchmark dashboard.
- `amber/src/test/scala/.../bench/ArrowFlightActorBench.scala` — spawns a
real `PythonWorkflowWorker` actor (real Pekko mailbox + real
`texera_run_python_worker.py` subprocess + real Arrow Flight gRPC) wired to an
identity Python UDF; sweeps a 36-config grid (`batch_size × schema_width ×
string_len`) with per-batch send→echo latency percentiles and throughput;
writes CSV + JSON incrementally after each config so a killed/timed-out sweep
still leaves usable artifacts.
- `bin/run-benchmarks.sh` — single opaque CI entry point. Future bench
suites (e.g. JMH for `ArrowUtils.fromTexeraSchema` / `appendTexeraTuple`) plug
in by appending one line here.
- `.github/workflows/benchmarks.yml` — bench-agnostic umbrella workflow.
Label-based trigger gate mirrors `amber-integration` exactly (`python` /
`engine` / `amber-integration` / `common` / `ddl-change` / `ci`); the `Wait for
Pull Request Labeler` step is lifted from `required-checks.yml` so the labeler
race is handled the same way. Adding a new bench is one publish-step block;
this workflow file otherwise stays unchanged.
Non-blocking by design: not included in `required-checks.yml`'s
`required-checks` aggregator, so a flaky bench does not gate merges. Gh-pages
auto-push is gated on push-to-main so PR runs do not pollute the tracked
baseline.
ASF policy: `benchmark-action/github-action-benchmark` is SHA-pinned to
`52576c92bccf6ac60c8223ec7eb2565637cae9ba` (v1.22.1), matching the entry
already on `apache/infrastructure-actions`'s allow-list.
### Any related issues, documentation, discussions?
Closes #5556
### How was this PR tested?
Local smoke runs on macOS through the full Pass-1 round-trip and a 2-config
sweep — confirmed control sequence (`InitializeExecutor` → `AssignPort` × 2 →
`AddInputChannel` → `AddPartitioning` → `OpenExecutor` → `StartWorker`) +
`StartChannel` ECM + `DataFrame` echo through the real Python identity UDF, and
that `bench-results/arrow-flight-e2e-{throughput,latency}.json` validate
against the github-action-benchmark schema.
### Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]