andygrove opened a new pull request, #1917:
URL: https://github.com/apache/datafusion-ballista/pull/1917
# Which issue does this PR close?
No dedicated issue; this strengthens the TPC-H CI added in #1913 by turning
it
into a correctness gate. The Q11 scale-factor threshold noted below is left
for
a separate follow-up.
# Rationale for this change
The TPC-H CI job runs all queries under AQE off and AQE on, but only checks
that
each query *runs without erroring* — it never checks results. A change that
made
Ballista return wrong rows (a broadcast/shuffle/serialization bug that drops,
duplicates, or corrupts rows) would pass CI silently.
This makes the job a correctness gate by comparing every Ballista result
against
single-process DataFusion over the same data. Ballista embeds DataFusion, so
for
a correct distributed execution the two must agree. This catches
Ballista-specific divergence — the highest-value class — and is scale-factor
agnostic (no committed answer files).
# What changes are included in this PR?
- Add a `--verify` flag to `tpch benchmark ballista`. For each query it runs
the
same SQL through a single-process DataFusion `SessionContext` over the same
Parquet data and compares the result.
- Add `compare_results`, replacing the old `assert_expected_results`. It
returns
a descriptive error (row-count, schema, or first differing cell) instead of
panicking, so a mismatch fails the binary with a useful message. Decimal,
integer, date, and string columns are compared exactly; only `Float32`/
`Float64` columns use a small relative-or-absolute tolerance, since
distributed vs single-process aggregation order differs and float `sum` is
non-associative. `Utf8`/`Utf8View` (and the large/binary view variants) are
treated as equal representations.
- Fix multi-statement result capture: a query file may wrap the answer in
setup/teardown statements (q15 creates and drops a view). The run loops now
execute every statement but capture the result of the answer statement (the
last `SELECT`/`WITH`), instead of reporting a trailing `DROP VIEW`.
Applied to
both the Ballista and DataFusion paths.
- Wire `--verify` into `.github/workflows/tpch.yml` for both the AQE-off and
AQE-on suites.
Validated locally on TPC-H SF10 (1 scheduler, 1 executor, 4 task slots, 16
partitions): all 21 queries (q16 omitted, as in CI) pass `--verify` under
both
AQE off and AQE on, including the ratio queries (q14/q17/q19/q22) under the
float
tolerance and the multi-statement q15.
Known follow-up (out of scope here): q11's `HAVING` threshold literal
`0.0001`
is the SF1 value and is not scaled by `1/SF`, so it returns 0 rows at SF10 in
both Ballista and single-process DataFusion. Since both engines agree, the
cross-check passes; the fix belongs in the query file and will be handled
separately.
# Are there any user-facing changes?
No. The change adds an opt-in `--verify` flag to the benchmark tool and
enables
it in CI; there are no changes to Ballista's runtime behavior or public APIs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]