luis4a0 opened a new pull request, #12108:
URL: https://github.com/apache/gluten/pull/12108
## What changes are proposed in this pull request?
> **Stacked on top of #12107.** Until that PR merges this PR's diff will
> appear larger (it carries both #12107's commits and this PR's commits).
> Once #12107 lands in `main`, the diff here will narrow to just the
> SplitRV sub-stage changes (2 commits, ~232 lines). Reviewers may wait
> for #12107 to merge if they prefer to look at this PR's contribution
> in isolation.
The existing `CpuWallTimingSplitRV` bucket wraps the entire
`splitRowVector()` call, so when it dominates wall time in a
`VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which
of the four sub-paths is responsible: fixed-width scatter, validity
copy, binary split, or complex (Presto-serialized) split.
This PR subdivides `SplitRV` into four new buckets, one per helper
inside `splitRowVector()`:
| New enum | Helper |
|---------------------------------------|-----------------------------------|
| `CpuWallTimingSplitFixedWidth` | `splitFixedWidthValueBuffer()` |
| `CpuWallTimingSplitValidity` | `splitValidityBuffer()` |
| `CpuWallTimingSplitBinary` | `splitBinaryArray()` |
| `CpuWallTimingSplitComplex` | `splitComplexType()` |
The outer `SplitRV` bucket is unchanged: it continues to count once per
call, and its wall/cpu measurement now happens to overlap the four
inner buckets (the outer timer is still ticking while each inner one
is). The outer **minus** the sum-of-inners is therefore the time spent
in `splitRowVector()` outside the four sub-paths (mostly the post-split
`partitionBufferBase_` update loop). This is the same parent/child
pattern most profiling tools use for nested timers, and it's documented
in a comment next to the enum.
To make the new sub-stage data accessible to anything other than the
existing compile-time log path (tests, future `ShuffleWriterMetrics`
exposure, profilers), three things are hoisted from `protected` to
`public` in `VeloxShuffleWriter`:
- The `CpuWallTimingType` enum
- The `CpuWallTimingName()` helper (already de-facto public, since
its output is printed by the `VELOX_SHUFFLE_WRITER_LOG_FLAG` log)
- A new single-value accessor `cpuWallTiming(CpuWallTimingType)`
The underlying `cpuWallTimingList_` storage stays `protected`.
### Why now
This is the natural follow-up to #12083 (which made the per-stage
timers actually record real numbers) and to #12107 (which adds a
per-batch input-encoding counter). Together, the three changes let a
reviewer answer **"what shape did the writer see, where did the time
go, and what inner path drove it"** from a single
`VELOX_SHUFFLE_WRITER_LOG_FLAG` run — instead of being stuck at
"`SplitRV` was 65 % of wall time, *somewhere in there*".
A per-query TPC-DS/TPC-H SF10 profile sweep on a closely-related
internal branch shows `SplitRV` is the #1 or #2 bucket on 10/12
queries surveyed (46 – 71 % of wall), with substantially different
sub-stage breakdowns: bigint-heavy queries (q17, q14a) are
overwhelmingly `splitFixedWidthValueBuffer`, while VARCHAR-heavy q67
shifts toward `splitBinaryArray`. Without the sub-stage timers, those
breakdowns are invisible.
## How was this patch tested?
This PR adds `cpp/velox/tests/VeloxHashShuffleWriterSplitRvSubStagesTest.cc`
(second commit) with five gtest cases:
- `enumNames` — validates the strings emitted under
`VELOX_SHUFFLE_WRITER_LOG_FLAG` for the 4 new enum entries (these
strings are stable observable surface).
- `fixedWidthBumpsItsBucket` — single fixed-width batch; asserts each
of the 4 inner buckets has `count == 1` (every helper inside
`splitRowVector()` is called once per batch regardless of column-type
mix, since the new sub-stage `SCOPED_TIMER` calls live at function
entry).
- `binaryBumpsItsBucket` — VARCHAR batch; asserts the binary
sub-stage's `wallNanos > 0`, confirming real work is being measured
(not just function-entry overhead).
- `complexBumpsItsBucket` — ARRAY batch; same `wallNanos > 0`
assertion against the complex sub-stage (Presto-serialized
round-trip per partition).
- `outerSplitRvStillCounted` — sanity: two batches in, outer `SplitRV`
bucket has `count == 2`. Confirms the new inner timers refine, not
replace, the existing outer timer.
End-to-end runtime validation in a real Spark/Gluten run was not
possible in my development environment for the same JNI-symbol-resolution
reason documented in #12083 and #12107. The local Velox ep build here
is configured with `VELOX_BUILD_TEST_UTILS=OFF`, so the new gtest
binary was syntax-validated locally (`g++ -fsyntax-only` against the
in-tree compile flags) but not executed; Gluten CI will run it.
Marked as draft until #12107 merges (so reviewers see the focused
SplitRV diff).
## Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]