GitHub user PG1204 edited a comment on the discussion: RFC: Workflow 
Performance Profiler - full design & implementation walkthrough

On the six rules. They're not a designed taxonomy but are the six bottleneck 
shapes I kept hitting on hackathon test workflows, each mapped to a recurring 
user question:


Rule | User question it answers
-- | --
RUNTIME_OUTLIER | Which op dominates the runtime?
LOW_PARALLELISM_HOT_OP | Is the hot op slow because of compute or parallelism?
IDLE_HEAVY | Is this op the bottleneck, or starved upstream?
UPSTREAM_OVERPRODUCTION | Am I reading way more data than I consume?
JOIN_HIGH_FANIN_LOW_FANOUT | Is this join doing wasted work?
SCAN_FULL_TABLE_NO_FILTER | Could a predicate pushdown save this?

Thresholds (3× median, ≥10× upstream ratio, <5% join fan-out, ≥70% idle, ≥1M 
scan rows, score ≥ user-configured hot threshold + workers ≤1) were hand-tuned 
on those workflows for low false-positive rate over recall. No formal benchmark 
backs them, they're conservative defaults, open for revision.

Which one wins when several fire: None, as they're additive. The panel shows 
all hints that fire on the same op, in stable alphabetical-by-rule-id order. In 
the screenshot above, RUNTIME_OUTLIER and LOW_PARALLELISM_HOT_OP fire together 
by design: the first says "this is the bottleneck", the second "and here's 
likely why." They compound rather than conflict. Real overlaps exist 
(UPSTREAM_OVERPRODUCTION vs JOIN_HIGH_FANIN_LOW_FANOUT, IDLE_HEAVY vs 
LOW_PARALLELISM_HOT_OP) but dedup/priority isn't designed yet, but happy to 
make that part of the separate hints issue you suggested.

GitHub link: 
https://github.com/apache/texera/discussions/5216#discussioncomment-17285710

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to