Xiao-zhen-Liu opened a new issue, #5884: URL: https://github.com/apache/texera/issues/5884
Parent: #5881 ยท Design: #5880 ## Goal Let the scheduler reuse matched ports. Before `CostBasedScheduleGenerator` runs, remove the operators whose results are reused, leaving the run-skeleton; schedule the run-skeleton as today; then combine it with the skipped regions into one schedule. This is the core PR and needs the most review. Note: the planning diagrams in the design Discussion show a cost estimator inside the scheduler; in this PR the cost-based choice is not used (every matched port is reused). It is the existing estimator, run only on the run-skeleton. ## What is included - `CacheReusePreSchedulingStep`: the reuse planner and skeleton generator. Starting from the outputs the run needs (sinks plus any ports the user asked to view), it keeps an operator only if one of its needed outputs is not a matched port; if a needed output is matched, its saved result supplies it and the operators above it are removed. The result is the run-skeleton, the skipped (ToSkip) regions, and the saved result locations to use for matched inputs and outputs. - `PreSchedulingHints`: a small carrier for those result-location overrides; empty by default. - `CostBasedScheduleGenerator`: two new optional inputs (the hints and the skipped regions), both empty by default. When empty, the schedule is built exactly as today. When present, it schedules the run-skeleton and places the skipped regions ahead of the regions that run. - `Region` gains a `cached` flag (default false). `RegionExecutionCoordinator` runs the normal path when false, and a skip path when true: it records a completed-from-cache result with no workers and passes the saved result locations downstream. - Supporting changes: thread the execution id through the coordinators, move the workflow-completion signal to the execution coordinator (a skipped region completes at once with no workers), and add the optional cached fields on the port config. ## Why this is safe when there are no matched ports When the cache lookup returns nothing, `CacheReusePreSchedulingStep` returns early: the run-skeleton is the whole plan, the hints are empty, and there are no skipped regions. `CostBasedScheduleGenerator` then builds the schedule as it does today, and every region has `cached = false`, so the coordinator runs the normal path. ## Depends on PR 1 (the `cachedOutputs` field and cache types) and PR 2 (the `COMPLETED_FROM_CACHE` state). It can merge while still inactive, since nothing reads or writes cache entries until PR 4. ## Out of scope No cost-based decision about whether to reuse (full reuse only). Reading and writing cache entries at submission and completion (PR 4). ## Size About 800 lines of code, plus tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
