andygrove opened a new pull request, #4129:
URL: https://github.com/apache/datafusion-comet/pull/4129

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   Each Spark-version-specific TPC-DS plan-stability directory currently stores 
a full copy of every query's golden plan, even when that query's plan is 
byte-identical to the previous version's. After the recent Spark 4.1 enablement 
(#4106), the per-version `approved-plans-*-spark4_1` directories duplicated 196 
of 206 (v1_4) and 60 of 64 (v2_7) plans against the 4.0 directory, which itself 
was largely a duplicate of 3.5, which was largely a duplicate of the base. This 
pattern grows quadratically as new versions are added.
   
   This PR teaches `CometPlanStabilitySuite` to fall back through the chain of 
older version directories at read time, so each version-specific directory only 
needs to contain the queries whose plans actually diverge.
   
   ## What changes are included in this PR?
   
   - `CometPlanStabilitySuite` gains a `protected def fallbackGoldenFilePaths: 
Seq[String]` hook. `getDirForTest` is unchanged in regenerate mode (always 
writes to the primary directory). For read-time lookups, when the primary 
`goldenFilePath` does not contain a directory for the active query, the suite 
walks the fallback list and returns the first existing directory.
   - Both `CometTPCDSV1_4_PlanStabilitySuite` and 
`CometTPCDSV2_7_PlanStabilitySuite` build their fallback chain via 
`CometPlanStabilitySuite.planNameChain(variant)`, which assembles the right 
sequence based on `isSpark{35,40,41}Plus`. The chain ends at the unsuffixed 
base directory.
   - `dev/regenerate-golden-files.sh` now prunes any query directory whose 
contents match what the fallback chain would resolve to, immediately after 
regenerating each version. This keeps the directories sparse without manual 
cleanup. The allowed `--spark-version` values are extended to include `4.1` 
(now supported on `main`).
   - Applies the prune once across the existing tree, removing 766 duplicate 
query directories:
   
     | directory | before | after |
     | --- | --- | --- |
     | `approved-plans-v1_4-spark3_5` | 206 | 10 |
     | `approved-plans-v1_4-spark4_0` | 206 | 12 |
     | `approved-plans-v1_4-spark4_1` | 206 | 10 |
     | `approved-plans-v2_7-spark3_5` | 64  | 4  |
     | `approved-plans-v2_7-spark4_0` | 64  | 4  |
     | `approved-plans-v2_7-spark4_1` | 64  | 4  |
   
   ## How are these changes tested?
   
   Locally, `CometTPCDSV1_4_PlanStabilitySuite` (194 tests) and 
`CometTPCDSV2_7_PlanStabilitySuite` (64 tests) pass with 0 failures against 
`-Pspark-3.5`, `-Pspark-4.0`, and `-Pspark-4.1` after the prune. Spark 3.4 
reads only the base directory and is unaffected. CI will exercise all four 
matrix profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to