brijrajk opened a new issue, #12375:
URL: https://github.com/apache/gluten/issues/12375

   ## Summary
   
   `GlutenTPCHPlanStabilitySuite` → `tpch/q19` fails in `spark-test-spark40` CI 
for any PR that touches Velox backend Scala files. The failure is caused by a 
stale golden file combined with a known limitation in the ExprId normalizer.
   
   ## Affected check
   
   `spark-test-spark40` (and `spark-test-spark41`)
   
   ## Root cause
   
   `GlutenPlanStabilitySuite.glutenNormalizeIds()` uses the regex 
`(?<prefix>(?<!id=)#)\\d+L?` which matches **any** `#<number>` in the explain 
text — including TPC-H string literals. The `p_brand` filter in q19 uses values 
`Brand#11`, `Brand#12`, `Brand#13` (actual TPC-H spec data values). These 
appear unquoted in the explain output:
   
   ```
   EqualTo(p_brand, Brand#12)
   ```
   
   The normalizer incorrectly treats `#12` as an ExprId and remaps it 
sequentially based on encounter order. The suite code itself documents this 
limitation at line 67–68:
   
   > *"Running all suites together in one JVM is recommended to avoid ExprId 
normalization issues where string constants (e.g., Brand#23 in TPCH q19) may 
collide with ExprId numbers."*
   
   ## How it manifests
   
   The golden file was committed in #11805 (`c37fee4e5`, 2026-03-24). Over the 
264 commits since then, new optimizer rules and expressions shifted the ExprId 
counter. `Brand#12` now normalizes to `Brand#6` and `_pre_1#14` shifts to 
`_pre_1#13`, causing a spurious mismatch.
   
   Reproduced on `main` at commit `6097b59a6` (2026-06-25) without any pending 
PR:
   ```
   Tests: succeeded 21, failed 1  ← tpch/q19
   BUILD FAILURE
   ```
   
   ## PRs affected
   
   - #12151 — [GLUTEN-12013][VL] Fix bloom-filter bytes corruption on 
whole-stage AQE fallback
   - #12095 — [GLUTEN-12094][VL] Strip default comparator from array_sort for 
Velox offloading
   - #12056 — [GLUTEN-11921] Enable Parquet read/write test for NullType
   
   ## Short-term fix
   
   Refresh `q19/explain.txt` via `SPARK_GENERATE_GOLDEN_FILES=1` — tracked in 
#12374.
   
   ## Long-term fix
   
   Make `glutenNormalizeIds` skip `#N` patterns that appear inside string 
literal contexts (i.e., where the `#` is preceded by non-whitespace word 
characters that are not a column/expression name). This would prevent TPC-H 
brand values like `Brand#12` from being incorrectly normalized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to