[PR] [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues [doris]

via GitHub Tue, 10 Feb 2026 09:29:36 -0800


airborne12 opened a new pull request, #60654:
URL: https://github.com/apache/doris/pull/60654


   ### What problem does this PR solve?
   
   Problem Summary:
   
   The `search()` function's DSL parser had multiple ES compatibility issues 
and used a two-phase parsing approach (manual pre-parse + ANTLR) that was 
error-prone. This PR refactors the parser and fixes several bugs:
   
   1. **SearchDslParser refactoring**: Consolidated from two-phase (manual 
pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now handles 
all DSL syntax directly, eliminating the fragile manual pre-parse layer. This 
fixes issues with operator precedence, grouping, and edge cases.
   
   2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and 
`SearchParser.g4` to properly handle quoted phrases, field-qualified 
expressions, prefix/wildcard/regexp patterns, range queries, and boolean 
operators with correct precedence.
   
   3. **minimum_should_match pipeline**: Added `default_operator` and 
`minimum_should_match` fields to `TSearchParam` thrift, passing them from FE 
`SearchPredicate` through to BE `function_search`. When `minimum_should_match > 
0`, uses `OccurBooleanQuery` for proper Lucene-style boolean query semantics.
   
   4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX patterns 
are now lowercased when the index has `parser + lower_case=true` (matching ES 
query_string normalizer behavior). REGEXP patterns are NOT lowercased (matching 
ES regex behavior where patterns bypass analysis).
   
   5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for 
standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery` with 
deferred `max_doc` from `context.segment_num_rows` and nullable field support 
via `NullableScorer`.
   
   6. **BE fixes**:
      - `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to 
prevent PREFIX queries from missing documents
      - `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses 
must match, preserving existing MUST scorers
      - Variant subcolumn `index_properties` propagation for proper analyzer 
selection
      - `lower_case` default handling: inverted index `lower_case` defaults to 
`"true"` when a parser is configured
   
   ### Release note
   
   Refactor search() DSL parser to single-phase ANTLR parsing and fix multiple 
ES compatibility issues including minimum_should_match, wildcard 
case-sensitivity, and MATCH_ALL_DOCS support.
   
   ### Check List (For Author)
   
   - Test
       - [x] Regression test
       - [x] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason
   
   - Behavior changed:
       - [x] Yes. Wildcard/PREFIX patterns are now lowercased when index has 
lower_case=true (matching ES behavior). REGEXP patterns remain case-sensitive. 
minimum_should_match is now properly passed from FE to BE.
   
   - Does this need documentation?
       - [ ] No.
       - [x] Yes. The search() function now supports minimum_should_match and 
default_operator parameters with proper ES-compatible semantics.
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [refactor](search) Refactor SearchDslParser to single-phase ANTLR parsing and fix ES compatibility issues [doris]

Reply via email to