airborne12 opened a new pull request, #60654:
URL: https://github.com/apache/doris/pull/60654
### What problem does this PR solve?
Problem Summary:
The `search()` function's DSL parser had multiple ES compatibility issues
and used a two-phase parsing approach (manual pre-parse + ANTLR) that was
error-prone. This PR refactors the parser and fixes several bugs:
1. **SearchDslParser refactoring**: Consolidated from two-phase (manual
pre-parse + ANTLR) to single-phase ANTLR parsing. The ANTLR grammar now handles
all DSL syntax directly, eliminating the fragile manual pre-parse layer. This
fixes issues with operator precedence, grouping, and edge cases.
2. **ANTLR grammar improvements**: Updated `SearchLexer.g4` and
`SearchParser.g4` to properly handle quoted phrases, field-qualified
expressions, prefix/wildcard/regexp patterns, range queries, and boolean
operators with correct precedence.
3. **minimum_should_match pipeline**: Added `default_operator` and
`minimum_should_match` fields to `TSearchParam` thrift, passing them from FE
`SearchPredicate` through to BE `function_search`. When `minimum_should_match >
0`, uses `OccurBooleanQuery` for proper Lucene-style boolean query semantics.
4. **Wildcard/Prefix/Regexp case-sensitivity**: Wildcard and PREFIX patterns
are now lowercased when the index has `parser + lower_case=true` (matching ES
query_string normalizer behavior). REGEXP patterns are NOT lowercased (matching
ES regex behavior where patterns bypass analysis).
5. **MATCH_ALL_DOCS support**: Added `MATCH_ALL_DOCS` clause type for
standalone `*` queries and pure NOT query rewrites. Enhanced `AllQuery` with
deferred `max_doc` from `context.segment_num_rows` and nullable field support
via `NullableScorer`.
6. **BE fixes**:
- `regexp_weight._max_expansions`: Changed from 50 to 0 (unlimited) to
prevent PREFIX queries from missing documents
- `occur_boolean_weight`: Fixed swap→append bug when all SHOULD clauses
must match, preserving existing MUST scorers
- Variant subcolumn `index_properties` propagation for proper analyzer
selection
- `lower_case` default handling: inverted index `lower_case` defaults to
`"true"` when a parser is configured
### Release note
Refactor search() DSL parser to single-phase ANTLR parsing and fix multiple
ES compatibility issues including minimum_should_match, wildcard
case-sensitivity, and MATCH_ALL_DOCS support.
### Check List (For Author)
- Test
- [x] Regression test
- [x] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [x] Yes. Wildcard/PREFIX patterns are now lowercased when index has
lower_case=true (matching ES behavior). REGEXP patterns remain case-sensitive.
minimum_should_match is now properly passed from FE to BE.
- Does this need documentation?
- [ ] No.
- [x] Yes. The search() function now supports minimum_should_match and
default_operator parameters with proper ES-compatible semantics.
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]