deepthi912 opened a new pull request, #18579:
URL: https://github.com/apache/pinot/pull/18579
## Summary
Fix a crash where `regexp_like(col, pattern, 'i')` or `LIKE 'pattern'`
(which converts to case-insensitive `REGEXP_LIKE` internally) throws
`UnsupportedOperationException` on segments where the column has `encodingType:
RAW` but a separate dictionary is built for secondary indexes (`inverted`,
`fst`, `ifst`, `range`).
## Repro
Table config (typical for iceberg / external-table integrations):
```json
{
"name": "col_string",
"encodingType": "RAW",
"indexes": {
"forward": { "encodingType": "RAW" },
"dictionary": {},
"ifst": { "enabled": true }
}
}
```
Query:
```sql
SELECT * FROM t WHERE col_string LIKE 'abc%';
-- or
SELECT * FROM t WHERE regexp_like(col_string, 'abc', 'i');
```
Stack trace:
```
java.lang.UnsupportedOperationException
at
BaseDictionaryBasedPredicateEvaluator.applySV(BaseDictionaryBasedPredicateEvaluator.java:133)
at
SVScanDocIdIterator$StringMatcher.doesValueMatch(SVScanDocIdIterator.java:308)
at ImmutableRoaringBitmap.flip(...)
at AndFilterOperator.getTrues(...)
```
## Root cause
For this layout:
1. `FilterPlanNode` builds an `IFSTBasedRegexpPredicateEvaluator` (extends
`BaseDictIdBasedRegexpLikePredicateEvaluator`) which only implements
`applySV(int dictId)`.
2. `FilterOperatorUtils.getLeafFilterOperator` checks for sorted / inverted
index to route to a dict-consuming operator; with neither available it falls
through to `ScanBasedFilterOperator`.
3. `SVScanDocIdIterator.getValueMatcher()` picks `StringMatcher` based on
`_reader.isDictionaryEncoded() == false` (forward index is RAW), ignoring the
fact that a `Dictionary` is still present in the segment.
4. `StringMatcher` calls `applySV(String)` on the dict-based evaluator —
`BaseDictionaryBasedPredicateEvaluator.applySV(String)` is `final` and throws.
## Fix
In `SVScanDocIdIterator.getValueMatcher()`, before falling back to typed raw
matchers, route to a new `<Type>DictLookupMatcher` when:
- `_reader.isDictionaryEncoded() == false` (forward index is RAW), **and**
- `_dictionary != null` (a separate dictionary is built), **and**
- `_predicateEvaluator instanceof BaseDictionaryBasedPredicateEvaluator`
(the evaluator wants dict ids).
Each new matcher reads the raw value from the forward index, looks up its
dict id via `dictionary.indexOf(value)`, and calls `applySV(int dictId)`. One
matcher per stored type (INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, STRING, BYTES).
`dictId < 0` means the value isn't in the dictionary, which is treated as "no
match".
`DataSource` is already passed to the main constructor; the test constructor
gains an optional `@Nullable Dictionary` parameter (the existing 3-arg test
constructor delegates with `null`).
## Test plan
- [ ] Add a unit test exercising `REGEXP_LIKE` / `LIKE` against a string
column with RAW forward index + dictionary + IFST (no inverted) — should match
correctly instead of throwing.
- [ ] Verify existing `SVScanDocIdIteratorTest` paths still pass
(dict-encoded → `DictIdMatcher`, RAW without dictionary → typed raw matcher).
- [ ] Smoke test integration with iceberg/external-table tables in StarTree
Cloud.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]