xiangfu0 opened a new pull request, #18500:
URL: https://github.com/apache/pinot/pull/18500

   ## Summary
   
   A column configured with `EncodingType.RAW` + `dictionaryIndex` has a 
dictionary file on disk but a non-dict-encoded forward index. 
`ColumnContext.fromDataSource` and `ProjectionBlockValSet.getDictionary` 
exposed the dictionary unconditionally, so callers gating on `getDictionary() 
!= null` then called `getDictionaryIdsSV/MV` — which routes to 
`ForwardIndexReader.readDictIds` and throws `UnsupportedOperationException` on 
a RAW forward index.
   
   Reported as a multi-column GROUP BY crash on a S3 Parquet column configured 
with RAW forward + dictionary:
   
   ```
   Caused by: java.lang.UnsupportedOperationException
     at 
org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readDictIds(ForwardIndexReader.java:129)
     at 
org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readDictIds(DataFetcher.java:353)
     at 
org.apache.pinot.core.common.DataFetcher.fetchDictIds(DataFetcher.java:105)
     at 
org.apache.pinot.core.common.DataBlockCache.getDictIdsForSVColumn(DataBlockCache.java:114)
     at 
org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getDictionaryIdsSV(ProjectionBlockValSet.java:105)
     at 
org.apache.pinot.core.query.aggregation.groupby.NoDictionaryMultiColumnGroupKeyGenerator.generateKeysForBlock(NoDictionaryMultiColumnGroupKeyGenerator.java:120)
   ```
   
   The same buggy assumption ("dictionary exists implies dict-encoded forward 
index") affects several distinct/aggregation paths beyond 
`NoDictionaryMultiColumnGroupKeyGenerator`: `DistinctExecutorFactory`, 
`SegmentPartitionedDistinctCount`, `AnyValue`, `ModeAggregation`, the 
`DistinctCount*` sketch aggregators, and `BaseDistinctAggregate*`. Fixing the 
two source methods covers all of them.
   
   ## Fix
   
   Apply the same gating already used by `IdentifierTransformFunction` and 
`DataFetcher.addDataSource`: expose the dictionary only when 
`ForwardIndexReader.isDictionaryEncoded()` is true. This matches the 
`BlockValSet.getDictionary()` Javadoc contract: *null if the column is not 
dictionary-encoded*.
   
   After this change, `getDictionary() != null` correctly implies 
`getDictionaryIdsSV/MV()` is callable, so every caller that already gates on 
this idiom is correct.
   
   ## Test plan
   
   - [x] Added `testMultiColumnGroupByWithRawDictColumnReturnsSameResults` in 
`RawForwardIndexWithDictionaryTest`, exercising the exact 
`NoDictionaryMultiColumnGroupKeyGenerator` path from the stack trace.
   - [x] `pinot-core` and `pinot-integration-tests` compile cleanly.
   - [x] Spotless, checkstyle, and license checks pass on both modules.
   - [x] Existing unit tests pass (238 tests across 
`DictionaryBasedGroupKeyGeneratorTest`, `NoDictionaryGroupKeyGeneratorTest`, 
`DistinctQueriesTest`, `InterSegmentAggregation{Single,Multi}ValueQueriesTest`, 
`InnerSegmentAggregation{Single,Multi}ValueQueriesTest`, `RangeQueriesTest`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to