Copilot commented on code in PR #18504:
URL: https://github.com/apache/pinot/pull/18504#discussion_r3245403955
##########
pinot-core/src/main/java/org/apache/pinot/core/operator/ColumnContext.java:
##########
@@ -49,25 +52,44 @@ public boolean isSingleValue() {
return _isSingleValue;
}
+ /// Returns the column's dictionary file if one exists, regardless of
whether the forward index can answer
+ /// dictionary-id reads. Callers that need to select between a dict-id read
path and a value read path MUST gate
+ /// on {@link #isDictionaryEncoded()} rather than {@code getDictionary() !=
null} — a column declared as
+ /// {@code EncodingType.RAW} with an explicit {@code dictionaryIndex}
returns a non-null dictionary here but its
+ /// forward index throws on {@link ForwardIndexReader#readDictIds}.
@Nullable
public Dictionary getDictionary() {
return _dictionary;
}
+ /// Returns {@code true} if the column's forward index is dictionary-encoded
and the dict-id read path
+ /// ({@link org.apache.pinot.core.common.BlockValSet#getDictionaryIdsSV()})
is callable. A column with
+ /// {@code EncodingType.RAW} + an explicit {@code dictionaryIndex} returns
{@code false} here even though
+ /// {@link #getDictionary()} is non-null.
+ public boolean isDictionaryEncoded() {
+ return _dictionaryEncoded;
+ }
+
@Nullable
public DataSource getDataSource() {
return _dataSource;
}
public static ColumnContext fromDataSource(DataSource dataSource) {
DataSourceMetadata dataSourceMetadata = dataSource.getDataSourceMetadata();
- return new ColumnContext(dataSourceMetadata.getDataType(),
dataSourceMetadata.isSingleValue(),
- dataSource.getDictionary(), dataSource);
+ Dictionary dictionary = dataSource.getDictionary();
+ ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
+ boolean dictEncoded = dictionary != null && (forwardIndex == null ||
forwardIndex.isDictionaryEncoded());
Review Comment:
ColumnContext#fromDataSource computes dictEncoded as `dictionary != null &&
(forwardIndex == null || forwardIndex.isDictionaryEncoded())`, which returns
true when the forward index is disabled (null). In that scenario dict-id reads
from the forward index are not callable, so isDictionaryEncoded() can
incorrectly steer callers into dict-id code paths. Consider treating
`forwardIndex == null` as not dictionary-encoded (or updating the contract/docs
if null forward index is intended to be supported).
##########
pinot-core/src/main/java/org/apache/pinot/core/common/BlockValSet.java:
##########
@@ -48,11 +48,27 @@ public interface BlockValSet {
boolean isSingleValue();
/**
- * Returns the dictionary for the column, or {@code null} if the column is
not dictionary-encoded.
+ * Returns the dictionary file for the column if one exists, or {@code null}
otherwise. The dictionary may be
+ * present even when {@link #isDictionaryEncoded()} returns {@code false} —
a column declared as
+ * {@code EncodingType.RAW} with an explicit {@code dictionaryIndex} carries
a dictionary file on disk but a RAW
+ * forward index. Callers that select between a dictionary-id read path
+ * ({@link #getDictionaryIdsSV()} / {@link #getDictionaryIdsMV()}) and a
value read path MUST gate on
+ * {@link #isDictionaryEncoded()}, not {@code getDictionary() != null}.
Review Comment:
The updated BlockValSet#getDictionary() Javadoc refers to a "dictionary
file". BlockValSet is used for transform-layer value sets as well (e.g.,
TransformBlockValSet returns TransformFunction#getDictionary(), which can be an
in-memory/on-the-fly dictionary), so this wording is inaccurate/misleading.
Consider rephrasing to simply "dictionary" (may be on-disk or computed) and
keep the distinction focused on whether dict-id reads are supported via
isDictionaryEncoded().
##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/RawForwardIndexWithDictionaryTest.java:
##########
@@ -542,6 +541,146 @@ RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION,
getTableName(),
"Multi-column GROUP BY rows must match between dictionary-only and
raw+dictionary columns");
}
+ /// Multi-column DISTINCT exercises {@link
org.apache.pinot.core.query.distinct.DistinctExecutorFactory}'s
+ /// multi-column path. The factory routes to {@code
DictionaryBasedMultiColumnDistinctExecutor} whenever every
+ /// column has a non-null dictionary, then that executor calls {@code
BlockValSet#getDictionaryIdsSV()}
+ /// — which throws on a RAW+dictionary column.
+ @Test(dataProvider = "useBothQueryEngines")
+ public void
testMultiColumnDistinctWithRawDictColumnReturnsSameResults(boolean
useMultiStageQueryEngine)
+ throws Exception {
+ setUseMultiStageQueryEngine(useMultiStageQueryEngine);
+ JsonNode dictRows = postQuery(
+ String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+ DICT_DIMENSION, DICT_INT_DIMENSION, getTableName(),
DICT_DIMENSION, DICT_INT_DIMENSION))
+ .get("resultTable").get("rows");
+ JsonNode rawRows = postQuery(
+ String.format("SELECT DISTINCT %s, %s FROM %s ORDER BY %s, %s",
+ RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION, getTableName(),
+ RAW_DICT_DIMENSION, RAW_DICT_INT_DIMENSION))
+ .get("resultTable").get("rows");
+ assertEquals(rawRows, dictRows,
+ "Multi-column DISTINCT rows must match between dictionary-only and
raw+dictionary columns");
+ }
+
+ /// {@code DISTINCTCOUNT} on a RAW+dictionary column drives
+ /// {@link
org.apache.pinot.core.query.aggregation.function.BaseDistinctAggregateAggregationFunction#svAggregate}
+ /// down the dictionary-id path because {@code blockValSet.getDictionary()
!= null}. That path then calls
+ /// {@code blockValSet.getDictionaryIdsSV()} on the RAW forward index. A
{@code WHERE} predicate is included to
+ /// bypass {@link
org.apache.pinot.core.operator.query.NonScanBasedAggregationOperator}, which
would otherwise
+ /// serve the aggregation directly from the dictionary and hide the bug.
Review Comment:
Several newly added test comments describe the pre-fix behavior as if it is
current (e.g., DistinctExecutorFactory routing based on `getDictionary() !=
null`, BaseDistinctAggregateAggregationFunction choosing the dict-id path for
the same reason). Since the production code now gates on isDictionaryEncoded(),
these comments are now stale and can confuse future debugging; consider
updating them to describe the historical bug explicitly ("previously...") or
reference isDictionaryEncoded() as the correct gate.
##########
pinot-core/src/main/java/org/apache/pinot/core/operator/docvalsets/ProjectionBlockValSet.java:
##########
@@ -98,6 +99,19 @@ public Dictionary getDictionary() {
return _dataSource.getDictionary();
}
+ /// A column with {@code EncodingType.RAW} + an explicit {@code
dictionaryIndex} has {@link #getDictionary()}
+ /// non-null but a RAW forward index that throws on {@link
ForwardIndexReader#readDictIds}. Callers selecting
+ /// between dict-id and value paths must gate on this method, not {@code
getDictionary() != null}.
+ @Override
+ public boolean isDictionaryEncoded() {
+ Dictionary dictionary = _dataSource.getDictionary();
+ if (dictionary == null) {
+ return false;
+ }
+ ForwardIndexReader<?> forwardIndex = _dataSource.getForwardIndex();
+ return forwardIndex == null || forwardIndex.isDictionaryEncoded();
Review Comment:
ProjectionBlockValSet#isDictionaryEncoded() currently returns true when
`ForwardIndexReader` is null. Elsewhere in the codebase (e.g., DataFetcher) a
null forward index indicates the forward index is disabled, and dict-id reads
are not possible. Returning true here can misroute callers into
getDictionaryIdsSV()/MV() and lead to failures; consider returning false when
the forward index is null.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]