Jackie-Jiang opened a new pull request, #18496:
URL: https://github.com/apache/pinot/pull/18496

   ## Summary
   
   When a forward-index update runs against a pre-1.6.0 segment, the 
per-element length stats added in 1.6.0 (`LENGTH_OF_SHORTEST_ELEMENT`, 
`LENGTH_OF_LONGEST_ELEMENT`, `MAX_ROW_LENGTH_IN_BYTES` for MV, `IS_ASCII` for 
STRING) may be missing from `ColumnMetadata`. Today the per-op handlers in 
`ForwardIndexHandler` cover for that with two ad-hoc scan paths, and the 
`InvertedIndexAndDictionaryBasedForwardIndexCreator` rebuild path doesn't 
backfill at all. After this PR the missing stats are gathered once during the 
rebuild and persisted to `metadata.properties`, so future reloads see a 
complete column metadata.
   
   ### `ForwardIndexHandler`
   - `updateIndices` runs a per-column `backfillMissingLengthStatsForColumn` 
pre-pass before dispatching ops. The helper opens the forward-index reader, 
scans the column once, and persists the gathered stats. No-op for fixed-width 
columns, columns whose metadata already carries the stats, or columns whose 
forward index is absent on disk.
   - The reader-driven scan in `rewriteForwardIndexForCompressionChange` and 
the metadata-driven scan in `rewriteDictToRawForwardIndex` are removed. Per-op 
methods construct their `IndexCreationContext` from `ColumnMetadata` directly — 
the pre-pass guarantees the lengths are populated.
   - The `IndexCreationContext.Builder` constructor already reads stats from 
`ColumnMetadata`, so the explicit `withLengthOfLongestElement` / 
`withMaxRowLengthInBytes` overrides on the per-op builders go away.
   
   ### `InvertedIndexAndDictionaryBasedForwardIndexCreator`
   - Constructor now takes `FieldIndexConfigs` directly. The three derived 
parameters (`dictionaryEnabled`, `dictionaryBasedForwardIndex`, 
`forwardIndexConfig`) are gone — they were confusing because the two booleans 
referred to *post-rebuild* state.
   - `_dictionaryPresent` renamed to `_keepDictionary` with a clearer javadoc 
that names the post-rebuild semantic ("kept vs. dropped").
   - Both SV and MV paths now track per-element length stats from the 
dictionary inline when the source segment is missing them, so the backfill 
happens during the existing main loop with no second pass over the column. The 
values flow into both the `IndexCreationContext.Builder` (so the new 
forward-index creator sees them) and the persisted `metadataProperties` map.
   - For MV var-length, `MAX_ROW_LENGTH_IN_BYTES` is always recomputed and 
persisted — in-row duplicates can drop during the rebuild and change the 
per-row max.
   
   ### `SegmentMetadataUtils.updateMetadataProperties`
   - A `null` value now removes the property via `clearProperty` rather than 
writing the literal string. Callers can drop a key by mapping it to `null`.
   
   ### `BaseIndexHandler.createForwardIndexIfNeeded`
   - Simplified to pass `_fieldIndexConfigs.get(columnName)` directly to the 
inv→fwd creator constructor instead of three derived locals.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to