xiangfu0 opened a new pull request, #18239:
URL: https://github.com/apache/pinot/pull/18239

   ## Summary
   
   Adds `UUID` as a first-class Pinot data type backed by 16-byte BYTES storage 
with canonical lowercase RFC 4122 strings as the external representation.
   
   ### Core type plumbing
   - `FieldSpec.DataType.UUID` — stored as BYTES (16 bytes), sortable via 
unsigned byte order. UUID is placed **after `UNKNOWN`** (ordinal 14) to avoid 
shifting existing ordinals since `AnyValueAggregationFunction` serializes 
`DataType.ordinal()` into intermediate results.
   - `PinotDataType.UUID` — conversions to/from STRING/BYTES; numeric ops throw 
as expected.
   - `DataSchema.ColumnDataType.UUID` — backed by BYTES internally, maps to 
Calcite `SqlTypeName.UUID`.
   - `Schema.validate()` — UUID and BIG_DECIMAL enforce SV-only constraint.
   
   ### UuidUtils (`pinot-spi`)
   - Canonical conversions: `toBytes()`, `toUUID()`, `toString()`
   - `UuidKey` inner class — stores UUID as two `long` fields for O(1) 
hashCode/equals in hot groupby/join paths (avoids ByteArray allocation per row)
   - `compare()` uses `Long.compareUnsigned` for correct unsigned byte-order 
UUID comparison
   
   ### Query engine
   - Predicate evaluators — Eq/NotEq/In/NotIn handle UUID via 
`BytesRaw*Evaluator` + `DataType.UUID`
   - Group key generators — `NoDictionarySingleColumnGroupKeyGenerator`, 
`NoDictionaryMultiColumnGroupKeyGenerator` updated
   - MSQE hot-path: `UuidToIdMap`, `OneUuidKeyGroupIdGenerator`, 
`UuidLookupTable`
   - `RequestContextUtils.evaluateLiteralValue` — supports 
CAST/TO_UUID/UUID_TO_BYTES/BYTES_TO_UUID/UUID_TO_STRING/IS_UUID on predicate 
RHS literals
   - `CastTransformFunction` — CAST(col AS UUID) per-row support
   - Scalar functions: `IS_UUID`, `TO_UUID`, `UUID_TO_BYTES`, `BYTES_TO_UUID`, 
`UUID_TO_STRING`
   - `ValueBasedSegmentPruner` — UUID bloom filter checked with 
`DataType.toString()` for canonical UUID strings
   
   ### gRPC / response encoding (bug fixes)
   - `JsonResponseEncoder.extractValue` — added `case UUID` to avoid 
`IllegalArgumentException` when decoding gRPC responses with UUID columns. 
**Root cause of `UuidUpsertRealtimeTest` failure.**
   - `ArrowResponseEncoder` — full UUID support via `getVarCharValue()` helper; 
avoids `ClassCastException` in Arrow encoding path.
   - `ServerPlanRequestUtils.computeInOperands` — emits canonical UUID string 
literals (not raw bytes) for UUID columns in MSQE join dynamic filters.
   
   ### Bloom filter (bug fix)
   - `BloomFilterHandler` — UUID columns are now handled correctly in both 
rebuild paths:
     - Dictionary path: calls `UuidUtils.toString(dictionary.getBytesValue(i))` 
instead of `dictionary.getStringValue(i)` (which returns hex encoding for 
byte-backed dicts).
     - Non-dictionary path: adds explicit `case UUID` before `case BYTES` to 
avoid `IllegalStateException` on segment reload.
   
   ### Segment / realtime
   - `MutableSegmentImpl` — UUID primary key normalization for upsert
   - `MutableNoDictColumnStatistics.isSorted()` — uses `UuidUtils.compare` 
(unsigned) instead of `ByteArray.compare` (signed) for UUID
   - `BloomFilterCreator` — UUID stored as canonical string
   
   ### Input format
   - Avro plugin — `logicalType: "uuid"` Avro schema field → UUID DataType
   
   ### Tests
   - `UuidTypeTest`, `UuidTypeRealtimeTest` — full integration tests for batch 
and realtime
   - `UuidUpsertRealtimeTest` — upsert with UUID primary key
   - Unit tests for UuidUtils, DataSchema, PinotDataType, RequestContextUtils, 
response encoders, Avro utils, predicate evaluators
   
   ## Test plan
   
   - [ ] `UuidTypeTest` (batch): predicate pushdown, group-by, CAST, ORDER BY
   - [ ] `UuidTypeRealtimeTest`: same for realtime segment
   - [ ] `UuidUpsertRealtimeTest`: upsert/dedup with UUID primary key (verifies 
gRPC fix)
   - [ ] `UuidUtilsTest`, `DataSchemaTest`, `PinotDataTypeTest`: unit coverage
   - [ ] `ArrowResponseEncoderTest`, `JsonResponseEncoderTest`: encoding 
round-trips
   - [ ] `RequestContextUtilsTest`: literal evaluation for UUID functions
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to