andygrove commented on PR #4532: URL: https://github.com/apache/datafusion-comet/pull/4532#issuecomment-4742461444
A few observations from a review pass. Acknowledging up front that this comment was drafted with help from AI (Claude Code). The reuse of the existing per-type `ArrowFieldWriter`s is exactly the right design choice — every Spark type comes along for free and the logic stays in sync with non-constant batches. A few notes: 1. **Hardcoded `"UTC"` timezone in `Utils.materializeConstantColumnVector`.** The choice is defensible since Spark stores `TimestampType` as micros in UTC, but elsewhere in this file `toArrowSchema` requires callers to pass the session timezone explicitly. If a batch flows through `getBatchFieldVectors` containing both a `ConstantColumnVector` of `TimestampType` and a non-constant timestamp column, would they end up with mismatched Arrow timezone metadata? Worth either threading a `timeZoneId` parameter through `getBatchFieldVectors` (sourced from `SQLConf.sessionLocalTimeZone`), or adding a comment explaining why `"UTC"` is safe in this serialization context. 2. **Test coverage in `UtilsSuite`.** Nice round-trip test. Would you mind adding a case or two beyond `IntegerType`? A nullable `StructType` or `ArrayType` constant exercises a different `ArrowFieldWriter` code path (via `getStruct(rowId)` / `getArray(rowId)` on `ConstantColumnVector`), and a `TimestampType` constant would pin down the timezone choice in case anyone later changes it. The materializer claims to support every type and it'd be useful to prove it for at least the complex case. 3. **FFI `exportBatch` arm.** It shares `materializeConstantColumnVector` with `getBatchFieldVectors`, so the materializer itself is covered. The surrounding code (`Data.exportVector` + the allocator handoff) isn't exercised by the new test. If easy, a small smoke test that runs a `ConstantColumnVector`-containing batch through an actual Comet operator path would catch a regression in the FFI wiring. Optional. 4. **CI flake.** The one failing check (`Spark 4.1 sql_core-3`) annotates as `The hosted runner lost communication with the server` — infrastructure, not the PR. The matching `sql_core-3` job on Spark 3.5 passes. A re-run should clear it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
