zeroshade commented on PR #854: URL: https://github.com/apache/arrow-go/pull/854#issuecomment-4747951695
## Review notes Rebased locally onto current `main` (past #852 / #856 / #857) — clean, no conflicts, the generated `column_writer_types.gen.go` still matches its template, and the full `parquet/...` test suite passes. For an experimental, opt-in proposal this is in good shape; nothing below blocks it landing as a proposal. The items worth resolving before `VECTOR` graduates from experimental are the two panic paths and the cross-version read behavior. ### Correctness — looks right - Row accounting (`values / vector_length`) is consistent across every write path via `rowsForLeafValues` (`file/column_writer.go:664`), and the whole-vector-per-page guarantee holds (`alignBatchToVector` + the per-batch flush in `commitWriteAndCheckPageLimit`). - Composes cleanly with #852: the misaligned-batch `panic` is caught by `defer recover()` at each public writer boundary and returned as an `error`. - Reader rejection is thorough — `file/row_group_reader.go:97-110` (value count must be a whole, consistent multiple of the row count, with an overflow guard), plus the stride/overflow guards in `seekToRowWithValueStride` and the `BuildArray` divisibility check. - The `Repetitions.Undefined` 3→4 shift is safe: it is an internal-only sentinel (`schema/reflection.go`), never serialized; on-disk `repetition_type` was only ever 0–2. ### Worth addressing before graduating from experimental 1. **Panic-as-validation on the public `file` writer.** `vectorLengthForBatch` panics on a misaligned batch (`file/column_writer.go:677`); through the low-level `WriteBatch` API this surfaces as a `recover()`'d `"unknown error type: …"`. pqarrow never triggers it, but the public writer should validate `n % vector_length == 0` up front and return a clean typed error. 2. **`NewSchemaChecked` can still panic.** `effectiveVectorLength` (`schema/column.go:58`) panics on int32 overflow, reachable via `buildTree`→`NewColumn` under `NewSchemaChecked`, which is documented to return an error instead of panicking. Unreachable today (groups can't be `VECTOR`), but a latent contract break once nested vectors exist — better to thread an error out. 3. **Cross-version read is a silent misread.** A `VECTOR` (repetition type 3) file opened by a pre-this-change reader has no `case` for `3` and would be read as a flat required column with `N×len` values / wrong row count rather than failing loudly. This is inherent to introducing a new enum value with no feature guard — worth calling out in the `WithVectorEncoding` docs, and ideally a read-side guard before this is non-experimental. 4. **Statistics stay element-level** (`file/column_writer.go:660`), so page/column min-max on a `VECTOR` column are not row-meaningful for predicate pushdown. A format-proposal decision worth stating explicitly. ### Minor - DataPageV1 without an offset index: the seek does a throwaway page seek and then re-seeks to 0 (`file/column_reader.go:644-683`) — correct result, but two scans. - The V1 + offset-index seek relies on `FirstRowIndex()` being in parent-row units; it holds, but a test would lock it in (multi-page `VECTOR`, V1, page index enabled, seek into later pages). - No explicit test for the `WriteBatchSpacedWithError` + `VECTOR` error path, or for a malformed `vector_length`. - Read reconstruction without `WithStoreSchema` names the element `"element"` and drops element/list metadata (both restored when the schema is stored) — fine for Phase 1, worth a doc note. - `pqarrow/encode_arrow.go` now returns `err` instead of swallowing it on builder-creation failure — good catch; unrelated to `VECTOR` but correct. ### Format / spec The Thrift choices (`FieldRepetitionType.VECTOR = 3`, `SchemaElement.vector_length` = field id 12) are hand-applied. Worth confirming these match the apache/parquet-format proposal exactly — if the standard settles on different values, files written now become silently incompatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
