zeroshade commented on PR #854:
URL: https://github.com/apache/arrow-go/pull/854#issuecomment-4747951695

   ## Review notes
   
   Rebased locally onto current `main` (past #852 / #856 / #857) — clean, no 
conflicts, the generated `column_writer_types.gen.go` still matches its 
template, and the full `parquet/...` test suite passes. For an experimental, 
opt-in proposal this is in good shape; nothing below blocks it landing as a 
proposal. The items worth resolving before `VECTOR` graduates from experimental 
are the two panic paths and the cross-version read behavior.
   
   ### Correctness — looks right
   - Row accounting (`values / vector_length`) is consistent across every write 
path via `rowsForLeafValues` (`file/column_writer.go:664`), and the 
whole-vector-per-page guarantee holds (`alignBatchToVector` + the per-batch 
flush in `commitWriteAndCheckPageLimit`).
   - Composes cleanly with #852: the misaligned-batch `panic` is caught by 
`defer recover()` at each public writer boundary and returned as an `error`.
   - Reader rejection is thorough — `file/row_group_reader.go:97-110` (value 
count must be a whole, consistent multiple of the row count, with an overflow 
guard), plus the stride/overflow guards in `seekToRowWithValueStride` and the 
`BuildArray` divisibility check.
   - The `Repetitions.Undefined` 3→4 shift is safe: it is an internal-only 
sentinel (`schema/reflection.go`), never serialized; on-disk `repetition_type` 
was only ever 0–2.
   
   ### Worth addressing before graduating from experimental
   1. **Panic-as-validation on the public `file` writer.** 
`vectorLengthForBatch` panics on a misaligned batch 
(`file/column_writer.go:677`); through the low-level `WriteBatch` API this 
surfaces as a `recover()`'d `"unknown error type: …"`. pqarrow never triggers 
it, but the public writer should validate `n % vector_length == 0` up front and 
return a clean typed error.
   2. **`NewSchemaChecked` can still panic.** `effectiveVectorLength` 
(`schema/column.go:58`) panics on int32 overflow, reachable via 
`buildTree`→`NewColumn` under `NewSchemaChecked`, which is documented to return 
an error instead of panicking. Unreachable today (groups can't be `VECTOR`), 
but a latent contract break once nested vectors exist — better to thread an 
error out.
   3. **Cross-version read is a silent misread.** A `VECTOR` (repetition type 
3) file opened by a pre-this-change reader has no `case` for `3` and would be 
read as a flat required column with `N×len` values / wrong row count rather 
than failing loudly. This is inherent to introducing a new enum value with no 
feature guard — worth calling out in the `WithVectorEncoding` docs, and ideally 
a read-side guard before this is non-experimental.
   4. **Statistics stay element-level** (`file/column_writer.go:660`), so 
page/column min-max on a `VECTOR` column are not row-meaningful for predicate 
pushdown. A format-proposal decision worth stating explicitly.
   
   ### Minor
   - DataPageV1 without an offset index: the seek does a throwaway page seek 
and then re-seeks to 0 (`file/column_reader.go:644-683`) — correct result, but 
two scans.
   - The V1 + offset-index seek relies on `FirstRowIndex()` being in parent-row 
units; it holds, but a test would lock it in (multi-page `VECTOR`, V1, page 
index enabled, seek into later pages).
   - No explicit test for the `WriteBatchSpacedWithError` + `VECTOR` error 
path, or for a malformed `vector_length`.
   - Read reconstruction without `WithStoreSchema` names the element 
`"element"` and drops element/list metadata (both restored when the schema is 
stored) — fine for Phase 1, worth a doc note.
   - `pqarrow/encode_arrow.go` now returns `err` instead of swallowing it on 
builder-creation failure — good catch; unrelated to `VECTOR` but correct.
   
   ### Format / spec
   The Thrift choices (`FieldRepetitionType.VECTOR = 3`, 
`SchemaElement.vector_length` = field id 12) are hand-applied. Worth confirming 
these match the apache/parquet-format proposal exactly — if the standard 
settles on different values, files written now become silently incompatible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to