rok opened a new pull request, #854:
URL: https://github.com/apache/arrow-go/pull/854
## What
Adds an **experimental** Parquet `VECTOR` `FieldRepetitionType` and `Vector`
logical type, and maps Arrow `FixedSizeList<T, N>` to it on the pqarrow write
and read paths, **opt-in** via `pqarrow.WithVectorEncoding()`.
`VECTOR` stores fixed-shape data (embeddings, image/tensor patches,
fixed-precision decimal vectors) **without per-element repetition/definition
levels**, eliminating the standard 3-level `LIST` overhead. This is the "Option
B" design from the *Fixed-size list type for Parquet* proposal (see also
apache/arrow#34510 for the measured ~3x read gap that motivates it).
## Scope (Phase 1)
This PR is intentionally the first, smallest slice:
- **Only** dense, **non-nullable**, **top-level** `FixedSizeList` columns
with a **fixed-width primitive** element are encoded as `VECTOR`.
- Every other `FixedSizeList` — nullable value or element, zero-length,
variable-width / dictionary / extension element, struct or nested-list element,
or a `FixedSizeList` nested inside another type — **transparently falls back to
the standard `LIST` encoding**. Nothing that writes today changes unless the
flag is set, and unsupported shapes never error.
- Nullable vectors, struct elements, and nested vectors are deferred to a
follow-up PR.
## Canonical structure (mirrors LIST)
```
<required|optional> group <name> (VECTOR) {
vector group list [N] {
<required|optional> <element-type> element;
}
}
```
The VECTOR-repeated middle group does **not** increment the max
definition/repetition level, so a dense vector leaf has no inner levels. The
column writer accounts rows as `values / vector_length` and never splits a
vector across a data page. The reader reconstructs the `FixedSizeList`
**without needing a stored Arrow schema**.
## Format additions
- `FieldRepetitionType.VECTOR = 3`
- `VectorType` logical type (`LogicalType` union id **19**)
- `SchemaElement.vector_length` (field id **12**)
Since `VECTOR` is not yet part of `apache/parquet-format`, the additions to
the generated `parquet/internal/gen-go/parquet/parquet.go` were applied **by
hand** in the existing Thrift 0.21.0 code-generator style.
`parquet/parquet_vector.thrift` vendors the IDL fragment as the source of truth
(byte-identical regeneration would need Thrift 0.21.0 + the full upstream
`parquet.thrift`). Field ids 19/12 and `VECTOR=3` match the arrow-cpp Option B
prototype, so files interoperate.
## ⚠️ Compatibility
Files written with `VECTOR` are **not readable** by Parquet readers that
don't understand the `VECTOR` repetition type. This is the defining trade-off
of Option B and the reason it is strictly opt-in and documented experimental.
## Testing
- Thrift compact-protocol round-trip for the new format types.
- Schema layer: logical type, node `vector_length` validation, level
computation, effective-length (incl. nested product), full schema round-trip.
- Core column writer: row accounting + page-not-split invariant (multi-page)
+ partial-vector rejection.
- pqarrow: schema mapping (VECTOR vs LIST fallback cases), manifest
reconstruction, and a full **Arrow round-trip** (`FixedSizeList<float64,8> ×
500` written as VECTOR and read back identically, leak-checked).
All new tests pass; the only failing tests in the suite are the pre-existing
ones that require the `parquet-testing` data submodule / `PARQUET_TEST_DATA`.
## Follow-ups (Phase 2)
Nullable vectors (spaced leaf materialization + def-level→validity
collapse), struct elements, nested vectors, and broadening the write paths
(`WriteBatchSpaced`/dictionary) to be vector-aligned.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]