rok opened a new pull request, #854:
URL: https://github.com/apache/arrow-go/pull/854

   ## What
   
   Adds an **experimental** Parquet `VECTOR` `FieldRepetitionType` and `Vector` 
logical type, and maps Arrow `FixedSizeList<T, N>` to it on the pqarrow write 
and read paths, **opt-in** via `pqarrow.WithVectorEncoding()`.
   
   `VECTOR` stores fixed-shape data (embeddings, image/tensor patches, 
fixed-precision decimal vectors) **without per-element repetition/definition 
levels**, eliminating the standard 3-level `LIST` overhead. This is the "Option 
B" design from the *Fixed-size list type for Parquet* proposal (see also 
apache/arrow#34510 for the measured ~3x read gap that motivates it).
   
   ## Scope (Phase 1)
   
   This PR is intentionally the first, smallest slice:
   
   - **Only** dense, **non-nullable**, **top-level** `FixedSizeList` columns 
with a **fixed-width primitive** element are encoded as `VECTOR`.
   - Every other `FixedSizeList` — nullable value or element, zero-length, 
variable-width / dictionary / extension element, struct or nested-list element, 
or a `FixedSizeList` nested inside another type — **transparently falls back to 
the standard `LIST` encoding**. Nothing that writes today changes unless the 
flag is set, and unsupported shapes never error.
   - Nullable vectors, struct elements, and nested vectors are deferred to a 
follow-up PR.
   
   ## Canonical structure (mirrors LIST)
   
   ```
   <required|optional> group <name> (VECTOR) {
     vector group list [N] {
       <required|optional> <element-type> element;
     }
   }
   ```
   
   The VECTOR-repeated middle group does **not** increment the max 
definition/repetition level, so a dense vector leaf has no inner levels. The 
column writer accounts rows as `values / vector_length` and never splits a 
vector across a data page. The reader reconstructs the `FixedSizeList` 
**without needing a stored Arrow schema**.
   
   ## Format additions
   
   - `FieldRepetitionType.VECTOR = 3`
   - `VectorType` logical type (`LogicalType` union id **19**)
   - `SchemaElement.vector_length` (field id **12**)
   
   Since `VECTOR` is not yet part of `apache/parquet-format`, the additions to 
the generated `parquet/internal/gen-go/parquet/parquet.go` were applied **by 
hand** in the existing Thrift 0.21.0 code-generator style. 
`parquet/parquet_vector.thrift` vendors the IDL fragment as the source of truth 
(byte-identical regeneration would need Thrift 0.21.0 + the full upstream 
`parquet.thrift`). Field ids 19/12 and `VECTOR=3` match the arrow-cpp Option B 
prototype, so files interoperate.
   
   ## ⚠️ Compatibility
   
   Files written with `VECTOR` are **not readable** by Parquet readers that 
don't understand the `VECTOR` repetition type. This is the defining trade-off 
of Option B and the reason it is strictly opt-in and documented experimental.
   
   ## Testing
   
   - Thrift compact-protocol round-trip for the new format types.
   - Schema layer: logical type, node `vector_length` validation, level 
computation, effective-length (incl. nested product), full schema round-trip.
   - Core column writer: row accounting + page-not-split invariant (multi-page) 
+ partial-vector rejection.
   - pqarrow: schema mapping (VECTOR vs LIST fallback cases), manifest 
reconstruction, and a full **Arrow round-trip** (`FixedSizeList<float64,8> × 
500` written as VECTOR and read back identically, leak-checked).
   
   All new tests pass; the only failing tests in the suite are the pre-existing 
ones that require the `parquet-testing` data submodule / `PARQUET_TEST_DATA`.
   
   ## Follow-ups (Phase 2)
   
   Nullable vectors (spaced leaf materialization + def-level→validity 
collapse), struct elements, nested vectors, and broadening the write paths 
(`WriteBatchSpaced`/dictionary) to be vector-aligned.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to