zeroshade commented on PR #854:
URL: https://github.com/apache/arrow-go/pull/854#issuecomment-4747966882

   **Follow-up — format/spec conformance**
   
   I checked the hand-applied Thrift against the in-flight parquet-format 
proposal. `VECTOR` is not in `apache/parquet-format` master yet 
(`FieldRepetitionType` there is still just `REQUIRED`/`OPTIONAL`/`REPEATED`). 
Comparing against the current Option B draft (Antoine Pitrou's 
`vector-repetition` branch):
   
   | | `FieldRepetitionType.VECTOR` | `SchemaElement.vector_length` field id |
   |---|---|---|
   | This PR (`parquet/parquet_vector.thrift`) | `3` | **`12`** |
   | parquet-format draft (`pitrou:vector-repetition`) | `3` | **`11`** |
   
   The enum value matches, but the **`vector_length` Thrift field id differs 
(12 here vs 11 in the draft).** A reader built to the draft would skip the 
unknown field 12, then see `repetition_type = VECTOR` with no `vector_length` 
and reject the file as malformed — so data written now wouldn't interop with a 
draft-conformant reader. Since cross-implementation compatibility is the whole 
crux of Option B, it'd be worth aligning the field id with the proposal (or 
calling out the divergence explicitly) before any files get written.
   
   Secondary, lower confidence: the Parquet C++ Option B prototype 
(`rok/arrow#51`) appears to model VECTOR as a three-level group carrying a 
dedicated `Vector` logical type, whereas this PR uses the primitive-leaf 
"reduced Option B" with no logical annotation — another representational 
difference to reconcile as the proposal converges.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to