Hi all,

Picking this thread back up. I've put together a design doc outlining three
options we've discussed:
https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing

* Option A: logical type annotating FIXED_LEN_BYTE_ARRAY.
* Option B: new VECTOR repetition type.
* Option C: logical type annotating a normal LIST, where a recognizing
reader skips rep-level decode and an unknown reader still sees a working
LIST. A future revision would let writers omit rep-levels entirely.

The document evaluates these against the same requirements and compares
them along six axes (backwards compatibility, composability, encoding
flexibility, implementation complexity, on-disk overhead, read
performance). The doc aims to centralize the discussion and help us pick a
direction.
Comments are open. Most useful pushback would be on the requirements
(especially the "no-fallback breaks adoption" one).

Best,
Rok

On Tue, Mar 3, 2026 at 8:58 PM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> The downside with this approach is that the top-level "unit" type is not
> the element type.
>
> For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the
> top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that
> specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or
> ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be
> applicable at all (for the latter two).
>
> I wonder if we can find an approach that doesn't emit repetition levels
> but still allows using efficient encodings for the element type.
>
> Regards
>
> Antoine.
>
>
> Le 03/03/2026 à 01:13, Rok Mihevc a écrit :
> > Hi all,
> >
> > I'd like to resurrect this thread in light of recent vectors in Parquet
> > discussion [1].
> > There is a (now updated) proposal PR from when the thread was started
> that
> > has a nice discussion [2].
> >
> > TLDR of the current proposal:
> > - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with
> > FixedSizeListType { type, num_values }.
> > - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE,
> > FIXED_LEN_BYTE_ARRAY); num_values > 0.
> > - type_length must match num_values encoded with PLAIN representation of
> > type.
> > - If the field is optional, the whole list value may be null; elements
> are
> > always non-null.
> > - Intentionally not a `LIST` encoding (no def/rep levels).
> > - Outer page/column encoding behavior is unchanged (any encoding valid
> for
> > `FIXED_LEN_BYTE_ARRAY` remains valid).
> >
> > I also added explicit validity requirements: writers must not emit
> > violating metadata, and readers must treat violating metadata as invalid.
> >
> >
> > Rok
> >
> > [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3
> > [2] https://github.com/apache/parquet-format/pull/241
> >
> > On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote:
> >
> >> I would love to see this!
> >>
> >> It is an important optimization for vectors, which become more and more
> >> important and ubiquitous for grounding of LLMs.
> >>
> >> Note however that the logical type route has one drawback: A logical
> type
> >> may not change the physical representation of values! Thus, if we make
> >> FIXED_SIZE_LIST just a logical type, we would still need to write
> R-Levels,
> >> as even clients not knowing this logical type need to be able to decode
> the
> >> column. We could avoid reading the R-Levels and just assume that each
> list
> >> has the fixed size, so the read path would be optimized but the write
> path
> >> wouldn't.
> >>
> >> If we want to avoid writing R-Levels altogether, a logical type doesn't
> cut
> >> it. It needs to be something different. E.g., in the schema, we could
> store
> >> an optional `count` for repeated fields. Whenever this count is
> present, we
> >> would not write R-Levels for this field (or more precisely, this field
> >> would not take part in the R-Level computation, as if it wasn't a
> repeated
> >> field). This of course is a more intrusive change, as legacy clients
> >> couldn't read such columns anymore.
> >>
> >> I don't know which of the two alternatives is better. I agree with Gang
> >> that we should probably discuss this in a PR.
> >>
> >> Cheers,
> >> Jan
> >>
> >>
> >> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>:
> >>
> >>> Hi Rok,
> >>>
> >>> Happy to see you here :)
> >>>
> >>> According to my past experience, it would be more helpful to open
> >>> a PR against the parquet-format repository and post it here.
> >>>
> >>> Best,
> >>> Gang
> >>>
> >>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor
> >>>> canonical extension types [1] that use FixedSizeList and
> >>> StructArray(List,
> >>>> FixedSizeList) as storage respectfully. These are targeted at machine
> >>>> learning and scientific applications that deal with large datasets and
> >>>> would benefit from using Parquet as on disk storage.
> >>>>
> >>>> However currently FixedSizeList is stored as List in Parquet which
> adds
> >>>> significant conversion overhead when reading and writing [2]. It would
> >>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical type.
> >>>>
> >>>> I would like to open a discussion on potentially adding
> FIXED_SIZE_LIST
> >>>> type and prepare a proposal if discussion supports it.
> >>>>
> >>>>
> >>>> Best,
> >>>> Rok
> >>>>
> >>>> [1]
> >>>>
> >>>
> >>
> https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list
> >>>> [2] https://github.com/apache/arrow/issues/34510
> >>>>
> >>>
> >>
> >
>
>
>

Reply via email to