Hi all, Picking this thread back up. I've put together a design doc outlining three options we've discussed: https://docs.google.com/document/d/1nf30OqK_UqxA4YTEZQszmOBEG56m9M5mp9rIYC2SUWc/edit?usp=sharing
* Option A: logical type annotating FIXED_LEN_BYTE_ARRAY. * Option B: new VECTOR repetition type. * Option C: logical type annotating a normal LIST, where a recognizing reader skips rep-level decode and an unknown reader still sees a working LIST. A future revision would let writers omit rep-levels entirely. The document evaluates these against the same requirements and compares them along six axes (backwards compatibility, composability, encoding flexibility, implementation complexity, on-disk overhead, read performance). The doc aims to centralize the discussion and help us pick a direction. Comments are open. Most useful pushback would be on the requirements (especially the "no-fallback breaks adoption" one). Best, Rok On Tue, Mar 3, 2026 at 8:58 PM Antoine Pitrou <[email protected]> wrote: > > Hello, > > The downside with this approach is that the top-level "unit" type is not > the element type. > > For example, if you have a FIXED_SIZE_LIST(FLOAT32, 3), then the > top-level unit type is FIXED_LEN_BYTE_ARRAY(12). This means that > specialized encodings such as BYTE_STREAM_SPLIT, DELTA_BINARY_PACKED or > ALP may either be less efficient (for BYTE_STREAM_SPLIT) or not be > applicable at all (for the latter two). > > I wonder if we can find an approach that doesn't emit repetition levels > but still allows using efficient encodings for the element type. > > Regards > > Antoine. > > > Le 03/03/2026 à 01:13, Rok Mihevc a écrit : > > Hi all, > > > > I'd like to resurrect this thread in light of recent vectors in Parquet > > discussion [1]. > > There is a (now updated) proposal PR from when the thread was started > that > > has a nice discussion [2]. > > > > TLDR of the current proposal: > > - FIXED_SIZE_LIST annotates a FIXED_LEN_BYTE_ARRAY primitive leaf with > > FixedSizeListType { type, num_values }. > > - type must be fixed-width and non-array (INT32, INT64, FLOAT, DOUBLE, > > FIXED_LEN_BYTE_ARRAY); num_values > 0. > > - type_length must match num_values encoded with PLAIN representation of > > type. > > - If the field is optional, the whole list value may be null; elements > are > > always non-null. > > - Intentionally not a `LIST` encoding (no def/rep levels). > > - Outer page/column encoding behavior is unchanged (any encoding valid > for > > `FIXED_LEN_BYTE_ARRAY` remains valid). > > > > I also added explicit validity requirements: writers must not emit > > violating metadata, and readers must treat violating metadata as invalid. > > > > > > Rok > > > > [1] https://lists.apache.org/thread/nmq7odlbg1p6yx0hg00clzjbc3tb1tc3 > > [2] https://github.com/apache/parquet-format/pull/241 > > > > On Thu, May 16, 2024 at 4:34 AM Jan Finis <[email protected]> wrote: > > > >> I would love to see this! > >> > >> It is an important optimization for vectors, which become more and more > >> important and ubiquitous for grounding of LLMs. > >> > >> Note however that the logical type route has one drawback: A logical > type > >> may not change the physical representation of values! Thus, if we make > >> FIXED_SIZE_LIST just a logical type, we would still need to write > R-Levels, > >> as even clients not knowing this logical type need to be able to decode > the > >> column. We could avoid reading the R-Levels and just assume that each > list > >> has the fixed size, so the read path would be optimized but the write > path > >> wouldn't. > >> > >> If we want to avoid writing R-Levels altogether, a logical type doesn't > cut > >> it. It needs to be something different. E.g., in the schema, we could > store > >> an optional `count` for repeated fields. Whenever this count is > present, we > >> would not write R-Levels for this field (or more precisely, this field > >> would not take part in the R-Level computation, as if it wasn't a > repeated > >> field). This of course is a more intrusive change, as legacy clients > >> couldn't read such columns anymore. > >> > >> I don't know which of the two alternatives is better. I agree with Gang > >> that we should probably discuss this in a PR. > >> > >> Cheers, > >> Jan > >> > >> > >> Am Mi., 15. Mai 2024 um 14:03 Uhr schrieb Gang Wu <[email protected]>: > >> > >>> Hi Rok, > >>> > >>> Happy to see you here :) > >>> > >>> According to my past experience, it would be more helpful to open > >>> a PR against the parquet-format repository and post it here. > >>> > >>> Best, > >>> Gang > >>> > >>> On Wed, May 15, 2024 at 7:25 PM Rok Mihevc <[email protected]> > wrote: > >>> > >>>> Hi all, > >>>> > >>>> Arrow recently introduced FixedShapeTensor and VariableShapeTensor > >>>> canonical extension types [1] that use FixedSizeList and > >>> StructArray(List, > >>>> FixedSizeList) as storage respectfully. These are targeted at machine > >>>> learning and scientific applications that deal with large datasets and > >>>> would benefit from using Parquet as on disk storage. > >>>> > >>>> However currently FixedSizeList is stored as List in Parquet which > adds > >>>> significant conversion overhead when reading and writing [2]. It would > >>>> therefore be beneficial to introduce a FIXED_SIZE_LIST logical type. > >>>> > >>>> I would like to open a discussion on potentially adding > FIXED_SIZE_LIST > >>>> type and prepare a proposal if discussion supports it. > >>>> > >>>> > >>>> Best, > >>>> Rok > >>>> > >>>> [1] > >>>> > >>> > >> > https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list > >>>> [2] https://github.com/apache/arrow/issues/34510 > >>>> > >>> > >> > > > > >
