[ https://issues.apache.org/jira/browse/ARROW-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446064#comment-17446064 ]
Martin Durant commented on ARROW-14770: --------------------------------------- Quick comment: the separate file offsets to the three components is explicitly given in V2 pages, where only the data portion is compressed. For V1, the components are compressed together, and the lengths of the components is only known after decompression, although that decompression could be streamed. > Direct (individualized) access to definition levels, repetition levels, and > numeric data of a column > ---------------------------------------------------------------------------------------------------- > > Key: ARROW-14770 > URL: https://issues.apache.org/jira/browse/ARROW-14770 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Parquet, Python > Reporter: Jim Pivarski > Priority: Minor > > It would be useful to have more low-level access to the three components of a > Parquet column in Python: the definition levels, the repetition levels, and > the numeric data, {_}individually{_}. > The particular use-case we have in Awkward Array is that users will sometimes > lazily read an array of lists of structs without reading any of the fields of > those structs. To build the data structure, we need the lengths of the lists > independently of the columns (which users can then use in functions like > {{{}ak.num{}}}; the number of structs without their field values is useful > information). > What we're doing right now is reading a column, converting it to Arrow > ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We > have been using the schema to try to pick the smallest column (booleans are > best!), but that's because we really just want the definition and repetition > levels without the numeric data. > I've heard that the Parquet metadata includes offsets to select just the > definition levels, just the repetition levels, or just the numeric data > (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would > be ideal. > Beyond our use case, such a feature could also help with wide structs in > lists: all of the non-nullable fields of the struct would share the same > definition and repetition levels, so they don't need to be re-read. For that > use-case, the ability to pick out definition, repetition, and numeric data > separately would still be useful, but the purpose would be to read the > numeric data without the structural integers (opposite of ours). > The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but > would return one, two, or three {{pa.Buffer}} objects depending on three > boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and > {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length > encodings and fixed-width encodings converted into integers of at least one > byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to > carry {{dtype}} information if that can depend on the maximum level (though > levels larger than 255 are likely rare!). This information must be available > at some level in Arrow's C++ code; the request is to expose it to Python. > I've labeled this minor because it is for optimizations, but it would be > really nice to have! -- This message was sent by Atlassian Jira (v8.20.1#820001)