Jim Pivarski created ARROW-14770: ------------------------------------ Summary: Direct (individualized) access to definition levels, repetition levels, and numeric data of a column Key: ARROW-14770 URL: https://issues.apache.org/jira/browse/ARROW-14770 Project: Apache Arrow Issue Type: New Feature Components: C++, Parquet, Python Reporter: Jim Pivarski
It would be useful to have more low-level access to the three components of a Parquet column in Python: the definition levels, the repetition levels, and the numeric data, {_}individually{_}. The particular use-case we have in Awkward Array is that users will sometimes lazily read an array of lists of structs without reading any of the fields of those structs. To build the data structure, we need the lengths of the lists independently of the columns (which users can then use in functions like {{{}ak.num{}}}; the number of structs without their field values is useful information). What we're doing right now is reading a column, converting it to Arrow ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We have been using the schema to try to pick the smallest column (booleans are best!), but that's because we really just want the definition and repetition levels without the numeric data. I've heard that the Parquet metadata includes offsets to select just the definition levels, just the repetition levels, or just the numeric data (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would be ideal. Beyond our use case, such a feature could also help with wide structs in lists: all of the non-nullable fields of the struct would share the same definition and repetition levels, so they don't need to be re-read. For that use-case, the ability to pick out definition, repetition, and numeric data separately would still be useful, but the purpose would be to read the numeric data without the structural integers (opposite of ours). The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but would return one, two, or three {{pa.Buffer}} objects depending on three boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length encodings and fixed-width encodings converted into integers of at least one byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to carry {{dtype}} information if that can depend on the maximum level (though levels larger than 255 are likely rare!). This information must be available at some level in Arrow's C++ code; the request is to expose it to Python. I've labeled this minor because it is for optimizations, but it would be really nice to have! -- This message was sent by Atlassian Jira (v8.20.1#820001)