Jim Pivarski created ARROW-14770:
------------------------------------

             Summary: Direct (individualized) access to definition levels, 
repetition levels, and numeric data of a column
                 Key: ARROW-14770
                 URL: https://issues.apache.org/jira/browse/ARROW-14770
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Parquet, Python
            Reporter: Jim Pivarski


It would be useful to have more low-level access to the three components of a 
Parquet column in Python: the definition levels, the repetition levels, and the 
numeric data, {_}individually{_}.

The particular use-case we have in Awkward Array is that users will sometimes 
lazily read an array of lists of structs without reading any of the fields of 
those structs. To build the data structure, we need the lengths of the lists 
independently of the columns (which users can then use in functions like 
{{{}ak.num{}}}; the number of structs without their field values is useful 
information).

What we're doing right now is reading a column, converting it to Arrow 
({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We have 
been using the schema to try to pick the smallest column (booleans are best!), 
but that's because we really just want the definition and repetition levels 
without the numeric data.

I've heard that the Parquet metadata includes offsets to select just the 
definition levels, just the repetition levels, or just the numeric data 
(pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
be ideal.

Beyond our use case, such a feature could also help with wide structs in lists: 
all of the non-nullable fields of the struct would share the same definition 
and repetition levels, so they don't need to be re-read. For that use-case, the 
ability to pick out definition, repetition, and numeric data separately would 
still be useful, but the purpose would be to read the numeric data without the 
structural integers (opposite of ours).

The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
would return one, two, or three {{pa.Buffer}} objects depending on three 
boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and {{{}numeric{}}}. 
The {{pa.Buffer}} would be unpacked, with all run-length encodings and 
fixed-width encodings converted into integers of at least one byte each. It may 
make more sense for the output to be {{{}np.ndarray{}}}, to carry {{dtype}} 
information if that can depend on the maximum level (though levels larger than 
255 are likely rare!). This information must be available at some level in 
Arrow's C++ code; the request is to expose it to Python.

I've labeled this minor because it is for optimizations, but it would be really 
nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to