[jira] [Commented] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column

2021-11-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447632#comment-17447632
 ] 

Micah Kornfield commented on ARROW-14770:
-

FWIW writing V2 data pages isn't production ready in arrow.  There is at least 
one open bug for incorrect statistics and we don't align pages to row 
boundaries which I believe is a requirement for V2.  My understanding is that 
V2 is not widely used in general, and we certainly haven't put a lot of effort 
into optimizing the read paths either.

 

In regards to addressing the specific issue, would a higher level API that 
returned list lengths be more appropriate? 

I think exposing the "values" column as a raw buffer is not something I would 
really like to support, because while it is easy to get to a representation 
that uses would agree with numeric types, it is a little bit less 
straight-forward to string/byte types.   For only processing the 
repetition/levels and definition levels it would take some refactoring to 
isolate these components, but there still might be a performance win if we 
decode and ignore the values buffer (which would in turn allow the use of 
existing parquet C++ APIs).   

 

[~jpivarski] is this something you would like to contribute, I can give you 
some code pointers. 

> Direct (individualized) access to definition levels, repetition levels, and 
> numeric data of a column
> 
>
> Key: ARROW-14770
> URL: https://issues.apache.org/jira/browse/ARROW-14770
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet, Python
>Reporter: Jim Pivarski
>Priority: Minor
>
> It would be useful to have more low-level access to the three components of a 
> Parquet column in Python: the definition levels, the repetition levels, and 
> the numeric data, {_}individually{_}.
> The particular use-case we have in Awkward Array is that users will sometimes 
> lazily read an array of lists of structs without reading any of the fields of 
> those structs. To build the data structure, we need the lengths of the lists 
> independently of the columns (which users can then use in functions like 
> {{{}ak.num{}}}; the number of structs without their field values is useful 
> information).
> What we're doing right now is reading a column, converting it to Arrow 
> ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We 
> have been using the schema to try to pick the smallest column (booleans are 
> best!), but that's because we really just want the definition and repetition 
> levels without the numeric data.
> I've heard that the Parquet metadata includes offsets to select just the 
> definition levels, just the repetition levels, or just the numeric data 
> (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
> be ideal.
> Beyond our use case, such a feature could also help with wide structs in 
> lists: all of the non-nullable fields of the struct would share the same 
> definition and repetition levels, so they don't need to be re-read. For that 
> use-case, the ability to pick out definition, repetition, and numeric data 
> separately would still be useful, but the purpose would be to read the 
> numeric data without the structural integers (opposite of ours).
> The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
> would return one, two, or three {{pa.Buffer}} objects depending on three 
> boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and 
> {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length 
> encodings and fixed-width encodings converted into integers of at least one 
> byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to 
> carry {{dtype}} information if that can depend on the maximum level (though 
> levels larger than 255 are likely rare!). This information must be available 
> at some level in Arrow's C++ code; the request is to expose it to Python.
> I've labeled this minor because it is for optimizations, but it would be 
> really nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column

2021-11-22 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447408#comment-17447408
 ] 

Joris Van den Bossche commented on ARROW-14770:
---

cc [~emkornfield]

> Direct (individualized) access to definition levels, repetition levels, and 
> numeric data of a column
> 
>
> Key: ARROW-14770
> URL: https://issues.apache.org/jira/browse/ARROW-14770
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet, Python
>Reporter: Jim Pivarski
>Priority: Minor
>
> It would be useful to have more low-level access to the three components of a 
> Parquet column in Python: the definition levels, the repetition levels, and 
> the numeric data, {_}individually{_}.
> The particular use-case we have in Awkward Array is that users will sometimes 
> lazily read an array of lists of structs without reading any of the fields of 
> those structs. To build the data structure, we need the lengths of the lists 
> independently of the columns (which users can then use in functions like 
> {{{}ak.num{}}}; the number of structs without their field values is useful 
> information).
> What we're doing right now is reading a column, converting it to Arrow 
> ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We 
> have been using the schema to try to pick the smallest column (booleans are 
> best!), but that's because we really just want the definition and repetition 
> levels without the numeric data.
> I've heard that the Parquet metadata includes offsets to select just the 
> definition levels, just the repetition levels, or just the numeric data 
> (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
> be ideal.
> Beyond our use case, such a feature could also help with wide structs in 
> lists: all of the non-nullable fields of the struct would share the same 
> definition and repetition levels, so they don't need to be re-read. For that 
> use-case, the ability to pick out definition, repetition, and numeric data 
> separately would still be useful, but the purpose would be to read the 
> numeric data without the structural integers (opposite of ours).
> The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
> would return one, two, or three {{pa.Buffer}} objects depending on three 
> boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and 
> {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length 
> encodings and fixed-width encodings converted into integers of at least one 
> byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to 
> carry {{dtype}} information if that can depend on the maximum level (though 
> levels larger than 255 are likely rare!). This information must be available 
> at some level in Arrow's C++ code; the request is to expose it to Python.
> I've labeled this minor because it is for optimizations, but it would be 
> really nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column

2021-11-18 Thread Martin Durant (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446064#comment-17446064
 ] 

Martin Durant commented on ARROW-14770:
---

Quick comment: the separate file offsets to the three components is explicitly 
given in V2 pages, where only the data portion is compressed. For V1, the 
components are compressed together, and the lengths of the components is only 
known after decompression, although that decompression could be streamed.

> Direct (individualized) access to definition levels, repetition levels, and 
> numeric data of a column
> 
>
> Key: ARROW-14770
> URL: https://issues.apache.org/jira/browse/ARROW-14770
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet, Python
>Reporter: Jim Pivarski
>Priority: Minor
>
> It would be useful to have more low-level access to the three components of a 
> Parquet column in Python: the definition levels, the repetition levels, and 
> the numeric data, {_}individually{_}.
> The particular use-case we have in Awkward Array is that users will sometimes 
> lazily read an array of lists of structs without reading any of the fields of 
> those structs. To build the data structure, we need the lengths of the lists 
> independently of the columns (which users can then use in functions like 
> {{{}ak.num{}}}; the number of structs without their field values is useful 
> information).
> What we're doing right now is reading a column, converting it to Arrow 
> ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We 
> have been using the schema to try to pick the smallest column (booleans are 
> best!), but that's because we really just want the definition and repetition 
> levels without the numeric data.
> I've heard that the Parquet metadata includes offsets to select just the 
> definition levels, just the repetition levels, or just the numeric data 
> (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
> be ideal.
> Beyond our use case, such a feature could also help with wide structs in 
> lists: all of the non-nullable fields of the struct would share the same 
> definition and repetition levels, so they don't need to be re-read. For that 
> use-case, the ability to pick out definition, repetition, and numeric data 
> separately would still be useful, but the purpose would be to read the 
> numeric data without the structural integers (opposite of ours).
> The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
> would return one, two, or three {{pa.Buffer}} objects depending on three 
> boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and 
> {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length 
> encodings and fixed-width encodings converted into integers of at least one 
> byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to 
> carry {{dtype}} information if that can depend on the maximum level (though 
> levels larger than 255 are likely rare!). This information must be available 
> at some level in Arrow's C++ code; the request is to expose it to Python.
> I've labeled this minor because it is for optimizations, but it would be 
> really nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)