[
https://issues.apache.org/jira/browse/ARROW-14383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437532#comment-17437532
]
Weston Pace commented on ARROW-14383:
-------------------------------------
This is a good question as for what is expected via the C data interface.
I can maybe offer some insights as to why you're getting this behavior as I
recently explored some of these cases as part of
https://github.com/apache/arrow/pull/11542:
The C++ impl considers array offsets to be inherited by children (note, I don't
see any description of this in the C data interface docs so we should clarify
intent). This allows for (as you have noticed) slicing parent arrays without
having to mutate child arrays. I also don't know of any requirement that the
child arrays must have the same length as the parent arrays. This is similar
to buffers not being required to have the same length as the containing array.
For most data types this is pretty straightforward to implement. It's a bit
trickier for variable length data types (you need to use the parent array's
offsets AND indices maps to determine the referenced portion of the child
array).
It's a downright pain for dense unions. There is no way to easy way know what
the appropriate offset of the children arrays if you are trying to determine
which portion of the child arrays is referenced by the union without iterating
the type ids array of the union.
The last notable exception is ArrowArray.dictionary. Slicing an arrow array in
C++ does not modify the ArrowArray.dictionary field at all and
ArrowArray.dictionary does not inherit the offsets of the containing array.
> [C++] [Python] Does a sliced StructArray roundtrip on c data interface?
> -----------------------------------------------------------------------
>
> Key: ARROW-14383
> URL: https://issues.apache.org/jira/browse/ARROW-14383
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 5.0.0
> Reporter: Jorge Leitão
> Priority: Major
>
> I am struggling to roundtrip a sliced StructArray over the c data interface.
> Consider the array:
> {code:python}
> fields = [
> ("f1", pyarrow.int32()),
> ("f2", pyarrow.string()),
> ]
> a = pyarrow.array(
> [
> {"f1": 1, "f2": "a"},
> None,
> {"f1": 3, "f2": None},
> {"f1": None, "f2": "d"},
> {"f1": None, "f2": None},
> ],
> pyarrow.struct(fields),
> ).slice(1, 2)
> {code}
> When reading this array from the c data interface, I get:
> {code:java}
> array: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00007f61796091c0,
> children: 0x00007f6179609280,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f617aef2ba0,
> ),
> private_data: 0x00007f617960b3c0,
> }
> child #0: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00007f0f49609200,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b480,
> }
> child #1: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00007f0f49609240,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b540,
> }
> {code}
> This does not seem consistent with what the Python API offers:
> {code:python}
> print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5? (or
> better, vice-versa)
> {code}
> Secondly and most importantly, the condition that each child's length must
> equal the array's own length is violated (children length is 5, array's
> length is 2 in the example above).
> We could argue that a consumer MUST slice each child to achieve the desired
> behavior, but that won't roundtrip because, when writing the StructArray
> (after consuming it), we would now write
> {code}
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 0,
> offset: 1,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00000000021c8b20,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024f0db0,
> }
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00000000024998f0,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x0000000002499910,
> }
> Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00000000024f12d0,
> children: 0x00000000021c8ae0,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024999c0,
> }
> {code}
> is consumed as
> {code}
> print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
> print(b.offset, len(b)) # 1 2 <-- OK
> {code}
> which causes the check in [this
> line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115]
> to fail.
> I was unable to find a test for a roundtrip of a sliced struct [in pyarrow
> tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py]
> to compare my test with a reference test, but it seems to me that when we
> slice a StructArray, we should slice its children accordingly so that its C
> data interface yields a consistent result?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)