AlenkaF commented on issue #49740:
URL: https://github.com/apache/arrow/issues/49740#issuecomment-4295607756
Thank you for opening the issue!
I can reproduce the segfault locally, the repr provided is very clear. I did
a bit of looking into the issue and asked Copilot for some help. What I found
was that the segfault only happens for short strings and works for long
strings, for example:
```python
In [2]: long = pa.chunked_array([
...: pa.array([b'a' * 13, b'e' * 13], type=pa.binary()),
...: pa.array([b'a' * 13, b'e' * 13], type=pa.binary()),
...: ]).combine_chunks().cast(pa.binary_view())
In [3]: long._export_to_c(ctypes.addressof(c_array),
ctypes.addressof(c_schema))
...: print("Long strings: OK")
...:
Long strings: OK
```
```python
In [4]: short = pa.chunked_array([
...: pa.array([b'a', b'e'], type=pa.binary()),
...: pa.array([b'a', b'e'], type=pa.binary()),
...: ]).combine_chunks().cast(pa.binary_view())
In [5]: short._export_to_c(ctypes.addressof(c_array),
ctypes.addressof(c_schema))
...: print("Short strings: OK") # never reached
[1] 18460 segmentation fault ipython
```
Copilot is pointing out that the bug is in the `scalar_cast_string.cc`
implementation where the extra data buffer is dropped because
`all_entries_are_inline` is `True`
https://github.com/apache/arrow/blob/e8b7b4e35e231a0fcdbfa74f6a6b0075108dd5dc/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L465-L467
So this would be a bug in the cast kernel. We might also update the bridge
file
https://github.com/apache/arrow/blob/e8b7b4e35e231a0fcdbfa74f6a6b0075108dd5dc/cpp/src/arrow/c/bridge.cc#L606-L614
guarding against null-pointer variadic buffers.
cc @pitrou
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]