rustyconover commented on issue #36388:
URL: https://github.com/apache/arrow/issues/36388#issuecomment-4364908856
Hit this from a different angle in `pyarrow 24.0.0` — small `N`, large
scalar, where `N * len(scalar)` overflows int32 even though `N` itself is well
below `2**31`. Posting in case it's useful for the fix discussion (in
particular: confirms @AlenkaF's read that the check belongs on `value_size *
N`, not on `N`).
```python
import struct, pyarrow as pa
N, SIZE = 2048, 1 << 20 # 2 GiB total
arr = pa.repeat(pa.scalar("z" * SIZE), N)
# Surface API silently looks fine:
print(len(arr), arr.null_count) # 2048 0
print(arr[0].as_py()[:4], arr[-1].as_py()[:4]) # zzzz zzzz
# Offsets buffer wraps:
print(struct.unpack("<3i", bytes(arr.buffers()[1])[-12:]))
# (2145386496, 2146435072, -2147483648)
arr.validate()
# pyarrow.lib.ArrowInvalid: Negative offsets in binary array
pa.RecordBatch.from_arrays([arr], names=["x"])
# pyarrow.lib.ArrowInvalid: In column 0: Invalid: Negative offsets in binary
array
```
Threshold is exactly at `value_size * N >= 2**31`:
| `N` | `value_size` | total | `validate()` |
|---|---|---|---|
| 2047 | 1 MiB | 2 146 435 072 | ✅ |
| 2048 | 1 MiB | 2 147 483 648 | ❌ |
Two characteristics worth flagging:
1. **Failure point is far from the cause.** `pa.repeat` returns
"successfully", and `len()` / `__getitem__` work because the element accessor
uses wider arithmetic. The corruption only surfaces at `validate()` (and any
consumer that calls it — `RecordBatch.from_arrays`, IPC writers, etc.). In the
wild this manifests as `ArrowInvalid: In column 0: Invalid: Negative offsets in
binary array` from `RecordBatch.from_arrays` with no hint that `pa.repeat` was
the actual culprit. Hit this in a real consumer:
<https://github.com/duckdb/duckdb-python/...> (DuckDB→PyArrow conversion
fixture in [VGI](https://github.com/query-farm/vgi-python)).
2. **An overflow-on-the-product check (per @AlenkaF's sketch in
`RepeatedArrayFactory`) catches both this case and the original `pa.repeat("?",
2**31)` case** — bounds-checking just `N` would miss this one.
Workaround for callers that know the data shape: build with
`pa.large_string()` (int64 offsets) explicitly, e.g.
```python
pa.array([scalar.as_py()] * N, type=pa.large_string())
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]