rustyconover commented on issue #36388:
URL: https://github.com/apache/arrow/issues/36388#issuecomment-4364908856

   Hit this from a different angle in `pyarrow 24.0.0` — small `N`, large 
scalar, where `N * len(scalar)` overflows int32 even though `N` itself is well 
below `2**31`. Posting in case it's useful for the fix discussion (in 
particular: confirms @AlenkaF's read that the check belongs on `value_size * 
N`, not on `N`).
   
   ```python
   import struct, pyarrow as pa
   N, SIZE = 2048, 1 << 20                  # 2 GiB total
   arr = pa.repeat(pa.scalar("z" * SIZE), N)
   
   # Surface API silently looks fine:
   print(len(arr), arr.null_count)          # 2048 0
   print(arr[0].as_py()[:4], arr[-1].as_py()[:4])  # zzzz zzzz
   
   # Offsets buffer wraps:
   print(struct.unpack("<3i", bytes(arr.buffers()[1])[-12:]))
   # (2145386496, 2146435072, -2147483648)
   
   arr.validate()
   # pyarrow.lib.ArrowInvalid: Negative offsets in binary array
   
   pa.RecordBatch.from_arrays([arr], names=["x"])
   # pyarrow.lib.ArrowInvalid: In column 0: Invalid: Negative offsets in binary 
array
   ```
   
   Threshold is exactly at `value_size * N >= 2**31`:
   
   | `N` | `value_size` | total | `validate()` |
   |---|---|---|---|
   | 2047 | 1 MiB | 2 146 435 072 | ✅ |
   | 2048 | 1 MiB | 2 147 483 648 | ❌ |
   
   Two characteristics worth flagging:
   
   1. **Failure point is far from the cause.** `pa.repeat` returns 
"successfully", and `len()` / `__getitem__` work because the element accessor 
uses wider arithmetic. The corruption only surfaces at `validate()` (and any 
consumer that calls it — `RecordBatch.from_arrays`, IPC writers, etc.). In the 
wild this manifests as `ArrowInvalid: In column 0: Invalid: Negative offsets in 
binary array` from `RecordBatch.from_arrays` with no hint that `pa.repeat` was 
the actual culprit. Hit this in a real consumer: 
<https://github.com/duckdb/duckdb-python/...> (DuckDB→PyArrow conversion 
fixture in [VGI](https://github.com/query-farm/vgi-python)).
   
   2. **An overflow-on-the-product check (per @AlenkaF's sketch in 
`RepeatedArrayFactory`) catches both this case and the original `pa.repeat("?", 
2**31)` case** — bounds-checking just `N` would miss this one.
   
   Workaround for callers that know the data shape: build with 
`pa.large_string()` (int64 offsets) explicitly, e.g.
   
   ```python
   pa.array([scalar.as_py()] * N, type=pa.large_string())
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to