jding-xyz opened a new pull request, #152: URL: https://github.com/apache/arrow-swift/pull/152
## Problem `VariableBufferBuilder<T>` (used by both `StringArrayBuilder` and `BinaryArrayBuilder`) has several correctness issues that can cause memory corruption and produce Arrow-invalid output: 1. **First-value buffer overflow.** The `append()` method only checks values buffer capacity inside `if index > 0`, so the first non-nil value can overflow the initial 64-byte allocation if it is larger than that. This corrupts memory silently. 2. **Null slots write phantom bytes.** Null entries create a 4-byte zero placeholder and copy it into the values buffer. Per the Arrow columnar format, null variable-width slots should repeat the previous offset and append zero value bytes. 3. **Offsets buffer is sized for N entries instead of N+1.** The Arrow spec requires N+1 offsets for N rows. The old `finish()` creates an offsets buffer with `length` entries, and the reader similarly expects `length` offsets, leading to an off-by-one in the offset buffer. 4. **Logical buffer lengths are never updated.** `values.length` and `offsets.length` stay at their initial values (0) after construction. `finish()` then uses stale lengths — in particular, the values buffer is created with `length = 0` regardless of actual content. 5. **Resize checks the wrong metric.** `value_resize()` compares against `values.length` (stale logical length) instead of `values.capacity` (actual allocation). These issues were discovered while using `StringArrayBuilder` in a production IPC-based data pipeline, where the first string value in a column was typically longer than 64 bytes, causing silent memory corruption during batch construction. ## Fix Rewrote `VariableBufferBuilder<T>.append(_:)` around explicit Arrow invariants rather than patching individual branches: - Read `currentOffset` from the offsets buffer uniformly for all indices (including row 0), instead of special-casing the first row. - For non-null values: compute `nextOffset = currentOffset + byteCount`, ensure values buffer capacity, copy bytes. - For null values: keep `nextOffset = currentOffset`, do not write any bytes. - Store the next offset, then update `self.length`, `nulls.length`, `offsets.length`, and `values.length` consistently after every append. - Initialize the offsets buffer with a single `Int32(0)` entry at construction (the required first offset). Updated the corresponding resize and finish methods: - `value_resize()` now checks `capacity` instead of `length`. - `resize()` decouples null bitmap and offset buffer growth, checks against `capacity`, and preserves logical lengths across reallocation. - `finish()` creates the offsets buffer with `length + 1` entries. Propagated the N+1 offset convention to the reader and C data importer: - `ArrowReader`: offset buffer length is now `node.length + 1`; values buffer length is derived from the last offset entry. - `ArrowCImporter`: offset buffer length changed from `length` to `length + 1`. - `ArrowData` convenience init: for variable-width types, derives array length as `offsets.length - 1`. ## Testing Added targeted regression tests with raw-buffer assertions (not just same-library round-trip checks): - **Long first value**: 256-byte first string, verifies offsets = `[0, 256]` and values buffer length = 256. - **Null does not advance offsets**: `["a", nil, "bbb"]` verifies offsets = `[0, 1, 1, 4]` and values buffer length = 4. - **Null first, then long value**: `[nil, 512-byte string]` verifies offsets = `[0, 0, 512]`. - **Binary parity**: same null-handling test for `BinaryArrayBuilder`. - **IPC round-trip**: `[256-byte string, nil, "tail"]` through `writeStreaming` / `readStreaming`, verifying schema, row count, decoded values, and null handling. All existing tests continue to pass. The 5 pre-existing failures on `main` (missing generated test data files) are unrelated. ## AI disclosure This fix was developed with AI assistance (Cursor with Claude). The bug was identified through real-world usage in a project that uses Arrow IPC for data serialization — string columns with longer first values caused memory corruption during batch construction. The root cause was then traced by manual code inspection of `VariableBufferBuilder.append()` against the Arrow columnar format spec, and documented in a detailed plan before any code was written. All generated code was reviewed line-by-line, traced through concrete examples, and validated against `swift test`. The approach — writing regression tests first, then fixing the implementation — was deliberate to ensure the tests could catch regressions independently of the fix. Made with [Cursor](https://cursor.com) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
