maltzsama opened a new pull request, #144:
URL: https://github.com/apache/arrow-swift/pull/144

   Closes #106
   
   ## What
   
   `alignTo64` was adding 8 extra bytes to every buffer allocation, regardless
   of whether the length was already aligned:
   
   ```swift
   // before
   private static func alignTo64(_ length: UInt) -> UInt {
       let bufAlignment = length % 64
       if bufAlignment != 0 {
           return length + (64 - bufAlignment) + 8  // +8 has no spec basis
       }
       return length + 8  // +8 even when already aligned
   }
   
   // after
   private static func alignTo64(_ length: UInt) -> UInt {
       let bufAlignment = length % 64
       if bufAlignment != 0 {
           return length + (64 - bufAlignment)
       }
       return length
   }
   ```
   
   ## Why
   
   The Arrow columnar format specification requires buffer alignment to 64 
bytes.
   There is no provision in the spec for additional padding beyond that 
alignment
   boundary. The extra 8 bytes were silently inflating every buffer allocation —
   both the allocated size and the `capacity` field used by `append(to:)` when
   writing IPC messages.
   
   The extra bytes are zeros (buffers are zero-initialized), so they have no
   effect on correctness when reading IPC streams — readers skip padding between
   messages at the IPC layer, not at the buffer layer. But they do produce
   non-conformant buffer sizes that diverge from what other Arrow 
implementations
   produce for the same input, which can surface as subtle interop issues when
   exchanging data via the C Data Interface with Go or other runtimes.
   
   The most likely origin of this `+8` is a confusion with the IPC message 
framing
   format, which uses a 4-byte continuation indicator followed by a 4-byte 
metadata
   length prefix before each message body. That framing is the responsibility of
   `ArrowWriter`, not `ArrowBuffer`.
   
   ## Impact
   
   Every `ArrowBuffer` created via `createBuffer` was 8 bytes larger than
   necessary. In workloads with many columns or high message throughput — 
exactly
   the scenarios Arrow is designed for — this overhead accumulates across every
   batch. The fix brings allocation sizes into conformance with the spec.
   
   No existing tests assert on `capacity` values directly, so no test changes 
are
   required. The behavior of all existing tests is unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to