trxcllnt commented on code in PR #438:
URL: https://github.com/apache/arrow-js/pull/438#discussion_r3357916044


##########
src/visitor/vectorloader.ts:
##########
@@ -167,6 +170,22 @@ export class VectorLoader extends Visitor {
         return nullCount > 0 && this.readData(type, buffer) || new 
Uint8Array(0);
     }
     protected readOffsets<T extends DataType>(type: T, buffer?: BufferRegion) 
{ return this.readData(type, buffer); }
+    // Large* types carry int64 offsets. Downstream code narrows each offset 
to a JS
+    // number when indexing buffers, which is lossless for any offset that 
indexes a
+    // buffer the runtime can actually allocate — so the common case is 
returned as a
+    // zero-copy view over the wire bytes, untouched. The exception is a 
sliced array
+    // serialized with absolute (non-rebased) offsets, whose values can exceed
+    // Number.MAX_SAFE_INTEGER even when the referenced span is small; only 
then do we
+    // rebase to 0 (the one case that requires a copy) so the offsets stay 
narrowable.
+    protected readLargeOffsets<T extends DataType>(type: T, buffer?: 
BufferRegion) {
+        const offsets = this.readOffsets(type, buffer);
+        const wide: BigInt64Array = toBigInt64Array(offsets);
+        if (wide.length === 0 || wide.at(-1)! <= 
BigInt(Number.MAX_SAFE_INTEGER)) {
+            return offsets;
+        }
+        const base = wide[0];
+        return wide.map((value) => value - base);
+    }

Review Comment:
   The format spec for variable-sized layouts 
[states](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout):
   > Generally the first slot in the offsets array is 0, and the last slot is 
the length of the values array. When serializing this layout, we recommend 
normalizing the offsets to start at 0.
   
   While this sounds permissive, I don't know of any language implementations 
that don't strictly enforce this.
   
   There are many possible ways to construct non-spec-conforming IPC streams, 
but it either isn't practical to attempt to catch them all, or we assume the 
user knows and has a very good reason for doing what they're doing.
   
   For example, we also don't validate whether the offsets buffer for List, 
Utf8, or Binary vectors start with offset 0, because we assume the IPC stream 
is from a conforming implementation.
   
   There are a few instances where I've taken advantage of ambiguities or UB in 
the spec to achieve specific results when working with certain technologies. 
For example, I've previously implemented a custom in-memory layout with large 
contiguous buffers on GPUs that included the IPC metadata for reading the 
contiguous buffers in chunks. I intentionally didn't rebase valueOffsets in 
that implementation, because the consumer mapped the valueOffsets to VBOs for 
GPU shaders. While this is technically UB in the spec, the CPU IPC streaming 
code shouldn't prohibit doing this.
   
   I think we should remove this function and revert to calling `readOffsets()` 
like before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to