This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-js.git


The following commit(s) were added to refs/heads/main by this push:
     new 25915cd  fix: Respect typed-array byteOffset in bigNumsToStrings (#439)
25915cd is described below

commit 25915cd9e79da52be3a476ceb87157aaeb9c618a
Author: Karakatiza666 <[email protected]>
AuthorDate: Tue Jun 9 04:48:51 2026 +0400

    fix: Respect typed-array byteOffset in bigNumsToStrings (#439)
    
    This PR was co-authored with [Claude
    Code](https://claude.com/claude-code).
    
    ---
    
    ## Fix incorrect 64-bit JSON IPC output for tables loaded from binary
    Arrow
    
    ### The bug
    When the JSON IPC writer serializes a column whose values are stored as
    64-bit words (offsets or data), it reads the buffer's underlying
    ArrayBuffer directly without honoring the typed-array's byteOffset /
    byteLength. If the column happens to be a view into a larger shared
    buffer rather than a freshly-allocated array, the writer reads from the
    wrong window of memory and emits garbage for the affected DATA / OFFSET
    strings.
    
    ### When it actually triggers
    Only on the flow "read data in binary Arrow IPC, write data in JSON
    IPC." That's the only public path that produces vectors backed by offset
    views — the binary IPC reader carves every buffer out of a single shared
    byte array, so anything loaded from a file, stream, or another Arrow
    implementation arrives at the JSON writer in exactly the at-risk shape.
    
    Tables built directly in JS memory (e.g. via the existing JSON
    round-trip tests, or tableFromArrays) are not affected, which is why the
    bug must have gone unnoticed: the RecordBatchJSONWriter test suite
    synthesizes its inputs in memory and never sees the offset-view state.
    The flow is mostly used in cross-language integration testing, not by
    typical end users — which may be the second reason it lingered.
    
    ### What's affected
    Any column whose values are 64-bit words processed through the writer's
    bignum helper:
    
    - Int64, Uint64
    - DateMillisecond
    - All four Timestamp units
    - TimeMicrosecond, TimeNanosecond
    - All four Duration units
    - Decimal128
    - LargeUtf8 and LargeBinary (their OFFSET arrays)
    
    Fixed types unaffected: 32-bit ints/floats, DateDay,
    TimeSecond/Millisecond, regular Utf8/Binary, anything that doesn't go
    through the 64-bit serialization path.
    
    ### Test coverage
    
    A new `test/unit/ipc/writer/json-offset-views-tests.ts` adds 16 focused
    unit tests — one per type whose JSON serialization goes through the
    buggy 64-bit helper. Each test takes a freshly-generated vector from the
    shared generate-test-data helpers, rewraps every backing typed array at
    a nonzero byteOffset, and asserts the JSON writer produces the same
    output for the rewrapped vector as for the original. The list of covered
    types is a flat table at the top of the file; **adding a future affected
    type is a one-line addition**.
    
    The existing `RecordBatchJSONWriter` suite in
    `test/unit/ipc/writer/json-writer-tests.ts` was also extended: it now
    routes its generated source tables through binary IPC before the JSON
    round-trip, so the JSON writer sees the offset-view typed arrays that
    real-world inputs always carry. This widens end-to-end coverage to every
    type the random/dictionary table generators produce, on the user-facing
    API path.
    
    The two layers are complementary:
    
    The focused unit tests pinpoint exactly which type fails on regression
    and stay sound even if the binary IPC reader's memory layout ever
    changes — they construct offset views directly, without depending on
    `tableFromIPC`.
    The extended round-trip suite catches the same regression on a much
    broader type matrix through the actual public flow a user would use.
    With the fix reverted, all 16 focused tests fail and over 100 cases in
    the extended round-trip suite fail. With the fix applied, all tests
    across 46 suites pass.
    
    The new test file is not strictly necessary right now, but it makes
    catching a regression more robust.
---
 src/visitor/jsonvectorassembler.ts               |   4 +-
 test/unit/ipc/writer/bignums-to-strings-tests.ts | 108 +++++++++++++++++++++
 test/unit/ipc/writer/json-offset-views-tests.ts  | 116 +++++++++++++++++++++++
 test/unit/ipc/writer/json-writer-tests.ts        |  14 ++-
 4 files changed, 236 insertions(+), 6 deletions(-)

diff --git a/src/visitor/jsonvectorassembler.ts 
b/src/visitor/jsonvectorassembler.ts
index f89dbdd..fc0ddc6 100644
--- a/src/visitor/jsonvectorassembler.ts
+++ b/src/visitor/jsonvectorassembler.ts
@@ -207,8 +207,8 @@ function* binaryToString(vector: Vector<Binary> | 
Vector<LargeBinary> | Vector<F
 }
 
 /** @ignore */
-function* bigNumsToStrings(values: BigUint64Array | BigInt64Array | 
Uint32Array | Int32Array | IntArray, stride: number) {
-    const u32s = new Uint32Array(values.buffer);
+export function* bigNumsToStrings(values: BigUint64Array | BigInt64Array | 
Uint32Array | Int32Array | IntArray, stride: number) {
+    const u32s = new Uint32Array(values.buffer, values.byteOffset, 
values.byteLength / Uint32Array.BYTES_PER_ELEMENT);
     for (let i = -1, n = u32s.length / stride; ++i < n;) {
         yield `${BN.new(u32s.subarray((i + 0) * stride, (i + 1) * stride), 
false)}`;
     }
diff --git a/test/unit/ipc/writer/bignums-to-strings-tests.ts 
b/test/unit/ipc/writer/bignums-to-strings-tests.ts
new file mode 100644
index 0000000..73ae154
--- /dev/null
+++ b/test/unit/ipc/writer/bignums-to-strings-tests.ts
@@ -0,0 +1,108 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Focused unit tests for bigNumsToStrings(), the helper every 64-/128-bit
+// JSON callsite (Int64, Uint64, Date, Timestamp, Time, Duration, Decimal,
+// LargeUtf8/Binary/List offsets) routes through. The end-to-end matrix in
+// json-offset-views-tests.ts exercises it through JSONVectorAssembler; these
+// tests pin the helper's contract directly:
+//
+//   - it reinterprets the backing buffer as little-endian u32 words and groups
+//     `stride` words per emitted value,
+//   - it always formats UNSIGNED (BN.new(words, false)), so a signed input's
+//     negative values come out as their two's-complement 64-bit magnitude,
+//   - it honours values.byteOffset/byteLength rather than the whole buffer
+//     (the regression fixed in "Respect typed-array byteOffset in
+//     bigNumsToStrings"): reverting that fix breaks the offset-view cases 
here.
+
+import { bigNumsToStrings } from 'apache-arrow/visitor/jsonvectorassembler';
+
+// JSONVectorAssembler's module is an internal (non-public) deep import. UMD /
+// single-bundle build targets collapse every `apache-arrow/*` path onto the
+// public bundle, which does not re-export this helper. Skip there; the same
+// behaviour is covered via the public API in json-writer-tests.ts.
+const describeHelper = bigNumsToStrings ? describe : describe.skip;
+if (!bigNumsToStrings) {
+    console.warn('Skipping "bigNumsToStrings" unit test under UMD build.');
+}
+
+/**
+ * Place `arr`'s contents at a nonzero byteOffset inside a larger buffer and
+ * return a subarray view over them — the memory layout a typed array has after
+ * coming out of tableFromIPC or a slice.
+ */
+function atOffset<T extends { length: number; set(a: any, o: number): void; 
subarray(a: number, b: number): T }>(
+    Ctor: new (n: number) => T,
+    values: ArrayLike<any>,
+): T {
+    const pad = 3;
+    const backing = new Ctor(values.length + pad * 2);
+    backing.set(values as any, pad);
+    return backing.subarray(pad, pad + values.length);
+}
+
+describeHelper('bigNumsToStrings', () => {
+    test('formats unsigned 64-bit values (stride 2)', () => {
+        const values = [0n, 1n, 42n, 0xFFFFFFFFn, 0x1_0000_0000n, 
1234567890123456789n, 0xFFFFFFFFFFFFFFFFn];
+        const arr = BigUint64Array.from(values);
+        expect([...bigNumsToStrings(arr, 2)]).toEqual(values.map(String));
+    });
+
+    test('formats signed 64-bit values as their unsigned two\'s-complement 
(stride 2)', () => {
+        const values = [-1n, -2n, -1234567890n, 5n];
+        const arr = BigInt64Array.from(values);
+        const expected = values.map((v) => String(BigInt.asUintN(64, v)));
+        expect([...bigNumsToStrings(arr, 2)]).toEqual(expected);
+        // sanity: -1 is the all-ones 64-bit word
+        expect(expected[0]).toBe('18446744073709551615');
+    });
+
+    test('formats unsigned 128-bit values (stride 4)', () => {
+        // little-endian u32 words: value = w0 + w1*2^32 + w2*2^64 + w3*2^96
+        const one = [1, 0, 0, 0];
+        const twoTo96 = [0, 0, 0, 1];
+        const arr = Uint32Array.from([...one, ...twoTo96]);
+        expect([...bigNumsToStrings(arr, 4)]).toEqual([
+            '1',
+            String(2n ** 96n),
+        ]);
+    });
+
+    test('yields one string per group of `stride` words', () => {
+        const arr = BigUint64Array.from([10n, 20n, 30n]);
+        expect([...bigNumsToStrings(arr, 2)]).toHaveLength(3);
+    });
+
+    test('honours byteOffset for 64-bit views (regression)', () => {
+        const values = [0n, 1n, 1234567890123456789n, 0xFFFFFFFFFFFFFFFFn];
+        const view = atOffset(BigUint64Array, values);
+
+        // precondition: the view really starts past the buffer origin
+        expect(view.byteOffset).toBeGreaterThan(0);
+
+        expect([...bigNumsToStrings(view, 2)]).toEqual(values.map(String));
+    });
+
+    test('offset view and fresh array agree for signed 64-bit', () => {
+        const values = [-1n, 9223372036854775807n, -9223372036854775808n, 0n];
+        const fresh = BigInt64Array.from(values);
+        const view = atOffset(BigInt64Array, values);
+
+        expect(view.byteOffset).toBeGreaterThan(0);
+        expect([...bigNumsToStrings(view, 
2)]).toEqual([...bigNumsToStrings(fresh, 2)]);
+    });
+});
diff --git a/test/unit/ipc/writer/json-offset-views-tests.ts 
b/test/unit/ipc/writer/json-offset-views-tests.ts
new file mode 100644
index 0000000..b368f5d
--- /dev/null
+++ b/test/unit/ipc/writer/json-offset-views-tests.ts
@@ -0,0 +1,116 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// JSONVectorAssembler must serialize a vector the same way regardless of where
+// its backing typed arrays sit in their buffers. For each affected type we 
take
+// a freshly-generated Vector (byteOffset === 0), rewrap every backing typed
+// array in a subarray view at byteOffset > 0, and assert both serialize to
+// identical JSON.
+
+import {
+    Data, DataType, Field, RecordBatch, Schema, Struct,
+} from 'apache-arrow';
+import { JSONVectorAssembler } from 'apache-arrow/visitor/jsonvectorassembler';
+import * as generate from '../../../generate-test-data.js';
+
+/**
+ * Return a copy of `arr` whose data lives at a nonzero byteOffset inside a
+ * larger underlying buffer. Equivalent to the state of a typed array that
+ * came out of `tableFromIPC` — same logical contents, shifted memory layout.
+ */
+function padBuffer<T extends ArrayBufferView>(arr: T | undefined): T | 
undefined {
+    if (!arr) return arr;
+    const Ctor = arr.constructor as new (n: number) => any;
+    const padding = 4;
+    const padded = new Ctor((arr as any).length + padding * 2);
+    padded.set(arr, padding);
+    return padded.subarray(padding, padding + (arr as any).length) as T;
+}
+
+/**
+ * Clone a Data so every backing typed array sits at byteOffset > 0.
+ * Recurses into children so nested types stay consistent.
+ */
+function withOffsetViews<T extends DataType>(d: Data<T>): Data<T> {
+    const newBuffers = Array.from(d.buffers as unknown as 
Array<ArrayBufferView | undefined>, padBuffer);
+    const newChildren = d.children.map((c) => withOffsetViews(c));
+    return d.clone(d.type, d.offset, d.length, d.nullCount, newBuffers as any, 
newChildren);
+}
+
+/**
+ * Wrap a Data in a single-field RecordBatch so we can hand it to assemble().
+ */
+function batchOf<T extends DataType>(type: T, data: Data<T>): RecordBatch<any> 
{
+    const field = new Field('x', type, true);
+    const schema = new Schema([field]);
+    const structType = new Struct([field]);
+    const root = data.clone(structType as any, 0, data.length, 0, [undefined, 
undefined, undefined, undefined] as any, [data]);
+    return new RecordBatch(schema, root as Data<Struct>);
+}
+
+// Every affected callsite of bigNumsToStrings, one entry per type. Adding
+// a new affected type is a one-line addition here.
+const affectedTypeCases: Array<{ name: string; gen: () => 
generate.GeneratedVector<any> }> = [
+    { name: 'Int64', gen: () => generate.int64() },
+    { name: 'Uint64', gen: () => generate.uint64() },
+    { name: 'DateMillisecond', gen: () => generate.dateMillisecond() },
+    { name: 'TimestampSecond', gen: () => generate.timestampSecond() },
+    { name: 'TimestampMillisecond', gen: () => generate.timestampMillisecond() 
},
+    { name: 'TimestampMicrosecond', gen: () => generate.timestampMicrosecond() 
},
+    { name: 'TimestampNanosecond', gen: () => generate.timestampNanosecond() },
+    { name: 'TimeMicrosecond', gen: () => generate.timeMicrosecond() },
+    { name: 'TimeNanosecond', gen: () => generate.timeNanosecond() },
+    { name: 'DurationSecond', gen: () => generate.durationSecond() },
+    { name: 'DurationMillisecond', gen: () => generate.durationMillisecond() },
+    { name: 'DurationMicrosecond', gen: () => generate.durationMicrosecond() },
+    { name: 'DurationNanosecond', gen: () => generate.durationNanosecond() },
+    { name: 'Decimal', gen: () => generate.decimal() },
+    { name: 'LargeUtf8', gen: () => generate.largeUtf8() },
+    { name: 'LargeBinary', gen: () => generate.largeBinary() },
+];
+
+// JSONVectorAssembler is an internal (non-public) export. Single-bundle build
+// targets such as the UMD bundle collapse every `apache-arrow/*` deep import 
to
+// the public bundle, which doesn't re-export it.
+// So, skip the test on UMD builds (covered there via the public API in
+// json-writer-tests.ts).
+if (!JSONVectorAssembler) {
+    console.warn(
+        'Skipping "JSONVectorAssembler offset-view safety" test under UMD 
build.'
+    );
+}
+const describeAssembler = JSONVectorAssembler ? describe : describe.skip;
+
+describeAssembler('JSONVectorAssembler offset-view safety', () => {
+    for (const { name, gen } of affectedTypeCases) {
+        test(name, () => {
+            const { vector } = gen();
+            const fresh = vector.data[0];
+            const viewed = withOffsetViews(fresh);
+
+            // Sanity: the rewrap really did shift at least one buffer
+            // past byteOffset 0, so this run actually exercises the bug
+            // precondition.
+            const buffers = Array.from(viewed.buffers as unknown as 
Array<ArrayBufferView | undefined>);
+            expect(buffers.some((b) => b && b.byteOffset > 0)).toBe(true);
+
+            const [[jsonFresh]] = 
JSONVectorAssembler.assemble(batchOf(fresh.type, fresh));
+            const [[jsonViewed]] = 
JSONVectorAssembler.assemble(batchOf(fresh.type, viewed));
+            expect(jsonViewed).toEqual(jsonFresh);
+        });
+    }
+});
diff --git a/test/unit/ipc/writer/json-writer-tests.ts 
b/test/unit/ipc/writer/json-writer-tests.ts
index 29b6651..c7d48ee 100644
--- a/test/unit/ipc/writer/json-writer-tests.ts
+++ b/test/unit/ipc/writer/json-writer-tests.ts
@@ -24,7 +24,9 @@ import {
     ArrowJSONLike,
     RecordBatchJSONWriter,
     RecordBatchReader,
-    Table
+    Table,
+    tableFromIPC,
+    tableToIPC,
 } from 'apache-arrow';
 
 describe('RecordBatchJSONWriter', () => {
@@ -38,12 +40,16 @@ describe('RecordBatchJSONWriter', () => {
 
 function testJSONWriter(table: Table, name: string) {
     describe(`should write the Arrow IPC JSON format (${name})`, () => {
-        test(`Table`, validateTable.bind(0, table));
+        test(`Table`, validateJSONWriter.bind(0, table));
     });
 }
 
-async function validateTable(source: Table) {
-    const writer = RecordBatchJSONWriter.writeAll(source);
+async function validateJSONWriter(source: Table) {
+    // Route through binary IPC so the JSON writer sees realistic typed-array
+    // views with byteOffset > 0 (as `tableFromIPC` produces), not just fresh
+    // allocations from the generator.
+    const loaded = tableFromIPC(tableToIPC(source, 'stream'));
+    const writer = RecordBatchJSONWriter.writeAll(loaded);
     const string = await writer.toString();
     const json = JSON.parse(string) as ArrowJSONLike;
     const result = new Table(RecordBatchReader.from(json));

Reply via email to