[PR] [SPARK-56904][SQL] Fix Int overflow in LongToUnsafeRowMap page size computations [spark]

via GitHub Sat, 16 May 2026 23:08:08 -0700


viirya opened a new pull request, #55929:
URL: https://github.com/apache/spark/pull/55929


   ### What changes were proposed in this pull request?
   
   Fix three sites in `LongToUnsafeRowMap` where a `Long` page-word count is 
multiplied by 8 using `Int` arithmetic. At the upper bound (`1 << 30` long 
words, the explicit cap in `grow` plus the 8 GiB ceiling), `Int * 8` wraps to 0:
   
     - `LongToUnsafeRowMap.grow`: `val newPage = allocatePage(newNumWords.toInt 
* 8)`
     - `LongToUnsafeRowMap.read` (deserialization on executors): `page = 
allocatePage(pageLength * 8)` `cursor = pageLength * 8 + page.getBaseOffset`
   
   When the multiplication overflows to 0, `MemoryConsumer.allocatePage(0)` 
falls through `TaskMemoryManager.allocatePage(Math.max(pageSize, 0))` and 
returns a default-sized page. Subsequent `append`s keep advancing `cursor` past 
the new page's end and `Platform.copyMemory(... page.getBaseObject, cursor, 
...)` writes/reads into adjacent native pages, eventually crashing inside the 
SIMD-optimized `StubRoutines::forward_copy_longs` on aarch64 (SEGV_ACCERR at 
the over-read of the next mmap page).
   
   We observed the crash on ARM Graviton; this fix resolves it. The bug is a 
latent heap corruption regardless of architecture.
   
   Fix: use `Long` multiplication (`* 8L`) at all three sites so the multiply 
matches `allocatePage`/`cursor`'s declared `Long` types.
   
   ### Why are the changes needed?
   
   To fix a JVM SEGV in `LongToUnsafeRowMap` triggered when the page reaches 
the 8 GiB cap, observed on ARM Graviton.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing `HashedRelationSuite` tests cover the affected paths. Validated on 
a downstream broadcast-hash-join build on ARM Graviton where the original SEGV 
reproduced; no crash with this fix applied.
   
   The reproducible suite is internal and it is hard to port to OSS. But the 
bug can be observed from the code clearly.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56904][SQL] Fix Int overflow in LongToUnsafeRowMap page size computations [spark]

Reply via email to