[jira] [Updated] (SPARK-54354) Driver / Executor hangs when there's not enough JVM heap memory for broadcast hashed relation

Hongze Zhang (Jira) Fri, 14 Nov 2025 06:16:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hongze Zhang updated SPARK-54354:
---------------------------------
    Description: 
This is a bug specifically happening when very large UnsafeHashedRelations are 
created for broadcast hash join.

 

The rationale of the bug explained as following:

 

This code: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
  creates a unified memory manager with Long.MaxValue / 2 as the heap memory 
size for the new hashed relation. This practice essentially creates the 
condition where the actual JVM memory is significantly smaller than the unified 
memory manager's heap size. And given we also have the recursive retry policy 
for this specific case: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
 and unified memory manager's heap size is too large, the program will likely 
hang for very long time. In a typical local test, it could hang for as long as 
2 hours.

  was:
This is a bug specifically happening when very large UnsafeHashedRelations are 
created for broadcast hash join.

 

The rationale of the bug explained as following:

 

This code: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
  creates a unified memory manager with Int.Max as the heap memory size for the 
new hashed relation. This practice essentially creates the condition where the 
actual JVM memory is significantly smaller than the unified memory manager's 
heap size. And given we also have the recursive retry policy for this specific 
case: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
 and unified memory manager's heap size is too large, the program will likely 
hang for very long time. In a typical local test, it could hang for as long as 
2 hours.


> Driver / Executor hangs when there's not enough JVM heap memory for broadcast 
> hashed relation
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54354
>                 URL: https://issues.apache.org/jira/browse/SPARK-54354
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.1
>            Reporter: Hongze Zhang
>            Priority: Major
>
> This is a bug specifically happening when very large UnsafeHashedRelations 
> are created for broadcast hash join.
>  
> The rationale of the bug explained as following:
>  
> This code: 
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
>   creates a unified memory manager with Long.MaxValue / 2 as the heap memory 
> size for the new hashed relation. This practice essentially creates the 
> condition where the actual JVM memory is significantly smaller than the 
> unified memory manager's heap size. And given we also have the recursive 
> retry policy for this specific case: 
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
>  and unified memory manager's heap size is too large, the program will likely 
> hang for very long time. In a typical local test, it could hang for as long 
> as 2 hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54354) Driver / Executor hangs when there's not enough JVM heap memory for broadcast hashed relation

Reply via email to