Hongze Zhang created SPARK-54354:
------------------------------------

             Summary: Driver / Executor hangs when there's enough JVM heap 
memory for broadcast hashed relation
                 Key: SPARK-54354
                 URL: https://issues.apache.org/jira/browse/SPARK-54354
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.1
            Reporter: Hongze Zhang


This is a bug specifically happening when very large UnsafeHashedRelations are 
created for broadcast hash join.

 

The rationale of the bug explained as following:

 

This code: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
  creates a unified memory manager with Int.Max as the heap memory size for the 
new hashed relation. This practice essentially creates the condition where the 
actual JVM memory is significantly smaller than the unified memory manager's 
heap size. And given we also have the recursive retry policy for this specific 
case: 
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
 and unified memory manager's heap size is too large, the program will likely 
hang for very long time. In a typical local test, it could hang for as long as 
2 hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to