Hongze Zhang created SPARK-54354:
------------------------------------
Summary: Driver / Executor hangs when there's enough JVM heap
memory for broadcast hashed relation
Key: SPARK-54354
URL: https://issues.apache.org/jira/browse/SPARK-54354
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.1
Reporter: Hongze Zhang
This is a bug specifically happening when very large UnsafeHashedRelations are
created for broadcast hash join.
The rationale of the bug explained as following:
This code:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
creates a unified memory manager with Int.Max as the heap memory size for the
new hashed relation. This practice essentially creates the condition where the
actual JVM memory is significantly smaller than the unified memory manager's
heap size. And given we also have the recursive retry policy for this specific
case:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
and unified memory manager's heap size is too large, the program will likely
hang for very long time. In a typical local test, it could hang for as long as
2 hours.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]