[
https://issues.apache.org/jira/browse/SPARK-54354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hongze Zhang updated SPARK-54354:
---------------------------------
Description:
This is a bug specifically happening when very large UnsafeHashedRelations are
created for broadcast hash join.
The rationale of the bug explained as following:
This code:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
creates a unified memory manager with Long.MaxValue / 2 as the heap memory
size for the new hashed relation. This practice essentially creates the
condition where the actual JVM memory is significantly smaller than the unified
memory manager's heap size. And given we also have the recursive retry policy
for this specific case:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
and unified memory manager's heap size is too large, the program will likely
hang for very long time. In a typical local test, it could hang for as long as
2 hours.
was:
This is a bug specifically happening when very large UnsafeHashedRelations are
created for broadcast hash join.
The rationale of the bug explained as following:
This code:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
creates a unified memory manager with Int.Max as the heap memory size for the
new hashed relation. This practice essentially creates the condition where the
actual JVM memory is significantly smaller than the unified memory manager's
heap size. And given we also have the recursive retry policy for this specific
case:
[https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
and unified memory manager's heap size is too large, the program will likely
hang for very long time. In a typical local test, it could hang for as long as
2 hours.
> Driver / Executor hangs when there's not enough JVM heap memory for broadcast
> hashed relation
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-54354
> URL: https://issues.apache.org/jira/browse/SPARK-54354
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.1
> Reporter: Hongze Zhang
> Priority: Major
>
> This is a bug specifically happening when very large UnsafeHashedRelations
> are created for broadcast hash join.
>
> The rationale of the bug explained as following:
>
> This code:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
> creates a unified memory manager with Long.MaxValue / 2 as the heap memory
> size for the new hashed relation. This practice essentially creates the
> condition where the actual JVM memory is significantly smaller than the
> unified memory manager's heap size. And given we also have the recursive
> retry policy for this specific case:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
> and unified memory manager's heap size is too large, the program will likely
> hang for very long time. In a typical local test, it could hang for as long
> as 2 hours.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]