[
https://issues.apache.org/jira/browse/SPARK-54354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun reassigned SPARK-54354:
-------------------------------------
Assignee: Hongze Zhang
> Driver / Executor hangs when there's not enough JVM heap memory for broadcast
> hashed relation
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-54354
> URL: https://issues.apache.org/jira/browse/SPARK-54354
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.1
> Reporter: Hongze Zhang
> Assignee: Hongze Zhang
> Priority: Major
> Labels: pull-request-available
>
> This is a bug specifically happening when very large UnsafeHashedRelations
> are created for broadcast hash join.
>
> The rationale of the bug explained as following:
>
> This code:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L142-L150]
> creates a temporary unified memory manager with Long.MaxValue / 2 as the
> heap memory size for the new hashed relation. This practice essentially
> creates the condition where the actual JVM memory is significantly smaller
> than the unified memory manager's heap size. And given we also have the
> recursive retry policy for this specific case:
> [https://github.com/apache/spark/blob/6cb88c10126bde79076ce5c8d7574cc5c9524746/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L402,]
> and unified memory manager's heap size is too large, the program will likely
> hang for very long time. In a typical local test, it could hang for as long
> as 2 hours.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]