mcdull_zhang created SPARK-43911:
------------------------------------
Summary: Directly use Set to consume iterator data to deduplicate,
thereby reducing memory usage
Key: SPARK-43911
URL: https://issues.apache.org/jira/browse/SPARK-43911
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.4.0
Reporter: mcdull_zhang
When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for
dynamic partition pruning, it will put all the keys in an Array, and then call
the distinct of the Array to remove the duplicates.
In general, Broadcast HashedRelation may have many rows, and the repetition
rate of this key is high. Doing so will cause this Array to occupy a large
amount of memory (and this memory is not managed by MemoryManager), which may
trigger OOM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]