mcdull_zhang created SPARK-43911: ------------------------------------ Summary: Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage Key: SPARK-43911 URL: https://issues.apache.org/jira/browse/SPARK-43911 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: mcdull_zhang
When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org