[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720257#comment-16720257 ]
Lantao Jin commented on SPARK-26305: ------------------------------------ Add a design doc. Not totally completed. > Breakthrough the memory limitation of broadcast join > ---------------------------------------------------- > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Lantao Jin > Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 10000 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is data skewing, the sortmerge > join always failed. > Hive has a skew join hint which could trigger two-stage task to handle the > skew key and normal key separately. I guess Databricks Runtime has the > similar implementation. However, the skew join hint needs SQL users know the > data in table like their children. They must know which key is skewing in a > join. It's very hard to know since the data is changing day by day and the > join key isn't fixed in different queries. The users have to set a huge > partition number to try their luck. > So, do we have a simple, rude and efficient way to resolve it? Back to the > limitation, if the broadcasted table no needs to fill to memory, in other > words, driver/executor stores the broadcasted table to disk only. The problem > mentioned above could be resolved. > A new hint like BROADCAST_DISK or an additional parameter in original > BROADCAST hint will be introduced to cover this case. The original broadcast > behavior won’t be changed. > I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org