[ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720257#comment-16720257
 ] 

Lantao Jin commented on SPARK-26305:
------------------------------------

Add a design doc. Not totally completed.

> Breakthrough the memory limitation of broadcast join
> ----------------------------------------------------
>
>                 Key: SPARK-26305
>                 URL: https://issues.apache.org/jira/browse/SPARK-26305
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 10000 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is data skewing, the sortmerge 
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the 
> skew key and normal key separately. I guess Databricks Runtime has the 
> similar implementation. However, the skew join hint needs SQL users know the 
> data in table like their children. They must know which key is skewing in a 
> join. It's very hard to know since the data is changing day by day and the 
> join key isn't fixed in different queries. The users have to set a huge 
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the 
> limitation, if the broadcasted table no needs to fill to memory, in other 
> words, driver/executor stores the broadcasted table to disk only. The problem 
> mentioned above could be resolved.
> A new hint like BROADCAST_DISK or an additional parameter in original 
> BROADCAST hint will be introduced to cover this case. The original broadcast 
> behavior won’t be changed.
> I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to