[ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931054#comment-15931054 ]
Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM: ------------------------------------------------------------- The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. I have opened a JIRA for that issue as well. https://issues.apache.org/jira/browse/SPARK-19890 was (Author: zhzhan): The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. https://issues.apache.org/jira/browse/SPARK-19890 > Separate threshold for broadcast and shuffled hash join > ------------------------------------------------------- > > Key: SPARK-20006 > URL: https://issues.apache.org/jira/browse/SPARK-20006 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Zhan Zhang > Priority: Minor > > Currently both canBroadcast and canBuildLocalHashMap use the same > configuration: AUTO_BROADCASTJOIN_THRESHOLD. > But the memory model may be different. For broadcast, currently the hash map > is always build on heap. For shuffledHashJoin, the hash map may be build on > heap(longHash), or off heap(other map if off heap is enabled). The same > configuration makes the configuration hard to tune (how to allocate memory > onheap/offheap). Propose to use different configuration. Please comments > whether it is reasonable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org