[ https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303940#comment-16303940 ]
Haijia Zhou commented on SPARK-21795: ------------------------------------- Any updates on this issue? We run into the same issue and would like it to be fixed. > Broadcast hint ignored when dataframe is cached > ----------------------------------------------- > > Key: SPARK-21795 > URL: https://issues.apache.org/jira/browse/SPARK-21795 > Project: Spark > Issue Type: Question > Components: Documentation, SQL > Affects Versions: 2.2.0 > Reporter: Lior Chaga > Priority: Minor > > Not sure if it's a bug or by design, but if a DF is cached, the broadcast > hint is ignored, and spark uses SortMergeJoin. > {code} > val largeDf = ... > val smalDf = ... > smallDf = smallDf.cache > largeDf.join(broadcast(smallDf)) > {code} > It make sense there's no need to use cache when using broadcast join, > however, I wonder if it's the correct behavior for spark to ignore the > broadcast hint just because the DF is cached. Consider a case when a DF > should be cached for several queries, and on different queries it should be > broadcasted. > If this is the correct behavior, at least it's worth documenting that cached > DF cannot be broadcasted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org