[jira] [Commented] (SPARK-21795) Broadcast hint ignored when dataframe is cached

2017-12-26 Thread Haijia Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303940#comment-16303940
 ] 

Haijia Zhou commented on SPARK-21795:
-

Any updates on this issue? We run into the same issue and would like it to be 
fixed.

> Broadcast hint ignored when dataframe is cached
> ---
>
> Key: SPARK-21795
> URL: https://issues.apache.org/jira/browse/SPARK-21795
> Project: Spark
>  Issue Type: Question
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
>Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21795) Broadcast hint ignored when dataframe is cached

2017-11-30 Thread Anton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273642#comment-16273642
 ] 

Anton commented on SPARK-21795:
---

A simple example of when you would like to broadcast a cached dataframe is when 
you have a dataframe that is the result of a long pipeline of transformations - 
e.g. a few large table joins and then some aggregations to make it small enough 
to be broadcasted. If you want to broadcast it to two different tables (or do 
multiple actions on that dataframe), it would be better to cache it rather than 
running the transformation pipeline multiple times.

I was able to replicate the issue and it appears that the the hint is ignored 
on cached dataframes however if the size of the dataframe is below the current 
spark.sql.autoBroadcastJoinThreshold value then it will still be broadcasted. 
This makes it feel more like a bug than by design.

> Broadcast hint ignored when dataframe is cached
> ---
>
> Key: SPARK-21795
> URL: https://issues.apache.org/jira/browse/SPARK-21795
> Project: Spark
>  Issue Type: Question
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
>Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21795) Broadcast hint ignored when dataframe is cached

2017-08-22 Thread Lior Chaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136891#comment-16136891
 ] 

Lior Chaga commented on SPARK-21795:


Hi, what I say is that if I use a hint to broadcast join on a DF, I expect to 
use BroadcastHashJoin, even if this DF was previously cached (or if not, then 
it should be documented that broadcast doesn't work with cached DF).

My 2nd claim is that one might have multiple queries in his spark sessions, and 
dataframes may be reused in different queries. So it's not entirely impossible 
that one would like to benefit from caching a DF in one query, and broadcast 
this DF in another unrelated query. But this is just a general statement, 
personally I don't have such a use case.



> Broadcast hint ignored when dataframe is cached
> ---
>
> Key: SPARK-21795
> URL: https://issues.apache.org/jira/browse/SPARK-21795
> Project: Spark
>  Issue Type: Question
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
>Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org