Or since you already use the DataFrame API, instead of SQL, you can add the 
broadcast function to force it.


https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)


Yong

functions - Apache 
Spark<https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
spark.apache.org
Computes the numeric value of the first character of the string column, and 
returns the result as a int column.





________________________________
From: Bryan Jeffrey <bryan.jeff...@gmail.com>
Sent: Friday, June 30, 2017 6:57 AM
To: d...@spark.org; user@spark.apache.org; paleyl
Subject: Re: about broadcast join of base table in spark sql

Hello.

If you want to allow broadcast join with larger broadcasts you can set 
spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
plan to allow join despite 'A' being larger than the default threshold.

Get Outlook for Android<https://aka.ms/ghei36>



From: paleyl
Sent: Wednesday, June 28, 10:42 PM
Subject: about broadcast join of base table in spark sql
To: d...@spark.org, user@spark.apache.org


Hi All,


Recently I meet a problem in broadcast join: I want to left join table A and B, 
A is the smaller one and the left table, so I wrote

A = A.join(B,A("key1") === B("key2"),"left")

but I found that A is not broadcast out, as the shuffle size is still very 
large.

I guess this is a designed mechanism in spark, so could anyone please tell me 
why it is designed like this? I am just very curious.


Best,


Paley



Reply via email to