Re:Re: How to force sorted merge join to broadcast join

zhangliyun Mon, 29 Jul 2019 02:03:41 -0700

thks! after using the syntax provided in the link, select /*+ BROADCAST (A) */ 
...  , i got what i want.
but i want to ask beside using queryExecution.stringWithStats (dataframe api) 
to show the table statistics, is there any way to show the table statistics in 
explain xxx in spark sql command line?



Best Regards
Kelly




在 2019-07-29 14:29:50，"Rubén Berenguel" <rbereng...@gmail.com> 写道：

Hi, I hope this answers your question.


You can hint the broadcast in SQL as detailed here: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html
 (thanks Jacek :) )
I'd recommend creating a temporary table with the trimming you use in the join 
(for clarity). Also keep in mind using the methods is more powerful/readable 
than
using Spark SQL directly (as happens with the broadcast case, although it 
depends on personal preference).


Regards, 


Ruben


-- 
Rubén Berenguel



On 29 July 2019 at 07:12:30, zhangliyun (kelly...@126.com) wrote:

Hi all:
   i want to ask a question about   broadcast join in spark sql.




```
   select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop
from B
left join A
on trim(A.country) = trim(B.cntry_code);
```
here A is a small table only 8 rows, but somehow the statistics of table A has 
problem.


A join B is sort merged join while the join key ( trim(A.country) = 
trim(B.cntry_code)) only
has serveral values( neary 21 countries).  is there any way i force spark sql 
to use 
broadcast join (I can not use enlarge the spark.sql.autoBroadcastJoinThreshold  
as i did not know the detail size of spark sql deal with it ).


I tried to print the physical plan , but it did not show the table size and i 
did not know 
how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force the 
sort merge join to
broadcast join.  




```
== Parsed Logical Plan ==
'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio * 
1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369]
+- 'Join LeftOuter, ('trim('cc_base_part1.country) = 
'trim('cc_rank_agg.cntry_code))
   :- 'UnresolvedRelation `cc_base_part1`
   +- 'UnresolvedRelation `cc_rank_agg`


== Analyzed Logical Plan ==
cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string, 
credit_card_created_date: string, card_usage: smallint, cc_category: string, 
cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2), dt: string, 
nsf_ratio_to_pop: decimal(38,6)
Project [cust_id#372, country#373, cc_id#374, bin_hmac#375, 
credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379, 
nsf_risk#380, nsf_cards_ratio#381, dt#382, 
CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381
 as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2)))), 
DecimalType(22,4)) as decimal(38,16))) / 
promote_precision(cast(nsf_on_entry#386 as decimal(38,16)))), 
DecimalType(38,6)) AS nsf_ratio_to_pop#369]
+- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None))
   :- SubqueryAlias cc_base_part1
   :  +- HiveTableRelation `fpv365h`.`cc_base_part1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372, country#373, 
cc_id#374, bin_hmac#375, credit_card_created_date#376, card_usage#377, 
cc_category#378, cc_size#379, nsf_risk#380, nsf_cards_ratio#381], [dt#382]
   +- SubqueryAlias cc_rank_agg
      +- HiveTableRelation `fpv365h`.`cc_rank_agg`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383, 
num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387]






```


Does spark have any command to show the table size  when printing the physical 
plan ?   Appreciate if you can help my question.




Best regards


Kelly Zhang

Re:Re: How to force sorted merge join to broadcast join

Reply via email to