Hi all:
i want to ask a question about broadcast join in spark sql.
```
select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop
from B
left join A
on trim(A.country) = trim(B.cntry_code);
```
here A is a small table only 8 rows, but somehow the statistics of table A has
problem.
A join B is sort merged join while the join key ( trim(A.country) =
trim(B.cntry_code)) only
has serveral values( neary 21 countries). is there any way i force spark sql
to use
broadcast join (I can not use enlarge the spark.sql.autoBroadcastJoinThreshold
as i did not know the detail size of spark sql deal with it ).
I tried to print the physical plan , but it did not show the table size and i
did not know
how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force the
sort merge join to
broadcast join.
```
== Parsed Logical Plan ==
'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio *
1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369]
+- 'Join LeftOuter, ('trim('cc_base_part1.country) =
'trim('cc_rank_agg.cntry_code))
:- 'UnresolvedRelation `cc_base_part1`
+- 'UnresolvedRelation `cc_rank_agg`
== Analyzed Logical Plan ==
cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string,
credit_card_created_date: string, card_usage: smallint, cc_category: string,
cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2), dt: string,
nsf_ratio_to_pop: decimal(38,6)
Project [cust_id#372, country#373, cc_id#374, bin_hmac#375,
credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379,
nsf_risk#380, nsf_cards_ratio#381, dt#382,
CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381
as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2)))),
DecimalType(22,4)) as decimal(38,16))) /
promote_precision(cast(nsf_on_entry#386 as decimal(38,16)))),
DecimalType(38,6)) AS nsf_ratio_to_pop#369]
+- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None))
:- SubqueryAlias cc_base_part1
: +- HiveTableRelation `fpv365h`.`cc_base_part1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372, country#373,
cc_id#374, bin_hmac#375, credit_card_created_date#376, card_usage#377,
cc_category#378, cc_size#379, nsf_risk#380, nsf_cards_ratio#381], [dt#382]
+- SubqueryAlias cc_rank_agg
+- HiveTableRelation `fpv365h`.`cc_rank_agg`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383,
num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387]
```
Does spark have any command to show the table size when printing the physical
plan ? Appreciate if you can help my question.
Best regards
Kelly Zhang