wangxiaoying commented on issue #423:
URL:
https://github.com/apache/incubator-wayang/issues/423#issuecomment-2043314822
> Thanks @wangxiaoying. I guess the broadcast join reduces the amount of
data shuffled for this specific dataset/query. Could you disable the broadcast
join in Spark to make sure if the difference comes from the join only? Sth
like: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) That would at
least make equal the amount of data shuffled to better understand the
performance difference.
Hi @zkaoudi , I set the config of `autoBroadcastJoinThreashold` to 1048576
(10485760 by default) since setting this value too small will make both joins
`SortMergeJoin` instead of `HashJoin`. Here is the result plan with two hash
joins:
```
*(6) Sort [revenue#138 DESC NULLS LAST, o_orderdate#58 ASC NULLS FIRST],
true, 0
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 4
+- Exchange rangepartitioning(revenue#138 DESC NULLS LAST,
o_orderdate#58 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=350]
+- *(5) HashAggregate(keys=[l_orderkey#0, o_orderdate#58,
o_shippriority#61], functions=[sum((l_extendedprice#5 * (1 - l_discount#6)))])
+- *(5) HashAggregate(keys=[l_orderkey#0, o_orderdate#58,
o_shippriority#61], functions=[partial_sum((l_extendedprice#5 * (1 -
l_discount#6)))])
+- *(5) Project [o_orderdate#58, o_shippriority#61,
l_orderkey#0, l_extendedprice#5, l_discount#6]
+- *(5) ShuffledHashJoin [o_orderkey#54],
[l_orderkey#0], Inner, BuildLeft
:- AQEShuffleRead coalesced
: +- ShuffleQueryStage 3
: +- Exchange hashpartitioning(o_orderkey#54,
200), ENSURE_REQUIREMENTS, [plan_id=266]
: +- *(4) Project [o_orderkey#54,
o_orderdate#58, o_shippriority#61]
: +- *(4) ShuffledHashJoin [c_custkey#36],
[o_custkey#55], Inner, BuildLeft
: :- AQEShuffleRead coalesced
: : +- ShuffleQueryStage 0
: : +- Exchange
hashpartitioning(c_custkey#36, 200), ENSURE_REQUIREMENTS, [plan_id=132]
: : +- *(1) Project
[c_custkey#36]
: : +- *(1) Filter
(staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
StringType, readSidePadding, c_mktsegment#42, 10, true, false,
true) = BUILDING )
: : +- *(1) Scan
JDBCRelation(public.customer) [numPartitions=1] [c_custkey#36,c_mktsegment#42]
PushedFilters: [*IsNotNull(c_custkey)], ReadSchema: struct<c_c
ustkey:int,c_mktsegment:string>
: +- AQEShuffleRead coalesced
: +- ShuffleQueryStage 1
: +- Exchange
hashpartitioning(o_custkey#55, 200), ENSURE_REQUIREMENTS, [plan_id=139]
: +- *(2) Scan
JDBCRelation(public.orders) [numPartitions=1]
[o_orderkey#54,o_custkey#55,o_orderdate#58,o_shippriority#61] PushedFilters:
[*IsNotNull(o_orderdate)
, *LessThan(o_orderdate,1995-03-15), *IsNotNull(o_custkey),
*IsNotNull(o_..., ReadSchema:
struct<o_orderkey:int,o_custkey:int,o_orderdate:date,o_shippriority:int>
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 2
+- Exchange hashpartitioning(l_orderkey#0,
200), ENSURE_REQUIREMENTS, [plan_id=150]
+- *(3) Scan JDBCRelation(public.lineitem)
[numPartitions=1] [l_orderkey#0,l_extendedprice#5,l_discount#6] PushedFilters:
[*IsNotNull(l_shipdate), *GreaterThan(l_shipdate,1995
-03-15), *IsNotNull(l_orderkey)], ReadSchema:
struct<l_orderkey:int,l_extendedprice:decimal(15,2),l_discount:decimal(15,2)>
```
The performance does not change much (still ~40s).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]