Re: [I] Performance issue on TPCH comparing with sparksql [incubator-wayang]

via GitHub Mon, 08 Apr 2024 10:43:24 -0700


wangxiaoying commented on issue #423:
URL: 
https://github.com/apache/incubator-wayang/issues/423#issuecomment-2043314822


   > Thanks @wangxiaoying. I guess the broadcast join reduces the amount of 
data shuffled for this specific dataset/query. Could you disable the broadcast 
join in Spark to make sure if the difference comes from the join only? Sth 
like: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) That would at 
least make equal the amount of data shuffled to better understand the 
performance difference.
   
   Hi @zkaoudi , I set the config of `autoBroadcastJoinThreashold` to 1048576 
(10485760 by default) since setting this value too small will make both joins 
`SortMergeJoin` instead of `HashJoin`.  Here is the result plan with two hash 
joins:
   
   ```
      *(6) Sort [revenue#138 DESC NULLS LAST, o_orderdate#58 ASC NULLS FIRST], 
true, 0
      +- AQEShuffleRead coalesced
         +- ShuffleQueryStage 4
            +- Exchange rangepartitioning(revenue#138 DESC NULLS LAST, 
o_orderdate#58 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=350]
               +- *(5) HashAggregate(keys=[l_orderkey#0, o_orderdate#58, 
o_shippriority#61], functions=[sum((l_extendedprice#5 * (1 - l_discount#6)))])
                  +- *(5) HashAggregate(keys=[l_orderkey#0, o_orderdate#58, 
o_shippriority#61], functions=[partial_sum((l_extendedprice#5 * (1 - 
l_discount#6)))])
                     +- *(5) Project [o_orderdate#58, o_shippriority#61, 
l_orderkey#0, l_extendedprice#5, l_discount#6]
                        +- *(5) ShuffledHashJoin [o_orderkey#54], 
[l_orderkey#0], Inner, BuildLeft
                           :- AQEShuffleRead coalesced
                           :  +- ShuffleQueryStage 3
                           :     +- Exchange hashpartitioning(o_orderkey#54, 
200), ENSURE_REQUIREMENTS, [plan_id=266]
                           :        +- *(4) Project [o_orderkey#54, 
o_orderdate#58, o_shippriority#61]
                           :           +- *(4) ShuffledHashJoin [c_custkey#36], 
[o_custkey#55], Inner, BuildLeft
                           :              :- AQEShuffleRead coalesced
                           :              :  +- ShuffleQueryStage 0
                           :              :     +- Exchange 
hashpartitioning(c_custkey#36, 200), ENSURE_REQUIREMENTS, [plan_id=132]
                           :              :        +- *(1) Project 
[c_custkey#36]
                           :              :           +- *(1) Filter 
(staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, 
StringType, readSidePadding, c_mktsegment#42, 10, true, false,
    true) = BUILDING  )
                           :              :              +- *(1) Scan 
JDBCRelation(public.customer) [numPartitions=1] [c_custkey#36,c_mktsegment#42] 
PushedFilters: [*IsNotNull(c_custkey)], ReadSchema: struct<c_c
   ustkey:int,c_mktsegment:string>
                           :              +- AQEShuffleRead coalesced
                           :                 +- ShuffleQueryStage 1
                           :                    +- Exchange 
hashpartitioning(o_custkey#55, 200), ENSURE_REQUIREMENTS, [plan_id=139]
                           :                       +- *(2) Scan 
JDBCRelation(public.orders) [numPartitions=1] 
[o_orderkey#54,o_custkey#55,o_orderdate#58,o_shippriority#61] PushedFilters: 
[*IsNotNull(o_orderdate)
   , *LessThan(o_orderdate,1995-03-15), *IsNotNull(o_custkey), 
*IsNotNull(o_..., ReadSchema: 
struct<o_orderkey:int,o_custkey:int,o_orderdate:date,o_shippriority:int>
                           +- AQEShuffleRead coalesced
                              +- ShuffleQueryStage 2
                                 +- Exchange hashpartitioning(l_orderkey#0, 
200), ENSURE_REQUIREMENTS, [plan_id=150]
                                    +- *(3) Scan JDBCRelation(public.lineitem) 
[numPartitions=1] [l_orderkey#0,l_extendedprice#5,l_discount#6] PushedFilters: 
[*IsNotNull(l_shipdate), *GreaterThan(l_shipdate,1995
   -03-15), *IsNotNull(l_orderkey)], ReadSchema: 
struct<l_orderkey:int,l_extendedprice:decimal(15,2),l_discount:decimal(15,2)>
   ```
   
   The performance does not change much (still ~40s).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Performance issue on TPCH comparing with sparksql [incubator-wayang]

Reply via email to