Hi Senthil,
I have just run a couple of quick tests for TPCDS Q4, using the TPCDS schema
created at scale 1500 that I have on a Hadoop/YARN cluster, and was not able to
reproduce the difference in execution time between Spark 2 and Spark 3 that you
report in your mail.
This is the Spark config I used:
bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10
--executor-memory 50g --conf spark.dynamicAllocation.enabled=false
--num-executors 20
This is how I ran the tests:
```
val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"
val
tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales",
"call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")
for (t <- tables) {
println(s"Creating temporary view $t")
spark.read.parquet(path + t).createOrReplaceTempView(t)
}
val q4="""…"""
// SQL from
<https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql>
https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql
spark.time(sql(q4).collect) // note q4 result set is only 100 rows
```
Spark 2.4.5:
Time taken: 256812 ms
Time taken: 226571 ms
Time taken: 305508 ms
Spark 3.1.2
spark.time(sql(q4).collect)
Time taken: 235356 ms
Time taken: 236284 ms
Best,
Luca
From: Senthil Kumar <[email protected]>
Sent: Monday, December 20, 2021 10:20
To: Rao, Abhishek (Nokia - IN/Bangalore) <[email protected]>
Cc: dev <[email protected]>
Subject: Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
Also we checked that we have already backported
https://issues.apache.org/jira/browse/SPARK-33557 jira.
On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar <[email protected]
<mailto:[email protected]> > wrote:
@abhishek. We use spark 3.1*
On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore),
<[email protected] <mailto:[email protected]> > wrote:
Hi Senthil,
Which version of Spark 3 are we using? We had this kind of observation with
Spark 3.0.2 and 3.1.x, but then we figured out that we had configured big value
for spark.network.timeout and this value was not taking effect in all releases
prior to 3.0.2.
This was fixed as part of https://issues.apache.org/jira/browse/SPARK-33557.
Because we had configured big value for spark.network.timeout, this was
resulting in TPCDS queries taking long time when tried with Spark 3.0.2 and
3.1.x. Once we corrected it, we observed that the queries were executed much
faster.
Thanks and Regards,
Abhishek
From: Senthil Kumar <[email protected] <mailto:[email protected]> >
Sent: Sunday, December 19, 2021 11:58 PM
To: dev <[email protected] <mailto:[email protected]> >
Subject: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
Hi All,
We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3 additional
features) with TPCDS queries and found that Spark 3's performance is reduced to
at-least 30-40% compared to Spark 2.4.5.
Eg.
Data size used 1TB
Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5 min.
Note: We tested this in the same cluster with the same size of data. And we
ensured that parameters we passed are one and the same for SPark 2.4* and Spark
3*.
It will be helpful, if any one you also encountered the same issue in your
benchmarking activities? If so, pls share your input on what could be the
reason behind this poor performance.
--
Senthil kumar
--
Senthil kumar