Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

2021-12-20 Thread Senthil Kumar
Hi Luca,

I m collecting logical n physical plan. So that it will be helpful to find
the root cause of this issue

On Mon, 20 Dec 2021, 16:46 Luca Canali,  wrote:

> Hi Senthil,
>
>
>
> I have just run a couple of quick tests for TPCDS Q4, using the TPCDS
> schema created at scale 1500 that I have on a Hadoop/YARN cluster, and was
> not able to reproduce the difference in execution time between Spark 2 and
> Spark 3 that you report in your mail.
>
> This is the Spark config I used:
>
> bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10
> --executor-memory 50g --conf spark.dynamicAllocation.enabled=false
> --num-executors 20
>
>
>
> This is how I ran the tests:
>
>
>
> ```
>
> val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"
>
>
>
> val
> tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales",
> "call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")
>
>
>
> for (t <- tables) {
>
>   println(s"Creating temporary view $t")
>
>   spark.read.parquet(path + t).createOrReplaceTempView(t)
>
> }
>
>
>
> val q4="""…"""
>
> // SQL from
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql
>
>
>
> spark.time(sql(q4).collect) // note q4 result set is only 100 rows
>
> ```
>
>
>
> Spark 2.4.5:
>
> Time taken: 256812 ms
>
> Time taken: 226571 ms
>
> Time taken: 305508 ms
>
>
>
> Spark 3.1.2
>
> spark.time(sql(q4).collect)
>
> Time taken: 235356 ms
>
> Time taken: 236284 ms
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Senthil Kumar 
> *Sent:* Monday, December 20, 2021 10:20
> *To:* Rao, Abhishek (Nokia - IN/Bangalore) 
> *Cc:* dev 
> *Subject:* Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
>
>
>
> Also we checked that we have already backported
> https://issues.apache.org/jira/browse/SPARK-33557 jira.
>
>
>
> On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar  wrote:
>
> @abhishek. We use spark 3.1*
>
>
>
> On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), <
> abhishek@nokia.com> wrote:
>
> Hi Senthil,
>
>
>
> Which version of Spark 3 are we using? We had this kind of observation
> with Spark 3.0.2 and 3.1.x, but then we figured out that we had configured
> big value for spark.network.timeout and this value was not taking effect in
> all releases prior to 3.0.2.
>
> This was fixed as part of
> https://issues.apache.org/jira/browse/SPARK-33557. Because we had
> configured big value for spark.network.timeout, this was resulting in TPCDS
> queries taking long time when tried with Spark 3.0.2 and 3.1.x. Once we
> corrected it, we observed that the queries were executed much faster.
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Senthil Kumar 
> *Sent:* Sunday, December 19, 2021 11:58 PM
> *To:* dev 
> *Subject:* Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
>
>
>
> Hi All,
>
> We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3
> additional features) with TPCDS queries and found that Spark 3's
> performance is reduced to at-least 30-40% compared to Spark 2.4.5.
>
>
>
> Eg.
>
> Data size used 1TB
>
>
> Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5
> min.
>
>
>
> Note: We tested this in the same cluster with the same size of data. And
> we ensured that parameters we passed are one and the same for SPark 2.4*
> and Spark 3*.
>
>
>
> It will be helpful, if any one you also encountered the same issue in your
> benchmarking activities? If so, pls share your input on what could be the
> reason behind this poor performance.
>
>
>
> --
>
> Senthil kumar
>
>
>
>
> --
>
> Senthil kumar
>


RE: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

2021-12-20 Thread Luca Canali
Hi Senthil,

 

I have just run a couple of quick tests for TPCDS Q4, using the TPCDS schema 
created at scale 1500 that I have on a Hadoop/YARN cluster, and was not able to 
reproduce the difference in execution time between Spark 2 and Spark 3 that you 
report in your mail.  

This is the Spark config I used:   

bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10 
--executor-memory 50g --conf spark.dynamicAllocation.enabled=false 
--num-executors 20   

 

This is how I ran the tests:

 

```

val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"

 

val 
tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales",
 
"call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")

 

for (t <- tables) {

  println(s"Creating temporary view $t")

  spark.read.parquet(path + t).createOrReplaceTempView(t)

}

 

val q4="""…"""

// SQL from  

 
https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql

 

spark.time(sql(q4).collect) // note q4 result set is only 100 rows

```

 

Spark 2.4.5:

Time taken: 256812 ms  

Time taken: 226571 ms

Time taken: 305508 ms

 

Spark 3.1.2

spark.time(sql(q4).collect)

Time taken: 235356 ms

Time taken: 236284 ms  

 

Best, 

Luca

 

From: Senthil Kumar  
Sent: Monday, December 20, 2021 10:20
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: dev 
Subject: Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

 

Also we checked that we have already backported 
https://issues.apache.org/jira/browse/SPARK-33557 jira. 

 

On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar mailto:sen...@gmail.com> > wrote:

@abhishek. We use spark 3.1*

 

On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), 
mailto:abhishek@nokia.com> > wrote:

Hi Senthil,

 

Which version of Spark 3 are we using? We had this kind of observation with 
Spark 3.0.2 and 3.1.x, but then we figured out that we had configured big value 
for spark.network.timeout and this value was not taking effect in all releases 
prior to 3.0.2.

This was fixed as part of https://issues.apache.org/jira/browse/SPARK-33557. 
Because we had configured big value for spark.network.timeout, this was 
resulting in TPCDS queries taking long time when tried with Spark 3.0.2 and 
3.1.x. Once we corrected it, we observed that the queries were executed much 
faster. 

 

Thanks and Regards,

Abhishek

 

From: Senthil Kumar mailto:sen...@gmail.com> > 
Sent: Sunday, December 19, 2021 11:58 PM
To: dev mailto:dev@spark.apache.org> >
Subject: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

 

Hi All,

We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3 additional 
features) with TPCDS queries and found that Spark 3's performance is reduced to 
at-least 30-40% compared to Spark 2.4.5.

 

Eg. 

Data size used 1TB


Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5 min.

 

Note: We tested this in the same cluster with the same size of data. And we 
ensured that parameters we passed are one and the same for SPark 2.4* and Spark 
3*.

 

It will be helpful, if any one you also encountered the same issue in your 
benchmarking activities? If so, pls share your input on what could be the 
reason behind this poor performance.


 

-- 

Senthil kumar




 

-- 

Senthil kumar



Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

2021-12-20 Thread Senthil Kumar
Also we checked that we have already backported
https://issues.apache.org/jira/browse/SPARK-33557 jira.

On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar  wrote:

> @abhishek. We use spark 3.1*
>
> On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), <
> abhishek@nokia.com> wrote:
>
>> Hi Senthil,
>>
>>
>>
>> Which version of Spark 3 are we using? We had this kind of observation
>> with Spark 3.0.2 and 3.1.x, but then we figured out that we had configured
>> big value for spark.network.timeout and this value was not taking effect
>> in all releases prior to 3.0.2.
>>
>> This was fixed as part of
>> https://issues.apache.org/jira/browse/SPARK-33557. Because we had
>> configured big value for spark.network.timeout, this was resulting in TPCDS
>> queries taking long time when tried with Spark 3.0.2 and 3.1.x. Once we
>> corrected it, we observed that the queries were executed much faster.
>>
>>
>>
>> Thanks and Regards,
>>
>> Abhishek
>>
>>
>>
>> *From:* Senthil Kumar 
>> *Sent:* Sunday, December 19, 2021 11:58 PM
>> *To:* dev 
>> *Subject:* Spark 3 is Slower than Spark 2 for TPCDS Q04 query.
>>
>>
>>
>> Hi All,
>>
>> We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3
>> additional features) with TPCDS queries and found that Spark 3's
>> performance is reduced to at-least 30-40% compared to Spark 2.4.5.
>>
>>
>>
>> Eg.
>>
>> Data size used 1TB
>>
>>
>> Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5
>> min.
>>
>>
>>
>> Note: We tested this in the same cluster with the same size of data. And
>> we ensured that parameters we passed are one and the same for SPark 2.4*
>> and Spark 3*.
>>
>>
>>
>> It will be helpful, if any one you also encountered the same issue in
>> your benchmarking activities? If so, pls share your input on what could be
>> the reason behind this poor performance.
>>
>>
>>
>> --
>>
>> Senthil kumar
>>
>

-- 
Senthil kumar