RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore) Thu, 10 Sep 2020 00:27:24 -0700

Hi All,

We tried to regenerate the TPC DS data on S3 and after regeneration, we see 
that the queries are running faster and the execution time is now comparable 
with execution time on HDFS with Spark 3.0.0.
So may be there was some issue in generating the TPC DS data first time due to 
which we were seeing discrepancy in query execution time on S3 with Spark 3.0.0.

Thanks and Regards,
Abhishek

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Sent: Wednesday, August 26, 2020 5:49 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com>
Cc: user <user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi
Can you try using emrfs?
Your study looks good best of luck.

Regards
Gourav

On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), 
<abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote:
Yeah… Not sure if I’m missing any configurations which is causing this issue. 
Any suggestions?

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
<abhishek....@nokia.com<mailto:abhishek....@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

So the results does not make sense.

Regards,
Gourav

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) 
<abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote:
Hi Gourav,

Yes. We’re using s3a.

Thanks and Regards,
Abhishek

From: Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) 
<abhishek....@nokia.com<mailto:abhishek....@nokia.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi,

are you using s3a, which is not using EMRFS? In that case, these results does 
not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) 
<abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote:
Hi All,

We’re doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query
Spark 3.0 with S3 (Time in seconds)
Spark 3.0 with Hadoop (Time in seconds)

Spark 2.4.5 with S3
(Time in seconds)
Spark 3.0 HDFS vs S3 (Factor)
Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)
Table involved
9
880.129
106.109
147.65
8.294574
5.960914
store_sales
44
129.618
23.747
103.916
5.458289
1.247334
store_sales
58
142.113
20.996
33.936
6.768575
4.187677
store_sales
62
32.519
5.425
14.809
5.994286
2.195894
web_sales
76
138.765
20.73
49.892
6.693922
2.781308
store_sales
88
475.824
48.2
94.382
9.871867
5.04147
store_sales
90
53.896
6.804
18.11
7.921223
2.976035
web_sales
94
241.172
43.49
81.181
5.545459
2.970794
web_sales
96
67.059
10.396
15.993
6.450462
4.193022
store_sales

When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I’m attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we’re missing?

Thanks and Regards,
Abhishek

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Reply via email to