Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Gourav Sengupta Wed, 26 Aug 2020 02:06:11 -0700

Hi,

So the results does not make sense.



Regards,
Gourav

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek....@nokia.com> wrote:

> Hi Gourav,
>
>
>
> Yes. We’re using s3a.
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Gourav Sengupta <gourav.sengu...@gmail.com>
> *Sent:* Wednesday, August 26, 2020 1:18 PM
> *To:* Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS
> Queries
>
>
>
> Hi,
>
>
>
> are you using s3a, which is not using EMRFS? In that case, these results
> does not make sense to me.
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <
> abhishek....@nokia.com> wrote:
>
> Hi All,
>
>
>
> We’re doing some performance comparisons between Spark querying data on
> HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage)
> using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is
> consuming significantly larger duration for some set of queries when
> compared with HDFS.
>
> We also ran similar queries with Spark 2.4.5 querying data from S3 and we
> see that for these set of queries, time taken by Spark 2.4.5 is lesser
> compared to Spark 3.0 looks to be very strange.
>
> Below are the details of 9 queries where Spark 3.0 is taking >5 times the
> duration for running queries on S3 when compared to Hadoop.
>
>
>
> *Environment Details:*
>
>    - *Spark running on Kubernetes*
>    - *TPC DS Scale Factor*: *500 GB*
>    - *Hadoop 3.x*
>    - *Same CPU and memory used for all executions*
>
>
>
> *Query*
>
> *Spark 3.0 with S3 (Time in seconds)*
>
> *Spark 3.0 with Hadoop (Time in seconds)*
>
>
>
>
>
> *Spark 2.4.5 with S3 *
>
> *(Time in seconds)*
>
> *Spark 3.0 HDFS vs S3 (Factor)*
>
> *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)*
>
> *Table involved*
>
> 9
>
> 880.129
>
> 106.109
>
> 147.65
>
> *8.294574*
>
> *5.960914*
>
> store_sales
>
> 44
>
> 129.618
>
> 23.747
>
> 103.916
>
> *5.458289*
>
> *1.247334*
>
> store_sales
>
> 58
>
> 142.113
>
> 20.996
>
> 33.936
>
> *6.768575*
>
> *4.187677*
>
> store_sales
>
> 62
>
> 32.519
>
> 5.425
>
> 14.809
>
> *5.994286*
>
> *2.195894*
>
> web_sales
>
> 76
>
> 138.765
>
> 20.73
>
> 49.892
>
> *6.693922*
>
> *2.781308*
>
> store_sales
>
> 88
>
> 475.824
>
> 48.2
>
> 94.382
>
> *9.871867*
>
> *5.04147*
>
> store_sales
>
> 90
>
> 53.896
>
> 6.804
>
> 18.11
>
> *7.921223*
>
> *2.976035*
>
> web_sales
>
> 94
>
> 241.172
>
> 43.49
>
> 81.181
>
> *5.545459*
>
> *2.970794*
>
> web_sales
>
> 96
>
> 67.059
>
> 10.396
>
> 15.993
>
> *6.450462*
>
> *4.193022*
>
> store_sales
>
>
>
> When we analysed it further, we see that all these queries are performing
> operations either on store_sales or web_sales tables and Spark 3 with S3
> seems to be downloading much more data from storage when compared to Spark
> 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for
> query completion. I’m attaching the screen shots of Driver UI for one such
> instance (Query 9) for reference.
>
> Also attached the spark configurations (Spark 3.0) used for these tests.
>
>
>
> We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on
> what we’re missing?
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Reply via email to