RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi Abhishek, Just a few ideas/comments on the topic: When benchmarking/testing I find it useful to collect a more complete view of resources usage and Spark metrics, beyond just measuring query elapsed time. Something like this: https://github.com/cerndb/spark-dashboard I'd rather not use dynamic allocation when benchmarking if possible, as it adds a layer of complexity when examining results. If you suspect that reading from S3 vs. HDFS may play an important role on the performance you observe, you may want to drill down on that with a simple micro-benchmark, for example something like this (for Spark 3.0): val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales") df.write.format("noop").mode("overwrite").save Best, Luca From: Rao, Abhishek (Nokia - IN/Bangalore) Sent: Monday, August 24, 2020 13:50 To: user@spark.apache.org Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi All, We're doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I'm attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We're not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we're missing? Thanks and Regards, Abhishek
RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi Luca, Thanks for sharing the feedback. We'll include these recommendations in our tests. However, we feel the issue that we're seeing right now is due to the difference in size of data downloaded from storage by the executors. In case of S3, executors are downloading almost 50 GB of data whereas in case of HDFS, it is only 4.5 GB. Any idea why this difference is there? Thanks and Regards, Abhishek From: Luca Canali Sent: Monday, August 24, 2020 7:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi Abhishek, Just a few ideas/comments on the topic: When benchmarking/testing I find it useful to collect a more complete view of resources usage and Spark metrics, beyond just measuring query elapsed time. Something like this: https://github.com/cerndb/spark-dashboard I'd rather not use dynamic allocation when benchmarking if possible, as it adds a layer of complexity when examining results. If you suspect that reading from S3 vs. HDFS may play an important role on the performance you observe, you may want to drill down on that with a simple micro-benchmark, for example something like this (for Spark 3.0): val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales") df.write.format("noop").mode("overwrite").save Best, Luca From: Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> Sent: Monday, August 24, 2020 13:50 To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi All, We're doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I'm attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We're not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we're missing? Thanks and Regards, Abhishek
Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) < abhishek@nokia.com> wrote: > Hi All, > > > > We’re doing some performance comparisons between Spark querying data on > HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) > using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is > consuming significantly larger duration for some set of queries when > compared with HDFS. > > We also ran similar queries with Spark 2.4.5 querying data from S3 and we > see that for these set of queries, time taken by Spark 2.4.5 is lesser > compared to Spark 3.0 looks to be very strange. > > Below are the details of 9 queries where Spark 3.0 is taking >5 times the > duration for running queries on S3 when compared to Hadoop. > > > > *Environment Details:* > >- *Spark running on Kubernetes* >- *TPC DS Scale Factor*: *500 GB* >- *Hadoop 3.x* >- *Same CPU and memory used for all executions* > > > > *Query* > > *Spark 3.0 with S3 (Time in seconds)* > > *Spark 3.0 with Hadoop (Time in seconds)* > > > > > > *Spark 2.4.5 with S3 * > > *(Time in seconds)* > > *Spark 3.0 HDFS vs S3 (Factor)* > > *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)* > > *Table involved* > > 9 > > 880.129 > > 106.109 > > 147.65 > > *8.294574* > > *5.960914* > > store_sales > > 44 > > 129.618 > > 23.747 > > 103.916 > > *5.458289* > > *1.247334* > > store_sales > > 58 > > 142.113 > > 20.996 > > 33.936 > > *6.768575* > > *4.187677* > > store_sales > > 62 > > 32.519 > > 5.425 > > 14.809 > > *5.994286* > > *2.195894* > > web_sales > > 76 > > 138.765 > > 20.73 > > 49.892 > > *6.693922* > > *2.781308* > > store_sales > > 88 > > 475.824 > > 48.2 > > 94.382 > > *9.871867* > > *5.04147* > > store_sales > > 90 > > 53.896 > > 6.804 > > 18.11 > > *7.921223* > > *2.976035* > > web_sales > > 94 > > 241.172 > > 43.49 > > 81.181 > > *5.545459* > > *2.970794* > > web_sales > > 96 > > 67.059 > > 10.396 > > 15.993 > > *6.450462* > > *4.193022* > > store_sales > > > > When we analysed it further, we see that all these queries are performing > operations either on store_sales or web_sales tables and Spark 3 with S3 > seems to be downloading much more data from storage when compared to Spark > 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for > query completion. I’m attaching the screen shots of Driver UI for one such > instance (Query 9) for reference. > > Also attached the spark configurations (Spark 3.0) used for these tests. > > > > We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on > what we’re missing? > > > > Thanks and Regards, > > Abhishek > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi Gourav, Yes. We’re using s3a. Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 1:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> wrote: Hi All, We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing? Thanks and Regards, Abhishek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi, So the results does not make sense. Regards, Gourav On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) < abhishek@nokia.com> wrote: > Hi Gourav, > > > > Yes. We’re using s3a. > > > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta > *Sent:* Wednesday, August 26, 2020 1:18 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS > Queries > > > > Hi, > > > > are you using s3a, which is not using EMRFS? In that case, these results > does not make sense to me. > > > > Regards, > > Gourav Sengupta > > > > On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) < > abhishek@nokia.com> wrote: > > Hi All, > > > > We’re doing some performance comparisons between Spark querying data on > HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) > using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is > consuming significantly larger duration for some set of queries when > compared with HDFS. > > We also ran similar queries with Spark 2.4.5 querying data from S3 and we > see that for these set of queries, time taken by Spark 2.4.5 is lesser > compared to Spark 3.0 looks to be very strange. > > Below are the details of 9 queries where Spark 3.0 is taking >5 times the > duration for running queries on S3 when compared to Hadoop. > > > > *Environment Details:* > >- *Spark running on Kubernetes* >- *TPC DS Scale Factor*: *500 GB* >- *Hadoop 3.x* >- *Same CPU and memory used for all executions* > > > > *Query* > > *Spark 3.0 with S3 (Time in seconds)* > > *Spark 3.0 with Hadoop (Time in seconds)* > > > > > > *Spark 2.4.5 with S3 * > > *(Time in seconds)* > > *Spark 3.0 HDFS vs S3 (Factor)* > > *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)* > > *Table involved* > > 9 > > 880.129 > > 106.109 > > 147.65 > > *8.294574* > > *5.960914* > > store_sales > > 44 > > 129.618 > > 23.747 > > 103.916 > > *5.458289* > > *1.247334* > > store_sales > > 58 > > 142.113 > > 20.996 > > 33.936 > > *6.768575* > > *4.187677* > > store_sales > > 62 > > 32.519 > > 5.425 > > 14.809 > > *5.994286* > > *2.195894* > > web_sales > > 76 > > 138.765 > > 20.73 > > 49.892 > > *6.693922* > > *2.781308* > > store_sales > > 88 > > 475.824 > > 48.2 > > 94.382 > > *9.871867* > > *5.04147* > > store_sales > > 90 > > 53.896 > > 6.804 > > 18.11 > > *7.921223* > > *2.976035* > > web_sales > > 94 > > 241.172 > > 43.49 > > 81.181 > > *5.545459* > > *2.970794* > > web_sales > > 96 > > 67.059 > > 10.396 > > 15.993 > > *6.450462* > > *4.193022* > > store_sales > > > > When we analysed it further, we see that all these queries are performing > operations either on store_sales or web_sales tables and Spark 3 with S3 > seems to be downloading much more data from storage when compared to Spark > 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for > query completion. I’m attaching the screen shots of Driver UI for one such > instance (Query 9) for reference. > > Also attached the spark configurations (Spark 3.0) used for these tests. > > > > We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on > what we’re missing? > > > > Thanks and Regards, > > Abhishek > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions? Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 2:35 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user@spark.apache.org Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, So the results does not make sense. Regards, Gourav On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> wrote: Hi Gourav, Yes. We’re using s3a. Thanks and Regards, Abhishek From: Gourav Sengupta mailto:gourav.sengu...@gmail.com>> Sent: Wednesday, August 26, 2020 1:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> wrote: Hi All, We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing? Thanks and Regards, Abhishek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi Can you try using emrfs? Your study looks good best of luck. Regards Gourav On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), < abhishek@nokia.com> wrote: > Yeah… Not sure if I’m missing any configurations which is causing this > issue. Any suggestions? > > > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta > *Sent:* Wednesday, August 26, 2020 2:35 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS > Queries > > > > Hi, > > > > So the results does not make sense. > > > > > > Regards, > > Gourav > > > > On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) < > abhishek@nokia.com> wrote: > > Hi Gourav, > > > > Yes. We’re using s3a. > > > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta > *Sent:* Wednesday, August 26, 2020 1:18 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS > Queries > > > > Hi, > > > > are you using s3a, which is not using EMRFS? In that case, these results > does not make sense to me. > > > > Regards, > > Gourav Sengupta > > > > On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) < > abhishek@nokia.com> wrote: > > Hi All, > > > > We’re doing some performance comparisons between Spark querying data on > HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) > using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is > consuming significantly larger duration for some set of queries when > compared with HDFS. > > We also ran similar queries with Spark 2.4.5 querying data from S3 and we > see that for these set of queries, time taken by Spark 2.4.5 is lesser > compared to Spark 3.0 looks to be very strange. > > Below are the details of 9 queries where Spark 3.0 is taking >5 times the > duration for running queries on S3 when compared to Hadoop. > > > > *Environment Details:* > >- *Spark running on Kubernetes* >- *TPC DS Scale Factor*: *500 GB* >- *Hadoop 3.x* >- *Same CPU and memory used for all executions* > > > > *Query* > > *Spark 3.0 with S3 (Time in seconds)* > > *Spark 3.0 with Hadoop (Time in seconds)* > > > > > > *Spark 2.4.5 with S3 * > > *(Time in seconds)* > > *Spark 3.0 HDFS vs S3 (Factor)* > > *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)* > > *Table involved* > > 9 > > 880.129 > > 106.109 > > 147.65 > > *8.294574* > > *5.960914* > > store_sales > > 44 > > 129.618 > > 23.747 > > 103.916 > > *5.458289* > > *1.247334* > > store_sales > > 58 > > 142.113 > > 20.996 > > 33.936 > > *6.768575* > > *4.187677* > > store_sales > > 62 > > 32.519 > > 5.425 > > 14.809 > > *5.994286* > > *2.195894* > > web_sales > > 76 > > 138.765 > > 20.73 > > 49.892 > > *6.693922* > > *2.781308* > > store_sales > > 88 > > 475.824 > > 48.2 > > 94.382 > > *9.871867* > > *5.04147* > > store_sales > > 90 > > 53.896 > > 6.804 > > 18.11 > > *7.921223* > > *2.976035* > > web_sales > > 94 > > 241.172 > > 43.49 > > 81.181 > > *5.545459* > > *2.970794* > > web_sales > > 96 > > 67.059 > > 10.396 > > 15.993 > > *6.450462* > > *4.193022* > > store_sales > > > > When we analysed it further, we see that all these queries are performing > operations either on store_sales or web_sales tables and Spark 3 with S3 > seems to be downloading much more data from storage when compared to Spark > 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for > query completion. I’m attaching the screen shots of Driver UI for one such > instance (Query 9) for reference. > > Also attached the spark configurations (Spark 3.0) used for these tests. > > > > We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on > what we’re missing? > > > > Thanks and Regards, > > Abhishek > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Hi All, We tried to regenerate the TPC DS data on S3 and after regeneration, we see that the queries are running faster and the execution time is now comparable with execution time on HDFS with Spark 3.0.0. So may be there was some issue in generating the TPC DS data first time due to which we were seeing discrepancy in query execution time on S3 with Spark 3.0.0. Thanks and Regards, Abhishek From: Gourav Sengupta Sent: Wednesday, August 26, 2020 5:49 PM To: Rao, Abhishek (Nokia - IN/Bangalore) Cc: user Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi Can you try using emrfs? Your study looks good best of luck. Regards Gourav On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), mailto:abhishek@nokia.com>> wrote: Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions? Thanks and Regards, Abhishek From: Gourav Sengupta mailto:gourav.sengu...@gmail.com>> Sent: Wednesday, August 26, 2020 2:35 PM To: Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, So the results does not make sense. Regards, Gourav On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> wrote: Hi Gourav, Yes. We’re using s3a. Thanks and Regards, Abhishek From: Gourav Sengupta mailto:gourav.sengu...@gmail.com>> Sent: Wednesday, August 26, 2020 1:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) mailto:abhishek@nokia.com>> wrote: Hi All, We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing? Thanks and Regards, Abhishek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>