Hi All, We tried to regenerate the TPC DS data on S3 and after regeneration, we see that the queries are running faster and the execution time is now comparable with execution time on HDFS with Spark 3.0.0. So may be there was some issue in generating the TPC DS data first time due to which we were seeing discrepancy in query execution time on S3 with Spark 3.0.0.
Thanks and Regards, Abhishek From: Gourav Sengupta <gourav.sengu...@gmail.com> Sent: Wednesday, August 26, 2020 5:49 PM To: Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com> Cc: user <user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi Can you try using emrfs? Your study looks good best of luck. Regards Gourav On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), <abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote: Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions? Thanks and Regards, Abhishek From: Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> Sent: Wednesday, August 26, 2020 2:35 PM To: Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com<mailto:abhishek....@nokia.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, So the results does not make sense. Regards, Gourav On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote: Hi Gourav, Yes. We’re using s3a. Thanks and Regards, Abhishek From: Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> Sent: Wednesday, August 26, 2020 1:18 PM To: Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com<mailto:abhishek....@nokia.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com<mailto:abhishek....@nokia.com>> wrote: Hi All, We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS. We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange. Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop. Environment Details: * Spark running on Kubernetes * TPC DS Scale Factor: 500 GB * Hadoop 3.x * Same CPU and memory used for all executions Query Spark 3.0 with S3 (Time in seconds) Spark 3.0 with Hadoop (Time in seconds) Spark 2.4.5 with S3 (Time in seconds) Spark 3.0 HDFS vs S3 (Factor) Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor) Table involved 9 880.129 106.109 147.65 8.294574 5.960914 store_sales 44 129.618 23.747 103.916 5.458289 1.247334 store_sales 58 142.113 20.996 33.936 6.768575 4.187677 store_sales 62 32.519 5.425 14.809 5.994286 2.195894 web_sales 76 138.765 20.73 49.892 6.693922 2.781308 store_sales 88 475.824 48.2 94.382 9.871867 5.04147 store_sales 90 53.896 6.804 18.11 7.921223 2.976035 web_sales 94 241.172 43.49 81.181 5.545459 2.970794 web_sales 96 67.059 10.396 15.993 6.450462 4.193022 store_sales When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference. Also attached the spark configurations (Spark 3.0) used for these tests. We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing? Thanks and Regards, Abhishek --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>