Hi Can you try using emrfs? Your study looks good best of luck. Regards Gourav
On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), < abhishek....@nokia.com> wrote: > Yeah… Not sure if I’m missing any configurations which is causing this > issue. Any suggestions? > > > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta <gourav.sengu...@gmail.com> > *Sent:* Wednesday, August 26, 2020 2:35 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com> > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS > Queries > > > > Hi, > > > > So the results does not make sense. > > > > > > Regards, > > Gourav > > > > On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) < > abhishek....@nokia.com> wrote: > > Hi Gourav, > > > > Yes. We’re using s3a. > > > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta <gourav.sengu...@gmail.com> > *Sent:* Wednesday, August 26, 2020 1:18 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com> > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set of TPC DS > Queries > > > > Hi, > > > > are you using s3a, which is not using EMRFS? In that case, these results > does not make sense to me. > > > > Regards, > > Gourav Sengupta > > > > On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) < > abhishek....@nokia.com> wrote: > > Hi All, > > > > We’re doing some performance comparisons between Spark querying data on > HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) > using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is > consuming significantly larger duration for some set of queries when > compared with HDFS. > > We also ran similar queries with Spark 2.4.5 querying data from S3 and we > see that for these set of queries, time taken by Spark 2.4.5 is lesser > compared to Spark 3.0 looks to be very strange. > > Below are the details of 9 queries where Spark 3.0 is taking >5 times the > duration for running queries on S3 when compared to Hadoop. > > > > *Environment Details:* > > - *Spark running on Kubernetes* > - *TPC DS Scale Factor*: *500 GB* > - *Hadoop 3.x* > - *Same CPU and memory used for all executions* > > > > *Query* > > *Spark 3.0 with S3 (Time in seconds)* > > *Spark 3.0 with Hadoop (Time in seconds)* > > > > > > *Spark 2.4.5 with S3 * > > *(Time in seconds)* > > *Spark 3.0 HDFS vs S3 (Factor)* > > *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)* > > *Table involved* > > 9 > > 880.129 > > 106.109 > > 147.65 > > *8.294574* > > *5.960914* > > store_sales > > 44 > > 129.618 > > 23.747 > > 103.916 > > *5.458289* > > *1.247334* > > store_sales > > 58 > > 142.113 > > 20.996 > > 33.936 > > *6.768575* > > *4.187677* > > store_sales > > 62 > > 32.519 > > 5.425 > > 14.809 > > *5.994286* > > *2.195894* > > web_sales > > 76 > > 138.765 > > 20.73 > > 49.892 > > *6.693922* > > *2.781308* > > store_sales > > 88 > > 475.824 > > 48.2 > > 94.382 > > *9.871867* > > *5.04147* > > store_sales > > 90 > > 53.896 > > 6.804 > > 18.11 > > *7.921223* > > *2.976035* > > web_sales > > 94 > > 241.172 > > 43.49 > > 81.181 > > *5.545459* > > *2.970794* > > web_sales > > 96 > > 67.059 > > 10.396 > > 15.993 > > *6.450462* > > *4.193022* > > store_sales > > > > When we analysed it further, we see that all these queries are performing > operations either on store_sales or web_sales tables and Spark 3 with S3 > seems to be downloading much more data from storage when compared to Spark > 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for > query completion. I’m attaching the screen shots of Driver UI for one such > instance (Query 9) for reference. > > Also attached the spark configurations (Spark 3.0) used for these tests. > > > > We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on > what we’re missing? > > > > Thanks and Regards, > > Abhishek > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >