date:20200824

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-24 Thread Luca Canali

Hi Abhishek,

Just a few ideas/comments on the topic:

When benchmarking/testing I find it useful to  collect a more complete view of 
resources usage and Spark metrics, beyond just measuring query elapsed time. 
Something like this:
https://github.com/cerndb/spark-dashboard

I'd rather not use dynamic allocation when benchmarking if possible, as it adds 
a layer of complexity when examining results.

If you suspect that reading from S3 vs. HDFS may play an important role on the 
performance you observe, you may want to drill down on that with a simple 
micro-benchmark, for example something like this (for Spark 3.0):

val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales")
df.write.format("noop").mode("overwrite").save

Best,
Luca

From: Rao, Abhishek (Nokia - IN/Bangalore) 
Sent: Monday, August 24, 2020 13:50
To: user@spark.apache.org
Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Hi All,

We're doing some performance comparisons between Spark querying data on HDFS vs 
Spark querying data on S3 (Ceph Object Store used for S3 storage) using 
standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming 
significantly larger duration for some set of queries when compared with HDFS.
We also ran similar queries with Spark 2.4.5 querying data from S3 and we see 
that for these set of queries, time taken by Spark 2.4.5 is lesser compared to 
Spark 3.0 looks to be very strange.
Below are the details of 9 queries where Spark 3.0 is taking >5 times the 
duration for running queries on S3 when compared to Hadoop.

Environment Details:

  *   Spark running on Kubernetes
  *   TPC DS Scale Factor: 500 GB
  *   Hadoop 3.x
  *   Same CPU and memory used for all executions

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)



Spark 2.4.5 with S3
(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales


When we analysed it further, we see that all these queries are performing 
operations either on store_sales or web_sales tables and Spark 3 with S3 seems 
to be downloading much more data from storage when compared to Spark 3 with 
Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query 
completion. I'm attaching the screen shots of Driver UI for one such instance 
(Query 9) for reference.
Also attached the spark configurations (Spark 3.0) used for these tests.

We're not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what 
we're missing?

Thanks and Regards,
Abhishek

Stream to Stream joins

2020-08-24 Thread Hamish Whittal

Hi folks,

I've got a stream coming from Kafka. It has the following schema:

 userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime }

I'm trying to count the number of logins by acctid.

I can do the count fine, but the table only has the acctid and the count.
Now I wish to get all the other columns.

I have tried to do this:
origdata = uDF2.select("userdetail.*")
# This gives me all the columns.
# +---+---+-++
#  | id| acctid|  uid| logintm|
# +---+---+-++
#  |014|1075627|curtis.ga|2020-08-24 13:58:...|
#  +---+---+-++

logins = origdata\
   .withWatermark('logintm', '10 minutes')\
   .groupBy("acctid", "logintm")\
   .count()\
   .select('acctid', 'count')
# This gives me the count of the accountId.

Now I thought I could do this
   alldata = logins.join(origdata, "acctid")

But I get no rows returned. Am I just completely missing the mark here? The
original dataframe has all the columns as expected (including the join
column) and the count has the join column and the count.

If I can't do this, then how otherwise does one get all the other data WITH
the count?

Thanks for your help in advance.

Hamish

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Stream to Stream joins

2 matches

Site Navigation

Mail list logo

Footer information