Hello,

We are running few Star Schema Benchmarks on parquet performance in Spark.
We have set up the below functions (shown at the bottom) for getting
runtimes. This is a simplified version of spark's benchmark API. The
benchmarks are called with 1 warmup and 10 runs. SSB link: 
https://www.cs.umb.edu/~poneil/StarSchemaB.PDF
<https://www.cs.umb.edu/~poneil/StarSchemaB.PDF>  

We make sure the sparkSession is closed and gc is initiated and new session
is started before and after each run. So each iteration should be a fresh
run with no cached tables and information to be carried over to the next
execution, which means each iteration we observed is independent (warm up
times are not included in the avg time if warm up is set).

However, we are observing that warmup or initial run is *always * slower
than rest of them. We are unsure of what is happening in the initial run to
make it slower than rest of the runs especially when SparkSession is
recreated every time. We make sure the times are only capturing when we call
spark.sql(...).collect and after it. 

For instance, 2 different observations
Query 1.1
val q1Dot1 = "select sum(loExtendedprice*loDiscount) as revenue" +
    " from lineorder, date" +
    " where loOrderdate = dDatekey" +
    " and dYear = 1993 and loDiscount between 1 and 3" +
    " and loQuantity < 25"

1 Warm Up 10 Runs (w/o warmup time)
iteration 1 *3.171736257* seconds 
iteration 2 3.255397683 seconds 
iteration 3 3.290669444 seconds 
iteration 4 2.947071071 seconds 
iteration 5 2.873120724 seconds 
iteration 6 4.070648529 seconds 
iteration 7 2.683136339 seconds 
iteration 8 2.704282199 seconds 
iteration 9 2.433551473 seconds 
iteration 10 2.639714696 seconds
Average time taken in q1Dot1 for 10 runs: 3.0069328415000003 seconds

No Warm Up 10 Run(s)
iteration 1 *8.394320631* seconds 
iteration 2 3.323732197 seconds 
iteration 3 3.107753855 seconds 
iteration 4 2.916643549 seconds 
iteration 5 2.646391655 seconds 
iteration 6 2.938923536 seconds 
iteration 7 2.848152271 seconds 
iteration 8 2.894685696 seconds 
iteration 9 3.705443473 seconds 
iteration 10 2.777800344 seconds 
Average time taken in q1Dot1 for 10 runs: 3.5553847207 seconds

No Warm Up 1 Run(s)
iteration 1 *8.168479413 *seconds 
Average time taken in q1Dot1 for 1 runs: 8.168479413 seconds

Could someone please help us understand what Spark might be doing in the
initial run and how to optimize that? Assuming that our performance
measurement approach is correct, what is happening in the delay and what is
causing it for the single runs? Are there any configs we can set to help
minimize this delay for the initial execution?

Thank you so much! 

*More information if needed:*
The way we are performing our benchmarks is like so.

We have some statements that call our *runQuery* method in the main

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture.png> 

This is where we specify queries from SSB with # of warmups and # of
executions. Our *runQuery* method generates temp tables in spark after
loading the tables with spark.read.parquet and calls our *benchmark *method
like so:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture2.png> 

The *init()* and *cleanAll* methods ensures old sparkSession is closed and
new one is started. So each iteration should be a fresh run with no cache
tables and information to be carried over to the next execution, which means
each iteration times we observed above is independent (warm up times are not
included in the avg time if warm up is set).

The benchmark function looks like this, and time is measured like so.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture3.png> 




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to