Hello, We are running few Star Schema Benchmarks on parquet performance in Spark. We have set up the below functions (shown at the bottom) for getting runtimes. This is a simplified version of spark's benchmark API. The benchmarks are called with 1 warmup and 10 runs. SSB link: https://www.cs.umb.edu/~poneil/StarSchemaB.PDF <https://www.cs.umb.edu/~poneil/StarSchemaB.PDF>
We make sure the sparkSession is closed and gc is initiated and new session is started before and after each run. So each iteration should be a fresh run with no cached tables and information to be carried over to the next execution, which means each iteration we observed is independent (warm up times are not included in the avg time if warm up is set). However, we are observing that warmup or initial run is *always * slower than rest of them. We are unsure of what is happening in the initial run to make it slower than rest of the runs especially when SparkSession is recreated every time. We make sure the times are only capturing when we call spark.sql(...).collect and after it. For instance, 2 different observations Query 1.1 val q1Dot1 = "select sum(loExtendedprice*loDiscount) as revenue" + " from lineorder, date" + " where loOrderdate = dDatekey" + " and dYear = 1993 and loDiscount between 1 and 3" + " and loQuantity < 25" 1 Warm Up 10 Runs (w/o warmup time) iteration 1 *3.171736257* seconds iteration 2 3.255397683 seconds iteration 3 3.290669444 seconds iteration 4 2.947071071 seconds iteration 5 2.873120724 seconds iteration 6 4.070648529 seconds iteration 7 2.683136339 seconds iteration 8 2.704282199 seconds iteration 9 2.433551473 seconds iteration 10 2.639714696 seconds Average time taken in q1Dot1 for 10 runs: 3.0069328415000003 seconds No Warm Up 10 Run(s) iteration 1 *8.394320631* seconds iteration 2 3.323732197 seconds iteration 3 3.107753855 seconds iteration 4 2.916643549 seconds iteration 5 2.646391655 seconds iteration 6 2.938923536 seconds iteration 7 2.848152271 seconds iteration 8 2.894685696 seconds iteration 9 3.705443473 seconds iteration 10 2.777800344 seconds Average time taken in q1Dot1 for 10 runs: 3.5553847207 seconds No Warm Up 1 Run(s) iteration 1 *8.168479413 *seconds Average time taken in q1Dot1 for 1 runs: 8.168479413 seconds Could someone please help us understand what Spark might be doing in the initial run and how to optimize that? Assuming that our performance measurement approach is correct, what is happening in the delay and what is causing it for the single runs? Are there any configs we can set to help minimize this delay for the initial execution? Thank you so much! *More information if needed:* The way we are performing our benchmarks is like so. We have some statements that call our *runQuery* method in the main <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture.png> This is where we specify queries from SSB with # of warmups and # of executions. Our *runQuery* method generates temp tables in spark after loading the tables with spark.read.parquet and calls our *benchmark *method like so: <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture2.png> The *init()* and *cleanAll* methods ensures old sparkSession is closed and new one is started. So each iteration should be a fresh run with no cache tables and information to be carried over to the next execution, which means each iteration times we observed above is independent (warm up times are not included in the avg time if warm up is set). The benchmark function looks like this, and time is measured like so. <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8398/Capture3.png> -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org