Hi, I am using data generated with sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark sql performance (spark on yarn, with 10 nodes) with the following code (The table store_sales is about 90 million records, 6G in sizeļ¼ val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales" val name="store_sales" sqlContext.sql( s""" |CREATE TEMPORARY TABLE ${name} |USING org.apache.spark.sql.parquet |OPTIONS ( | path '${outputDir}' |) """.stripMargin) val sql=""" |select | t1.ss_quantity, | t1.ss_list_price, | t1.ss_coupon_amt, | t1.ss_cdemo_sk, | t1.ss_item_sk, | t1.ss_promo_sk, | t1.ss_sold_date_sk |from store_sales t1 join store_sales t2 on t1.ss_item_sk = t2.ss_item_sk |where | t1.ss_sold_date_sk between 2450815 and 2451179 """.stripMargin val df = sqlContext.sql(sql) df.rdd.foreach(row=>Unit)
With 1.4.1, I can finish the query in 6 minutes, but I need 10+ minutes with 1.5. The configuration are basically the same, since I copy the configuration from 1.4.1 to 1.5: sparkVersion 1.4.1 1.5.0 scaleFactor 30 30 spark.sql.shuffle.partitions 600 600 spark.sql.sources.partitionDiscovery.enabled true true spark.default.parallelism 200 200 spark.driver.memory 4G 4G 4G spark.executor.memory 4G 4G spark.executor.instances 10 10 spark.shuffle.consolidateFiles true true spark.storage.memoryFraction 0.4 0.4 spark.executor.cores 3 3 I am not sure where is going wrong,any ideas?