Thanks Davies for the explanation. When I turn off the following options, I still see that spark1.5 is much slower than 1.4.1. I am thinking how I can configure so that spark 1.5 can have similar performance as spark1.4 for this particular query..
--conf spark.sql.planner.sortMergeJoin=false --conf spark.sql.tungsten.enabled=false --conf spark.shuffle.reduceLocality.enabled=false --conf spark.sql.planner.externalSort=false --conf spark.sql.parquet.filterPushdown=false --conf spark.sql.codegen=false At 2015-09-12 01:32:15, "Davies Liu" <dav...@databricks.com> wrote: >I had ran similar benchmark for 1.5, do self join on a fact table with >join key that had many duplicated rows (there are N rows for the same >join key), say N, after join, there will be N*N rows for each join >key. Generating the joined row is slower in 1.5 than 1.4 (it needs to >copy left and right row together, but not in 1.4). If the generated >row is accessed after join, there will be not much difference between >1.5 and 1.4, because accessing the joined row is slower in 1.4 than >1.5. > >So, for this particular query, 1.5 is slower than 1.4, will be even >slower if you increase the N. But for real workload, it will not, 1.5 >is usually faster than 1.4. > >On Fri, Sep 11, 2015 at 1:31 AM, prosp4300 <prosp4...@163.com> wrote: >> >> >> By the way turn off the code generation could be an option to try, sometime >> code generation could introduce slowness >> >> >> 在2015年09月11日 15:58,Cheng, Hao 写道: >> >> Can you confirm if the query really run in the cluster mode? Not the local >> mode. Can you print the call stack of the executor when the query is running? >> >> >> >> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not >> Spark SQL. >> >> >> >> From: Todd [mailto:bit1...@163.com] >> Sent: Friday, September 11, 2015 3:39 PM >> To: Todd >> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org >> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ >> compared with spark 1.4.1 SQL >> >> >> >> I add the following two options: >> spark.sql.planner.sortMergeJoin=false >> spark.shuffle.reduceLocality.enabled=false >> >> But it still performs the same as not setting them two. >> >> One thing is that on the spark ui, when I click the SQL tab, it shows an >> empty page but the header title 'SQL',there is no table to show queries and >> execution plan information. >> >> >> >> >> >> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote: >> >> >> Thanks Hao. >> Yes,it is still low as SMJ。Let me try the option your suggested, >> >> >> >> >> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote: >> >> You mean the performance is still slow as the SMJ in Spark 1.5? >> >> >> >> Can you set the spark.shuffle.reduceLocality.enabled=false when you start >> the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by >> default, but we found it probably causes the performance reduce dramatically. >> >> >> >> >> >> From: Todd [mailto:bit1...@163.com] >> Sent: Friday, September 11, 2015 2:17 PM >> To: Cheng, Hao >> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org >> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with >> spark 1.4.1 SQL >> >> >> >> Thanks Hao for the reply. >> I turn the merge sort join off, the physical plan is below, but the >> performance is roughly the same as it on... >> >> == Physical Plan == >> TungstenProject >> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0] >> ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight >> TungstenExchange hashpartitioning(ss_item_sk#2) >> ConvertToUnsafe >> Scan >> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0] >> TungstenExchange hashpartitioning(ss_item_sk#25) >> ConvertToUnsafe >> Scan >> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25] >> >> Code Generation: true >> >> >> >> >> At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote: >> >> This is not a big surprise the SMJ is slower than the HashJoin, as we do not >> fully utilize the sorting yet, more details can be found at >> https://issues.apache.org/jira/browse/SPARK-2926 . >> >> >> >> Anyway, can you disable the sort merge join by >> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query >> again? In our previous testing, it’s about 20% slower for sort merge join. I >> am not sure if there anything else slow down the performance. >> >> >> >> Hao >> >> >> >> >> >> From: Jesse F Chen [mailto:jfc...@us.ibm.com] >> Sent: Friday, September 11, 2015 1:18 PM >> To: Michael Armbrust >> Cc: Todd; user@spark.apache.org >> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with >> spark 1.4.1 SQL >> >> >> >> Could this be a build issue (i.e., sbt package)? >> >> If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression >> too in queries (all other things identical)... >> >> I am curious, to build 1.5 (when it isn't released yet), what do I need to >> do with the build.sbt file? >> >> any special parameters i should be using to make sure I load the latest hive >> dependencies? >> >> Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS >> SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising. I >> >> From: Michael Armbrust <mich...@databricks.com> >> To: Todd <bit1...@163.com> >> Cc: "user@spark.apache.org" <user@spark.apache.org> >> Date: 09/10/2015 11:07 AM >> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with >> spark 1.4.1 SQL >> >> ________________________________ >> >> >> >> >> I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, >> so this is surprising. In my experiments Spark 1.5 is either the same or >> faster than 1.4 with only small exceptions. A few thoughts, >> >> - 600 partitions is probably way too many for 6G of data. >> - Providing the output of explain for both runs would be helpful whenever >> reporting performance changes. >> >> On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote: >> >> Hi, >> >> I am using data generated with >> sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark >> sql performance (spark on yarn, with 10 nodes) with the following code (The >> table store_sales is about 90 million records, 6G in size) >> >> val >> outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales" >> val name="store_sales" >> sqlContext.sql( >> s""" >> |CREATE TEMPORARY TABLE ${name} >> |USING org.apache.spark.sql.parquet >> |OPTIONS ( >> | path '${outputDir}' >> |) >> """.stripMargin) >> >> val sql=""" >> |select >> | t1.ss_quantity, >> | t1.ss_list_price, >> | t1.ss_coupon_amt, >> | t1.ss_cdemo_sk, >> | t1.ss_item_sk, >> | t1.ss_promo_sk, >> | t1.ss_sold_date_sk >> |from store_sales t1 join store_sales t2 on t1.ss_item_sk = >> t2.ss_item_sk >> |where >> | t1.ss_sold_date_sk between 2450815 and 2451179 >> """.stripMargin >> >> val df = sqlContext.sql(sql) >> df.rdd.foreach(row=>Unit) >> >> With 1.4.1, I can finish the query in 6 minutes, but I need 10+ minutes >> with 1.5. >> >> The configuration are basically the same, since I copy the configuration >> from 1.4.1 to 1.5: >> >> sparkVersion 1.4.1 1.5.0 >> scaleFactor 30 30 >> spark.sql.shuffle.partitions 600 600 >> spark.sql.sources.partitionDiscovery.enabled true true >> spark.default.parallelism 200 200 >> spark.driver.memory 4G 4G 4G >> spark.executor.memory 4G 4G >> spark.executor.instances 10 10 >> spark.shuffle.consolidateFiles true true >> spark.storage.memoryFraction 0.4 0.4 >> spark.executor.cores 3 3 >> >> I am not sure where is going wrong,any ideas? >> >> >> >> >> > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >