Thanks Davies for the explanation.
When I turn off the following options, I still see that spark1.5 is much slower 
than 1.4.1. I am thinking how I can configure so that spark 1.5 can have 
similar performance as spark1.4 for this particular query..

--conf spark.sql.planner.sortMergeJoin=false 
--conf spark.sql.tungsten.enabled=false
--conf spark.shuffle.reduceLocality.enabled=false
--conf spark.sql.planner.externalSort=false
--conf spark.sql.parquet.filterPushdown=false
--conf spark.sql.codegen=false









At 2015-09-12 01:32:15, "Davies Liu" <dav...@databricks.com> wrote:
>I had ran similar benchmark for 1.5, do self join on a fact table with
>join key that had many duplicated rows (there are N rows for the same
>join key), say N, after join, there will be N*N rows for each join
>key. Generating the joined row is slower in 1.5 than 1.4 (it needs to
>copy left and right row together, but not in 1.4). If the generated
>row is accessed after join, there will be not much difference between
>1.5 and 1.4, because accessing the joined row is slower in 1.4 than
>1.5.
>
>So, for this particular query, 1.5 is slower than 1.4, will be even
>slower if you increase the N. But for real workload, it will not, 1.5
>is usually faster than 1.4.
>
>On Fri, Sep 11, 2015 at 1:31 AM, prosp4300 <prosp4...@163.com> wrote:
>>
>>
>> By the way turn off the code generation could be an option to try, sometime 
>> code generation could introduce slowness
>>
>>
>> 在2015年09月11日 15:58,Cheng, Hao 写道:
>>
>> Can you confirm if the query really run in the cluster mode? Not the local 
>> mode. Can you print the call stack of the executor when the query is running?
>>
>>
>>
>> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
>> Spark SQL.
>>
>>
>>
>> From: Todd [mailto:bit1...@163.com]
>> Sent: Friday, September 11, 2015 3:39 PM
>> To: Todd
>> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
>> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
>> compared with spark 1.4.1 SQL
>>
>>
>>
>> I add the following two options:
>> spark.sql.planner.sortMergeJoin=false
>> spark.shuffle.reduceLocality.enabled=false
>>
>> But it still performs the same as not setting them two.
>>
>> One thing is that on the spark ui, when I click the SQL tab, it shows an 
>> empty page but the header title 'SQL',there is no table to show queries and 
>> execution plan information.
>>
>>
>>
>>
>>
>> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>>
>>
>> Thanks Hao.
>>  Yes,it is still low as SMJ。Let me try the option your suggested,
>>
>>
>>
>>
>> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>>
>> You mean the performance is still slow as the SMJ in Spark 1.5?
>>
>>
>>
>> Can you set the spark.shuffle.reduceLocality.enabled=false when you start 
>> the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
>> default, but we found it probably causes the performance reduce dramatically.
>>
>>
>>
>>
>>
>> From: Todd [mailto:bit1...@163.com]
>> Sent: Friday, September 11, 2015 2:17 PM
>> To: Cheng, Hao
>> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
>> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
>> spark 1.4.1 SQL
>>
>>
>>
>> Thanks Hao for the reply.
>> I turn the merge sort join off, the physical plan is below, but the 
>> performance is roughly the same as it on...
>>
>> == Physical Plan ==
>> TungstenProject 
>> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
>>  ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>>   TungstenExchange hashpartitioning(ss_item_sk#2)
>>    ConvertToUnsafe
>>     Scan 
>> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>>   TungstenExchange hashpartitioning(ss_item_sk#25)
>>    ConvertToUnsafe
>>     Scan 
>> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>>
>> Code Generation: true
>>
>>
>>
>>
>> At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>>
>> This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
>> fully utilize the sorting yet, more details can be found at 
>> https://issues.apache.org/jira/browse/SPARK-2926 .
>>
>>
>>
>> Anyway, can you disable the sort merge join by 
>> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query 
>> again? In our previous testing, it’s about 20% slower for sort merge join. I 
>> am not sure if there anything else slow down the performance.
>>
>>
>>
>> Hao
>>
>>
>>
>>
>>
>> From: Jesse F Chen [mailto:jfc...@us.ibm.com]
>> Sent: Friday, September 11, 2015 1:18 PM
>> To: Michael Armbrust
>> Cc: Todd; user@spark.apache.org
>> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
>> spark 1.4.1 SQL
>>
>>
>>
>> Could this be a build issue (i.e., sbt package)?
>>
>> If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression 
>> too in queries (all other things identical)...
>>
>> I am curious, to build 1.5 (when it isn't released yet), what do I need to 
>> do with the build.sbt file?
>>
>> any special parameters i should be using to make sure I load the latest hive 
>> dependencies?
>>
>> Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS 
>> SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I
>>
>> From: Michael Armbrust <mich...@databricks.com>
>> To: Todd <bit1...@163.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Date: 09/10/2015 11:07 AM
>> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
>> spark 1.4.1 SQL
>>
>> ________________________________
>>
>>
>>
>>
>> I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, 
>> so this is surprising.  In my experiments Spark 1.5 is either the same or 
>> faster than 1.4 with only small exceptions.  A few thoughts,
>>
>>  - 600 partitions is probably way too many for 6G of data.
>>  - Providing the output of explain for both runs would be helpful whenever 
>> reporting performance changes.
>>
>> On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote:
>>
>> Hi,
>>
>> I am using data generated with 
>> sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
>> sql performance (spark on yarn, with 10 nodes) with the following code (The 
>> table store_sales is about 90 million records, 6G in size)
>>
>> val 
>> outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
>> val name="store_sales"
>>     sqlContext.sql(
>>       s"""
>>           |CREATE TEMPORARY TABLE ${name}
>>           |USING org.apache.spark.sql.parquet
>>           |OPTIONS (
>>           |  path '${outputDir}'
>>           |)
>>         """.stripMargin)
>>
>> val sql="""
>>          |select
>>          |  t1.ss_quantity,
>>          |  t1.ss_list_price,
>>          |  t1.ss_coupon_amt,
>>          |  t1.ss_cdemo_sk,
>>          |  t1.ss_item_sk,
>>          |  t1.ss_promo_sk,
>>          |  t1.ss_sold_date_sk
>>          |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
>> t2.ss_item_sk
>>          |where
>>          |  t1.ss_sold_date_sk between 2450815 and 2451179
>>        """.stripMargin
>>
>> val df = sqlContext.sql(sql)
>> df.rdd.foreach(row=>Unit)
>>
>> With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes 
>> with 1.5.
>>
>> The configuration are basically the same, since I copy the configuration 
>> from 1.4.1 to 1.5:
>>
>> sparkVersion    1.4.1        1.5.0
>> scaleFactor    30        30
>> spark.sql.shuffle.partitions    600        600
>> spark.sql.sources.partitionDiscovery.enabled    true        true
>> spark.default.parallelism    200        200
>> spark.driver.memory    4G    4G        4G
>> spark.executor.memory    4G        4G
>> spark.executor.instances    10        10
>> spark.shuffle.consolidateFiles    true        true
>> spark.storage.memoryFraction    0.4        0.4
>> spark.executor.cores    3        3
>>
>> I am not sure where is going wrong,any ideas?
>>
>>
>>
>>
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to