[ https://issues.apache.org/jira/browse/SPARK-31143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058573#comment-17058573 ]
Jungtaek Lim commented on SPARK-31143: -------------------------------------- [~shijiezhiai] Could you please leave the information about the reason to close as "Not A Problem"? > Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1 > ---------------------------------------------------------------------------- > > Key: SPARK-31143 > URL: https://issues.apache.org/jira/browse/SPARK-31143 > Project: Spark > Issue Type: Question > Components: SQL > Affects Versions: 2.4.4 > Environment: Spark 2.4.4 with self built Spark thrift server > Hadoop 2.6.0-cdh5.7.4 > Hive 1.2.1 > Spark 1.6.2 contained in CDH-5.7.4 > Reporter: Kevin Ma > Priority: Major > > In our company, we are doing the migration of our ad-hoc query engine from > Hive to Spark. We use Spark thrift server, and the version is 2.4.4. Many of > the queries run well and faster than Hive. But we have a complex query with > multiple count(distinct) expression. This query runs extremely slow on Spark, > comparing to Hive. It runs much slower even comparing to Spark 1.6.2. The > query is as flowing: > {code:java} > select > 'All'as industry_name > ,sum(shop_cnt)/31 as shop_cnt > ,sum(sku_cnt)/31 as sku_cnt > ,sum(limit_act_sku_cnt)/31 as limit_act_sku_cnt > ,sum(discount_act_sku_cnt)/31 as discount_act_sku_cnt > ,sum(sku_mid_price)/31 as sku_mid_price > ,sum(sku_high_mid_price)/31 as sku_high_mid_price > FROM > ( > select > cal_dt > ,approx_count_distinct(shop_id) as shop_cnt > ,sum(sku_cnt)/approx_count_distinct(shop_id) as sku_cnt > ,sum(limit_act_sku_cnt)/approx_count_distinct(shop_id) as limit_act_sku_cnt > ,sum(discount_act_sku_cnt)/approx_count_distinct(shop_id) as > discount_act_sku_cnt > ,percentile(cast(sku_mid_price as bigint) ,0.5) as sku_mid_price > ,percentile(cast(sku_high_mid_price as bigint),0.75) as sku_high_mid_price > from > ( > select > cal_dt > ,vender_id > ,shop_id > ,approx_count_distinct(sku_id) as sku_cnt > ,approx_count_distinct(case when is_limit_grab_act_sku=1 then sku_id end) as > limit_act_sku_cnt > ,approx_count_distinct(case when is_offer_act_sku=1 then sku_id end) as > discount_act_sku_cnt > ,percentile(cast(sku_price as bigint),0.5) as sku_mid_price > ,percentile(cast(sku_price as bigint),0.75) as sku_high_mid_price > from bi_dw.dw_dj_prd_shop_sku_info > where cal_dt = '2019-12-01' > group by cal_dt, vender_id, shop_id > ) a > group by cal_dt > ) a; > {code} > The query took about 18 minutes to run on Spark 2.4.4. And it took only about > 80 seconds to run on Hive 1.2.1. On Spark 1.6.2, it only took about 2 to 3 > minutes (run on Spark shell, no accurate time taken output). > When investigating this, I found the Jira > https://issues.apache.org/jira/browse/SPARK-9241, which optimizes count > disctint. But when I look at the code of Spark 2.4.4, I found the related > code is not there. > So my question is: why the code is removed? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org