Kevin Ma created SPARK-31143: --------------------------------- Summary: Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1 Key: SPARK-31143 URL: https://issues.apache.org/jira/browse/SPARK-31143 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.4 Environment: Spark 2.4.4 with self built Spark thrift server
Hadoop 2.6.0-cdh5.7.4 Hive 1.2.1 Spark 1.6.2 contained in CDH-5.7.4 Reporter: Kevin Ma In our company, we are doing the migration of our ad-hoc query engine from Hive to Spark. We use Spark thrift server, and the version is 2.4.4. Many of the queries run well and faster than Hive. But we have a complex query with multiple count(distinct) expression. This query runs extremely slow on Spark, comparing to Hive. It runs much slower even comparing to Spark 1.6.2. The query is as flowing: {code:java} select 'All'as industry_name ,sum(shop_cnt)/31 as shop_cnt ,sum(sku_cnt)/31 as sku_cnt ,sum(limit_act_sku_cnt)/31 as limit_act_sku_cnt ,sum(discount_act_sku_cnt)/31 as discount_act_sku_cnt ,sum(sku_mid_price)/31 as sku_mid_price ,sum(sku_high_mid_price)/31 as sku_high_mid_price FROM ( select cal_dt ,approx_count_distinct(shop_id) as shop_cnt ,sum(sku_cnt)/approx_count_distinct(shop_id) as sku_cnt ,sum(limit_act_sku_cnt)/approx_count_distinct(shop_id) as limit_act_sku_cnt ,sum(discount_act_sku_cnt)/approx_count_distinct(shop_id) as discount_act_sku_cnt ,percentile(cast(sku_mid_price as bigint) ,0.5) as sku_mid_price ,percentile(cast(sku_high_mid_price as bigint),0.75) as sku_high_mid_price from ( select cal_dt ,vender_id ,shop_id ,approx_count_distinct(sku_id) as sku_cnt ,approx_count_distinct(case when is_limit_grab_act_sku=1 then sku_id end) as limit_act_sku_cnt ,approx_count_distinct(case when is_offer_act_sku=1 then sku_id end) as discount_act_sku_cnt ,percentile(cast(sku_price as bigint),0.5) as sku_mid_price ,percentile(cast(sku_price as bigint),0.75) as sku_high_mid_price from bi_dw.dw_dj_prd_shop_sku_info where cal_dt = '2019-12-01' group by cal_dt, vender_id, shop_id ) a group by cal_dt ) a; {code} The query took about 18 minutes to run on Spark 2.4.4. And it took only about 80 seconds to run on Hive 1.2.1. On Spark 1.6.2, it only took about 2 to 3 minutes (run on Spark shell, no accurate time taken output). When investigating this, I found the Jira https://issues.apache.org/jira/browse/SPARK-9241, which optimizes count disctint. But when I look at the code of Spark 2.4.4, I found the related code is not there. So my question is: why the code is removed? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org