[jira] [Commented] (SPARK-31143) Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1

Jungtaek Lim (Jira) Fri, 13 Mar 2020 02:49:51 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058573#comment-17058573
 ]


Jungtaek Lim commented on SPARK-31143:
--------------------------------------

[~shijiezhiai] Could you please leave the information about the reason to close 
as "Not A Problem"?

> Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-31143
>                 URL: https://issues.apache.org/jira/browse/SPARK-31143
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 2.4.4
>         Environment: Spark 2.4.4 with self built Spark thrift server
> Hadoop 2.6.0-cdh5.7.4
> Hive 1.2.1
> Spark 1.6.2 contained in CDH-5.7.4
>            Reporter:  Kevin Ma
>            Priority: Major
>
> In our company, we are doing the migration of our ad-hoc query engine from 
> Hive to Spark. We use Spark thrift server, and the version is 2.4.4. Many of 
> the queries run well and faster than Hive. But we have a complex query with 
> multiple count(distinct) expression. This query runs extremely slow on Spark, 
> comparing to Hive. It runs much slower even comparing to Spark 1.6.2. The 
> query is as flowing:
> {code:java}
> select
> 'All'as industry_name
> ,sum(shop_cnt)/31 as shop_cnt
> ,sum(sku_cnt)/31 as sku_cnt
> ,sum(limit_act_sku_cnt)/31 as limit_act_sku_cnt
> ,sum(discount_act_sku_cnt)/31 as discount_act_sku_cnt
> ,sum(sku_mid_price)/31 as sku_mid_price
> ,sum(sku_high_mid_price)/31 as sku_high_mid_price
> FROM
> (
> select
> cal_dt
> ,approx_count_distinct(shop_id) as shop_cnt
> ,sum(sku_cnt)/approx_count_distinct(shop_id) as sku_cnt
> ,sum(limit_act_sku_cnt)/approx_count_distinct(shop_id) as limit_act_sku_cnt
> ,sum(discount_act_sku_cnt)/approx_count_distinct(shop_id) as 
> discount_act_sku_cnt
> ,percentile(cast(sku_mid_price as bigint) ,0.5) as sku_mid_price
> ,percentile(cast(sku_high_mid_price as bigint),0.75) as sku_high_mid_price
> from
> (
> select
> cal_dt
> ,vender_id
> ,shop_id
> ,approx_count_distinct(sku_id) as sku_cnt
> ,approx_count_distinct(case when is_limit_grab_act_sku=1 then sku_id end) as 
> limit_act_sku_cnt
> ,approx_count_distinct(case when is_offer_act_sku=1 then sku_id end) as 
> discount_act_sku_cnt
> ,percentile(cast(sku_price as bigint),0.5) as sku_mid_price
> ,percentile(cast(sku_price as bigint),0.75) as sku_high_mid_price
> from bi_dw.dw_dj_prd_shop_sku_info
> where cal_dt = '2019-12-01'
> group by cal_dt, vender_id, shop_id
> ) a
> group by cal_dt
> ) a;
> {code}
> The query took about 18 minutes to run on Spark 2.4.4. And it took only about 
> 80 seconds to run on Hive 1.2.1. On Spark 1.6.2, it only took about 2 to 3 
> minutes (run on Spark shell, no accurate time taken output).
> When investigating this, I found the Jira 
> https://issues.apache.org/jira/browse/SPARK-9241, which optimizes count 
> disctint. But when I look at the code of Spark 2.4.4, I found the related 
> code is not there. 
> So my question is: why the code is removed?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31143) Spark 2.4.4 count distinct query much slower than Spark 1.6.2 and Hive 1.2.1

Reply via email to