[ https://issues.apache.org/jira/browse/SPARK-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24440. ---------------------------------- Resolution: Incomplete > When use constant as column we may get wrong answer versus impala > ----------------------------------------------------------------- > > Key: SPARK-24440 > URL: https://issues.apache.org/jira/browse/SPARK-24440 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0, 2.3.0 > Reporter: zhoukang > Priority: Major > Labels: bulk-closed > > For query below: > {code:java} > select `date`, 100 as platform, count(distinct deviceid) as new_user from > tv.clean_new_user where `date`=20180528 group by `date`, platform > {code} > We intended to group by 100 and get distinct deviceid number. > By spark sql,we get: > {code} > +-----------+-----------+-----------+--+ > | date | platform | new_user | > +-----------+-----------+-----------+--+ > | 20180528 | 100 | 521 | > | 20180528 | 100 | 82 | > | 20180528 | 100 | 3 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 870 | > | 20180528 | 100 | 3 | > | 20180528 | 100 | 8 | > | 20180528 | 100 | 3 | > | 20180528 | 100 | 2204 | > | 20180528 | 100 | 1123 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 54 | > | 20180528 | 100 | 440 | > | 20180528 | 100 | 4 | > | 20180528 | 100 | 478 | > | 20180528 | 100 | 34 | > | 20180528 | 100 | 195 | > | 20180528 | 100 | 17 | > | 20180528 | 100 | 18 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 84 | > | 20180528 | 100 | 1616 | > | 20180528 | 100 | 15 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 479 | > | 20180528 | 100 | 50 | > | 20180528 | 100 | 376 | > | 20180528 | 100 | 21 | > | 20180528 | 100 | 842 | > | 20180528 | 100 | 444 | > | 20180528 | 100 | 538 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 17 | > | 20180528 | 100 | 133 | > | 20180528 | 100 | 7 | > | 20180528 | 100 | 415 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 318 | > | 20180528 | 100 | 5 | > | 20180528 | 100 | 1 | > | 20180528 | 100 | 2060 | > | 20180528 | 100 | 1217 | > | 20180528 | 100 | 2 | > | 20180528 | 100 | 60 | > | 20180528 | 100 | 22 | > | 20180528 | 100 | 4 | > +-----------+-----------+-----------+--+ > {code} > Actually sum of the deviceid is below: > {code} > 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as > platform, count(distinct deviceid) as new_user from tv.clean_new_user where > `date`=20180528 group by `date`, platform)t1; > +----------------+--+ > | sum(new_user) | > +----------------+--+ > | 14816 | > +----------------+--+ > 1 row selected (4.934 seconds) > {code} > And the real distinct deviceid value is below: > {code} > 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as > new_user from tv.clean_new_user where `date`=20180528; > +-----------+-----------+--+ > | platform | new_user | > +-----------+-----------+--+ > | 100 | 14773 | > +-----------+-----------+--+ > 1 row selected (2.846 seconds) > {code} > In impala,with the first query we can get result below: > {code} > [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user > from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: > select `date`, 100 as platform, count(distinct deviceid) as new_user from > tv.clean_new_user where `date`=20180528 group by `date`, platform > +----------+----------+----------+ > | date | platform | new_user | > +----------+----------+----------+ > | 20180528 | 100 | 14773 | > +----------+----------+----------+ > Fetched 1 row(s) in 1.00s > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org