zhoukang created SPARK-24440: -------------------------------- Summary: When use constant as column we may get wrong answer versus impala Key: SPARK-24440 URL: https://issues.apache.org/jira/browse/SPARK-24440 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.1.0 Reporter: zhoukang
For query below: {code:java} select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform {code} We intended to group by 100 and get distinct deviceid number. By spark sql,we get: {code} +-----------+-----------+-----------+--+ | date | platform | new_user | +-----------+-----------+-----------+--+ | 20180528 | 100 | 521 | | 20180528 | 100 | 82 | | 20180528 | 100 | 3 | | 20180528 | 100 | 2 | | 20180528 | 100 | 7 | | 20180528 | 100 | 870 | | 20180528 | 100 | 3 | | 20180528 | 100 | 8 | | 20180528 | 100 | 3 | | 20180528 | 100 | 2204 | | 20180528 | 100 | 1123 | | 20180528 | 100 | 1 | | 20180528 | 100 | 54 | | 20180528 | 100 | 440 | | 20180528 | 100 | 4 | | 20180528 | 100 | 478 | | 20180528 | 100 | 34 | | 20180528 | 100 | 195 | | 20180528 | 100 | 17 | | 20180528 | 100 | 18 | | 20180528 | 100 | 2 | | 20180528 | 100 | 2 | | 20180528 | 100 | 84 | | 20180528 | 100 | 1616 | | 20180528 | 100 | 15 | | 20180528 | 100 | 7 | | 20180528 | 100 | 479 | | 20180528 | 100 | 50 | | 20180528 | 100 | 376 | | 20180528 | 100 | 21 | | 20180528 | 100 | 842 | | 20180528 | 100 | 444 | | 20180528 | 100 | 538 | | 20180528 | 100 | 1 | | 20180528 | 100 | 2 | | 20180528 | 100 | 7 | | 20180528 | 100 | 17 | | 20180528 | 100 | 133 | | 20180528 | 100 | 7 | | 20180528 | 100 | 415 | | 20180528 | 100 | 2 | | 20180528 | 100 | 318 | | 20180528 | 100 | 5 | | 20180528 | 100 | 1 | | 20180528 | 100 | 2060 | | 20180528 | 100 | 1217 | | 20180528 | 100 | 2 | | 20180528 | 100 | 60 | | 20180528 | 100 | 22 | | 20180528 | 100 | 4 | +-----------+-----------+-----------+--+ {code} Actually sum of the deviceid is below: {code} 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform)t1; +----------------+--+ | sum(new_user) | +----------------+--+ | 14816 | +----------------+--+ 1 row selected (4.934 seconds) {code} And the real distinct deviceid value is below: {code} 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528; +-----------+-----------+--+ | platform | new_user | +-----------+-----------+--+ | 100 | 14773 | +-----------+-----------+--+ 1 row selected (2.846 seconds) {code} In impala,with the first query we can get result below: {code} [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: select `date`, 100 as platform, count(distinct deviceid) as new_user from tv.clean_new_user where `date`=20180528 group by `date`, platform +----------+----------+----------+ | date | platform | new_user | +----------+----------+----------+ | 20180528 | 100 | 14773 | +----------+----------+----------+ Fetched 1 row(s) in 1.00s {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org