[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060838#comment-14060838 ] Ashutosh Chauhan commented on HIVE-7400: [~darranl] If you can upload a small dataset with which this can be reproduced, that will be great. count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string param mapstring,string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct param['from']) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061654#comment-14061654 ] Gopal V commented on HIVE-7400: --- Never mind, laggy JIRA updates. Saw the file now. count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061651#comment-14061651 ] Gopal V commented on HIVE-7400: --- Where is the data-set? Can you attach it to the JIRA? count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061662#comment-14061662 ] Gopal V commented on HIVE-7400: --- With Tez enabled, I cannot reproduce this {code} Status: Finished successfully OK domain1 product115 36424 Time taken: 6.389 seconds, Fetched: 1 row(s) hive {code} count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061666#comment-14061666 ] Danran Lai commented on HIVE-7400: -- ~Gopal V which version of hive are you using? With my own hive environment, I can only get the wrong result By the way, What's Tez? count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061672#comment-14061672 ] Gopal V commented on HIVE-7400: --- [~darranl]: I am using hive-14 which is the only branch in development at the moment. And Tez is the new faster execution engine for Hive on Hadoop-2 clusters - http://www.slideshare.net/t3rmin4t0r/hivetez-a-performance-deep-dive/3 count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7400) count and count distinct not correct
[ https://issues.apache.org/jira/browse/HIVE-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061677#comment-14061677 ] Danran Lai commented on HIVE-7400: -- ~gopalv : Thank you. Now I see some. But I'm using hive-11 and it's not easy for me to upgrade hive-11 to hive-14 for a variety of reasons. Did you know whether there's a kind of patch based on hive-11 could solve this bug? count and count distinct not correct Key: HIVE-7400 URL: https://issues.apache.org/jira/browse/HIVE-7400 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Danran Lai Attachments: data_15 I have a table in Hive and I want to count unique records and all records. Table looks like: {quote} sid string sender string domain string product string {quote} And my query like this: {quote} select domain,product,count(1) as num,count(distinct sender) as user_num from table group by domain,product {quote} But the results are not correct. I can get the right user_num, but the num is wrong which is less than the real num. The real num is about 30 millon but I can only get 9 millon. So how can I fix this so that I get the correct result? ==Updated== Dataset is uploaded. Row format delimited fields terminated by '\t'. This dataset has 150,000 rows. With this query, I got the result as followed: bq.domain1 product136424 36424 -- This message was sent by Atlassian JIRA (v6.2#6252)