[GitHub] [doris] dutyu opened a new issue, #21960: [Bug] Hive catalog query result not right

via GitHub Tue, 18 Jul 2023 21:03:33 -0700


dutyu opened a new issue, #21960:
URL: https://github.com/apache/doris/issues/21960


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   2.0-beta
   
   ### What's Wrong?
   
   SQL of Query 1: 
   ```
   select 
     count(*), 
     count(distinct user_no) as "no_reuse_distinct_user_total" 
   from 
     (
       SELECT 
         `id` AS `id`, 
         `intf_type` AS `intf_type`, 
         `actual_intf_type` AS `actual_intf_type`, 
          ...
         `user_no` AS `user_no`, 
         `reuse_flag` AS `reuse_flag`, 
           ...
         `partitions` AS `partitions` 
       FROM 
         (
           select 
             `crs_query_cr_query_info_partition`.`id`, 
             `crs_query_cr_query_info_partition`.`actual_intf_type`, 
              ...
             `crs_query_cr_query_info_partition`.`user_no`, 
             `crs_query_cr_query_info_partition`.`reuse_flag`, 
              ...
             `crs_query_cr_query_info_partition`.`partitions` 
           from 
             `ods_safe`.`crs_query_cr_query_info_partition`
         ) `crs_query_cr_query_info_partition`
     ) t 
   WHERE 
     t.partitions in (
       DATE_FORMAT(
         DATE_SUB(NOW(), INTERVAL 1 DAY), 
         'yyyy-MM-dd'
       )
     ) 
     and t.actual_intf_type = 'FuZhouPbocScore' 
     and (
       t.reuse_flag is null 
       or t.reuse_flag <> 'Y'
     );
   ```
   
   Query 1 Result: 
   ```
   +----------+------------------------------+
   | count(*) | no_reuse_distinct_user_total |
   +----------+------------------------------+
   |       23 |                           23 |
   +----------+------------------------------+
   1 row in set (1.89 sec)
   ```
   
   SQL of Query 2:
   ```
   select 
     count(*), 
     count(distinct user_no) as "no_reuse_distinct_user_total" 
   from 
     (
       SELECT 
         `id` AS `id`, 
         `intf_type` AS `intf_type`, 
         `actual_intf_type` AS `actual_intf_type`, 
          ...
         `user_no` AS `user_no`, 
         `reuse_flag` AS `reuse_flag`, 
           ...
         `partitions` AS `partitions` 
       FROM 
         (
           select 
             `crs_query_cr_query_info_partition`.`id`, 
             `crs_query_cr_query_info_partition`.`actual_intf_type`, 
              ...
             `crs_query_cr_query_info_partition`.`user_no`, 
             `crs_query_cr_query_info_partition`.`reuse_flag`, 
              ...
             `crs_query_cr_query_info_partition`.`partitions` 
           from 
             `ods_safe`.`crs_query_cr_query_info_partition`
         ) `crs_query_cr_query_info_partition`
     ) t 
   WHERE 
     t.partitions in (
       DATE_FORMAT(
         DATE_SUB(NOW(), INTERVAL 1 DAY), 
         'yyyy-MM-dd'
       )
     ) 
     and t.actual_intf_type = 'FuZhouPbocScore' 
     and t.reuse_flag is null;
   ``` 
   
   Result of Query 2:
   ```
   +----------+------------------------------+
   | count(*) | no_reuse_distinct_user_total |
   +----------+------------------------------+
   |   123084 |                       119912 |
   +----------+------------------------------+
   1 row in set (2.03 sec)
   ```
   
   Table ddl: 
   ```
   CREATE TABLE `crs_query_cr_query_info_partition`(
     `id` bigint COMMENT '物理主键', 
     `actual_intf_type` string COMMENT '', 
     `user_no` string COMMENT '', 
      ...
     `reuse_flag` string COMMENT '', 
      ...
   ) PARTITIONED BY (`partitions` string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs://xxx/ods_safe.db/crs_query_cr_query_info_partition' TBLPROPERTIES (
     'last_modified_time' = '1688024841', 
     'spark.sql.sources.schema.numPartCols' = '1', 
     'spark.sql.sources.schema.part.0' = '...', 
     'spark.sql.sources.schema.partCol.0' = 'partitions', 
     'transient_lastDdlTime' = '1688025415', 
     'bucketing_version' = '2', 'last_modified_by' = 'hive', 
     'spark.sql.sources.schema.numParts' = '1', 
     'spark.sql.create.version' = '2.2 or prior'
   );
   ```
   
   
   ### What You Expected?
   
   The count of query 2 should be equal or greater than query 1 .
   
   ### How to Reproduce?
   
   _No response_
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [doris] dutyu opened a new issue, #21960: [Bug] Hive catalog query result not right

Reply via email to