[ 
https://issues.apache.org/jira/browse/HIVE-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924543#comment-17924543
 ] 

Paramvir Singh commented on HIVE-28735:
---------------------------------------

My setup 

- 1 Primary  - 32 vCore, 256 GiB memory,
- 2 core nodes - 4 vCore, 16 GiB memory
- Data : 
https://drive.google.com/drive/folders/14KEaeRRlYCVhHKqohuql7kO7_HxZwdqO?usp=drive_link




Repro scenario DDLs


{code:java}

CREATE EXTERNAL TABLE `catalog_sales_small2`(
  `cs_sold_date_sk` int, 
  `cs_sold_time_sk` int, 
  `cs_ship_date_sk` int, 
  `cs_bill_customer_sk` int, 
  `cs_bill_cdemo_sk` int, 
  `cs_bill_hdemo_sk` int, 
  `cs_bill_addr_sk` int, 
  `cs_ship_customer_sk` int, 
  `cs_ship_cdemo_sk` int, 
  `cs_ship_hdemo_sk` int, 
  `cs_ship_addr_sk` int, 
  `cs_call_center_sk` int, 
  `cs_catalog_page_sk` int, 
  `cs_ship_mode_sk` int, 
  `cs_warehouse_sk` int, 
  `cs_item_sk` int, 
  `cs_promo_sk` int, 
  `cs_order_number` bigint, 
  `cs_quantity` int, 
  `cs_wholesale_cost` decimal(7,2), 
  `cs_list_price` decimal(7,2), 
  `cs_sales_price` decimal(7,2), 
  `cs_ext_discount_amt` decimal(7,2), 
  `cs_ext_sales_price` decimal(7,2), 
  `cs_ext_wholesale_cost` decimal(7,2), 
  `cs_ext_list_price` decimal(7,2), 
  `cs_ext_tax` decimal(7,2), 
  `cs_coupon_amt` decimal(7,2), 
  `cs_ext_ship_cost` decimal(7,2), 
  `cs_net_paid` decimal(7,2), 
  `cs_net_paid_inc_tax` decimal(7,2), 
  `cs_net_paid_inc_ship` decimal(7,2), 
  `cs_net_paid_inc_ship_tax` decimal(7,2), 
  `cs_net_profit` decimal(7,2))
STORED AS ORC
LOCATION
  's3://prmsingh-hive/hive4/catalog_sales_small2/'; 

CREATE EXTERNAL TABLE `customer_small`(
  `c_customer_sk` int, 
  `c_customer_id` string, 
  `c_current_cdemo_sk` int, 
  `c_current_hdemo_sk` int, 
  `c_current_addr_sk` int, 
  `c_first_shipto_date_sk` int, 
  `c_first_sales_date_sk` int, 
  `c_salutation` string, 
  `c_first_name` string, 
  `c_last_name` string, 
  `c_preferred_cust_flag` string, 
  `c_birth_day` int, 
  `c_birth_month` int, 
  `c_birth_year` int, 
  `c_birth_country` string, 
  `c_login` string, 
  `c_email_address` string, 
  `c_last_review_date` string)
STORED AS ORC
LOCATION
  's3://prmsingh-hive/hive4/customer_small/';

CREATE EXTERNAL TABLE `customer_address_small`(
  `ca_address_sk` int, 
  `ca_address_id` string, 
  `ca_street_number` string, 
  `ca_street_name` string, 
  `ca_street_type` string, 
  `ca_suite_number` string, 
  `ca_city` string, 
  `ca_county` string, 
  `ca_state` string, 
  `ca_zip` string, 
  `ca_country` string, 
  `ca_gmt_offset` decimal(5,2), 
  `ca_location_type` string)
STORED AS ORC
LOCATION
  's3://prmsingh-hive/hive4/customer_address_small/';

{code}

Then run the following query


{code:java}
select ca_zip, count(*)
from catalog_sales_small2, customer_small, customer_address_small
where cs_bill_customer_sk = c_customer_sk
and c_current_addr_sk = ca_address_sk
group by ca_zip
order by ca_zip
limit 100;
{code}

Case 1:
{code:java}
hive.auto.convert.join.noconditionaltask.size=1073741824
hive.vectorized.execution.enabled=true
hive.auto.convert.join=true
hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true
{code}

Results are wrong and random on different runs
Case 2 : 
When any of the above property is set to false. It'll give correct results 
which are as below.

{code:java}
+---------+--------+
| ca_zip  |  _c1   |
+---------+--------+
| 00601   | 550    |
| 00608   | 516    |
| 00626   | 382    |
| 00629   | 12     |
| 00649   | 40     |
| 00656   | 25     |
| 00659   | 9811   |
| 00668   | 14     |
| 00669   | 10559  |
| 00716   | 616    |
| 00725   | 561    |
| 00728   | 14     |
| 00741   | 491    |
| 00749   | 4      |
| 00750   | 9976   |
| 00756   | 4      |
| 00762   | 9764   |
| 00763   | 12     |
| 00764   | 575    |
| 00769   | 10538  |
| 00791   | 10021  |
| 00793   | 37     |
| 00794   | 569    |
| 00804   | 25     |
| 00816   | 497    |
| 00824   | 6      |
| 00838   | 18     |
| 00854   | 13     |
| 00862   | 512    |
| 00868   | 376    |
| 00878   | 18     |
| 00896   | 692    |
| 00897   | 11     |
| 00902   | 554    |
| 00907   | 46     |
| 00909   | 525    |
| 00911   | 21     |
| 00914   | 46     |
| 00917   | 13     |
| 00919   | 550    |
| 00920   | 28     |
| 00930   | 38     |
| 00936   | 29     |
| 00944   | 609    |
| 00948   | 31     |
| 00969   | 445    |
| 00981   | 7      |
| 00983   | 24     |
| 00986   | 21     |
| 00999   | 9840   |
| 01008   | 361    |
| 01011   | 11210  |
| 01014   | 11     |
| 01018   | 415    |
| 01019   | 469    |
| 01028   | 9      |
| 01044   | 32     |
| 01051   | 430    |
| 01081   | 29     |
| 01085   | 22     |
| 01092   | 544    |
| 01099   | 10428  |
| 01114   | 13     |
| 01119   | 461    |
| 01122   | 8      |
| 01125   | 535    |
| 01126   | 25     |
| 01134   | 10346  |
| 01140   | 490    |
| 01173   | 521    |
| 01180   | 23     |
| 01184   | 22     |
| 01186   | 24     |
| 01187   | 10502  |
| 01189   | 369    |
| 01218   | 625    |
| 01231   | 389    |
| 01234   | 572    |
| 01235   | 467    |
| 01236   | 41     |
| 01247   | 48     |
| 01262   | 17     |
| 01263   | 8      |
| 01279   | 18     |
| 01281   | 38     |
| 01289   | 9      |
| 01302   | 19     |
| 01325   | 466    |
| 01328   | 7      |
| 01332   | 501    |
| 01344   | 510    |
| 01357   | 4      |
| 01365   | 14     |
| 01366   | 496    |
| 01417   | 18     |
| 01451   | 14     |
| 01452   | 541    |
| 01454   | 555    |
| 01458   | 77     |
| 01470   | 27     |
+---------+--------+
{code}



> TPCDS queries q15, q19 are failing when 
> hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled  is set to 
> true
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-28735
>                 URL: https://issues.apache.org/jira/browse/HIVE-28735
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Vectorization
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Paramvir Singh
>            Priority: Major
>              Labels: hive-4.1.0-must
>
> TPCDS queries q15, q19 are failing when 
> hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled  is set to 
> true. 
> Setup should include atleast 2 node cluster.  It's passing when the cluster 
> has only 1 node. 
> The wrong result is also random(on each run I get different random wrong 
> values).
> Small repro query on TPCDS dataset
> {code:java}
> select ca_zip, count(*)
> from catalog_sales_small, customer_small, customer_address_small
> where cs_bill_customer_sk = c_customer_sk
> and c_current_addr_sk = ca_address_sk
> group by ca_zip
> order by ca_zip
> limit 100;
> {code}
> If we set the following properties, we get correct results
> {code:java}
> set hive.vectorized.execution.enabled=false; - Correct results
> {code}
> OR
> {code:java}
> set hive.auto.convert.join=false; - Correct results
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to