[ 
https://issues.apache.org/jira/browse/HIVE-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077612#comment-17077612
 ] 

Sungwoo commented on HIVE-23114:
--------------------------------

Hello [~kuczoram],

I tried this combination: HIVE-21164 (ffee30) + HIVE-22832 + 
HIVE-23114.2.patch. I tested with 100GB dataset, and the ORC database is now 
loaded okay. You can see that the size of catalog_returns is 911MB.

{code:sh}
$ hdfs dfs -du -h /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/
7.6 K    22.9 K   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/call_center
722.2 K  2.1 M    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_page
911.1 M  2.7 G    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
9.0 G    27.0 G   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_sales
73.9 M   221.8 M  
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer
12.4 M   37.3 M   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer_address
137.1 K  411.3 K  
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer_demographics
348.5 K  1.0 M    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/date_dim
1.5 K    4.4 K    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/household_demographics
881      2.6 K    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/income_band
846.0 M  2.5 G    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/inventory
13.5 M   40.5 M   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/item
30.6 K   91.9 K   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/promotion
1.4 K    4.2 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/reason
1.8 K    5.3 K    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/ship_mode
27.4 K   82.2 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store
1.4 G    4.2 G    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store_returns
13.1 G   39.3 G   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store_sales
125.1 K  375.4 K  
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/time_dim
2.9 K    8.7 K    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/warehouse
22.7 K   68.1 K   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_page
437.8 M  1.3 G    
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_returns
4.8 G    14.4 G   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_sales
6.3 K    18.8 K   
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_site
{code}

Let me report the result of running TPC-DS queries later. (There is a problem 
with my setup for testing.)

> Insert overwrite with dynamic partitioning is not working correctly with 
> direct insert
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-23114
>                 URL: https://issues.apache.org/jira/browse/HIVE-23114
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-23114.1.patch, HIVE-23114.2.patch
>
>
> This is a follow-up Jira for the 
> [conversation|https://issues.apache.org/jira/browse/HIVE-21164?focusedCommentId=17059280&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17059280]
>  in HIVE-21164
>  Doing an insert overwrite from a multi-insert statement with dynamic 
> partitioning will give wrong results for ACID tables when 
> 'hive.acid.direct.insert.enabled' is true or for insert-only tables.
> Reproduction:
> {noformat}
> set hive.acid.direct.insert.enabled=true;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.vectorized.execution.enabled=false;
> set hive.stats.autogather=false;
> create external table multiinsert_test_text (a int, b int, c int) stored as 
> textfile;
> insert into multiinsert_test_text values (1111, 11, 1111), (2222, 22, 1111), 
> (3333, 33, 2222), (4444, 44, NULL), (5555, 55, NULL);
> create table multiinsert_test_acid (a int, b int) partitioned by (c int) 
> stored as orc tblproperties('transactional'='true');
> create table multiinsert_test_mm (a int, b int) partitioned by (c int) stored 
> as orc tblproperties('transactional'='true', 
> 'transactional_properties'='insert_only');
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_acid;
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_mm;
> {noformat}
> The result of these steps can be different, it depends on the execution order 
> of the FileSinkOperators of the insert overwrite statements. It can happen 
> that an error occurs due to manifest file collision, it can happen that no 
> error occurs but the result will be incorrect.
>  Running the same insert query with an external table of with and ACID table 
> with 'hive.acid.direct.insert.enabled=false' will give the follwing result:
> {noformat}
> 1111    11      1111
> 2222    22      1111
> 3333    33      2222
> 4444    44      NULL
> 5555    55      NULL
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to