[jira] [Commented] (HIVE-23114) Insert overwrite with dynamic partitioning is not working correctly with direct insert

Sungwoo (Jira) Tue, 31 Mar 2020 14:24:23 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072165#comment-17072165
 ]


Sungwoo commented on HIVE-23114:
--------------------------------

I tried 1.patch with Hive 4 on Tez with these commits:

- Hive 4 commit: ffee30e6267e85f00a22767262192abb9681cfb7 (HIVE-21164, Fri Feb 
21)
- Tez commit: fd19ce6c93bc1f899ccca7161b0c0407f850bd77 (TEZ-4123, Wed Feb 12)

I tried the above script for reproduction with 
hive.acid.direct.insert.enabled=true, and obtained the correct result. So, I 
think building Hive on Tez was okay.

The scenario in my testing is:

1. generate TPC-DS data of 100GB in text format (on HDFS)
2. Create external tables from the TPC-DS data
3. Create ORC tables from the external tables

The TPC-DS data has 24 tables, and the result is that all ORC tables are 
properly created, except catalog_returns. The catalog_returns table has about 
2.1GB of data:
{code:sh}
gitlab-runner@indigo1:hive-logs$ hdfs dfs -du -h /tmp/tpcds-generate/100/
...
2.1 G     2.1 G     /tmp/tpcds-generate/100/catalog_returns
{code}

The corresponding ORC table has less than 1MB of data (after generating 4GB of 
intermediate data).
{code:sh}
200.7 K  602.0 K  
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
{code}

>From the log of HiveServer2, the intermediate data is deleted for 
>catalog_returns, following the same pattern reported earlier for HIVE-21164 
>(where FileSinkOperator.jobCloseOp() is called twice and the second call 
>deletes all the intermediate data).
{code:sh}
20/03/31 20:22:02 INFO FileOperations: Reading manifest 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1/000000_0.manifest
...
20/03/31 20:22:02 INFO FileOperations: Reading manifest 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1/000041_0.manifest
20/03/31 20:22:02 INFO FileOperations: Looking for files in: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
20/03/31 20:22:03 INFO FileOperations: Looking at path: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001
...
20/03/31 20:22:03 INFO FileOperations: Looking at path: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001
20/03/31 20:22:03 INFO FileOperations: Deleting manifest directory 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1
20/03/31 20:22:04 INFO FileOperations: No manifests found - query produced no 
output
20/03/31 20:22:04 INFO FileOperations: Looking for files in: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
20/03/31 20:22:05 INFO FileOperations: Looking at path: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001
...
20/03/31 20:22:05 INFO FileOperations: Looking at path: 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001
20/03/31 20:22:05 INFO FileOperations: Deleting 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001/bucket_00013_0
 that was not committed
...
20/03/31 20:22:07 INFO FileOperations: Deleting 
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001/bucket_00017_0
 that was not committed
{code}

The queries for creating and loading the table are:
{code:sql}
create table catalog_returns
(
    cr_returned_time_sk       bigint,
    cr_item_sk                bigint,
    cr_refunded_customer_sk   bigint,
    cr_refunded_cdemo_sk      bigint,
    cr_refunded_hdemo_sk      bigint,
    cr_refunded_addr_sk       bigint,
    cr_returning_customer_sk  bigint,
    cr_returning_cdemo_sk     bigint,
    cr_returning_hdemo_sk     bigint,
    cr_returning_addr_sk      bigint,
    cr_call_center_sk         bigint,
    cr_catalog_page_sk        bigint,
    cr_ship_mode_sk           bigint,
    cr_warehouse_sk           bigint,
    cr_reason_sk              bigint,
    cr_order_number           bigint,
    cr_return_quantity        int,
    cr_return_amount          double,
    cr_return_tax             double,
    cr_return_amt_inc_tax     double,
    cr_fee                    double,
    cr_return_ship_cost       double,
    cr_refunded_cash          double,
    cr_reversed_charge        double,
    cr_store_credit           double,
    cr_net_loss               double
)
partitioned by (cr_returned_date_sk bigint)
stored as orc
TBLPROPERTIES('transactional'='true', 'transactional_properties'='default');

from tpcds_text_100.catalog_returns cr
insert overwrite table catalog_returns partition(cr_returned_date_sk)
select
        cr.cr_returned_time_sk,
        cr.cr_item_sk,
        cr.cr_refunded_customer_sk,
        cr.cr_refunded_cdemo_sk,
        cr.cr_refunded_hdemo_sk,
        cr.cr_refunded_addr_sk,
        cr.cr_returning_customer_sk,
        cr.cr_returning_cdemo_sk,
        cr.cr_returning_hdemo_sk,
        cr.cr_returning_addr_sk,
        cr.cr_call_center_sk,
        cr.cr_catalog_page_sk,
        cr.cr_ship_mode_sk,
        cr.cr_warehouse_sk,
        cr.cr_reason_sk,
        cr.cr_order_number,
        cr.cr_return_quantity,
        cr.cr_return_amount,
        cr.cr_return_tax,
        cr.cr_return_amt_inc_tax,
        cr.cr_fee,
        cr.cr_return_ship_cost,
        cr.cr_refunded_cash,
        cr.cr_reversed_charge,
        cr.cr_store_credit,
        cr.cr_net_loss,
        cr.cr_returned_date_sk
      where cr.cr_returned_date_sk is not null
insert overwrite table catalog_returns partition (cr_returned_date_sk)
select
        cr.cr_returned_time_sk,
        cr.cr_item_sk,
        cr.cr_refunded_customer_sk,
        cr.cr_refunded_cdemo_sk,
        cr.cr_refunded_hdemo_sk,
        cr.cr_refunded_addr_sk,
        cr.cr_returning_customer_sk,
        cr.cr_returning_cdemo_sk,
        cr.cr_returning_hdemo_sk,
        cr.cr_returning_addr_sk,
        cr.cr_call_center_sk,
        cr.cr_catalog_page_sk,
        cr.cr_ship_mode_sk,
        cr.cr_warehouse_sk,
        cr.cr_reason_sk,
        cr.cr_order_number,
        cr.cr_return_quantity,
        cr.cr_return_amount,
        cr.cr_return_tax,
        cr.cr_return_amt_inc_tax,
        cr.cr_fee,
        cr.cr_return_ship_cost,
        cr.cr_refunded_cash,
        cr.cr_reversed_charge,
        cr.cr_store_credit,
        cr.cr_net_loss,
        cr.cr_returned_date_sk
      where cr.cr_returned_date_sk is null
      sort by cr_returned_date_sk
;
{code}

On the other hand, other tables (such as store_sales) are correctly loaded, so 
I guess the result is non-deterministic.

I can quickly test new patches using the same scenario. A minor issue with my 
testing is that for some reason, applying the patch to commit 
ffee30e6267e85f00a22767262192abb9681cfb7 had a problem in 
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:
{code:java}
+    Set<String> dynamicPartitionSpecs = new HashSet<>();
     Set<Path> committed = Collections.newSetFromMap(new ConcurrentHashMap<>());
{code}
The existing line looks like:
{code:java}
     HashSet<Path> committed = new HashSet<>();
{code}
So, I kept the existing code (using new HashSet).


> Insert overwrite with dynamic partitioning is not working correctly with 
> direct insert
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-23114
>                 URL: https://issues.apache.org/jira/browse/HIVE-23114
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-23114.1.patch
>
>
> This is a follow-up Jira for the 
> [conversation|https://issues.apache.org/jira/browse/HIVE-21164?focusedCommentId=17059280&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17059280]
>  in HIVE-21164
>  Doing an insert overwrite from a multi-insert statement with dynamic 
> partitioning will give wrong results for ACID tables when 
> 'hive.acid.direct.insert.enabled' is true or for insert-only tables.
> Reproduction:
> {noformat}
> set hive.acid.direct.insert.enabled=true;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.vectorized.execution.enabled=false;
> set hive.stats.autogather=false;
> create external table multiinsert_test_text (a int, b int, c int) stored as 
> textfile;
> insert into multiinsert_test_text values (1111, 11, 1111), (2222, 22, 1111), 
> (3333, 33, 2222), (4444, 44, NULL), (5555, 55, NULL);
> create table multiinsert_test_acid (a int, b int) partitioned by (c int) 
> stored as orc tblproperties('transactional'='true');
> create table multiinsert_test_mm (a int, b int) partitioned by (c int) stored 
> as orc tblproperties('transactional'='true', 
> 'transactional_properties'='insert_only');
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_acid;
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_mm;
> {noformat}
> The result of these steps can be different, it depends on the execution order 
> of the FileSinkOperators of the insert overwrite statements. It can happen 
> that an error occurs due to manifest file collision, it can happen that no 
> error occurs but the result will be incorrect.
>  Running the same insert query with an external table of with and ACID table 
> with 'hive.acid.direct.insert.enabled=false' will give the follwing result:
> {noformat}
> 1111    11      1111
> 2222    22      1111
> 3333    33      2222
> 4444    44      NULL
> 5555    55      NULL
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23114) Insert overwrite with dynamic partitioning is not working correctly with direct insert

Reply via email to