[
https://issues.apache.org/jira/browse/HIVE-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072165#comment-17072165
]
Sungwoo commented on HIVE-23114:
--------------------------------
I tried 1.patch with Hive 4 on Tez with these commits:
- Hive 4 commit: ffee30e6267e85f00a22767262192abb9681cfb7 (HIVE-21164, Fri Feb
21)
- Tez commit: fd19ce6c93bc1f899ccca7161b0c0407f850bd77 (TEZ-4123, Wed Feb 12)
I tried the above script for reproduction with
hive.acid.direct.insert.enabled=true, and obtained the correct result. So, I
think building Hive on Tez was okay.
The scenario in my testing is:
1. generate TPC-DS data of 100GB in text format (on HDFS)
2. Create external tables from the TPC-DS data
3. Create ORC tables from the external tables
The TPC-DS data has 24 tables, and the result is that all ORC tables are
properly created, except catalog_returns. The catalog_returns table has about
2.1GB of data:
{code:sh}
gitlab-runner@indigo1:hive-logs$ hdfs dfs -du -h /tmp/tpcds-generate/100/
...
2.1 G 2.1 G /tmp/tpcds-generate/100/catalog_returns
{code}
The corresponding ORC table has less than 1MB of data (after generating 4GB of
intermediate data).
{code:sh}
200.7 K 602.0 K
/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
{code}
>From the log of HiveServer2, the intermediate data is deleted for
>catalog_returns, following the same pattern reported earlier for HIVE-21164
>(where FileSinkOperator.jobCloseOp() is called twice and the second call
>deletes all the intermediate data).
{code:sh}
20/03/31 20:22:02 INFO FileOperations: Reading manifest
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1/000000_0.manifest
...
20/03/31 20:22:02 INFO FileOperations: Reading manifest
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1/000041_0.manifest
20/03/31 20:22:02 INFO FileOperations: Looking for files in:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
20/03/31 20:22:03 INFO FileOperations: Looking at path:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001
...
20/03/31 20:22:03 INFO FileOperations: Looking at path:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001
20/03/31 20:22:03 INFO FileOperations: Deleting manifest directory
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/_tmp.base_0000001_1
20/03/31 20:22:04 INFO FileOperations: No manifests found - query produced no
output
20/03/31 20:22:04 INFO FileOperations: Looking for files in:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
20/03/31 20:22:05 INFO FileOperations: Looking at path:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001
...
20/03/31 20:22:05 INFO FileOperations: Looking at path:
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001
20/03/31 20:22:05 INFO FileOperations: Deleting
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2450821/base_0000001/bucket_00013_0
that was not committed
...
20/03/31 20:22:07 INFO FileOperations: Deleting
hdfs://indigo1:8020/tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns/cr_returned_date_sk=2452921/base_0000001/bucket_00017_0
that was not committed
{code}
The queries for creating and loading the table are:
{code:sql}
create table catalog_returns
(
cr_returned_time_sk bigint,
cr_item_sk bigint,
cr_refunded_customer_sk bigint,
cr_refunded_cdemo_sk bigint,
cr_refunded_hdemo_sk bigint,
cr_refunded_addr_sk bigint,
cr_returning_customer_sk bigint,
cr_returning_cdemo_sk bigint,
cr_returning_hdemo_sk bigint,
cr_returning_addr_sk bigint,
cr_call_center_sk bigint,
cr_catalog_page_sk bigint,
cr_ship_mode_sk bigint,
cr_warehouse_sk bigint,
cr_reason_sk bigint,
cr_order_number bigint,
cr_return_quantity int,
cr_return_amount double,
cr_return_tax double,
cr_return_amt_inc_tax double,
cr_fee double,
cr_return_ship_cost double,
cr_refunded_cash double,
cr_reversed_charge double,
cr_store_credit double,
cr_net_loss double
)
partitioned by (cr_returned_date_sk bigint)
stored as orc
TBLPROPERTIES('transactional'='true', 'transactional_properties'='default');
from tpcds_text_100.catalog_returns cr
insert overwrite table catalog_returns partition(cr_returned_date_sk)
select
cr.cr_returned_time_sk,
cr.cr_item_sk,
cr.cr_refunded_customer_sk,
cr.cr_refunded_cdemo_sk,
cr.cr_refunded_hdemo_sk,
cr.cr_refunded_addr_sk,
cr.cr_returning_customer_sk,
cr.cr_returning_cdemo_sk,
cr.cr_returning_hdemo_sk,
cr.cr_returning_addr_sk,
cr.cr_call_center_sk,
cr.cr_catalog_page_sk,
cr.cr_ship_mode_sk,
cr.cr_warehouse_sk,
cr.cr_reason_sk,
cr.cr_order_number,
cr.cr_return_quantity,
cr.cr_return_amount,
cr.cr_return_tax,
cr.cr_return_amt_inc_tax,
cr.cr_fee,
cr.cr_return_ship_cost,
cr.cr_refunded_cash,
cr.cr_reversed_charge,
cr.cr_store_credit,
cr.cr_net_loss,
cr.cr_returned_date_sk
where cr.cr_returned_date_sk is not null
insert overwrite table catalog_returns partition (cr_returned_date_sk)
select
cr.cr_returned_time_sk,
cr.cr_item_sk,
cr.cr_refunded_customer_sk,
cr.cr_refunded_cdemo_sk,
cr.cr_refunded_hdemo_sk,
cr.cr_refunded_addr_sk,
cr.cr_returning_customer_sk,
cr.cr_returning_cdemo_sk,
cr.cr_returning_hdemo_sk,
cr.cr_returning_addr_sk,
cr.cr_call_center_sk,
cr.cr_catalog_page_sk,
cr.cr_ship_mode_sk,
cr.cr_warehouse_sk,
cr.cr_reason_sk,
cr.cr_order_number,
cr.cr_return_quantity,
cr.cr_return_amount,
cr.cr_return_tax,
cr.cr_return_amt_inc_tax,
cr.cr_fee,
cr.cr_return_ship_cost,
cr.cr_refunded_cash,
cr.cr_reversed_charge,
cr.cr_store_credit,
cr.cr_net_loss,
cr.cr_returned_date_sk
where cr.cr_returned_date_sk is null
sort by cr_returned_date_sk
;
{code}
On the other hand, other tables (such as store_sales) are correctly loaded, so
I guess the result is non-deterministic.
I can quickly test new patches using the same scenario. A minor issue with my
testing is that for some reason, applying the patch to commit
ffee30e6267e85f00a22767262192abb9681cfb7 had a problem in
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:
{code:java}
+ Set<String> dynamicPartitionSpecs = new HashSet<>();
Set<Path> committed = Collections.newSetFromMap(new ConcurrentHashMap<>());
{code}
The existing line looks like:
{code:java}
HashSet<Path> committed = new HashSet<>();
{code}
So, I kept the existing code (using new HashSet).
> Insert overwrite with dynamic partitioning is not working correctly with
> direct insert
> --------------------------------------------------------------------------------------
>
> Key: HIVE-23114
> URL: https://issues.apache.org/jira/browse/HIVE-23114
> Project: Hive
> Issue Type: Bug
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Attachments: HIVE-23114.1.patch
>
>
> This is a follow-up Jira for the
> [conversation|https://issues.apache.org/jira/browse/HIVE-21164?focusedCommentId=17059280&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17059280]
> in HIVE-21164
> Doing an insert overwrite from a multi-insert statement with dynamic
> partitioning will give wrong results for ACID tables when
> 'hive.acid.direct.insert.enabled' is true or for insert-only tables.
> Reproduction:
> {noformat}
> set hive.acid.direct.insert.enabled=true;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.vectorized.execution.enabled=false;
> set hive.stats.autogather=false;
> create external table multiinsert_test_text (a int, b int, c int) stored as
> textfile;
> insert into multiinsert_test_text values (1111, 11, 1111), (2222, 22, 1111),
> (3333, 33, 2222), (4444, 44, NULL), (5555, 55, NULL);
> create table multiinsert_test_acid (a int, b int) partitioned by (c int)
> stored as orc tblproperties('transactional'='true');
> create table multiinsert_test_mm (a int, b int) partitioned by (c int) stored
> as orc tblproperties('transactional'='true',
> 'transactional_properties'='insert_only');
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_acid partition (c)
> select
> a.a,
> a.b,
> a.c
> where a.c is not null
> insert overwrite table multiinsert_test_acid partition (c)
> select
> a.a,
> a.b,
> a.c
> where a.c is null;
> select * from multiinsert_test_acid;
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_mm partition (c)
> select
> a.a,
> a.b,
> a.c
> where a.c is not null
> insert overwrite table multiinsert_test_mm partition (c)
> select
> a.a,
> a.b,
> a.c
> where a.c is null;
> select * from multiinsert_test_mm;
> {noformat}
> The result of these steps can be different, it depends on the execution order
> of the FileSinkOperators of the insert overwrite statements. It can happen
> that an error occurs due to manifest file collision, it can happen that no
> error occurs but the result will be incorrect.
> Running the same insert query with an external table of with and ACID table
> with 'hive.acid.direct.insert.enabled=false' will give the follwing result:
> {noformat}
> 1111 11 1111
> 2222 22 1111
> 3333 33 2222
> 4444 44 NULL
> 5555 55 NULL
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)