icyjhl created HIVE-29343:
-----------------------------

             Summary: Multi Insert Caused Data Missing when 
hive.execute.parallel set to true
                 Key: HIVE-29343
                 URL: https://issues.apache.org/jira/browse/HIVE-29343
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2
    Affects Versions: 3.1.3
         Environment: hive.execute.parallel set to True
hive.merge.mapredfiles true
            Reporter: icyjhl
         Attachments: hiveserver2.log

I used multi insert SQL as below, then partition cmrto_diff_lim_tot got data 
loaded successfully, but data missed for partition cmdiff_date_lim_chg_now.

from (select user_id,
             **********
) ab

insert overwrite table xc.fact_user_model_variable partition(tag = 
'cmrto_diff_lim_tot',dt='2025-11-25') select user_id
,
cmrto_diff_lim_tot
 where ab.rn=1
insert overwrite table xc.fact_user_model_variable partition(tag = 
'cmdiff_date_lim_chg_now',dt='2025-11-25') select user_id
,
DATEDIFF('2025-11-25',dt)
 where ab.rn=1


In Hiveserver2 log, we found dir -ext-10007 was renamed to -ext-10002, but data 
inside -ext-10002 was not loaded to the final folder.

In HDFS Audit log, we can see 2 entries, one is deleting -ext-10002 and the 
other is rename -ext-10007  to -ext-10002, but they have the same timestamp.

In Ranger Audit log, these 2 entries also got the same timestamp, but the 
delete action got the higher version and _expire_at_ time.

So we suspect the order of delete and rename are messed up, which caused the 
data deleted.

Hive Server 2 log:
2025-11-25 05:42:59,846 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Launching Job 2 out of 8
2025-11-25 05:42:59,849 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-3:MAPRED] 
in parallel
2025-11-25 05:50:47,107 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task 
[Stage-9:CONDITIONAL] in parallel
2025-11-25 05:50:47,108 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Launching Job 3 out of 8
2025-11-25 05:50:47,114 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-10:MAPRED] 
in parallel
2025-11-25 05:50:47,114 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task 
[Stage-16:CONDITIONAL] in parallel
2025-11-25 05:50:47,114 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Launching Job 4 out of 8
2025-11-25 05:50:47,122 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-17:MAPRED] 
in parallel
2025-11-25 05:50:47,136 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-13:MOVE] 
in serial mode
2025-11-25 05:50:47,140 INFO  org.apache.hadoop.hive.ql.exec.Task: 
[HiveServer2-Background-Pool: Thread-33006771]: Moving data to directory 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10002
 from 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10007
2025-11-25 05:50:47,144 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-6:MOVE] in 
serial mode
2025-11-25 05:50:47,145 INFO  org.apache.hadoop.hive.ql.exec.Task: 
[HiveServer2-Background-Pool: Thread-33006771]: Moving data to directory 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmrto_diff_lim_tot/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10000
 from 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmrto_diff_lim_tot/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10005
2025-11-25 05:50:47,152 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-0:MOVE] in 
serial mode
2025-11-25 05:50:47,153 INFO  org.apache.hadoop.hive.ql.exec.Task: 
[HiveServer2-Background-Pool: Thread-33006771]: Loading data to table 
xc.fact_user_model_variable partition (tag=cmrto_diff_lim_tot, dt=2025-11-25) 
from 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmrto_diff_lim_tot/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10000
2025-11-25 05:50:47,159 INFO  org.apache.hadoop.hive.ql.exec.MoveTask: 
[HiveServer2-Background-Pool: Thread-33006771]: Partition is: 
{tag=cmrto_diff_lim_tot, dt=2025-11-25}
2025-11-25 05:50:47,176 INFO  org.apache.hadoop.hive.common.FileUtils: 
[HiveServer2-Background-Pool: Thread-33006771]: Creating directory if it 
doesn't exist: 
hdfs://ppdcdpha/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmrto_diff_lim_tot/dt=2025-11-25
2025-11-25 05:50:47,244 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Starting task [Stage-4:STATS] 
in parallel
2025-11-25 05:52:21,495 INFO  
org.apache.hadoop.hive.ql.hooks.HiveProtoLoggingHook: 
[HiveServer2-Background-Pool: Thread-33006771]: Received post-hook notification 
for: hive_20251125050916_cb971ed7-dfd6-48bf-b854-519aa883581d
2025-11-25 05:52:21,496 ERROR SessionState: [HiveServer2-Background-Pool: 
Thread-33006771]: Failed to log lineage graph, query is not affected
2025-11-25 05:52:21,554 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: MapReduce Jobs Launched:
2025-11-25 05:52:21,555 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Stage-Stage-2: Map: 16462  
Reduce: 200   Cumulative CPU: 522783.84 sec   HDFS Read: 1365864271204 HDFS 
Write: 508643901 HDFS EC Read: 0 SUCCESS
2025-11-25 05:52:21,555 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Stage-Stage-3: Map: 119  
Reduce: 1   Cumulative CPU: 1796.17 sec   HDFS Read: 509032112 HDFS Write: 
102197949 HDFS EC Read: 0 SUCCESS
2025-11-25 05:52:21,555 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Stage-Stage-10: Map: 1  Reduce: 
1   Cumulative CPU: 8.21 sec   HDFS Read: 12135 HDFS Write: 2038 HDFS EC Read: 
0 SUCCESS
2025-11-25 05:52:21,555 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Stage-Stage-17: Map: 1  Reduce: 
1   Cumulative CPU: 10.67 sec   HDFS Read: 11990 HDFS Write: 1680 HDFS EC Read: 
0 SUCCESS
2025-11-25 05:52:21,555 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Total MapReduce CPU Time Spent: 
6 days 1 hours 43 minutes 18 seconds 890 msec
2025-11-25 05:52:21,556 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: Completed executing 
command(queryId=hive_20251125050916_cb971ed7-dfd6-48bf-b854-519aa883581d); Time 
taken: 2576.387 seconds
2025-11-25 05:52:21,556 INFO  org.apache.hadoop.hive.ql.Driver: 
[HiveServer2-Background-Pool: Thread-33006771]: OK
2025-11-25 05:52:21,556 INFO  org.apache.hadoop.hive.ql.lockmgr.DbTxnManager: 
[HiveServer2-Background-Pool: Thread-33006771]: Stopped heartbeat for query: 
hive_20251125050916_cb971ed7-dfd6-48bf-b854-519aa883581d
2025-11-25 05:52:21,560 INFO  org.apache.hadoop.hive.common.LogUtils: 
[HiveServer2-Background-Pool: Thread-33006771]: Unregistered logging context.


HDFS Audit Log:

kafka_time      info_time       info_level      audit_state     cmd     src     
dst
2025-11-25 05:50:48.065 2025-11-25 05:50:47,141 INFO    true    rename  
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10007
 
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10002
2025-11-25 05:50:48.065 2025-11-25 05:50:47,141 INFO    true    delete  
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10002
 null


Ranger Audit Log:
id      access  resource        reason  action  evttime _ttl_   _expire_at_     
_version_
e2b61c58-d844-4bff-af57-b99fe2f29d98-0  listStatus      
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10007
 
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10007
 read    2025-11-25 05:50:47.135 +30DAYS 2025-12-25 05:50:49.947 
1849710136473419777
0e5a2e56-523e-4317-9004-0085e1057fc4-0  rename  
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10002
 
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301
    write   2025-11-25 05:50:47.141 +30DAYS 2025-12-25 05:50:50.101 
1849710136640143371
eb0e83af-66c6-4615-a9bc-32aab4424417-0  delete  
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301/-ext-10002
 
/user/hive/warehouse/xc.db/fact_user_model_variable/tag=cmdiff_date_lim_chg_now/dt=2025-11-25/.hive-staging_hive_2025-11-25_05-09-16_900_6728589823245031279-1054301
    write   2025-11-25 05:50:47.141 +30DAYS 2025-12-25 05:50:50.38  
1849710136925356042





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to