[ 
https://issues.apache.org/jira/browse/HIVE-28700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihua Deng updated HIVE-28700:
-------------------------------
    Description: 
Steps to repro:

set mapreduce.job.reduces=7;
create table ext(a int);
insert into table ext values(1),(2),(3),(3),(3),(3),(4),(5),(6),(7);
create table full_acid(a int) stored as orc 
tblproperties("transactional"="true");
insert overwrite table full_acid select * from ext where a  = 3;
insert into table full_acid select * from ext where a != 3 group by a;

select * from full_acid;
alter table full_acid compact 'major' and wait;
select * from full_acid;

After the major compaction, the full_acid table misses records "a = 3";

This issue might happen on overwrite then insert into or merge the ACID table, 
followed by a major compaction. During the major compaction, due to the 
accidental bucket on the base file and no the same bucket found on the delta 
files, the compactor will miss this base file, making all records in this file 
loss.

  was:
Steps to repro:

set mapreduce.job.reduces=7;
create table ext(a int);
insert into table ext values(1),(2),(3),(3),(3),(3),(4),(5),(6),(7);
create table full_acid(a int) stored as orc 
tblproperties("transactional"="true");
insert overwrite table full_acid select * from ext where a  = 3;
insert into table full_acid select * from ext where a != 3 group by a;

select * from full_acid;
alter table full_acid compact 'major' and wait;
select * from full_acid;

After the major compaction, the full_acid table misses records "a = 3";

This issue might happen on overwriting table then inserting into, followed by a 
major compaction. During the major compaction, due to the accidental bucket on 
the base file and no the same bucket found on the delta files, the compactor 
will miss this base file, making all records in this file loss.


> MRCompactor may cause data loss when performing the major compaction
> --------------------------------------------------------------------
>
>                 Key: HIVE-28700
>                 URL: https://issues.apache.org/jira/browse/HIVE-28700
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Zhihua Deng
>            Assignee: Zhihua Deng
>            Priority: Blocker
>              Labels: hive-4.1.0-must, pull-request-available
>             Fix For: 4.1.0
>
>
> Steps to repro:
> set mapreduce.job.reduces=7;
> create table ext(a int);
> insert into table ext values(1),(2),(3),(3),(3),(3),(4),(5),(6),(7);
> create table full_acid(a int) stored as orc 
> tblproperties("transactional"="true");
> insert overwrite table full_acid select * from ext where a  = 3;
> insert into table full_acid select * from ext where a != 3 group by a;
> select * from full_acid;
> alter table full_acid compact 'major' and wait;
> select * from full_acid;
> After the major compaction, the full_acid table misses records "a = 3";
> This issue might happen on overwrite then insert into or merge the ACID 
> table, followed by a major compaction. During the major compaction, due to 
> the accidental bucket on the base file and no the same bucket found on the 
> delta files, the compactor will miss this base file, making all records in 
> this file loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to