[ https://issues.apache.org/jira/browse/HBASE-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788986#comment-16788986 ]
Toshihiro Suzuki commented on HBASE-21920: ------------------------------------------ [~arshiya9414] Could you please add a test for the issue to the patch? > Ignoring 'empty' end_key while calculating end_key for new region in HBCK > -fixHdfsOverlaps command can cause data loss > ---------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-21920 > URL: https://issues.apache.org/jira/browse/HBASE-21920 > Project: HBase > Issue Type: Bug > Components: hbck > Affects Versions: 1.0.0 > Reporter: Syeda Arshiya Tabreen > Assignee: Syeda Arshiya Tabreen > Priority: Major > Attachments: HBASE-21920.branch-1.patch > > > When running *-fixHdfsOverlaps* command due to overlap in the regions of the > table ,it moves all the hfiles of overlapping regions into new region with > start_key and end_key calculating based on minimum and maximum start_key and > end_key of all overlapping regions. > When calculating start_key and end_key for new region,end_key with 'empty' is > not considered which leads to data loss when scanned using '*startrow'.* > *For example:* > 1.create table 't' > 2.Insert records \{00,111,200} into the table 't'and flush the data > 3.split the table 't' with split-key '100' > 4.Now we have three regions( 1 parent and two daughter regions ) > 1.*Region-1*('Empty','Empty') => \{00,111,200} > 2.*Region-2*('Empty','100')=>\{00} > 3.*Region-3*('100','Empty')=>\{111,200} > 5.Make sure parent region is not deleted in file system and run > -*fixHdfsOverlaps* command > This -*fixHdfsOverlaps* command will move all the hfiles of the three regions > {*Region-1,Region- 2,Region-3*} into a new region(*Region-4*) created with > start_key='*Empty'* and end_key='*100'* > This is because it does not consider end_key=*'Empty'* and considers > end_key=*'100'* as maximum which in turn makes all the hfiles of three > regions to move into new region even if records in hfile is more than the > end_key='*100'* and one empty region *Region -5 (100,Empty)* will be > created because table region end key was not empty. > Now we have 2 regions: > 1.*Region-4*(Empty,100)=>\{00,111,200} > 2.*Region-5*(100,Empty)=>{} > when the entire table scan is done, all the records will be displayed, there > wont be any data loss but scan with start_key is done below are the results: > 1.scan 't', \{ STARTROW => '00'} => \{00,111,200} > 2.scan 't', \{ STARTROW => '100'}=>{} > The second scan will give empty result because it searches the rows in > *Region -5*(100,Empty) which contains no records but records \{111,200} is > present in *Region-4*(Empty,100). > The problem exists only when end_key=*'Empty'* is present in any of the > overlapping regions.I think if end_key is present in any of the overlapping > regions,we have to consider it as maximum end_key. -- This message was sent by Atlassian JIRA (v7.6.3#76005)