[jira] [Commented] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

chunhui shen (Commented) (JIRA) Wed, 14 Mar 2012 22:08:13 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229894#comment-13229894
 ]


chunhui shen commented on HBASE-5568:
-------------------------------------

I think it's the problem of test case TestStore#testDeleteExpiredStoreFiles.
See its code:
{code}
int storeFileNum = 4;
int ttl = 1;
...    
hcd.setTimeToLive(ttl);
init(getName(), conf, hcd);

    long sleepTime = this.store.scanInfo.getTtl() / storeFileNum;
    ...
    for (int i = 1; i <= storeFileNum; i++) {
      LOG.info("Adding some data for the store file #" + i);
      timeStamp = EnvironmentEdgeManager.currentTimeMillis();
      this.store.add(new KeyValue(row, family, qf1, timeStamp, (byte[]) null));
      this.store.add(new KeyValue(row, family, qf2, timeStamp, (byte[]) null));
      this.store.add(new KeyValue(row, family, qf3, timeStamp, (byte[]) null));
      flush(i);
      Thread.sleep(sleepTime);
    }
{code}

Because ttl=1, so sleepTime=250ms, so for the 4 store files , the 
maxExpiredTimeStamp discrepancy is only 250ms.

so they may be expired at the same time if you run slowly and then 
CompactionRequest cr = this.store.requestCompaction(); cr.getFiles().size() 
return 3.
We can ensure this from the logs:
https://builds.apache.org/job/PreCommit-HBASE-Build/1187//testReport/org.apache.hadoop.hbase.regionserver/TestStore/testDeleteExpiredStoreFiles/

{code}2012-03-14 17:19:14,672 INFO  [pool-1-thread-1] regionserver.Store(1002): 
Completed compaction of 1 file(s) in family of 
table,,1331745551742.011a93cc4f763307dc36f577662db4b1. into none, size=none; 
total size for store is 2.4k
2012-03-14 17:19:14,923 INFO  [pool-1-thread-1] 
compactions.CompactSelection(104): Deleting the expired store file by 
compaction: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4d75667a-f352-40cc-978a-f36fec71baf2/TestStoretestDeleteExpiredStoreFiles/011a93cc4f763307dc36f577662db4b1/family/489673e6568241c6bb500b34d7ff29ad
 whose maxTimeStamp is 1331745552397 while the max expired timestamp is 
1331745553923
2012-03-14 17:19:14,923 INFO  [pool-1-thread-1] 
compactions.CompactSelection(104): Deleting the expired store file by 
compaction: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4d75667a-f352-40cc-978a-f36fec71baf2/TestStoretestDeleteExpiredStoreFiles/011a93cc4f763307dc36f577662db4b1/family/7ac26f52fd214aca88bd555f3c82dd91
 whose maxTimeStamp is 1331745552671 while the max expired timestamp is 
1331745553923
2012-03-14 17:19:14,923 INFO  [pool-1-thread-1] 
compactions.CompactSelection(104): Deleting the expired store file by 
compaction: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4d75667a-f352-40cc-978a-f36fec71baf2/TestStoretestDeleteExpiredStoreFiles/011a93cc4f763307dc36f577662db4b1/family/7ebf7908f5044fcab65a7327d3e19405
 whose maxTimeStamp is 1331745552939 while the max expired timestamp is 
1331745553923
{code}

{code}
for (int i = 1; i <= storeFileNum; i++) {
      // verify the expired store file.
      CompactionRequest cr = this.store.requestCompaction();
      assertEquals(1, cr.getFiles().size());
      assertTrue(cr.getFiles().get(0).getReader().getMaxTimestamp() < 
          (System.currentTimeMillis() - this.store.scanInfo.getTtl()));
      // Verify that the expired the store has been deleted.
      this.store.compact(cr);
      assertEquals(storeFileNum - i, this.store.getStorefiles().size());

      // Let the next store file expired.
      Thread.sleep(sleepTime);
    }{code}

after the first compaction,
the next call this.store.requestCompaction(), the other three store files are  
all expired at one time.
                
> Multi concurrent flushcache() for one region could cause data loss
> ------------------------------------------------------------------
>
>                 Key: HBASE-5568
>                 URL: https://issues.apache.org/jira/browse/HBASE-5568
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>             Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>
>         Attachments: HBASE-5568-90.patch, HBASE-5568.patch, HBASE-5568.patch
>
>
> We could call HRegion#flushcache() concurrently now through 
> HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin.
> However, we find if HRegion#internalFlushcache() is called concurrently by 
> multi thread, HRegion.memstoreSize will be calculated wrong.
> At the end of HRegion#internalFlushcache(), we will do 
> this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the 
> actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is 
> negative and prevent next flush if we close this region.
> Logs in RS for region e9d827913a056e696c39bc569ea3
> 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> Started memstore flush for 
> writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
> f99f., current region memstore size 128.0m
> 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, 
> memsize=59.6m, filesize=31.2m
> 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> Started memstore flush for 
> writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
> f99f., current region memstore size 134.8m
> 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, 
> memsize=68.5m, filesize=26.6m
> 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
> Finished memstore flush of ~128.1m for region 
> writetest1,,1331454657410.e9d827913a
> 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction 
> requested=false
> 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> Started memstore flush for 
> writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
> f99f., current region memstore size 6.8m
> 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, 
> memsize=3.1m, filesize=1.6m
> 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, 
> memsize=3.6m, filesize=1.4m
> 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
> Finished memstore flush of ~134.8m for region 
> writetest1,,1331454657410.e9d827913a
> 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction 
> requested=true
> 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, 
> memsize=47.4k, filesize=25.6k
> 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
> a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, 
> memsize=47.8k, filesize=19.3k
> 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
> Finished memstore flush of ~6.8m for region 
> writetest1,,1331454657410.e9d827913a05
> 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction 
> requested=true

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

Reply via email to