[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

gaojinchao (JIRA) Thu, 25 Aug 2011 02:23:09 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090881#comment-13090881
 ]


gaojinchao commented on HBASE-3845:
-----------------------------------

I verified the patch. I think it is ok.
I created a table(one regoin) and put a lot of data. The log said that seq is 
continuous.
code :
      // updateLock not needed for removing snapshot's entry
      // Cleaning up of lastSeqWritten is in the finally clause because we
      // don't want to confuse getOldestOutstandingSeqNum()
      this.lastSeqWritten.remove(getSnapshotName(encodedRegionName));
      Long seq = this.lastSeqWritten.get(encodedRegionName);
      if (null != seq) {
        LOG.error("gjc: end flush seq " + logSeqId + "current seq" + seq);
      } else {
        LOG.error("gjc: end flush seq " + logSeqId);
      }
logs:
2011-08-25 04:11:50,807 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start flush seq495032
2011-08-25 04:11:50,808 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start flush seq495032current seq499908
2011-08-25 04:12:11,073 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc: end flush seq 499908current seq499909
2011-08-25 04:12:11,700 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start flush seq499909
2011-08-25 04:12:11,700 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start flush seq499909current seq505058
2011-08-25 04:12:58,532 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc: end flush seq 505058current seq505059
2011-08-25 04:12:58,784 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start flush seq505059

The logs before the patch:
2011-08-25 05:35:20,691 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start seq679214
2011-08-25 05:35:20,940 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:end current seq679215
2011-08-25 05:36:19,024 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start seq682145
2011-08-25 05:36:26,928 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:end current seq685931
2011-08-25 05:36:27,571 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start seq686209
2011-08-25 05:36:36,311 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:end current seq690191
2011-08-25 05:36:36,768 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start seq690244
2011-08-25 05:36:44,709 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:  
gjc:end current seq693566
2011-08-25 05:36:45,940 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
gjc:start seq694126

> data loss because lastSeqWritten can miss memstore edits
> --------------------------------------------------------
>
>                 Key: HBASE-3845
>                 URL: https://issues.apache.org/jira/browse/HBASE-3845
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Prakash Khemani
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.92.0
>
>         Attachments: 
> 0001-HBASE-3845-data-loss-because-lastSeqWritten-can-miss.patch, 
> HBASE-3845-fix-TestResettingCounters-test.txt, HBASE-3845_1.patch, 
> HBASE-3845_2.patch, HBASE-3845_4.patch, HBASE-3845_5.patch, 
> HBASE-3845_6.patch, HBASE-3845__trunk.patch, HBASE-3845_branch90V1.patch, 
> HBASE-3845_branch90V2.patch, HBASE-3845_trunk_2.patch, 
> HBASE-3845_trunk_3.patch
>
>
> (I don't have a test case to prove this yet but I have run it by Dhruba and 
> Kannan internally and wanted to put this up for some feedback.)
> In this discussion let us assume that the region has only one column family. 
> That way I can use region/memstore interchangeably.
> After a memstore flush it is possible for lastSeqWritten to have a 
> log-sequence-id for a region that is not the earliest log-sequence-id for 
> that region's memstore.
> HLog.append() does a putIfAbsent into lastSequenceWritten. This is to ensure 
> that we only keep track  of the earliest log-sequence-number that is present 
> in the memstore.
> Every time the memstore is flushed we remove the region's entry in 
> lastSequenceWritten and wait for the next append to populate this entry 
> again. This is where the problem happens.
> step 1:
> flusher.prepare() snapshots the memstore under 
> HRegion.updatesLock.writeLock().
> step 2 :
> as soon as the updatesLock.writeLock() is released new entries will be added 
> into the memstore.
> step 3 :
> wal.completeCacheFlush() is called. This method removes the region's entry 
> from lastSeqWritten.
> step 4:
> the next append will create a new entry for the region in lastSeqWritten(). 
> But this will be the log seq id of the current append. All the edits that 
> were added in step 2 are missing.
> ==
> as a temporary measure, instead of removing the region's entry in step 3 I 
> will replace it with the log-seq-id of the region-flush-event.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3845) data loss because lastSeqWritten can miss memstore edits

Reply via email to