[ 
https://issues.apache.org/jira/browse/HBASE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905351#comment-13905351
 ] 

Feng Honghua commented on HBASE-10556:
--------------------------------------

Thanks [~yuzhih...@gmail.com] for the review, and ping [~lhofhansl], [~stack] 
and [~apurtell] for review and another +1, thanks:-)

> Possible data loss due to non-handled DroppedSnapshotException for 
> user-triggered flush from client/shell
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10556
>                 URL: https://issues.apache.org/jira/browse/HBASE-10556
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Feng Honghua
>            Assignee: Feng Honghua
>            Priority: Critical
>         Attachments: HBASE-10556-trunk_v1.patch
>
>
> During the code review when investigating HBASE-10499, a possibility of data 
> loss due to non-handled DroppedSnapshotException for user-triggered flush is 
> exposed.
> Data loss can happen as below:
> # A flush for some region is triggered via HBaseAdmin or shell
> # The request reaches regionserver and eventually HRegion.internalFlushcache 
> is called, then fails at persisting memstore's snapshot to hfile, 
> DroppedSnapshotException is thrown and the snapshot is left not cleared.
> # DroppedSnapshotException is not handled in HRegion, and is just 
> encapsulated as a ServiceException before returning to client
> # After a while, some new writes are handled and put in the current memstore, 
> then a new flush is triggered for the region due to memstoreSize exceeds 
> flush threshold
> # This second(new) flush succeeds, for the HStore which failed in the 
> previous user-triggered flush, the remained non-empty snapshot is used rather 
> than a new snapshot made from the current memstore, but HLog's latest 
> sequenceId is used for the resultant hfiles --- the sequenceId attached 
> within the hfiles says all edits with sequenceId <= it have all been 
> persisted, but actually it's not the truth for the edits still in the 
> existing memstore
> # Now the regionserver hosting this region dies
> # During the replay phase of failover, the edits corresponding to the ones 
> while are in memstore and not actually persisted in hfiles when the previous 
> regionserver dies will be ignored, since they are deemed as persisted by 
> compared to the hfiles' latest consequenceID --- These edits are lost...
> For the second flush, we also can't discard the remained snapshot and make a 
> new one using current memstore, that way the data in the remained snapshot is 
> lost. We should abort the regionserver immediately and rely on the failover 
> to replay the log for data safety.
> DroppedSnapshotException is correctly handled in MemStoreFlusher for 
> internally triggered flush (which are generated by flush-size / rollWriter / 
> periodicFlusher). But user-triggered flush is processed directly by 
> HRegionServer->HRegion without putting a flush entry to flushQueue, hence not 
> handled by MemStoreFlusher



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to