[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

Ted Yu (JIRA) Mon, 05 Mar 2018 02:32:29 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383887#comment-16383887
 ]


Ted Yu edited comment on HBASE-20090 at 3/5/18 10:31 AM:
---------------------------------------------------------

With more debug logging, we can see the scenario where the exception was 
triggered.
{code}
2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
compaction_queue=(0:0),     split_queue=1
2018-03-02 17:28:30,098 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info  
    size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
2018-03-02 17:28:30,296 INFO  
[RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush thread woke up because memory above low water=381.5 M
2018-03-02 17:28:30,297 INFO  
[RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 
due to global     heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal 
Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
wake up by WAKEUPFLUSH_INSTANCE
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Nothing to flush for 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Excluding unflushable region 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -    trying to 
find a different region to flush.
{code}
Region 0453f29030757eedb6e6a1c57e88c085 was being split.
In HRegion#flushcache, the log from else branch can be seen in 
20090-server-61260-01-000007.log :
{code}
      synchronized (writestate) {
        if (!writestate.flushing && writestate.writesEnabled) {
          this.writestate.flushing = true;
        } else {
          if (LOG.isDebugEnabled()) {
            LOG.debug("NOT flushing memstore for region " + this
                + ", flushing=" + writestate.flushing + ", writesEnabled="
                + writestate.writesEnabled);
          }
{code}
Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving memory 
pressure at high level.
When MemStoreFlusher ran to the following call, the region was no longer a 
flush candidate:
{code}
      HRegion bestFlushableRegion =
          getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
{code}
So the other region, 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined 
next. Since the region was not receiving write, the (current) Precondition 
check failed.

The proposed fix is to convert the Precondition to normal return.


was (Author: yuzhih...@gmail.com):
With more debug logging, we can see the scenario where the exception was 
triggered.
{code}
2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
compaction_queue=(0:0),     split_queue=1
2018-03-02 17:28:30,098 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info  
    size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
2018-03-02 17:28:30,296 INFO  
[RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush thread woke up because memory above low water=381.5 M
2018-03-02 17:28:30,297 INFO  
[RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 
due to global     heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal 
Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
wake up by WAKEUPFLUSH_INSTANCE
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Nothing to flush for 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Excluding unflushable region 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -    trying to 
find a different region to flush.
{code}
Region 0453f29030757eedb6e6a1c57e88c085 was being split.
When MemStoreFlusher ran to this call, the region was no longer a flush 
candidate:
{code}
      HRegion bestFlushableRegion =
          getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
{code}
So the other region, 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined 
next. Since the region was not receiving write, the (current) Precondition 
check failed.

The proposed fix is to convert the Precondition to normal log.

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-20090
>                 URL: https://issues.apache.org/jira/browse/HBASE-20090
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>         Attachments: 20090-server-61260-01-000007.log, 20090.v1.txt, 
> 20090.v4.txt, 20090.v5.txt
>
>
> Here is the code in branch-2 :
> {code}
>         try {
>           wakeupPending.set(false); // allow someone to wake us up again
>           fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
>           if (fqe == null || fqe instanceof WakeupFlushThread) {
> ...
>               if (!flushOneForGlobalPressure()) {
> ...
>           FlushRegionEntry fre = (FlushRegionEntry) fqe;
>           if (!flushRegion(fre)) {
>             break;
> ...
>         } catch (Exception ex) {
>           LOG.error("Cache flusher failed for entry " + fqe, ex);
>           if (!server.checkFileSystem()) {
>             break;
>           }
>         }
> {code}
> Inside flushOneForGlobalPressure():
> {code}
>       Preconditions.checkState(
>         (regionToFlush != null && regionToFlushSize > 0) ||
>         (bestRegionReplica != null && bestRegionReplicaSize > 0));
> {code}
> When the Preconditions check fails, IllegalStateException is caught by the 
> catch block shown above.
> However, the fqe is not flushed, resulting in potential data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

Reply via email to