[ https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383887#comment-16383887 ]
Ted Yu edited comment on HBASE-20090 at 3/5/18 10:31 AM: --------------------------------------------------------- With more debug logging, we can see the scenario where the exception was triggered. {code} 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., compaction_queue=(0:0), split_queue=1 2018-03-02 17:28:30,098 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1 2018-03-02 17:28:30,296 INFO [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=381.5 M 2018-03-02 17:28:30,297 INFO [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Nothing to flush for atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Excluding unflushable region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. - trying to find a different region to flush. {code} Region 0453f29030757eedb6e6a1c57e88c085 was being split. In HRegion#flushcache, the log from else branch can be seen in 20090-server-61260-01-000007.log : {code} synchronized (writestate) { if (!writestate.flushing && writestate.writesEnabled) { this.writestate.flushing = true; } else { if (LOG.isDebugEnabled()) { LOG.debug("NOT flushing memstore for region " + this + ", flushing=" + writestate.flushing + ", writesEnabled=" + writestate.writesEnabled); } {code} Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving memory pressure at high level. When MemStoreFlusher ran to the following call, the region was no longer a flush candidate: {code} HRegion bestFlushableRegion = getBiggestMemStoreRegion(regionsBySize, excludedRegions, true); {code} So the other region, atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined next. Since the region was not receiving write, the (current) Precondition check failed. The proposed fix is to convert the Precondition to normal return. was (Author: yuzhih...@gmail.com): With more debug logging, we can see the scenario where the exception was triggered. {code} 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., compaction_queue=(0:0), split_queue=1 2018-03-02 17:28:30,098 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1 2018-03-02 17:28:30,296 INFO [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=381.5 M 2018-03-02 17:28:30,297 INFO [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Nothing to flush for atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 2018-03-02 17:28:30,298 INFO [MemStoreFlusher.1] regionserver.MemStoreFlusher: Excluding unflushable region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. - trying to find a different region to flush. {code} Region 0453f29030757eedb6e6a1c57e88c085 was being split. When MemStoreFlusher ran to this call, the region was no longer a flush candidate: {code} HRegion bestFlushableRegion = getBiggestMemStoreRegion(regionsBySize, excludedRegions, true); {code} So the other region, atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined next. Since the region was not receiving write, the (current) Precondition check failed. The proposed fix is to convert the Precondition to normal log. > Properly handle Preconditions check failure in > MemStoreFlusher$FlushHandler.run > ------------------------------------------------------------------------------- > > Key: HBASE-20090 > URL: https://issues.apache.org/jira/browse/HBASE-20090 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Attachments: 20090-server-61260-01-000007.log, 20090.v1.txt, > 20090.v4.txt, 20090.v5.txt > > > Here is the code in branch-2 : > {code} > try { > wakeupPending.set(false); // allow someone to wake us up again > fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS); > if (fqe == null || fqe instanceof WakeupFlushThread) { > ... > if (!flushOneForGlobalPressure()) { > ... > FlushRegionEntry fre = (FlushRegionEntry) fqe; > if (!flushRegion(fre)) { > break; > ... > } catch (Exception ex) { > LOG.error("Cache flusher failed for entry " + fqe, ex); > if (!server.checkFileSystem()) { > break; > } > } > {code} > Inside flushOneForGlobalPressure(): > {code} > Preconditions.checkState( > (regionToFlush != null && regionToFlushSize > 0) || > (bestRegionReplica != null && bestRegionReplicaSize > 0)); > {code} > When the Preconditions check fails, IllegalStateException is caught by the > catch block shown above. > However, the fqe is not flushed, resulting in potential data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)