[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-20 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407425#comment-16407425
 ] 

stack edited comment on HBASE-20090 at 3/21/18 3:56 AM:


Now I get it. Thanks. Let me backport to branch-2.0... After it goes in 
elsewhere.


was (Author: stack):
Now I get it. Thanks. Let me backport...

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20090-server-61260-01-07.log, 20090.v10.txt, 
> 20090.v10.txt, 20090.v6.txt, 20090.v7.txt, 20090.v8.txt, 20090.v9.txt
>
>
> Copied the following from a comment since this was better description of the 
> race condition.
> The original description was merged to the beginning of my first comment 
> below.
> Observed the following in region server log (running on hadoop3 cluster):
> {code}
> 2018-02-26 16:06:50,044 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=16020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 135ms
> 2018-02-26 16:06:50,049 ERROR [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Cache flusher failed for entry 
> org.apache.hadoop.hbase.regionserver.  
> MemStoreFlusher$WakeupFlushThread@2adfadd7
> java.lang.IllegalStateException
> at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:441)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:174)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$600(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:237)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Here is the Precondition from MemStoreFlusher#flushOneForGlobalPressure() :
> {code}
>   Preconditions.checkState(
> (regionToFlush != null && regionToFlushSize > 0) || 
> bestRegionReplicaSize > 0);
> {code}
> When the Preconditions check fails, IllegalStateException would be raised.
> With more debug logging, we can see the scenario where the exception was 
> triggered.
> {code}
> 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
> Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
> compaction_queue=(0:0), split_queue=1
> 2018-03-02 17:28:30,098 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
> regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because 
> info  size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
> 2018-03-02 17:28:30,296 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush thread woke up because memory above low 
> water=381.5 M
> 2018-03-02 17:28:30,297 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush of region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global
>  heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap 
> size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Nothing to flush for 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Excluding unflushable region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -trying to 
> find a different region to flush.
> {code}
> Region 0453f29030757eedb6e6a1c57e88c085 was being split.
> In HRegion#flushcache, the log from else branch can be seen in 
> 20090-server-61260-01-07.log :
> {code}
>   synchronized (writestate) {
> if (!writestate

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-12 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396158#comment-16396158
 ] 

Ted Yu edited comment on HBASE-20090 at 3/12/18 10:17 PM:
--

Eshcar:

bq. calling to flushOneForGlobalPressure region A is not selected (correct?),

Right. See the following condition in getBiggestMemStoreRegion():
{code}
if (region.writestate.flushing || !region.writestate.writesEnabled) {
  continue;
}
{code}


See also the log line from 20090-server-61260-01-07.log :

{code}

2018-03-02 17:28:30,096 DEBUG [MemStoreFlusher.0] regionserver.HRegion: NOT 
flushing memstore for region 
TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., flushing=false, 
writesEnabled=false
{code}
bq. Is this the right way to describe the problem?

Your description is mostly the same as my understanding (with minor correction 
below).

bq. is it because its size is set to 0 or since it is marked as non-flushable?

See the following log line from 20090-server-61260-01-07.log :
{code}
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
{code}
The size of the region (A) was non-zero. But the writestate being false 
resulted in the region not being included in the candidates.

bq. adding a check in line 187 would solve the problem?

That may work.
However, please take a look at patch v7 where the exclusion of ineligible 
regions is done in getBiggestMemStoreRegion().
With patch v7, the existing check on line 187 should suffice.

Thanks


was (Author: yuzhih...@gmail.com):
Eshcar:

bq. calling to flushOneForGlobalPressure region A is not selected (correct?),

Right. See the following condition in getBiggestMemStoreRegion():

if (region.writestate.flushing || !region.writestate.writesEnabled) {

  continue;

}



See also the log line from 20090-server-61260-01-07.log :

{code}

2018-03-02 17:28:30,096 DEBUG [MemStoreFlusher.0] regionserver.HRegion: NOT 
flushing memstore for region 
TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., flushing=false, 
writesEnabled=false
{code}
bq. Is this the right way to describe the problem?

Your description is mostly the same as my understanding (with minor correction 
below).

bq. is it because its size is set to 0 or since it is marked as non-flushable?

See the following log line from 20090-server-61260-01-07.log :
{code}
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
{code}
The size of the region (A) was non-zero. But the writestate being false 
resulted in the region not being included in the candidates.

bq. adding a check in line 187 would solve the problem?

That may work.
However, please take a look at patch v7 where the exclusion of ineligible 
regions is done in getBiggestMemStoreRegion().
With patch v7, the existing check on line 187 should suffice.

Thanks

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20090-server-61260-01-07.log, 20090.v6.txt, 
> 20090.v7.txt
>
>
> Copied the following from a comment since this was better description of the 
> race condition.
> The original description was merged to the beginning of my first comment 
> below.
> With more debug logging, we can see the scenario where the exception was 
> triggered.
> {code}
> 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
> Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
> compaction_queue=(0:0), split_queue=1
> 2018-03-02 17:28:30,098 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
> regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because 
> info  size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
> 2018-03-02 17:28:30,296 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush thread woke up because memory above low 
> water=381.5 M
> 2018-03-02 17:28:30,297 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: reg

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-12 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395229#comment-16395229
 ] 

Eshcar Hillel edited comment on HBASE-20090 at 3/12/18 1:27 PM:


I am trying to see if I understand correctly the discussed scenario:
 * Region A (TestTable) and region B(Atlas table) are the only two regions in 
the region server.
 * The region server exceeds the global pressure threshold (380MB)
 * Region A has 400MB while region B has 0MB
 * Region A has more than 5GB of data on disk (6.9GB) hence a split is 
triggered.
 * Finally, (this is what is not 100% clear) in the next loop calling to 
flushOneForGlobalPressure region A is not selected (correct?), region B is 
selected but since it has size 0 the precondition test throws an exception.

Is this the right way to describe the problem? Can you explain again why region 
A is not selected, is it because its size is set to 0 or since it is marked as 
non-flushable?

To summarize, if I understand correctly there is a period of time where the 
global memstore size is high (>380MB) and at the same time the region holding 
all this data is being split and hence the data is in the process of being 
flushed, as a result the memstore flusher that is triggered due to global 
pressure finds itself in an inconsistent state:

{color:#008000}"Above memory mark but there are no flushable regions!"{color}

{color:#33}Do you think that adding a check in line 187 would solve the 
problem?{color}
{code:java}
if (bestAnyRegion == null 
||  bestAnyRegion.getMemStoreDataSize() == 0) // NEW
{
  LOG.error("Above memory mark but there are no flushable regions!");
  return false;
}{code}
 


was (Author: eshcar):
I am trying to see if I understand correctly the discussed scenario:
 * Region A (TestTable) and region B(Atlas table) are the only two regions in 
the region server.
 * The region server exceeds the global pressure threshold (380MB)
 * Region A has 400MB while region B has 0MB
 * Region A has more than 5GB of data on disk (6.9GB) hence a split is 
triggered.
 * Finally, (this is what is not 100% clear:) in the next loop calling to 
flushOneForGlobalPressure region A is not selected (correct?), region B is 
selected but since it has size 0 the precondition test throws an exception.

Is this the right way to describe the problem? Can you explain again why region 
A is not selected, is it because its size is set to 0 or since it is marked as 
non-flushable?

To summarize, if I understand correctly there is a period of time where the 
global memstore size is high (>380MB) and at the same time the region holding 
all this data is being split and hence the data is in the process of being 
flushed, as a result the memstore flusher that is triggered due to global 
pressure finds itself in an inconsistent state:

{color:#008000}"Above memory mark but there are no flushable regions!"{color}

{color:#33}Do you think that adding a check in line 187 would solve the 
problem?
{color}
{code:java}
if (bestAnyRegion == null 
||  bestAnyRegion.getMemStoreDataSize() == 0) // NEW
{
  LOG.error("Above memory mark but there are no flushable regions!");
  return false;
}{code}
 

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20090-server-61260-01-07.log, 20090.v6.txt
>
>
> Copied the following from a comment since this was better description of the 
> race condition.
> The original description was merged to the beginning of my first comment 
> below.
> With more debug logging, we can see the scenario where the exception was 
> triggered.
> {code}
> 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
> Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
> compaction_queue=(0:0), split_queue=1
> 2018-03-02 17:28:30,098 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
> regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because 
> info  size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
> 2018-03-02 17:28:30,296 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush thread woke up because memory above low 
> water=381.5 M
> 2018-03-02 17:28:30,297 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to 

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388569#comment-16388569
 ] 

Ted Yu edited comment on HBASE-20090 at 3/7/18 3:17 AM:


Looking at the javadoc for getCopyOfOnlineRegionsSortedByOffHeapSize() :
{code}
   *   the biggest.  If two regions are the same size, then the last one found 
wins; i.e. this
   *   method may NOT return all regions.
{code}
Currently value type is HRegion - we only store one region per size.
I think we should change value type to Collection so that we don't 
miss any region (potentially with big size).
e.g.
Suppose there are there regions (R1, R2 and R3) with sizes 100, 100 and 1, 
respectively.
Using the current data structure, R2 would be stored in the Map, evicting R1 
from the Map.
This means that the current code would choose to flush regions R2 and R3, 
releasing 101 from memory.
If value type is changed to Collection, we would flush both R1 and R2. 
This achieves faster memory reclamation.


was (Author: yuzhih...@gmail.com):
Looking at the javadoc for getCopyOfOnlineRegionsSortedByOffHeapSize() :
{code}
   *   the biggest.  If two regions are the same size, then the last one found 
wins; i.e. this
   *   method may NOT return all regions.
{code}
Currently value type is HRegion - we only store one region per size.
One enhancement is to change value type to Collection so that we don't 
miss any region (potentially with big size).

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20090-server-61260-01-07.log, 20090.v6.txt
>
>
> Copied the following from a comment since this was better description of the 
> race condition.
> The original description was merged to the beginning of my first comment 
> below.
> With more debug logging, we can see the scenario where the exception was 
> triggered.
> {code}
> 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
> Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
> compaction_queue=(0:0), split_queue=1
> 2018-03-02 17:28:30,098 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
> regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because 
> info  size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
> 2018-03-02 17:28:30,296 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush thread woke up because memory above low 
> water=381.5 M
> 2018-03-02 17:28:30,297 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush of region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global
>  heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap 
> size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Nothing to flush for 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Excluding unflushable region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -trying to 
> find a different region to flush.
> {code}
> Region 0453f29030757eedb6e6a1c57e88c085 was being split.
> In HRegion#flushcache, the log from else branch can be seen in 
> 20090-server-61260-01-07.log :
> {code}
>   synchronized (writestate) {
> if (!writestate.flushing && writestate.writesEnabled) {
>   this.writestate.flushing = true;
> } else {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("NOT flushing memstore for region " + this
> + ", flushing=" + writestate.flushing + ", writesEnabled="
> + wri

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377843#comment-16377843
 ] 

Ted Yu edited comment on HBASE-20090 at 3/5/18 1:48 PM:


Here is the code in branch-2 :
{code}
try {
  wakeupPending.set(false); // allow someone to wake us up again
  fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
  if (fqe == null || fqe instanceof WakeupFlushThread) {
...
  if (!flushOneForGlobalPressure()) {
...
  FlushRegionEntry fre = (FlushRegionEntry) fqe;
  if (!flushRegion(fre)) {
break;
...
} catch (Exception ex) {
  LOG.error("Cache flusher failed for entry " + fqe, ex);
  if (!server.checkFileSystem()) {
break;
  }
}
{code}
Inside flushOneForGlobalPressure():
{code}
  Preconditions.checkState(
(regionToFlush != null && regionToFlushSize > 0) ||
(bestRegionReplica != null && bestRegionReplicaSize > 0));
{code}
When the Preconditions check fails, IllegalStateException is caught by the 
catch block shown above.
However, the fqe is not flushed, resulting in potential data loss.

Observed the following in region server log (in hadoop3 cluster):
{code}
2018-02-26 16:06:49,962 INFO  [MemStoreFlusher.1] regionserver.HRegion: 
Flushing 1/1 column families, memstore=804.67 KB
2018-02-26 16:06:50,028 INFO  [MemStoreFlusher.1] 
regionserver.DefaultStoreFlusher: Flushed, sequenceid=5448, memsize=804.7 K, 
hasBloomFilter=true, into tmp file hdfs://  
mycluster/apps/hbase/data/data/default/TestTable/3552368c92476437cb96e357d2c7d618/.tmp/info/81721cc57fee43ebb55ba430f5730c25
2018-02-26 16:06:50,042 INFO  [MemStoreFlusher.1] regionserver.HStore: Added 
hdfs://mycluster/apps/hbase/data/data/default/TestTable/3552368c92476437cb96e357d2c7d618/info/
81721cc57fee43ebb55ba430f5730c25, entries=784, sequenceid=5448, 
filesize=813.9 K
2018-02-26 16:06:50,044 INFO  [MemStoreFlusher.1] regionserver.HRegion: 
Finished memstore flush of ~804.67 KB/823984, currentsize=0 B/0 for region 
TestTable,  
00155728,1519661093622.3552368c92476437cb96e357d2c7d618. in 
82ms, sequenceid=5448, compaction requested=true
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 185ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 163ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=22,queue=1,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 160ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 160ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=23,queue=2,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 158ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=27,queue=0,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 151ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 147ms
2018-02-26 16:06:50,044 WARN  
[RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=16020] 
regionserver.MemStoreFlusher: Memstore is above high water mark and block 135ms

2018-02-26 16:06:50,049 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Cache flusher failed for entry org.apache.hadoop.hbase.regionserver.
   MemStoreFlusher$WakeupFlushThread@2adfadd7
java.lang.IllegalStateException
at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:441)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:174)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$600(MemStoreFlusher.java:69)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:237)
at java.lang.Thread.run(Thread.java:748)
{code}
Unfortunately the DEBUG logging was not on.
Will see if I can reproduce the exception next time.


was (Author: yuzhih...@gmail.com):
Observed the following in region server log (in hadoop3 cluster):
{code}
2018-02-26 16:06:49,962 INFO  [MemStoreFlusher.1] regionserver.HRegion: 
Flushing 1/1 column families, memstore=804.67 KB
2018-02-26 16:06:50,028 INFO  [MemStoreFlusher.1] 
regionserver.DefaultStoreFlusher: Flushed, sequenceid=5448, memsize=804.7 K, 
hasBloomFilter=true, into tmp file hdfs://  
mycluster/

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385924#comment-16385924
 ] 

Ted Yu edited comment on HBASE-20090 at 3/5/18 10:45 AM:
-

Here was how I produced the server log with additional DEBUG information:

* Wipe out TestTable
* Add DEBUG log, build tar ball and load tar ball onto the cluster, start 
PerformanceEval
* Observe that TestTable has 8 regions, then search for the new DEBUG log in 
each region server log

The above procedure was repeated several times until I got what you saw as 
20090-server-61260-01-07.log


was (Author: yuzhih...@gmail.com):
Here was how I produced the server log with additional DEBUG information:

* Wipe out TestTable
* Add DEBUG log, build tar ball and load tar ball onto the cluster
* Observe that TestTable has 8 regions, then search for the new DEBUG log in 
each region server log

The above procedure was repeated several times until I got what you saw as 
20090-server-61260-01-07.log

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20090-server-61260-01-07.log, 20090.v1.txt, 
> 20090.v4.txt, 20090.v5.txt
>
>
> Here is the code in branch-2 :
> {code}
> try {
>   wakeupPending.set(false); // allow someone to wake us up again
>   fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
>   if (fqe == null || fqe instanceof WakeupFlushThread) {
> ...
>   if (!flushOneForGlobalPressure()) {
> ...
>   FlushRegionEntry fre = (FlushRegionEntry) fqe;
>   if (!flushRegion(fre)) {
> break;
> ...
> } catch (Exception ex) {
>   LOG.error("Cache flusher failed for entry " + fqe, ex);
>   if (!server.checkFileSystem()) {
> break;
>   }
> }
> {code}
> Inside flushOneForGlobalPressure():
> {code}
>   Preconditions.checkState(
> (regionToFlush != null && regionToFlushSize > 0) ||
> (bestRegionReplica != null && bestRegionReplicaSize > 0));
> {code}
> When the Preconditions check fails, IllegalStateException is caught by the 
> catch block shown above.
> However, the fqe is not flushed, resulting in potential data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383887#comment-16383887
 ] 

Ted Yu edited comment on HBASE-20090 at 3/5/18 10:31 AM:
-

With more debug logging, we can see the scenario where the exception was 
triggered.
{code}
2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
compaction_queue=(0:0), split_queue=1
2018-03-02 17:28:30,098 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info  
size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
2018-03-02 17:28:30,296 INFO  
[RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush thread woke up because memory above low water=381.5 M
2018-03-02 17:28:30,297 INFO  
[RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 
due to global heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal 
Memstore Heap size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
wake up by WAKEUPFLUSH_INSTANCE
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Nothing to flush for 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Excluding unflushable region 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -trying to 
find a different region to flush.
{code}
Region 0453f29030757eedb6e6a1c57e88c085 was being split.
In HRegion#flushcache, the log from else branch can be seen in 
20090-server-61260-01-07.log :
{code}
  synchronized (writestate) {
if (!writestate.flushing && writestate.writesEnabled) {
  this.writestate.flushing = true;
} else {
  if (LOG.isDebugEnabled()) {
LOG.debug("NOT flushing memstore for region " + this
+ ", flushing=" + writestate.flushing + ", writesEnabled="
+ writestate.writesEnabled);
  }
{code}
Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving memory 
pressure at high level.
When MemStoreFlusher ran to the following call, the region was no longer a 
flush candidate:
{code}
  HRegion bestFlushableRegion =
  getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
{code}
So the other region, 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined 
next. Since the region was not receiving write, the (current) Precondition 
check failed.

The proposed fix is to convert the Precondition to normal return.


was (Author: yuzhih...@gmail.com):
With more debug logging, we can see the scenario where the exception was 
triggered.
{code}
2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
compaction_queue=(0:0), split_queue=1
2018-03-02 17:28:30,098 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because info  
size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
2018-03-02 17:28:30,296 INFO  
[RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush thread woke up because memory above low water=381.5 M
2018-03-02 17:28:30,297 INFO  
[RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 
400432696
2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] regionser

[jira] [Comment Edited] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

2018-03-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383145#comment-16383145
 ] 

Ted Yu edited comment on HBASE-20090 at 3/2/18 10:11 AM:
-

I added log (the first line below) for related variables in the Precondition 
check:
{code}
2018-03-02 00:58:18,880 DEBUG [MemStoreFlusher.0] regionserver.MemStoreFlusher: 
regionToFlush 
ATLAS_ENTITY_AUDIT_EVENTS,,1519927487389.6b67b274d95d61fcf4c5ab91e102994d.  
 regionToFlushSize=0 bestRegionReplica null bestRegionReplicaSize=0
2018-03-02 00:58:18,881 ERROR [MemStoreFlusher.0] regionserver.MemStoreFlusher: 
Cache flusher failed for entry 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$1@2a
java.lang.IllegalStateException
at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:441)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:259)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$700(MemStoreFlusher.java:69)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:345)
{code}
We can see that bestRegionReplica was null and the region for 
ATLAS_ENTITY_AUDIT_EVENTS had 0 flush size(because TestTable was written to, 
not ATLAS_ENTITY_AUDIT_EVENTS).

It seems the Preconditions check can be converted to a normal condition check.
[~ram_krish] [~anoop.hbase] [~anastas] :
Can you take a look at the patch ?
Here was snippet from region server log during PE randomWrite:
{code}
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 
due to global heap pressure. Flush type=ABOVE_ONHEAP_HIGHER_MARKTotal 
Memstore Heap size=403.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Nothing to flush for 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Excluding unflushable region 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -trying to 
find a different region to flush.
{code}
Note atlas_janus was not the table being written.
TestTable was being written to.


was (Author: yuzhih...@gmail.com):
It seems the Preconditions check can be converted to a normal condition check.
[~ram_krish] [~anoop.hbase] [~anastas] :
Can you take a look at the patch ?
Here was snippet from region server log during PE randomWrite:
{code}
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Flush of region atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. 
due to global heap pressure. Flush type=ABOVE_ONHEAP_HIGHER_MARKTotal 
Memstore Heap size=403.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Nothing to flush for 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
2018-03-02 03:55:19,232 INFO  [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Excluding unflushable region 
atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -trying to 
find a different region to flush.
{code}
Note atlas_janus was not the table being written.
TestTable was being written to.

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> ---
>
> Key: HBASE-20090
> URL: https://issues.apache.org/jira/browse/HBASE-20090
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20094.v01.patch
>
>
> Here is the code in branch-2 :
> {code}
> try {
>   wakeupPending.set(false); // allow someone to wake us up again
>   fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
>   if (fqe == null || fqe instanceof WakeupFlushThread) {
> ...
>   if (!flushOneForGlobalPressure()) {
> ...
>   FlushRegionEntry fre = (FlushRegionEntry) fqe;
>   if (!flushRegion(fre)) {
> break;
> ...
> } catch (Exception ex) {
>   LOG.error("Cache flusher failed for entry " + fqe, ex);
>   if (!server.checkFileSystem()) {
> break;
>   }
> }
> {code}
> Inside flushOneForGlobalPressure():
> {code}
>   Preconditions.checkState(
> (regionToFlush != null && regionToFlushSize > 0) ||
> (bestRegionReplica != null && bestRegionReplicaSize > 0));
> {code}
> When the Preconditions check fails, IllegalStateException is caught by the 
> catch block shown above.
> However, the fqe