[GitHub] [hbase] wchevreuil commented on issue #931: HBASE-22285 A normalizer which merges small size regions with adjacen…

2019-12-18 Thread GitBox
wchevreuil commented on issue #931: HBASE-22285 A normalizer which merges small 
size regions with adjacen…
URL: https://github.com/apache/hbase/pull/931#issuecomment-566935817
 
 
   > After giving it a thought i believe we should have two normalizers one 
which splits and second which merges and may be third SimpleRegionNormalizer 
which uses these two and together.
   And to reuse code maybe we can have a utility class for normalizer which 
contains only pure methods which will help other normalization implementations.
   
   I'm not sure moving logic to an util class is a good design practice. Here I 
think we could probably do a strategy pattern: An abstract normalizer class 
implementing most of  _computePlanForTable_ logic, with only the parts related 
to picking a region to the region plan and what actions should be perfomed over 
the region plan (merge, or split, or both) being delegated to methods that 
would be implemented differently on each of the subclasses.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998965#comment-16998965
 ] 

Anoop Sam John commented on HBASE-23588:


bq.Btw just a question: Is it possible that cacheIndexesOnWrite & 
cacheBloomsOnWrite both might be enabled but cacheDataOnWrite might be 
disabled? If so, is it recommended for specific usecase(may be majority 
usecases?) or has any advantage?
This is possible. It depends on the cache size available in RS. The size of 
index and bloom will be very less compared to data blocks size.  So for lower 
sized cache it make sense.  But the reverse does not make sense.  Same with new 
config around compacted blocks write.
+1 for the idea Ram.

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998966#comment-16998966
 ] 

Anoop Sam John commented on HBASE-23588:


Can we make this a sub task of the other jira which added the new config for 
compacted blocks cache.

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-23588:
-
Parent: HBASE-23066
Issue Type: Sub-task  (was: Improvement)

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23584) Descrease rpc getFileStatus count when open a storefile

2019-12-18 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998981#comment-16998981
 ] 

Anoop Sam John commented on HBASE-23584:


Can we have a Master branch patch and a Git pull?

> Descrease rpc getFileStatus count when open a storefile 
> 
>
> Key: HBASE-23584
> URL: https://issues.apache.org/jira/browse/HBASE-23584
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 2.1.1
>Reporter: yuhuiyang
>Priority: Minor
> Attachments: HBASE-23584-branch-2.1-01.patch
>
>
> When a store needs to open a storefile , it will create getFileStatus rpc 
> twice . So open a region with too many files or open too many regions at once 
> will cost very much time. if namenode wastes too much time in rpc process 
> every time (in my case 5s sometime) due to namenode itself's problem . So i 
> think we can descrease the times for getFileStatus , this will reduce stress 
> to namenode and consume less time when store open a storefile .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21428) Performance issue due to userRegionLock in the ConnectionManager.

2019-12-18 Thread koo (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998995#comment-16998995
 ] 

koo commented on HBASE-21428:
-

[~stack]

https://github.com/apache/hbase/commit/b6fcc8458002ac8fb5b9f2e93783eb282607fd9b

I have checked that this issue resolved since above patch.

Now our service works well even under overload.

 

thanks :)

> Performance issue due to userRegionLock in the ConnectionManager.
> -
>
> Key: HBASE-21428
> URL: https://issues.apache.org/jira/browse/HBASE-21428
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.7
>Reporter: koo
>Priority: Major
>
> My service is that execute a lot of puts using HTableMultiplexer.
> After the version change, most of the requests are rejected.
> It works fine in 1.2.6.1, but there is a problem in 1.2.7.
> This issue is related with the HBASE-19260.
> Most of my threads are using a lot of time as below.
>  
> |"Worker-972" #2479 daemon prio=5 os_prio=0 tid=0x7f8cea86b000 nid=0x4c8c 
> waiting on condition [0x7f8b78104000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005dd703b78> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1274)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1186)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1170)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1127)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:962)
>  at 
> org.apache.hadoop.hbase.client.HTableMultiplexer.put(HTableMultiplexer.java:206)
>  at 
> org.apache.hadoop.hbase.client.HTableMultiplexer.put(HTableMultiplexer.java:150)|
>  
> When I looked at the issue(HBASE-19260), I recognized the dangerous of to 
> allow accessessing multiple threads.
> However, Already create many threads with the limitations
> I think it is very inefficient to allow only one thread access.
>  
> | this.metaLookupPool = getThreadPool(
>  conf.getInt("hbase.hconnection.meta.lookup.threads.max", 128),
>  conf.getInt("hbase.hconnection.meta.lookup.threads.core", 10),
>  "-metaLookup-shared-", new LinkedBlockingQueue());|
>  
> I want to suggest changing it that allow to have multiple locks.(but not the 
> entire thread)
> The following is pseudocode.
>  
> |int lockSize = conf.getInt("hbase.hconnection.meta.lookup.threads.max", 128) 
> / 2;
> BlockingQueue userRegionLockQueue = new 
> LinkedBlockingQueue();
>  for (int i=0; i   userRegionLockQueue.put(new ReentrantLock());
>  }|
>  
> thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23366) Test failure due to flaky tests on ppc64le

2019-12-18 Thread AK97 (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998865#comment-16998865
 ] 

AK97 edited comment on HBASE-23366 at 12/18/19 10:27 AM:
-

Any leads will be appreciated. 




was (Author: ak2019):
Any leads will be appreciated. 


> Test failure due to flaky tests on ppc64le
> --
>
> Key: HBASE-23366
> URL: https://issues.apache.org/jira/browse/HBASE-23366
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.2.0
> Environment: {color:#172b4d}os: rhel 7.6{color}
> {color:#172b4d} arch: ppc64le{color}
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
> passes, however it leads to flaky test failures in module hbase-server.
> All the test pass most of the times when run individually.
> Following is the list of the tests that fail often:
>  * TestMetaTableMetrics
>  * TestMasterAbortWhileMergingTable
>  * TestSnapshotFromMaster
>  * TestReplicationAdminWithClusters
>  * TestAsyncDecommissionAdminApi
>  * TestCompactSplitThread
>  
>    
> I am on branch rel/2.2.0
> {color:#172b4d}Would like some help on understanding the cause for the same . 
> I am running it on a High end VM with good connectivity.{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23366) Test failure due to flaky tests on ppc64le

2019-12-18 Thread AK97 (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998865#comment-16998865
 ] 

AK97 edited comment on HBASE-23366 at 12/18/19 10:30 AM:
-

Any leads will be appreciated. 
[~liuml07] could you please look into this.



was (Author: ak2019):
Any leads will be appreciated. 



> Test failure due to flaky tests on ppc64le
> --
>
> Key: HBASE-23366
> URL: https://issues.apache.org/jira/browse/HBASE-23366
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.2.0
> Environment: {color:#172b4d}os: rhel 7.6{color}
> {color:#172b4d} arch: ppc64le{color}
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
> passes, however it leads to flaky test failures in module hbase-server.
> All the test pass most of the times when run individually.
> Following is the list of the tests that fail often:
>  * TestMetaTableMetrics
>  * TestMasterAbortWhileMergingTable
>  * TestSnapshotFromMaster
>  * TestReplicationAdminWithClusters
>  * TestAsyncDecommissionAdminApi
>  * TestCompactSplitThread
>  
>    
> I am on branch rel/2.2.0
> {color:#172b4d}Would like some help on understanding the cause for the same . 
> I am running it on a High end VM with good connectivity.{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23366) Test failure due to flaky tests on ppc64le

2019-12-18 Thread AK97 (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998865#comment-16998865
 ] 

AK97 edited comment on HBASE-23366 at 12/18/19 10:30 AM:
-

Any leads will be appreciated. 
[~liuml07] could you please look into this.
Thank you




was (Author: ak2019):
Any leads will be appreciated. 
[~liuml07] could you please look into this.


> Test failure due to flaky tests on ppc64le
> --
>
> Key: HBASE-23366
> URL: https://issues.apache.org/jira/browse/HBASE-23366
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.2.0
> Environment: {color:#172b4d}os: rhel 7.6{color}
> {color:#172b4d} arch: ppc64le{color}
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
> passes, however it leads to flaky test failures in module hbase-server.
> All the test pass most of the times when run individually.
> Following is the list of the tests that fail often:
>  * TestMetaTableMetrics
>  * TestMasterAbortWhileMergingTable
>  * TestSnapshotFromMaster
>  * TestReplicationAdminWithClusters
>  * TestAsyncDecommissionAdminApi
>  * TestCompactSplitThread
>  
>    
> I am on branch rel/2.2.0
> {color:#172b4d}Would like some help on understanding the cause for the same . 
> I am running it on a High end VM with good connectivity.{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23366) Test failure due to flaky tests on ppc64le

2019-12-18 Thread AK97 (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AK97 updated HBASE-23366:
-
Description: 
I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
passes, however it leads to flaky test failures in module hbase-server.

All the tests pass most of the times when run individually.

Following is the list of the tests that fail often:
 * TestMetaTableMetrics
 * TestMasterAbortWhileMergingTable
 * TestSnapshotFromMaster
 * TestReplicationAdminWithClusters
 * TestAsyncDecommissionAdminApi
 * TestCompactSplitThread
 

   
I am on branch rel/2.2.0

{color:#172b4d}Would like some help on understanding the cause for the same . I 
am running it on a High end VM with good connectivity.{color}
 
 
 

  was:
I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
passes, however it leads to flaky test failures in module hbase-server.

All the test pass most of the times when run individually.

Following is the list of the tests that fail often:
 * TestMetaTableMetrics
 * TestMasterAbortWhileMergingTable
 * TestSnapshotFromMaster
 * TestReplicationAdminWithClusters
 * TestAsyncDecommissionAdminApi
 * TestCompactSplitThread
 

   
I am on branch rel/2.2.0

{color:#172b4d}Would like some help on understanding the cause for the same . I 
am running it on a High end VM with good connectivity.{color}
 
 
 


> Test failure due to flaky tests on ppc64le
> --
>
> Key: HBASE-23366
> URL: https://issues.apache.org/jira/browse/HBASE-23366
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.2.0
> Environment: {color:#172b4d}os: rhel 7.6{color}
> {color:#172b4d} arch: ppc64le{color}
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
> passes, however it leads to flaky test failures in module hbase-server.
> All the tests pass most of the times when run individually.
> Following is the list of the tests that fail often:
>  * TestMetaTableMetrics
>  * TestMasterAbortWhileMergingTable
>  * TestSnapshotFromMaster
>  * TestReplicationAdminWithClusters
>  * TestAsyncDecommissionAdminApi
>  * TestCompactSplitThread
>  
>    
> I am on branch rel/2.2.0
> {color:#172b4d}Would like some help on understanding the cause for the same . 
> I am running it on a High end VM with good connectivity.{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-21428) Performance issue due to userRegionLock in the ConnectionManager.

2019-12-18 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li resolved HBASE-21428.
---
Resolution: Won't Fix

Thanks for the update [~taejin], and glad to know you find a solution. I'm 
closing this issue as won't fix and linking HBASE-21196 with it.

> Performance issue due to userRegionLock in the ConnectionManager.
> -
>
> Key: HBASE-21428
> URL: https://issues.apache.org/jira/browse/HBASE-21428
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.7
>Reporter: koo
>Priority: Major
>
> My service is that execute a lot of puts using HTableMultiplexer.
> After the version change, most of the requests are rejected.
> It works fine in 1.2.6.1, but there is a problem in 1.2.7.
> This issue is related with the HBASE-19260.
> Most of my threads are using a lot of time as below.
>  
> |"Worker-972" #2479 daemon prio=5 os_prio=0 tid=0x7f8cea86b000 nid=0x4c8c 
> waiting on condition [0x7f8b78104000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005dd703b78> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1274)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1186)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1170)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1127)
>  at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:962)
>  at 
> org.apache.hadoop.hbase.client.HTableMultiplexer.put(HTableMultiplexer.java:206)
>  at 
> org.apache.hadoop.hbase.client.HTableMultiplexer.put(HTableMultiplexer.java:150)|
>  
> When I looked at the issue(HBASE-19260), I recognized the dangerous of to 
> allow accessessing multiple threads.
> However, Already create many threads with the limitations
> I think it is very inefficient to allow only one thread access.
>  
> | this.metaLookupPool = getThreadPool(
>  conf.getInt("hbase.hconnection.meta.lookup.threads.max", 128),
>  conf.getInt("hbase.hconnection.meta.lookup.threads.core", 10),
>  "-metaLookup-shared-", new LinkedBlockingQueue());|
>  
> I want to suggest changing it that allow to have multiple locks.(but not the 
> entire thread)
> The following is pseudocode.
>  
> |int lockSize = conf.getInt("hbase.hconnection.meta.lookup.threads.max", 128) 
> / 2;
> BlockingQueue userRegionLockQueue = new 
> LinkedBlockingQueue();
>  for (int i=0; i   userRegionLockQueue.put(new ReentrantLock());
>  }|
>  
> thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23350) Make compaction files cacheonWrite configurable based on threshold

2019-12-18 Thread ramkrishna.s.vasudevan (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999038#comment-16999038
 ] 

ramkrishna.s.vasudevan commented on HBASE-23350:


Lets add a new config which takes a size the value. If the resulting compacting 
file's size is going to be above this configured value we just don't cache the 
blocks of that file. 
It will be static way of doing and may be later we can see if some way of 
intelligence can be added here. 


> Make compaction files cacheonWrite configurable based on threshold
> --
>
> Key: HBASE-23350
> URL: https://issues.apache.org/jira/browse/HBASE-23350
> Project: HBase
>  Issue Type: Sub-task
>  Components: Compaction
>Reporter: ramkrishna.s.vasudevan
>Assignee: Abhinaba Sarkar
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>
> As per comment from [~javaman_chen] in the parent JIRA
> https://issues.apache.org/jira/browse/HBASE-23066?focusedCommentId=16937361&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16937361
> This is to introduce a config to identify if the resulting compacted file's 
> blocks should  be added to the cache - while writing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hbase] virajjasani opened a new pull request #948: HBASE-23588 : Cache index & bloom blocks on write if CacheCompactedBl…

2019-12-18 Thread GitBox
virajjasani opened a new pull request #948: HBASE-23588 : Cache index & bloom 
blocks on write if CacheCompactedBl…
URL: https://github.com/apache/hbase/pull/948
 
 
   …ocksOnWrite is enabled


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-23588:
-
Status: Patch Available  (was: In Progress)

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-23588 started by Viraj Jasani.

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23584) Descrease rpc getFileStatus count when open a storefile

2019-12-18 Thread HBase QA (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999101#comment-16999101
 ] 

HBase QA commented on HBASE-23584:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  3m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
18s{color} | {color:green} branch-2.1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} branch-2.1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
28s{color} | {color:green} branch-2.1 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
23s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} branch-2.1 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
57s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
56s{color} | {color:green} branch-2.1 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
30s{color} | {color:red} hbase-server: The patch generated 3 new + 18 unchanged 
- 0 fixed = 21 total (was 18) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
20s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
21m 50s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.7 2.8.5 or 3.0.3 3.1.2. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}193m 55s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}256m 47s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.regionserver.TestOpenSeqNumUnexpectedIncrease |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1072/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-23584 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989072/HBASE-23584-branch-2.1-01.patch
 |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
shadedjars hadoopcheck hbaseanti checkstyle compile |
| uname | Linux 889ed8471e4f 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 GNU/Linux |
| Build tool | m

[jira] [Created] (HBASE-23589) FlushDescriptor contains non-matching family/output combinations

2019-12-18 Thread Szabolcs Bukros (Jira)
Szabolcs Bukros created HBASE-23589:
---

 Summary: FlushDescriptor contains non-matching family/output 
combinations
 Key: HBASE-23589
 URL: https://issues.apache.org/jira/browse/HBASE-23589
 Project: HBase
  Issue Type: Bug
  Components: read replicas
Affects Versions: 2.2.2
Reporter: Szabolcs Bukros
Assignee: Szabolcs Bukros


Flushing the active region creates the following files:
{code:java}
2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c,
 entries=49128, sequenceid
=70688, filesize=41.4 M
2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a,
 entries=5, sequenceid
=70688, filesize=42.3 M
{code}
On the read replica region when we try to replay the flush we see the following:
{code:java}
2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: 
bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: 
action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" 
encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" flush_sequence_number: 
70688 store_flushes { family_name: "f2" store_home_dir: "f2" flush_output: 
"ecc50f33085042f7bd2397253b896a3a" } store_flushes { family_name: "f3" 
store_home_dir: "f3" flush_output: "dab4d1cc01e44773bad7bdb5d2e33b6c" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576252065847.20af2eb8929408f26d0b3b81e6b86d47."
 doesn't exist any more. Skip loading the file(s)
java.io.FileNotFoundException: HFileLink 
locations=[hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/.tmp/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/mobdir/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/archive/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a]
at org.apache.hadoop.hbase.io.FileLink.getFileStatus(FileLink.java:415)
at 
org.apache.hadoop.hbase.util.ServerRegionReplicaUtil.getStoreFileInfo(ServerRegionReplicaUtil.java:135)
at 
org.apache.hadoop.hbase.regionserver.HRegionFileSystem.getStoreFileInfo(HRegionFileSystem.java:311)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2414)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:5310)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5184)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5018)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2229)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{code}
As you can see the flush_outputs are mixed up.

 

The issue is caused by HRegion.internalFlushCacheAndCommit. The code assumes 
"{color:#808080}stores.values() and storeFlushCtxs have same order{color}" 
which no longer seems to be true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23589) FlushDescriptor contains non-matching family/output combinations

2019-12-18 Thread Szabolcs Bukros (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szabolcs Bukros updated HBASE-23589:

Description: 
Flushing the active region creates the following files:
{code:java}
2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c,
 entries=49128, sequenceid
=70688, filesize=41.4 M
2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a,
 entries=5, sequenceid
=70688, filesize=42.3 M
{code}
On the read replica region when we try to replay the flush we see the following:
{code:java}
2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: 
bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: 
action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" 
encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" flush_sequence_number: 
70688 store_flushes { family_name: "f2" store_home_dir: "f2" flush_output: 
"ecc50f33085042f7bd2397253b896a3a" } store_flushes { family_name: "f3" 
store_home_dir: "f3" flush_output: "dab4d1cc01e44773bad7bdb5d2e33b6c" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576252065847.20af2eb8929408f26d0b3b81e6b86d47."
 doesn't exist any more. Skip loading the file(s)
java.io.FileNotFoundException: HFileLink 
locations=[hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/.tmp/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/mobdir/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a,
 
hdfs://replica-1:8020/hbase/archive/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a]
at org.apache.hadoop.hbase.io.FileLink.getFileStatus(FileLink.java:415)
at 
org.apache.hadoop.hbase.util.ServerRegionReplicaUtil.getStoreFileInfo(ServerRegionReplicaUtil.java:135)
at 
org.apache.hadoop.hbase.regionserver.HRegionFileSystem.getStoreFileInfo(HRegionFileSystem.java:311)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2414)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:5310)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5184)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5018)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2229)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{code}
As we can see the flush_outputs got mixed up. 

 

The issue is caused by HRegion.internalFlushCacheAndCommit. The code assumes 
"{color:#808080}stores.values() and storeFlushCtxs have same order{color}" 
which no longer seems to be true.

  was:
Flushing the active region creates the following files:
{code:java}
2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c,
 entries=49128, sequenceid
=70688, filesize=41.4 M
2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: Added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a,
 entries=5, sequenceid
=70688, filesize=42.3 M
{code}
On the read replica region when we try to replay the flush we see the following:
{code:java}
2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: 
bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: 
action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" 
encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" flush_sequence_number: 
70688 sto

[GitHub] [hbase] BukrosSzabolcs opened a new pull request #949: HBASE-23589: FlushDescriptor contains non-matching family/output combinations

2019-12-18 Thread GitBox
BukrosSzabolcs opened a new pull request #949: HBASE-23589: FlushDescriptor 
contains non-matching family/output combinations
URL: https://github.com/apache/hbase/pull/949
 
 
   Make sure commited files belong to the correct column family when
   creating a FlushDescriptor.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HBASE-23590) Update maxStoreFileRefCount to maxCompactedStoreFileRefCount

2019-12-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-23590:


 Summary: Update maxStoreFileRefCount to 
maxCompactedStoreFileRefCount
 Key: HBASE-23590
 URL: https://issues.apache.org/jira/browse/HBASE-23590
 Project: HBase
  Issue Type: Bug
Affects Versions: 3.0.0, 2.3.0, 1.6.0
Reporter: Viraj Jasani
Assignee: Viraj Jasani


As per discussion on HBASE-23349, RegionsRecoveryChore should use max refCount 
on compacted away store files and not on new store files to determine when to 
reopen the region. Although work on HBASE-23349 is in progress, we need to at 
least update the metric to get the desired refCount i.e. max refCount among all 
compacted away store files for a given region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23591) Negative memStoreSizing

2019-12-18 Thread Szabolcs Bukros (Jira)
Szabolcs Bukros created HBASE-23591:
---

 Summary: Negative memStoreSizing
 Key: HBASE-23591
 URL: https://issues.apache.org/jira/browse/HBASE-23591
 Project: HBase
  Issue Type: Bug
  Components: read replicas
Reporter: Szabolcs Bukros
 Fix For: 2.2.2


After a flush on the replica region the memStoreSizing becomes negative:

 
{code:java}
2019-12-17 08:31:59,983 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Replaying flush marker action: COMMIT_FLUSH 
table_name: "IntegrationTestRegionReplicaReplicati
on" encoded_region_name: "544affde3e027454f67c8ea46c8f69ee" 
flush_sequence_number: 41392 store_flushes { family_name: "f1" store_home_dir: 
"f1" flush_output: "3c48a23eac784a348a18e10e337d80a2" } store_flushes { 
family_name: "f2" store_home_dir: "f2" flush_output: 
"9a5283ec95694667b4ead2398af5f01e" } store_flushes { family_name: "f3" 
store_home_dir: "f3" flush_output: "e6f25e6b0eca4d22af15d0626d0f8759" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576599911697.544affde3e027454f67c8ea46c8f69ee."
2019-12-17 08:31:59,984 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Received a flush commit marker with 
seqId:41392 and a previous prepared snapshot was found
2019-12-17 08:31:59,993 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f1/3c48a23eac784a348a18e10e337d80a2,
 entries=32445, sequenceid=41392, filesize=27.6 M
2019-12-17 08:32:00,016 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f2/9a5283ec95694667b4ead2398af5f01e,
 entries=12264, sequenceid=41392, filesize=10.9 M
2019-12-17 08:32:00,121 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f3/e6f25e6b0eca4d22af15d0626d0f8759,
 entries=32379, sequenceid=41392, filesize=27.5 M
2019-12-17 08:32:00,122 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
CustomLog decrMemStoreSize. Current: dataSize=135810071, getHeapSize=148400960, 
getOffHeapSize=0, getCellsCount=167243 delta: dataSizeDelta=155923644, 
heapSizeDelta=170112320, offHeapSizeDelta=0, cellsCountDelta=188399
2019-12-17 08:32:00,122 ERROR org.apache.hadoop.hbase.regionserver.HRegion: 
Asked to modify this region's 
(IntegrationTestRegionReplicaReplication,,1576599911697_0001.0beaae111b0f6e98bfde31ba35be54
08.) memStoreSizing to a negative value which is incorrect. Current 
memStoreSizing=135810071, delta=-155923644
java.lang.Exception
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkNegativeMemStoreDataSize(HRegion.java:1323)
at 
org.apache.hadoop.hbase.regionserver.HRegion.decrMemStoreSize(HRegion.java:1316)
at 
org.apache.hadoop.hbase.regionserver.HRegion.decrMemStoreSize(HRegion.java:1303)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5194)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5025)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2232)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

{code}
 

 

I added some custom logging to the snapshot logic to be able to see snapshot 
sizes:

 
{code:java}
2019-12-17 08:31:56,900 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Replaying flush marker action: START_FLUSH 
table_name: "IntegrationTestRegionReplicaReplication" encoded_region_name: 
"544affde3e027454f67c8ea46c8f69ee" flush_sequence_number: 41392 store_flushes { 
family_name: "f1" store_home_dir: "f1" } store_flushes { family_name: "f2" 
store_home_dir: "f2" } store_flushes { family_name: "f3" store_home_dir: "f3" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576599911697.544affde3e027454f67c8ea46c8f69ee."
2019-12-17 08:31:56,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Flushing 0beaae111b0f6e98bfde31ba35be5408 3/3 column famil

[jira] [Updated] (HBASE-23591) Negative memStoreSizing

2019-12-18 Thread Szabolcs Bukros (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szabolcs Bukros updated HBASE-23591:

Description: 
After a flush on the replica region the memStoreSizing becomes negative:
{code:java}
2019-12-17 08:31:59,983 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Replaying flush marker action: COMMIT_FLUSH 
table_name: "IntegrationTestRegionReplicaReplicati
on" encoded_region_name: "544affde3e027454f67c8ea46c8f69ee" 
flush_sequence_number: 41392 store_flushes { family_name: "f1" store_home_dir: 
"f1" flush_output: "3c48a23eac784a348a18e10e337d80a2" } store_flushes { 
family_name: "f2" store_home_dir: "f2" flush_output: 
"9a5283ec95694667b4ead2398af5f01e" } store_flushes { family_name: "f3" 
store_home_dir: "f3" flush_output: "e6f25e6b0eca4d22af15d0626d0f8759" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576599911697.544affde3e027454f67c8ea46c8f69ee."
2019-12-17 08:31:59,984 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Received a flush commit marker with 
seqId:41392 and a previous prepared snapshot was found
2019-12-17 08:31:59,993 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f1/3c48a23eac784a348a18e10e337d80a2,
 entries=32445, sequenceid=41392, filesize=27.6 M
2019-12-17 08:32:00,016 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f2/9a5283ec95694667b4ead2398af5f01e,
 entries=12264, sequenceid=41392, filesize=10.9 M
2019-12-17 08:32:00,121 INFO org.apache.hadoop.hbase.regionserver.HStore: 
Region: 0beaae111b0f6e98bfde31ba35be5408 added 
hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/544affde3e027454f67c8ea46c8f69ee/f3/e6f25e6b0eca4d22af15d0626d0f8759,
 entries=32379, sequenceid=41392, filesize=27.5 M
2019-12-17 08:32:00,122 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
CustomLog decrMemStoreSize. Current: dataSize=135810071, getHeapSize=148400960, 
getOffHeapSize=0, getCellsCount=167243 delta: dataSizeDelta=155923644, 
heapSizeDelta=170112320, offHeapSizeDelta=0, cellsCountDelta=188399
2019-12-17 08:32:00,122 ERROR org.apache.hadoop.hbase.regionserver.HRegion: 
Asked to modify this region's 
(IntegrationTestRegionReplicaReplication,,1576599911697_0001.0beaae111b0f6e98bfde31ba35be54
08.) memStoreSizing to a negative value which is incorrect. Current 
memStoreSizing=135810071, delta=-155923644
java.lang.Exception
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkNegativeMemStoreDataSize(HRegion.java:1323)
at 
org.apache.hadoop.hbase.regionserver.HRegion.decrMemStoreSize(HRegion.java:1316)
at 
org.apache.hadoop.hbase.regionserver.HRegion.decrMemStoreSize(HRegion.java:1303)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5194)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5025)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2232)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

{code}
I added some custom logging to the snapshot logic to be able to see snapshot 
sizes: 
{code:java}
2019-12-17 08:31:56,900 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
0beaae111b0f6e98bfde31ba35be5408 : Replaying flush marker action: START_FLUSH 
table_name: "IntegrationTestRegionReplicaReplication" encoded_region_name: 
"544affde3e027454f67c8ea46c8f69ee" flush_sequence_number: 41392 store_flushes { 
family_name: "f1" store_home_dir: "f1" } store_flushes { family_name: "f2" 
store_home_dir: "f2" } store_flushes { family_name: "f3" store_home_dir: "f3" } 
region_name: 
"IntegrationTestRegionReplicaReplication,,1576599911697.544affde3e027454f67c8ea46c8f69ee."
2019-12-17 08:31:56,900 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Flushing 0beaae111b0f6e98bfde31ba35be5408 3/3 column families, dataSize=126.49 
MB heapSize=138.24 MB
2019-12-17 08:31:56,900 WARN 
org.apache.hadoop.hbase.regionserver.DefaultMemStore: Snapshot called again 
without clearing previous.

[GitHub] [hbase] virajjasani opened a new pull request #950: HBASE-23590 : Update maxStoreFileRefCount to maxCompactedStoreFileRef…

2019-12-18 Thread GitBox
virajjasani opened a new pull request #950: HBASE-23590 : Update 
maxStoreFileRefCount to maxCompactedStoreFileRef…
URL: https://github.com/apache/hbase/pull/950
 
 
   …Count for auto region recovery based on old reader references


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HBASE-23590) Update maxStoreFileRefCount to maxCompactedStoreFileRefCount

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-23590:
-
Fix Version/s: 1.6.0
   2.3.0
   3.0.0
   Status: Patch Available  (was: In Progress)

> Update maxStoreFileRefCount to maxCompactedStoreFileRefCount
> 
>
> Key: HBASE-23590
> URL: https://issues.apache.org/jira/browse/HBASE-23590
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> As per discussion on HBASE-23349, RegionsRecoveryChore should use max 
> refCount on compacted away store files and not on new store files to 
> determine when to reopen the region. Although work on HBASE-23349 is in 
> progress, we need to at least update the metric to get the desired refCount 
> i.e. max refCount among all compacted away store files for a given region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HBASE-23590) Update maxStoreFileRefCount to maxCompactedStoreFileRefCount

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-23590 started by Viraj Jasani.

> Update maxStoreFileRefCount to maxCompactedStoreFileRefCount
> 
>
> Key: HBASE-23590
> URL: https://issues.apache.org/jira/browse/HBASE-23590
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> As per discussion on HBASE-23349, RegionsRecoveryChore should use max 
> refCount on compacted away store files and not on new store files to 
> determine when to reopen the region. Although work on HBASE-23349 is in 
> progress, we need to at least update the metric to get the desired refCount 
> i.e. max refCount among all compacted away store files for a given region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23588) Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite is enabled

2019-12-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-23588:
-
Fix Version/s: 2.3.0
   3.0.0

> Cache index blocks and bloom blocks on write if CacheCompactedBlocksOnWrite 
> is enabled
> --
>
> Key: HBASE-23588
> URL: https://issues.apache.org/jira/browse/HBASE-23588
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: ramkrishna.s.vasudevan
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>
> The existing behaviour even even cacheOnWrite is enabled is that we don cache 
> the index or bloom blocks. Now with HBASE-23066 in place we also write blocks 
> on compaction. So it may be better to cache the index/bloom blocks also if 
> cacheOnWrite is enabled?
> FYI [~javaman_chen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23589) FlushDescriptor contains non-matching family/output combinations

2019-12-18 Thread HBase QA (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999422#comment-16999422
 ] 

HBase QA commented on HBASE-23589:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
32s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
 1s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  4m 
47s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
45s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
 0s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
17m 11s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.8.5 2.9.2 or 3.1.2. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}261m 
27s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}324m 29s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 base: 
https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-949/1/artifact/out/Dockerfile
 |
| GITHUB PR | https://github.com/apache/hbase/pull/949 |
| JIRA Issue | HBASE-23589 |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
shadedjars hadoopcheck hbaseanti checkstyle compile |
| uname | Linux 48a2deed3164 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-949/out/precommit/personality/provided.sh
 |
| git revision | master / 17e180e4ee |
| Default Java | 1.8.0_181 |
|  Test Results | 
https://builds.apache.or

[jira] [Created] (HBASE-23592) Refactor tests in hbase-kafka-proxy in hbase-connectors

2019-12-18 Thread Jan Hentschel (Jira)
Jan Hentschel created HBASE-23592:
-

 Summary: Refactor tests in hbase-kafka-proxy in hbase-connectors
 Key: HBASE-23592
 URL: https://issues.apache.org/jira/browse/HBASE-23592
 Project: HBase
  Issue Type: Improvement
Reporter: Jan Hentschel
Assignee: Jan Hentschel


The tests in {{hbase-kafka-proxy}} within {{hbase-connectors}} should be 
refactored to

* move the usage of the character set to {{StandardCharsets}}
* remove printing the stacktrace
* simplification of the asserts



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hbase-connectors] HorizonNet opened a new pull request #57: HBASE-23592 Refactored tests in hbase-kafka-proxy

2019-12-18 Thread GitBox
HorizonNet opened a new pull request #57: HBASE-23592 Refactored tests in 
hbase-kafka-proxy
URL: https://github.com/apache/hbase-connectors/pull/57
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] YamasakiSS opened a new pull request #951: [UI] Master UI shows long stack traces when table is broken

2019-12-18 Thread GitBox
YamasakiSS opened a new pull request #951: [UI] Master UI shows long stack 
traces when table is broken
URL: https://github.com/apache/hbase/pull/951
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HBASE-23578) [UI] Master UI shows long stack traces when table is broken

2019-12-18 Thread Shuhei Yamasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999560#comment-16999560
 ] 

Shuhei Yamasaki commented on HBASE-23578:
-

[~xucang] 

Thank you for comment. I created Pull Request .

It's a simple patch with modification of only  table.jsp.

> [UI] Master UI shows long stack traces when table is broken
> ---
>
> Key: HBASE-23578
> URL: https://issues.apache.org/jira/browse/HBASE-23578
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Shuhei Yamasaki
>Priority: Minor
> Attachments: stackCompact1_short.png, table_jsp.png
>
>
> The table.jsp in Master UI shows long stack traces when table is broken. 
> (shown as table_jsp.png)
> This messages are hard to read and web page is very wide because stack traces 
> displayed in a single line.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23362) WalPrettyPrinter should include the table name

2019-12-18 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-23362:

Fix Version/s: 1.6.0

> WalPrettyPrinter should include the table name
> --
>
> Key: HBASE-23362
> URL: https://issues.apache.org/jira/browse/HBASE-23362
> Project: HBase
>  Issue Type: Improvement
>  Components: tooling
>Affects Versions: master
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0, 2.1.8, 2.2.3
>
>
> I was playing around with a large WAL file to debug something and I noticed a 
> couple missing items.
> - Pretty printer doesn't print the table name. It is difficult to map the 
> region hashes to a table name manually.
> - It should include an option to filter the edits by table so that we can 
> only see the entries for a given table. A similar option exists for regions.
> I hacked it locally, I thought this might save some time for others too if 
> the fix goes into the master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23366) Test failure due to flaky tests on ppc64le

2019-12-18 Thread Mingliang Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999592#comment-16999592
 ] 

Mingliang Liu commented on HBASE-23366:
---

[~AK2019] I'm not seeing this errors in our daily build because we are not on 
this 2.2 release. Tests are failing for different reasons, so you can check 
status of each failing test by searching in JIRA. Perhaps they have been 
addressed in later releases, for e.g. 
https://issues.apache.org/jira/browse/HBASE-22637?jql=text%20~%20TestMetaTableMetrics%20ORDER%20BY%20created%20DESC
 If they have been fixed in later releases, you can either backport to your 
fork, or upgrade your version if that fits.

> Test failure due to flaky tests on ppc64le
> --
>
> Key: HBASE-23366
> URL: https://issues.apache.org/jira/browse/HBASE-23366
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.2.0
> Environment: {color:#172b4d}os: rhel 7.6{color}
> {color:#172b4d} arch: ppc64le{color}
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Hbase on rhel_7.6/ppc64le. The build 
> passes, however it leads to flaky test failures in module hbase-server.
> All the tests pass most of the times when run individually.
> Following is the list of the tests that fail often:
>  * TestMetaTableMetrics
>  * TestMasterAbortWhileMergingTable
>  * TestSnapshotFromMaster
>  * TestReplicationAdminWithClusters
>  * TestAsyncDecommissionAdminApi
>  * TestCompactSplitThread
>  
>    
> I am on branch rel/2.2.0
> {color:#172b4d}Would like some help on understanding the cause for the same . 
> I am running it on a High end VM with good connectivity.{color}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23304) Implement RPCs needed for master based registry

2019-12-18 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999617#comment-16999617
 ] 

Andrew Kyle Purtell commented on HBASE-23304:
-

bq. old client -> new server - does not work, the channel will be closed on 
error because the end points it is looking up would still not be implemented on 
the server side.

This should work unless and until the client config is updated to swap in the 
new asyncregistry provider, right? So it does work, you just can't update your 
client config... and why would you, it is an old client...

> Implement RPCs needed for master based registry
> ---
>
> Key: HBASE-23304
> URL: https://issues.apache.org/jira/browse/HBASE-23304
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 3.0.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>
> We need to implement RPCs on masters needed by client to fetch information 
> like clusterID, active master server name, meta locations etc. These RPCs are 
> used by clients during connection init.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23304) Implement RPCs needed for master based registry

2019-12-18 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993699#comment-16993699
 ] 

Bharath Vissapragada edited comment on HBASE-23304 at 12/19/19 12:59 AM:
-

Following is the information needed by clients during connection init and the 
corresponding RPC endpoints added by the patch.
 # clusterid - GetClusterId(GetClusterIdRequest) -> (GetClusterIdResponse)
 # activemaster - GetActiveMaster(GetActiveMasterRequest) -> 
(GetActiveMasterResponse)
 # metalocations - GetMetaRegionLocations(GetMetaRegionLocationsRequest) -> 
(GetMetaRegionLocationsResponse)

Given these are new service endpoints, we need to consider some upgrade 
scenarios.
 * old client -> old server works (duh!)
 * old client -> new server works (backwards compatible since the patch does 
not change any existing RPC signatures)
 * new client -> old server - *does not work*, the channel will be closed on 
error because the end points it is looking up would still not be implemented on 
the server side. (note: this is updated after Andrew's next comment)
 * new client -> new server works (duh!)

Given this compatibility matrix,
 * Client-server compatibility - clients and servers are *not* allowed to 
upgrade out of sync. Servers should be upgraded first before upgrading the 
clients.
 * Server-server compatibility - unaffected.
 * File format compatibility - unaffected.
 * Client API compatibility - unaffected.
 * Client binary compatibility - unaffected. (only configuration changes needed)
 * Server side limited API compatibility - unaffected.
 * Dependency compatibility - unaffected.


was (Author: bharathv):
Following is the information needed by clients during connection init and the 
corresponding RPC endpoints added by the patch.
 # clusterid - GetClusterId(GetClusterIdRequest) -> (GetClusterIdResponse)
 # activemaster - GetActiveMaster(GetActiveMasterRequest) -> 
(GetActiveMasterResponse)
 # metalocations - GetMetaRegionLocations(GetMetaRegionLocationsRequest) -> 
(GetMetaRegionLocationsResponse)

Given these are new service endpoints, we need to consider some upgrade 
scenarios.
 * old client -> old server works (duh!)
 * old client -> new server works (backwards compatible since the patch does 
not change any existing RPC signatures)
 * old client -> new server - *does not work*, the channel will be closed on 
error because the end points it is looking up would still not be implemented on 
the server side.
 * new client -> new server works (duh!)

Given this compatibility matrix,
 * Client-server compatibility - clients and servers are *not* allowed to 
upgrade out of sync. Servers should be upgraded first before upgrading the 
clients.
 * Server-server compatibility - unaffected.
 * File format compatibility - unaffected.
 * Client API compatibility - unaffected.
 * Client binary compatibility - unaffected. (only configuration changes needed)
 * Server side limited API compatibility - unaffected.
 * Dependency compatibility - unaffected.

> Implement RPCs needed for master based registry
> ---
>
> Key: HBASE-23304
> URL: https://issues.apache.org/jira/browse/HBASE-23304
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 3.0.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>
> We need to implement RPCs on masters needed by client to fetch information 
> like clusterID, active master server name, meta locations etc. These RPCs are 
> used by clients during connection init.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23304) Implement RPCs needed for master based registry

2019-12-18 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999633#comment-16999633
 ] 

Bharath Vissapragada commented on HBASE-23304:
--

Actually, that was a typo. I meant "new client -> old server" (updated it now 
with a note). This is meant to cover a scenario where a "new client" (with 
updated configs) tries to talk to an "old server" (without updated RPC 
definitions). However what you said makes sense to me. This case is unlikely.

> Implement RPCs needed for master based registry
> ---
>
> Key: HBASE-23304
> URL: https://issues.apache.org/jira/browse/HBASE-23304
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 3.0.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>
> We need to implement RPCs on masters needed by client to fetch information 
> like clusterID, active master server name, meta locations etc. These RPCs are 
> used by clients during connection init.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23304) Implement RPCs needed for master based registry

2019-12-18 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999635#comment-16999635
 ] 

Andrew Kyle Purtell commented on HBASE-23304:
-

Thanks, makes sense, that part confused me because why would you change the 
config for an old client...

> Implement RPCs needed for master based registry
> ---
>
> Key: HBASE-23304
> URL: https://issues.apache.org/jira/browse/HBASE-23304
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 3.0.0
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Major
>
> We need to implement RPCs on masters needed by client to fetch information 
> like clusterID, active master server name, meta locations etc. These RPCs are 
> used by clients during connection init.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-18095) Provide an option for clients to find the server hosting META that does not involve the ZooKeeper client

2019-12-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999661#comment-16999661
 ] 

Hudson commented on HBASE-18095:


Results for branch HBASE-18095/client-locate-meta-no-zookeeper
[build #13 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18095%252Fclient-locate-meta-no-zookeeper/13/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18095%252Fclient-locate-meta-no-zookeeper/13//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18095%252Fclient-locate-meta-no-zookeeper/13//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18095%252Fclient-locate-meta-no-zookeeper/13//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Provide an option for clients to find the server hosting META that does not 
> involve the ZooKeeper client
> 
>
> Key: HBASE-18095
> URL: https://issues.apache.org/jira/browse/HBASE-18095
> Project: HBase
>  Issue Type: New Feature
>  Components: Client
>Reporter: Andrew Kyle Purtell
>Assignee: Bharath Vissapragada
>Priority: Major
> Attachments: HBASE-18095.master-v1.patch, HBASE-18095.master-v2.patch
>
>
> Clients are required to connect to ZooKeeper to find the location of the 
> regionserver hosting the meta table region. Site configuration provides the 
> client a list of ZK quorum peers and the client uses an embedded ZK client to 
> query meta location. Timeouts and retry behavior of this embedded ZK client 
> are managed orthogonally to HBase layer settings and in some cases the ZK 
> cannot manage what in theory the HBase client can, i.e. fail fast upon outage 
> or network partition.
> We should consider new configuration settings that provide a list of 
> well-known master and backup master locations, and with this information the 
> client can contact any of the master processes directly. Any master in either 
> active or passive state will track meta location and respond to requests for 
> it with its cached last known location. If this location is stale, the client 
> can ask again with a flag set that requests the master refresh its location 
> cache and return the up-to-date location. Every client interaction with the 
> cluster thus uses only HBase RPC as transport, with appropriate settings 
> applied to the connection. The configuration toggle that enables this 
> alternative meta location lookup should be false by default.
> This removes the requirement that HBase clients embed the ZK client and 
> contact the ZK service directly at the beginning of the connection lifecycle. 
> This has several benefits. ZK service need not be exposed to clients, and 
> their potential abuse, yet no benefit ZK provides the HBase server cluster is 
> compromised. Normalizing HBase client and ZK client timeout settings and 
> retry behavior - in some cases, impossible, i.e. for fail-fast - is no longer 
> necessary. 
> And, from [~ghelmling]: There is an additional complication here for 
> token-based authentication. When a delegation token is used for SASL 
> authentication, the client uses the cluster ID obtained from Zookeeper to 
> select the token identifier to use. So there would also need to be some 
> Zookeeper-less, unauthenticated way to obtain the cluster ID as well. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23584) Descrease rpc getFileStatus count when open a storefile

2019-12-18 Thread yuhuiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhuiyang updated HBASE-23584:
--
Attachment: HBASE-23584-master-001.patch

> Descrease rpc getFileStatus count when open a storefile 
> 
>
> Key: HBASE-23584
> URL: https://issues.apache.org/jira/browse/HBASE-23584
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 2.1.1
>Reporter: yuhuiyang
>Priority: Minor
> Attachments: HBASE-23584-branch-2.1-01.patch, 
> HBASE-23584-master-001.patch
>
>
> When a store needs to open a storefile , it will create getFileStatus rpc 
> twice . So open a region with too many files or open too many regions at once 
> will cost very much time. if namenode wastes too much time in rpc process 
> every time (in my case 5s sometime) due to namenode itself's problem . So i 
> think we can descrease the times for getFileStatus , this will reduce stress 
> to namenode and consume less time when store open a storefile .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23584) Descrease rpc getFileStatus count when open a storefile

2019-12-18 Thread yuhuiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999697#comment-16999697
 ] 

yuhuiyang commented on HBASE-23584:
---

[~anoop.hbase] I update a patch for master . And a Git pull means a PR?

> Descrease rpc getFileStatus count when open a storefile 
> 
>
> Key: HBASE-23584
> URL: https://issues.apache.org/jira/browse/HBASE-23584
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Affects Versions: 2.1.1
>Reporter: yuhuiyang
>Priority: Minor
> Attachments: HBASE-23584-branch-2.1-01.patch, 
> HBASE-23584-master-001.patch
>
>
> When a store needs to open a storefile , it will create getFileStatus rpc 
> twice . So open a region with too many files or open too many regions at once 
> will cost very much time. if namenode wastes too much time in rpc process 
> every time (in my case 5s sometime) due to namenode itself's problem . So i 
> think we can descrease the times for getFileStatus , this will reduce stress 
> to namenode and consume less time when store open a storefile .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-19146) Hbase3.0 protobuf-maven-plugin do not support Arm64(only for x86)

2019-12-18 Thread Ganesh Raju (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-19146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999710#comment-16999710
 ] 

Ganesh Raju commented on HBASE-19146:
-

Any update/progress on this?

> Hbase3.0  protobuf-maven-plugin do not support Arm64(only for x86)
> --
>
> Key: HBASE-19146
> URL: https://issues.apache.org/jira/browse/HBASE-19146
> Project: HBase
>  Issue Type: Bug
>  Components: build, pom
>Affects Versions: 3.0.0
> Environment: OS:  Ubuntu 16.04.3 
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
> Hw platform:  AARCH64
>Reporter: Yuqi Gu
>Priority: Major
>
> We are building the HBase 3.0.0-SNAPSHOT on AARCH64.
> It is noted that 'protobuf-maven-plugin' only support x86 shown as follows:
> {code:java}
>  
>org.xolstice.maven.plugins
>protobuf-maven-plugin
>${protobuf.plugin.version}
>
>   com.google.protobuf:protoc:${external.protobuf.version}:
> exe:${os.detected.classifier}
> 
> com.google.protobuf:protoc:${external.protobuf.version}:exe:${os.detected.classifier}
> 
>false
>true
>   
> 
> {code}
> So the build is failed.
> {code:java}
> [INFO] --- protobuf-maven-plugin:0.5.0:compile (compile-protoc) @ 
> hbase-protocol-shaded ---
> [INFO] Compiling 32 proto file(s) to 
> /root/hbase/hbase-protocol-shaded/target/generated-sources/protobuf/java
> Failed to execute goal 
> org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.0:compile 
> (compile-protoc) on project hbase-protocol-shaded: Missing:
> {code}
> Then I installed aarch64 protobuf 2.5.0 on the host and modify the pom:
> {code:java}
> -   
> com.google.protobuf:protoc:${external.protobuf.version}:exe:${os.detected.classifier}
> +  /usr/local/bin/protoc
> {code}
>  The build is also failed:
> {code:java}
> [INFO] Compiling 32 proto file(s) to 
> /root/hbase/hbase-protocol-shaded/target/generated-sources/protobuf/java
> [ERROR] PROTOC FAILED: google/protobuf/any.proto:31:10: Unrecognized syntax 
> identifier "proto3".  This parser only recognizes "proto2".
> {code}
> It seems that "internal.protobuf.version" in "hbase-protocol-shaded" is 3.3.0.
> How to fix it? Thanks!
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23376) NPE happens while replica region is moving

2019-12-18 Thread Sun Xin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999714#comment-16999714
 ] 

Sun Xin commented on HBASE-23376:
-

Ping [~zhangduo], could you please check this?

> NPE happens while replica region is moving
> --
>
> Key: HBASE-23376
> URL: https://issues.apache.org/jira/browse/HBASE-23376
> Project: HBase
>  Issue Type: Bug
>  Components: read replicas
>Reporter: Sun Xin
>Assignee: Sun Xin
>Priority: Minor
> Attachments: HBASE-23376.branch-2.001.patch, 
> HBASE-23376.master.v02.dummy.patch, HBASE-23376.master.v02.dummy.patch
>
>
> The following code is from AsyncNonMetaRegionLocator#addToCache
> {code:java}
> private RegionLocations addToCache(TableCache tableCache, RegionLocations 
> locs) {
>   LOG.trace("Try adding {} to cache", locs);
>   byte[] startKey = locs.getDefaultRegionLocation().getRegion().getStartKey();
>   ...
> }{code}
>  we will get a NPE if the locs is without the default region.
>  
> The following code is from 
> AsyncRegionLocatorHelper#updateCachedLocationOnError 
> {code:java}
> ...
> if (cause instanceof RegionMovedException) {
>   RegionMovedException rme = (RegionMovedException) cause;
>   HRegionLocation newLoc =
> new HRegionLocation(loc.getRegion(), rme.getServerName(), 
> rme.getLocationSeqNum());
>   LOG.debug("Try updating {} with the new location {} constructed by {}", 
> loc, newLoc,
> rme.toString());
>   addToCache.accept(newLoc);
> ...{code}
> If the replica region is moving, we will get a RegionMovedException and add 
> the HRegionLocation of replica region to cache. And finally NPE happens.
>   
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addToCache(AsyncNonMetaRegionLocator.java:240)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addLocationToCache(AsyncNonMetaRegionLocator.java:596)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocatorHelper.updateCachedLocationOnError(AsyncRegionLocatorHelper.java:80)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.updateCachedLocationOnError(AsyncNonMetaRegionLocator.java:610)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocator.updateCachedLocationOnError(AsyncRegionLocator.java:153)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Michael Stack (Jira)
Michael Stack created HBASE-23593:
-

 Summary: Stalled SCP Assigns
 Key: HBASE-23593
 URL: https://issues.apache.org/jira/browse/HBASE-23593
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Affects Versions: 2.2.3
Reporter: Michael Stack


I'm stuck on this one so doing a write up here in case anyone else has ideas.

Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but 
from time to time I'll see the SCP stuck waiting on an Assign to finish. The 
assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the 
procedure but then not a peek thereafter. Later we'll see complaint that the 
region is STUCK. Doesn't recover. Doesn't run.

Basic story is as follows:

Server dies:
{code}
 2019-12-17 11:10:42,002 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration [s011.example.org,16020,1576561318119]
 2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added 
s011.example.org,16020,1576561318119; numProcessing=1
...
 2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Started processing s011.example.org,16020,1576561318119; numProcessing=1
{code}

The dead server restarts which purges the old server from dead server and 
processing lists:

{code}
 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
from the dead servers list
{code}
 

even though we are still processing logs in the SCP of the old server...

{code}
 2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
Archived processed log 
hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
 to hdfs://nameservice1/hbase/oldWALs/s011.example.
org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
{code}

I thought early purge of deadserver was a problem but I don't think so after 
study.

WALS split took two minutes to split and server was removed from dead 
servers...  three minutes earlier...
{code}
 2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log 
files in 
[hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] 
in 143236ms
{code}

 Almost immediately we get this:

{code}
 2019-12-17 11:14:08,649 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119, 
table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
{code}

For this region assign, I see the SCP proc making an assign for this region 
which then makes a subtask to OpenRegionProcedure. This is where it gets stuck. 
No progress after this. The procedure does not come alive to run.

Here are logs for the ORP pid=421761:

{code}
2019-12-17 11:38:34,761 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]

2019-12-17 11:38:34,765 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the 
exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2019-12-17 11:38:34,770 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
rollback step
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Lijin Bin (Jira)
Lijin Bin created HBASE-23594:
-

 Summary: Procedure stuck due to region happen to recorded on two 
servers.
 Key: HBASE-23594
 URL: https://issues.apache.org/jira/browse/HBASE-23594
 Project: HBase
  Issue Type: Bug
Reporter: Lijin Bin


Master log:
{code}
$ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
hbase-hbaseadmin-master-100.107.176.225.log.1
2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
by anyone when adding pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
forceNewPlan=true, retain=false
2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.176.215,60020,1576552834619; 
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, 
ppid=193706, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY 
=> 'user68694', ENDKEY => 'user68703'} to server 
100.107.176.215,60020,1576552834619, this usually because the server is alread 
dead, give up and mark the procedure as complete, the parent procedure will 
take care of this.
2019-12-17 11:24:22,921 DEBUG [PEWorker-8] procedure.MasterProcedureScheduler: 
Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1664) to run queue 
because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
Finished subprocedure pid=243482, resume processing parent pid=193706, 
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,921 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=193706, 
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, 
location=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd8

[jira] [Commented] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999742#comment-16999742
 ] 

Duo Zhang commented on HBASE-23594:
---

What's the hbase version? We have fixed a lot of double assign issues on 2.2.x 
branches.

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
> location=100.107.176.215,60020,1576552834619; 
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
> procedure pid=193706, ppid=187614, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation 
> pid=243482, ppid=193706, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', 
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 
> 100.107.176.215,60020,1576552834619, this usually because the server is 
> alread dead, give up and mark the procedure as complete, the parent procedure 
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=243482, resume processing parent pid=193706, 
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> p

[jira] [Commented] (HBASE-23376) NPE happens while replica region is moving

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999745#comment-16999745
 ] 

Duo Zhang commented on HBASE-23376:
---

Sorry too busy recently...

Plan to take a look this evening. Please ping me again if I do not reply until 
next week...

> NPE happens while replica region is moving
> --
>
> Key: HBASE-23376
> URL: https://issues.apache.org/jira/browse/HBASE-23376
> Project: HBase
>  Issue Type: Bug
>  Components: read replicas
>Reporter: Sun Xin
>Assignee: Sun Xin
>Priority: Minor
> Attachments: HBASE-23376.branch-2.001.patch, 
> HBASE-23376.master.v02.dummy.patch, HBASE-23376.master.v02.dummy.patch
>
>
> The following code is from AsyncNonMetaRegionLocator#addToCache
> {code:java}
> private RegionLocations addToCache(TableCache tableCache, RegionLocations 
> locs) {
>   LOG.trace("Try adding {} to cache", locs);
>   byte[] startKey = locs.getDefaultRegionLocation().getRegion().getStartKey();
>   ...
> }{code}
>  we will get a NPE if the locs is without the default region.
>  
> The following code is from 
> AsyncRegionLocatorHelper#updateCachedLocationOnError 
> {code:java}
> ...
> if (cause instanceof RegionMovedException) {
>   RegionMovedException rme = (RegionMovedException) cause;
>   HRegionLocation newLoc =
> new HRegionLocation(loc.getRegion(), rme.getServerName(), 
> rme.getLocationSeqNum());
>   LOG.debug("Try updating {} with the new location {} constructed by {}", 
> loc, newLoc,
> rme.toString());
>   addToCache.accept(newLoc);
> ...{code}
> If the replica region is moving, we will get a RegionMovedException and add 
> the HRegionLocation of replica region to cache. And finally NPE happens.
>   
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addToCache(AsyncNonMetaRegionLocator.java:240)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addLocationToCache(AsyncNonMetaRegionLocator.java:596)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocatorHelper.updateCachedLocationOnError(AsyncRegionLocatorHelper.java:80)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.updateCachedLocationOnError(AsyncNonMetaRegionLocator.java:610)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocator.updateCachedLocationOnError(AsyncRegionLocator.java:153)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23066) Create a config that forces to cache blocks on compaction

2019-12-18 Thread ramkrishna.s.vasudevan (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-23066:
---
Fix Version/s: (was: 1.6.0)
   3.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to branch-2 and master. Lets raise a backport JIRA for branch-1.6. 
Confirmed that in branch-2 also the test cases passed with this change. 

> Create a config that forces to cache blocks on compaction
> -
>
> Key: HBASE-23066
> URL: https://issues.apache.org/jira/browse/HBASE-23066
> Project: HBase
>  Issue Type: Improvement
>  Components: Compaction, regionserver
>Affects Versions: 1.4.10
>Reporter: Jacob LeBlanc
>Assignee: Jacob LeBlanc
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: HBASE-23066.patch, performance_results.png, 
> prefetchCompactedBlocksOnWrite.patch
>
>
> In cases where users care a lot about read performance for tables that are 
> small enough to fit into a cache (or the cache is large enough), 
> prefetchOnOpen can be enabled to make the entire table available in cache 
> after the initial region opening is completed. Any new data can also be 
> guaranteed to be in cache with the cacheBlocksOnWrite setting.
> However, the missing piece is when all blocks are evicted after a compaction. 
> We found very poor performance after compactions for tables under heavy read 
> load and a slower backing filesystem (S3). After a compaction the prefetching 
> threads need to compete with threads servicing read requests and get 
> constantly blocked as a result. 
> This is a proposal to introduce a new cache configuration option that would 
> cache blocks on write during compaction for any column family that has 
> prefetch enabled. This would virtually guarantee all blocks are kept in cache 
> after the initial prefetch on open is completed allowing for guaranteed 
> steady read performance despite a slow backing file system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999751#comment-16999751
 ] 

Duo Zhang commented on HBASE-23593:
---

Have you checked the log at RegionServer side?

> Stalled SCP Assigns
> ---
>
> Key: HBASE-23593
> URL: https://issues.apache.org/jira/browse/HBASE-23593
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.2.3
>Reporter: Michael Stack
>Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.   
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23594:
--
Affects Version/s: 2.2.2

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
> location=100.107.176.215,60020,1576552834619; 
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
> procedure pid=193706, ppid=187614, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation 
> pid=243482, ppid=193706, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', 
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 
> 100.107.176.215,60020,1576552834619, this usually because the server is 
> alread dead, give up and mark the procedure as complete, the parent procedure 
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=243482, resume processing parent pid=193706, 
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=t

[jira] [Commented] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999758#comment-16999758
 ] 

Lijin Bin commented on HBASE-23594:
---

[~zhangduo] a version based on 2.2.2

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
> location=100.107.176.215,60020,1576552834619; 
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
> procedure pid=193706, ppid=187614, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation 
> pid=243482, ppid=193706, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', 
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 
> 100.107.176.215,60020,1576552834619, this usually because the server is 
> alread dead, give up and mark the procedure as complete, the parent procedure 
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=243482, resume processing parent pid=193706, 
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=193706, ppid=187614, 

[jira] [Commented] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999765#comment-16999765
 ] 

Duo Zhang commented on HBASE-23594:
---

{quote}
Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region 
cf9a4ec6cd890aa6806fb313d71e2ebd,
region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and 
assign to 100.107.164.90,60020,1576553001648 and open on 
100.107.164.90,60020,1576553001648 success.
{quote}

This is a bit strange, we will only try to reassign when the regionserver is 
dead, and if the target server is dead, we will skip reassign and let the SCP 
interrupt us.

So the problem here is that, we quit due to other conditions, and then in ORP 
we find out the RS is not dead, so we finish the ORP and let the parent TRSP to 
reassign, and at the same time, the RS is dead and SCP is coming, and finds out 
the region is in opening state so includes it in the region list for 
reassigning, and then the existing TRSP finished, so when assigning regions in 
SCP, we will assign it again. This could cause a double assign or a dead lock 
if it schedules the region to the same server again.

Seems possible, [~binlijin] could you please what is the reason we give up 
assign the region to 100.107.176.215,60020,1576552834619? You can see the code 
in RSProcedureDispatcher.ExecuteProceduresRemoteCall.scheduleForRetry, I think 
we have a warn log for every quit reason.

Thanks.

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
> location=100.107.176.215,60020,1576552834619; 
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
> procedure pid=193706, ppid=187614, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation 
> pid=243482, ppid=193706, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment

[jira] [Resolved] (HBASE-20461) Implement fsync for AsyncFSWAL

2019-12-18 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-20461.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

Pushed to branch-2.1+.

Thanks [~stack] for reviewing.

> Implement fsync for AsyncFSWAL
> --
>
> Key: HBASE-20461
> URL: https://issues.apache.org/jira/browse/HBASE-20461
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
> Environment: Parent issue adds a config so we can fsync rather than 
> sync for FSHLog, the branch-1 WAL. Add same for asyncfwal, the branch-2 WAL
>Reporter: Michael Stack
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.2.3, 2.1.9
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999769#comment-16999769
 ] 

Michael Stack commented on HBASE-23593:
---

Server we are trying to open a region on,  is struggling. Syncs taking a second 
and more, WALs backed-up ~ 100s. It is throwing CallQueueTooBig exceptions.

I don't see this line though from RSProcedureDispatcher... for the struggling 
server:

  LOG.debug("request to {} failed, try={}", serverName, 
numberOfAttemptsSoFar, e);

... almost as though it just didn't get scheduled/originally dispatched. 
Nothing about the region on the RS side.

> Stalled SCP Assigns
> ---
>
> Key: HBASE-23593
> URL: https://issues.apache.org/jira/browse/HBASE-23593
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.2.3
>Reporter: Michael Stack
>Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.   
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
Thi

[jira] [Updated] (HBASE-20461) Implement fsync for AsyncFSWAL

2019-12-18 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-20461:
--
Release Note: Now AsyncFSWAL also supports Durability.FSYNC_WAL.

> Implement fsync for AsyncFSWAL
> --
>
> Key: HBASE-20461
> URL: https://issues.apache.org/jira/browse/HBASE-20461
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
> Environment: Parent issue adds a config so we can fsync rather than 
> sync for FSHLog, the branch-1 WAL. Add same for asyncfwal, the branch-2 WAL
>Reporter: Michael Stack
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.2.3, 2.1.9
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)
Lijin Bin created HBASE-23595:
-

 Summary: HMaster abort when write to meta failed
 Key: HBASE-23595
 URL: https://issues.apache.org/jira/browse/HBASE-23595
 Project: HBase
  Issue Type: Bug
Reporter: Lijin Bin


RegionStateStore
{code}
  private void updateRegionLocation(RegionInfo regionInfo, State state, Put put)
  throws IOException {
try (Table table = 
master.getConnection().getTable(TableName.META_TABLE_NAME)) {
  table.put(put);
} catch (IOException e) {
  // TODO: Revist Means that if a server is loaded, then we will abort 
our host!
  // In tests we abort the Master!
  String msg = String.format("FAILED persisting region=%s state=%s",
regionInfo.getShortNameToLog(), state);
  LOG.error(msg, e);
  master.abort(msg, e);
  throw e;
}
  }
{code}
When regionserver (carry meta) stop or crash, if the ServerCrashProcedure have 
not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999795#comment-16999795
 ] 

Duo Zhang commented on HBASE-23593:
---

{code}
  boolean submitRegionProcedure(long procId) {
if (procId == -1) {
  return true;
}
// Ignore the region procedures which already submitted.
Long previous = submittedRegionProcedures.putIfAbsent(procId, procId);
if (previous != null) {
  LOG.warn("Received procedure pid={}, which already submitted, just ignore 
it", procId);
  return false;
}
// Ignore the region procedures which already executed.
if (executedRegionProcedures.getIfPresent(procId) != null) {
  LOG.warn("Received procedure pid={}, which already executed, just ignore 
it", procId);
  return false;
}
return true;
  }
{code}

Have you seen something like this at the region server side?

> Stalled SCP Assigns
> ---
>
> Key: HBASE-23593
> URL: https://issues.apache.org/jira/browse/HBASE-23593
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.2.3
>Reporter: Michael Stack
>Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.   
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-

[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999793#comment-16999793
 ] 

Michael Stack commented on HBASE-23593:
---

[~zhangduo] Yeah, see above. Probably just a case of super overloaded server. 
Upshot though is that the assign doesn't complete so SCP is stuck.

> Stalled SCP Assigns
> ---
>
> Key: HBASE-23593
> URL: https://issues.apache.org/jira/browse/HBASE-23593
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.2.3
>Reporter: Michael Stack
>Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.   
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999798#comment-16999798
 ] 

Duo Zhang commented on HBASE-23595:
---

This is by design I think? There is no simple way to recover this...

> HMaster abort when write to meta failed
> ---
>
> Key: HBASE-23595
> URL: https://issues.apache.org/jira/browse/HBASE-23595
> Project: HBase
>  Issue Type: Bug
>Reporter: Lijin Bin
>Priority: Major
>
> RegionStateStore
> {code}
>   private void updateRegionLocation(RegionInfo regionInfo, State state, Put 
> put)
>   throws IOException {
> try (Table table = 
> master.getConnection().getTable(TableName.META_TABLE_NAME)) {
>   table.put(put);
> } catch (IOException e) {
>   // TODO: Revist Means that if a server is loaded, then we will 
> abort our host!
>   // In tests we abort the Master!
>   String msg = String.format("FAILED persisting region=%s state=%s",
> regionInfo.getShortNameToLog(), state);
>   LOG.error(msg, e);
>   master.abort(msg, e);
>   throw e;
> }
>   }
> {code}
> When regionserver (carry meta) stop or crash, if the ServerCrashProcedure 
> have not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999799#comment-16999799
 ] 

Lijin Bin commented on HBASE-23594:
---

{code}
2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.176.215,60020,1576552834619; 
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619


2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=243483, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
2019-12-17 11:24:22,915 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=243483 for 
100.107.176.215,60020,1576552834619 (carryingMeta=false) 
100.107.176.215,60020,1576552834619/CRASHED/regionCount=22284/lock=java.util.concurrent.locks.ReentrantReadWriteLock@34d5ca67[Write
 locks = 1, Read locks = 0], oldState=ONLINE.


2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, 
ppid=193706, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY 
=> 'user68694', ENDKEY => 'user68703'} to server 
100.107.176.215,60020,1576552834619, this usually because the server is alread 
dead, give up and mark the procedure as complete, the parent procedure will 
take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
100.107.176.215,60020,1576552834619; pid=243482, ppid=193706, state=RUNNABLE, 
locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
{code}

[~zhangduo]

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure 

[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

2019-12-18 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999801#comment-16999801
 ] 

Michael Stack commented on HBASE-23593:
---

[~zhangduo] I don't see those 'Received procedure pid.. ' on the RS side. Dang 
(Thanks for help).

> Stalled SCP Assigns
> ---
>
> Key: HBASE-23593
> URL: https://issues.apache.org/jira/browse/HBASE-23593
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.2.3
>Reporter: Michael Stack
>Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.   
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999805#comment-16999805
 ] 

Lijin Bin commented on HBASE-23595:
---

[~zhangduo] I think we do not need to abort and should process the 
ServerCrashProcedure and assign meta...

> HMaster abort when write to meta failed
> ---
>
> Key: HBASE-23595
> URL: https://issues.apache.org/jira/browse/HBASE-23595
> Project: HBase
>  Issue Type: Bug
>Reporter: Lijin Bin
>Priority: Major
>
> RegionStateStore
> {code}
>   private void updateRegionLocation(RegionInfo regionInfo, State state, Put 
> put)
>   throws IOException {
> try (Table table = 
> master.getConnection().getTable(TableName.META_TABLE_NAME)) {
>   table.put(put);
> } catch (IOException e) {
>   // TODO: Revist Means that if a server is loaded, then we will 
> abort our host!
>   // In tests we abort the Master!
>   String msg = String.format("FAILED persisting region=%s state=%s",
> regionInfo.getShortNameToLog(), state);
>   LOG.error(msg, e);
>   master.abort(msg, e);
>   throw e;
> }
>   }
> {code}
> When regionserver (carry meta) stop or crash, if the ServerCrashProcedure 
> have not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23596) HBCKServerCrashProcedure can double assign

2019-12-18 Thread Michael Stack (Jira)
Michael Stack created HBASE-23596:
-

 Summary: HBCKServerCrashProcedure can double assign
 Key: HBASE-23596
 URL: https://issues.apache.org/jira/browse/HBASE-23596
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Reporter: Michael Stack
 Fix For: 2.2.3


The new SCP that does SCP plus cleanup 'Unknown Servers' with mentions in 
hbase:meta added by the below can make for double assignments.

{code}
commit c238891a26734e1e4276b6b1677a58cf83de5dc4
Author: stack 
Date:   Wed Nov 13 22:36:26 2019 -0800

HBASE-23282 HBCKServerCrashProcedure for 'Unknown Servers'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999799#comment-16999799
 ] 

Lijin Bin edited comment on HBASE-23594 at 12/19/19 7:17 AM:
-

{code}
2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.176.215,60020,1576552834619; 
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619


2019-12-17 11:24:22,913 DEBUG [RegionServerTracker-0] master.DeadServer: Added 
100.107.176.215,60020,1576552834619; numProcessing=2
2019-12-17 11:24:22,913 INFO  [RegionServerTracker-0] master.ServerManager: 
Processing expiration of 100.107.176.215,60020,1576552834619 on 
100.107.176.225,6,1576552667220
2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=243483, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
2019-12-17 11:24:22,915 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=243483 for 
100.107.176.215,60020,1576552834619 (carryingMeta=false) 
100.107.176.215,60020,1576552834619/CRASHED/regionCount=22284/lock=java.util.concurrent.locks.ReentrantReadWriteLock@34d5ca67[Write
 locks = 1, Read locks = 0], oldState=ONLINE.


2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, 
ppid=193706, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY 
=> 'user68694', ENDKEY => 'user68703'} to server 
100.107.176.215,60020,1576552834619, this usually because the server is alread 
dead, give up and mark the procedure as complete, the parent procedure will 
take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
100.107.176.215,60020,1576552834619; pid=243482, ppid=193706, state=RUNNABLE, 
locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
{code}

[~zhangduo]


was (Author: aoxiang):
{code}
2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.176.215,60020,1576552834619; 
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619


2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=243483, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
2019-12-17 11:24:22,915 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=243483 for 
100.107.176.215,60020,1576552834619 (carryingMeta=false) 
100.107.176.215,60020,1576552834619/CRASHED/regionCount=22284/lock=java.util.concurrent.locks.ReentrantReadWriteLock@34d5ca67[Write
 locks = 1, Read locks = 0], oldState=ONLINE.


2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, 
ppid=193706, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{EN

[GitHub] [hbase] saintstack opened a new pull request #952: HBASE-23596 HBCKServerCrashProcedure can double assign

2019-12-18 Thread GitBox
saintstack opened a new pull request #952: HBASE-23596 HBCKServerCrashProcedure 
can double assign
URL: https://github.com/apache/hbase/pull/952
 
 
   First attempt. Will be back after testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999811#comment-16999811
 ] 

Duo Zhang commented on HBASE-23595:
---

Then you can just increase the operation timeout for accessing meta table...

> HMaster abort when write to meta failed
> ---
>
> Key: HBASE-23595
> URL: https://issues.apache.org/jira/browse/HBASE-23595
> Project: HBase
>  Issue Type: Bug
>Reporter: Lijin Bin
>Priority: Major
>
> RegionStateStore
> {code}
>   private void updateRegionLocation(RegionInfo regionInfo, State state, Put 
> put)
>   throws IOException {
> try (Table table = 
> master.getConnection().getTable(TableName.META_TABLE_NAME)) {
>   table.put(put);
> } catch (IOException e) {
>   // TODO: Revist Means that if a server is loaded, then we will 
> abort our host!
>   // In tests we abort the Master!
>   String msg = String.format("FAILED persisting region=%s state=%s",
> regionInfo.getShortNameToLog(), state);
>   LOG.error(msg, e);
>   master.abort(msg, e);
>   throw e;
> }
>   }
> {code}
> When regionserver (carry meta) stop or crash, if the ServerCrashProcedure 
> have not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22263) Master creates duplicate ServerCrashProcedure on initialization, leading to assignment hanging in region-dense clusters

2019-12-18 Thread Pankaj Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999812#comment-16999812
 ] 

Pankaj Kumar commented on HBASE-22263:
--

We also met this problem in our production environment (HBase 1.3.1), which 
caused more than 40 mins to assign 20k regions.

[~busbey] how's your internal branch testing report? 

> Master creates duplicate ServerCrashProcedure on initialization, leading to 
> assignment hanging in region-dense clusters
> ---
>
> Key: HBASE-22263
> URL: https://issues.apache.org/jira/browse/HBASE-22263
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2, Region Assignment
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
> Attachments: HBASE-22263-branch-1.v0.patch
>
>
> h3. Problem:
> During Master initialization we
>  # restore existing procedures that still need to run from prior active 
> Master instances
>  # look for signs that Region Servers have died and need to be recovered 
> while we were out and schedule a ServerCrashProcedure (SCP) for each them
>  # turn on the assignment manager
> The normal turn of events for a ServerCrashProcedure will attempt to use a 
> bulk assignment to maintain the set of regions on a RS if possible. However, 
> we wait around and retry a bit later if the assignment manager isn’t ready 
> yet.
> Note that currently #2 has no notion of wether or not a previous active 
> Master instances has already done a check. This means we might schedule an 
> SCP for a ServerName (host, port, start code) that already has an SCP 
> scheduled. Ideally, such a duplicate should be a no-op.
> However, before step #2 schedules the SCP it first marks the region server as 
> dead and not yet processed, with the expectation that the SCP it just created 
> will look if there is log splitting work and then mark the server as easy for 
> region assignment. At the same time, any restored SCPs that are past the step 
> of log splitting will be waiting for the AssignmentManager still. As a part 
> of restoring themselves, they do not update with the current master instance 
> to show that they are past the point of WAL processing.
> Once the AssignmentManager starts in #3 the restored SCP continues; it will 
> eventually get to the assignment phase and find that its server is marked as 
> dead and in need of wal processing. Such assignments are skipped with a log 
> message. Thus as we iterate over the regions to assign we’ll skip all of 
> them. This non-intuitively shifts the “no-op” status from the newer SCP we 
> scheduled at #2 to the older SCP that was restored in #1.
> Bulk assignment works by sending the assign calls via a pool to allow more 
> parallelism. Once we’ve set up the pool we just wait to see if the region 
> state updates to online. Unfortunately, since all of the assigns got skipped, 
> we’ll never change the state for any of these regions. That means the bulk 
> assign, and the older SCP that started it, will wait until it hits a timeout.
> By default the timeout for a bulk assignment is the smaller of {{(# Regions 
> in the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s + 
> 60s + # of RegionServers in the cluster * 30s)}}. For even modest clusters 
> with several hundreds of regions per region server, this means the “no-op” 
> SCP will end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average 
> region density of 300 regions per region server on a 100 node cluster. ~11 
> minutes for 300 regions per region server on a 10 node cluster). During this 
> time, the SCP will hold one of the available procedure execution slots for 
> both the overall pool and for the specific server queue.
> As previously mentioned, restored SCPs will retry their submission if the 
> assignment manager has not yet been activated (done in #3), this can cause 
> them to be scheduled after the newer SCPs (created in #2). Thus the order of 
> execution of no-op and usable SCPs can vary from run-to-run of master 
> initialization.
> This means that unless you get lucky with SCP ordering, impacted regions will 
> remain as RIT for an extended period of time. If you get particularly unlucky 
> and a critical system table is included in the regions that are being 
> recovered, then master initialization itself will end up blocked on this 
> sequence of SCP timeouts. If there are enough of them to exceed the master 
> initialization timeouts, then the situation can be self-sustaining as 
> additional master fails over cause even more duplicative SCPs to be scheduled.
> h3. Indicators:
>  * Master appears to hang; failing to assign regions to available re

[jira] [Commented] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

2019-12-18 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999813#comment-16999813
 ] 

Duo Zhang commented on HBASE-23594:
---

OK, so there could be a race, that aftering get the regions on the crash 
server, a region in the list can be moved to another rs.

Let me see if we can provide a UT to reproduce this.

> Procedure stuck due to region happen to recorded on two servers.
> 
>
> Key: HBASE-23594
> URL: https://issues.apache.org/jira/browse/HBASE-23594
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
> hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
> by anyone when adding pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
> procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
> forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
> procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:6] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
> location=100.107.176.215,60020,1576552834619; 
> loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
> regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
> procedure pid=193706, ppid=187614, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation 
> pid=243482, ppid=193706, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
> 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', 
> STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 
> 100.107.176.215,60020,1576552834619, this usually because the server is 
> alread dead, give up and mark the procedure as complete, the parent procedure 
> will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8] 
> procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
> sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=243482, resume processing parent pid=193706, 
> ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=table1w_7, 
> region=cf9a4ec6cd890aa6806fb313d71e2ebd, AS

[jira] [Updated] (HBASE-23376) NPE happens while replica region is moving

2019-12-18 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-23376:
--
Hadoop Flags: Reviewed
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

Pushed to branch-2.1+.

Thanks [~Ddupg] for contributing.

> NPE happens while replica region is moving
> --
>
> Key: HBASE-23376
> URL: https://issues.apache.org/jira/browse/HBASE-23376
> Project: HBase
>  Issue Type: Bug
>  Components: read replicas
>Reporter: Sun Xin
>Assignee: Sun Xin
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 2.2.3, 2.1.9
>
> Attachments: HBASE-23376.branch-2.001.patch, 
> HBASE-23376.master.v02.dummy.patch, HBASE-23376.master.v02.dummy.patch
>
>
> The following code is from AsyncNonMetaRegionLocator#addToCache
> {code:java}
> private RegionLocations addToCache(TableCache tableCache, RegionLocations 
> locs) {
>   LOG.trace("Try adding {} to cache", locs);
>   byte[] startKey = locs.getDefaultRegionLocation().getRegion().getStartKey();
>   ...
> }{code}
>  we will get a NPE if the locs is without the default region.
>  
> The following code is from 
> AsyncRegionLocatorHelper#updateCachedLocationOnError 
> {code:java}
> ...
> if (cause instanceof RegionMovedException) {
>   RegionMovedException rme = (RegionMovedException) cause;
>   HRegionLocation newLoc =
> new HRegionLocation(loc.getRegion(), rme.getServerName(), 
> rme.getLocationSeqNum());
>   LOG.debug("Try updating {} with the new location {} constructed by {}", 
> loc, newLoc,
> rme.toString());
>   addToCache.accept(newLoc);
> ...{code}
> If the replica region is moving, we will get a RegionMovedException and add 
> the HRegionLocation of replica region to cache. And finally NPE happens.
>   
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addToCache(AsyncNonMetaRegionLocator.java:240)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addLocationToCache(AsyncNonMetaRegionLocator.java:596)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocatorHelper.updateCachedLocationOnError(AsyncRegionLocatorHelper.java:80)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.updateCachedLocationOnError(AsyncNonMetaRegionLocator.java:610)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocator.updateCachedLocationOnError(AsyncRegionLocator.java:153)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23376) NPE happens while replica region is moving

2019-12-18 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-23376:
--
Fix Version/s: 2.1.9
   2.2.3
   2.3.0
   3.0.0

> NPE happens while replica region is moving
> --
>
> Key: HBASE-23376
> URL: https://issues.apache.org/jira/browse/HBASE-23376
> Project: HBase
>  Issue Type: Bug
>  Components: read replicas
>Reporter: Sun Xin
>Assignee: Sun Xin
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 2.2.3, 2.1.9
>
> Attachments: HBASE-23376.branch-2.001.patch, 
> HBASE-23376.master.v02.dummy.patch, HBASE-23376.master.v02.dummy.patch
>
>
> The following code is from AsyncNonMetaRegionLocator#addToCache
> {code:java}
> private RegionLocations addToCache(TableCache tableCache, RegionLocations 
> locs) {
>   LOG.trace("Try adding {} to cache", locs);
>   byte[] startKey = locs.getDefaultRegionLocation().getRegion().getStartKey();
>   ...
> }{code}
>  we will get a NPE if the locs is without the default region.
>  
> The following code is from 
> AsyncRegionLocatorHelper#updateCachedLocationOnError 
> {code:java}
> ...
> if (cause instanceof RegionMovedException) {
>   RegionMovedException rme = (RegionMovedException) cause;
>   HRegionLocation newLoc =
> new HRegionLocation(loc.getRegion(), rme.getServerName(), 
> rme.getLocationSeqNum());
>   LOG.debug("Try updating {} with the new location {} constructed by {}", 
> loc, newLoc,
> rme.toString());
>   addToCache.accept(newLoc);
> ...{code}
> If the replica region is moving, we will get a RegionMovedException and add 
> the HRegionLocation of replica region to cache. And finally NPE happens.
>   
> {code:java}
> java.lang.NullPointerExceptionjava.lang.NullPointerException at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addToCache(AsyncNonMetaRegionLocator.java:240)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.addLocationToCache(AsyncNonMetaRegionLocator.java:596)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocatorHelper.updateCachedLocationOnError(AsyncRegionLocatorHelper.java:80)
>  at 
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.updateCachedLocationOnError(AsyncNonMetaRegionLocator.java:610)
>  at 
> org.apache.hadoop.hbase.client.AsyncRegionLocator.updateCachedLocationOnError(AsyncRegionLocator.java:153)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999818#comment-16999818
 ] 

Lijin Bin edited comment on HBASE-23595 at 12/19/19 7:36 AM:
-

Yes, i think we need to give high priority to meta assign procedure and 
ServerCrashProcedure(carry meta), so the meta will assign quick.
{code}
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
loc=100.107.165.61,60020,1576553057082
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as 100.107.165.61,60020,1576553057082
2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23569, resume processing parent pid=23568, 
ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=23569, ppid=23568, state=SUCCESS; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23568, resume processing parent pid=23567, 
state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec


2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=69619, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure.MasterProcedureScheduler: Add 
ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
size=1) to run queue because: the exclusive lock is not held by anyone when 
adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=69619 for 
100.107.165.61,60020,1576553057082 (carryingMeta=true) 
100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
2019-12-18 16:15:21,629 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
master.DeadServer: Removed 100.107.165.61,60020,1576553057082, processing=true, 
numProcessing=0



2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
attempt=0 of 31 failed; retrying after sleep of 31
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=81362: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on 100.107.165.61,60020,1576656916048
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
 row 'hbase:rsgroup,,99' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
seqNum=-1




2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: * ABORTI

[jira] [Commented] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999818#comment-16999818
 ] 

Lijin Bin commented on HBASE-23595:
---

Yes, i think we need to give high priority to meta assign procedure and 
ServerCrashProcedure(carry meta), so the meta will assign quick.
{code}
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
loc=100.107.165.61,60020,1576553057082
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as 100.107.165.61,60020,1576553057082
2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23569, resume processing parent pid=23568, 
ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=23569, ppid=23568, state=SUCCESS; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23568, resume processing parent pid=23567, 
state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec


2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=69619, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure.MasterProcedureScheduler: Add 
ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
size=1) to run queue because: the exclusive lock is not held by anyone when 
adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=69619 for 
100.107.165.61,60020,1576553057082 (carryingMeta=true) 
100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
2019-12-18 16:15:21,629 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
master.DeadServer: Removed 100.107.165.61,60020,1576553057082, processing=true, 
numProcessing=0



2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
attempt=0 of 31 failed; retrying after sleep of 31
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=81362: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on 100.107.165.61,60020,1576656916048
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
 row 'hbase:rsgroup,,99' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
seqNum=-1




2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: * ABORTING 
master 100.107.176.225,6,1576656778460: FAI

[jira] [Created] (HBASE-23597) Give priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-18 Thread Lijin Bin (Jira)
Lijin Bin created HBASE-23597:
-

 Summary: Give priority for meta assign procedure and 
ServerCrashProcedure which carry meta.
 Key: HBASE-23597
 URL: https://issues.apache.org/jira/browse/HBASE-23597
 Project: HBase
  Issue Type: Improvement
Reporter: Lijin Bin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23597) Give priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23597:
--
Description: 
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
loc=100.107.165.61,60020,1576553057082
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as 100.107.165.61,60020,1576553057082
2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23569, resume processing parent pid=23568, 
ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=23569, ppid=23568, state=SUCCESS; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23568, resume processing parent pid=23567, 
state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec


2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=69619, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure.MasterProcedureScheduler: Add 
ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
size=1) to run queue because: the exclusive lock is not held by anyone when 
adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=69619 for 
100.107.165.61,60020,1576553057082 (carryingMeta=true) 
100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
2019-12-18 16:15:21,629 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
master.DeadServer: Removed 100.107.165.61,60020,1576553057082, processing=true, 
numProcessing=0



2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
attempt=0 of 31 failed; retrying after sleep of 31
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=81362: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on 100.107.165.61,60020,1576656916048
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
 row 'hbase:rsgroup,,99' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
seqNum=-1




2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: * ABORTING 
master 100.107.176.225,6,1576656778460: FAILED persisting 
region=38d18fd824890c80cff972cbf2e4c174 state=OPENING *
java.net.SocketTimeoutException: callTimeout=120, callDuration=1286005: 
org.apache.hadoop.hbase.NotServi

[jira] [Updated] (HBASE-23597) Give priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23597:
--
Description: 
{code}
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
loc=100.107.165.61,60020,1576553057082
2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as 100.107.165.61,60020,1576553057082
2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23569, resume processing parent pid=23568, 
ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN
2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=23569, ppid=23568, state=SUCCESS; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished subprocedure pid=23568, resume processing parent pid=23567, 
state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec


2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=69619, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
procedure.MasterProcedureScheduler: Add 
ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
size=1) to run queue because: the exclusive lock is not held by anyone when 
adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=69619 for 
100.107.165.61,60020,1576553057082 (carryingMeta=true) 
100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
2019-12-18 16:15:21,629 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
master.DeadServer: Removed 100.107.165.61,60020,1576553057082, processing=true, 
numProcessing=0



2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
attempt=0 of 31 failed; retrying after sleep of 31
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=81362: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on 100.107.165.61,60020,1576656916048
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
 row 'hbase:rsgroup,,99' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, hostname=100.107.165.61,60020,1576553057082, 
seqNum=-1




2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: Master server 
abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint]
2019-12-18 16:33:04,715 ERROR [PEWorker-15] master.HMaster: * ABORTING 
master 100.107.176.225,6,1576656778460: FAILED persisting 
region=38d18fd824890c80cff972cbf2e4c174 state=OPENING *
java.net.SocketTimeoutException: callTimeout=120, callDuration=1286005: 
org.apache.hadoop.hbase.N

[jira] [Updated] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23597:
--
Summary: Give high priority for meta assign procedure and 
ServerCrashProcedure which carry meta.  (was: Give priority for meta assign 
procedure and ServerCrashProcedure which carry meta.)

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.

[jira] [Updated] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23595:
--
Affects Version/s: 2.2.2

> HMaster abort when write to meta failed
> ---
>
> Key: HBASE-23595
> URL: https://issues.apache.org/jira/browse/HBASE-23595
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> RegionStateStore
> {code}
>   private void updateRegionLocation(RegionInfo regionInfo, State state, Put 
> put)
>   throws IOException {
> try (Table table = 
> master.getConnection().getTable(TableName.META_TABLE_NAME)) {
>   table.put(put);
> } catch (IOException e) {
>   // TODO: Revist Means that if a server is loaded, then we will 
> abort our host!
>   // In tests we abort the Master!
>   String msg = String.format("FAILED persisting region=%s state=%s",
> regionInfo.getShortNameToLog(), state);
>   LOG.error(msg, e);
>   master.abort(msg, e);
>   throw e;
> }
>   }
> {code}
> When regionserver (carry meta) stop or crash, if the ServerCrashProcedure 
> have not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin updated HBASE-23597:
--
Affects Version/s: 2.2.2

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>  row 'hbase:rsgrou

[jira] [Resolved] (HBASE-23595) HMaster abort when write to meta failed

2019-12-18 Thread Lijin Bin (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijin Bin resolved HBASE-23595.
---
Resolution: Won't Fix

> HMaster abort when write to meta failed
> ---
>
> Key: HBASE-23595
> URL: https://issues.apache.org/jira/browse/HBASE-23595
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> RegionStateStore
> {code}
>   private void updateRegionLocation(RegionInfo regionInfo, State state, Put 
> put)
>   throws IOException {
> try (Table table = 
> master.getConnection().getTable(TableName.META_TABLE_NAME)) {
>   table.put(put);
> } catch (IOException e) {
>   // TODO: Revist Means that if a server is loaded, then we will 
> abort our host!
>   // In tests we abort the Master!
>   String msg = String.format("FAILED persisting region=%s state=%s",
> regionInfo.getShortNameToLog(), state);
>   LOG.error(msg, e);
>   master.abort(msg, e);
>   throw e;
> }
>   }
> {code}
> When regionserver (carry meta) stop or crash, if the ServerCrashProcedure 
> have not start process, write to meta will fail and abort master.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)