[jira] [Comment Edited] (HBASE-28293) Add metric for GetClusterStatus request count.

2024-01-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803724#comment-17803724
 ] 

Viraj Jasani edited comment on HBASE-28293 at 1/5/24 11:18 PM:
---

+1, maybe for this Jira we can focus on getClusterStatus as it is heavy one, 
and in follow-up jiras, we can extend this for other RPCs served by master.


was (Author: vjasani):
+1

> Add metric for GetClusterStatus request count.
> --
>
> Key: HBASE-28293
> URL: https://issues.apache.org/jira/browse/HBASE-28293
> Project: HBase
>  Issue Type: Bug
>Reporter: Rushabh Shah
>Priority: Major
>
> We have been bitten multiple times by GetClusterStatus request overwhelming 
> HMaster's memory usage. It would be good to add a metric for the total 
> GetClusterStatus requests count.
> In almost all of our production incidents involving GetClusterStatus request, 
> HMaster were running out of memory with many clients call this RPC in 
> parallel and the response size is very big.
> In hbase2 we have 
> [ClusterMetrics.Option|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterMetrics.java#L164-L224]
>  which can reduce the size of the response.
> It would be nice to add another metric to indicate if the response size of 
> GetClusterStatus is greater than some threshold (like 5MB)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28293) Add metric for GetClusterStatus request count.

2024-01-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803724#comment-17803724
 ] 

Viraj Jasani commented on HBASE-28293:
--

+1

> Add metric for GetClusterStatus request count.
> --
>
> Key: HBASE-28293
> URL: https://issues.apache.org/jira/browse/HBASE-28293
> Project: HBase
>  Issue Type: Bug
>Reporter: Rushabh Shah
>Priority: Major
>
> We have been bitten multiple times by GetClusterStatus request overwhelming 
> HMaster's memory usage. It would be good to add a metric for the total 
> GetClusterStatus requests count.
> In almost all of our production incidents involving GetClusterStatus request, 
> HMaster were running out of memory with many clients call this RPC in 
> parallel and the response size is very big.
> In hbase2 we have 
> [ClusterMetrics.Option|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterMetrics.java#L164-L224]
>  which can reduce the size of the response.
> It would be nice to add another metric to indicate if the response size of 
> GetClusterStatus is greater than some threshold (like 5MB)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28293) Add metric for GetClusterStatus request count.

2024-01-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803718#comment-17803718
 ] 

Viraj Jasani commented on HBASE-28293:
--

We can have two metrics: response size and request count

> Add metric for GetClusterStatus request count.
> --
>
> Key: HBASE-28293
> URL: https://issues.apache.org/jira/browse/HBASE-28293
> Project: HBase
>  Issue Type: Bug
>Reporter: Rushabh Shah
>Priority: Major
>
> We have been bitten multiple times by GetClusterStatus request overwhelming 
> HMaster's memory usage. It would be good to add a metric for the total 
> GetClusterStatus requests count.
> In almost all of our production incidents involving GetClusterStatus request, 
> HMaster were running out of memory with many clients call this RPC in 
> parallel and the response size is very big.
> In hbase2 we have 
> [ClusterMetrics.Option|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterMetrics.java#L164-L224]
>  which can reduce the size of the response.
> It would be nice to add another metric to indicate if the response size of 
> GetClusterStatus is greater than some threshold (like 5MB)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2024-01-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28271:
-
Fix Version/s: 2.6.0
   2.4.18
   2.5.8
   3.0.0-beta-2
   Status: Patch Available  (was: In Progress)

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.5.7, 2.4.17, 3.0.0-alpha-4
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2
>
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2024-01-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-28271 started by Viraj Jasani.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2024-01-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802345#comment-17802345
 ] 

Viraj Jasani commented on HBASE-28271:
--

Thanks for pointing that out [~dmanning], yes it is actually worse than what i 
thought earlier.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-26192) Master UI hbck should provide a JSON formatted output option

2023-12-23 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HBASE-26192:


Assignee: Mihir Monani

> Master UI hbck should provide a JSON formatted output option
> 
>
> Key: HBASE-26192
> URL: https://issues.apache.org/jira/browse/HBASE-26192
> Project: HBase
>  Issue Type: New Feature
>Reporter: Andrew Kyle Purtell
>Assignee: Mihir Monani
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-beta-2
>
> Attachments: Screen Shot 2022-05-31 at 5.18.15 PM.png
>
>
> It used to be possible to get hbck's verdict of cluster status from the 
> command line, especially useful for headless deployments, i.e. without 
> requiring a browser with sufficient connectivity to load a UI, or scrape 
> information out of raw HTML, or write regex to comb over log4j output. The 
> hbck tool's output wasn't particularly convenient to parse but it was 
> straightforward to extract the desired information with a handful of regular 
> expressions. 
> HBCK2 has a different design philosophy than the old hbck, which is to serve 
> as a collection of small and discrete recovery and repair functions, rather 
> than attempt to be a universal repair tool. This makes a lot of sense and 
> isn't the issue at hand. Unfortunately the old hbck's utility for reporting 
> the current cluster health assessment has not been replaced either in whole 
> or in part. Instead:
> {quote}
> HBCK2 is for fixes. For listings of inconsistencies or blockages in the 
> running cluster, you go elsewhere, to the logs and UI of the running cluster 
> Master. Once an issue has been identified, you use the HBCK2 tool to ask the 
> Master to effect fixes or to skip-over bad state. Asking the Master to make 
> the fixes rather than try and effect the repair locally in a fix-it tool's 
> context is another important difference between HBCK2 and hbck1. 
> {quote}
> Developing custom tooling to mine logs and scrape UI simply to gain a top 
> level assessment of system health is unsatisfying. There should be a 
> convenient means for querying the system if issues that rise to the level of 
> _inconsistency_, in the hbck parlance, are believed to be present. It would 
> be relatively simple to bring back the experience of invoking a command line 
> tool to deliver a verdict. This could be added to the hbck2 tool itself but 
> given that hbase-operator-tools is a separate project an intrinsic solution 
> is desirable. 
> An option that immediately comes to mind is modification of the Master's 
> hbck.jsp page to provide a JSON formatted output option if the HTTP Accept 
> header asks for text/json. However, looking at the source of hbck.jsp, it 
> makes more sense to leave it as is and implement a convenient machine 
> parseable output format elsewhere. This can be trivially accomplished with a 
> new servlet. Like hbck.jsp the servlet implementation would get a reference 
> to HbckChore and present the information this class makes available via its 
> various getters.  
> The machine parseable output is sufficient to enable headless hbck status 
> checking but it still would be nice if we could provide operators a command 
> line tool that formats the information for convenient viewing in a terminal. 
> That part could be implemented in the hbck2 tool after this proposal is 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-20 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799124#comment-17799124
 ] 

Viraj Jasani commented on HBASE-28271:
--

Thank you [~frostruan]! I also wonder if setting 
"hbase.snapshot.zk.coordinated" in any test would even make any difference 
since we no longer use that config?

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798791#comment-17798791
 ] 

Viraj Jasani commented on HBASE-28271:
--

[~frostruan] After HBASE-26323, do we have any test for snapshot creation 
without specifying nonce group and nonce?

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798696#comment-17798696
 ] 

Viraj Jasani edited comment on HBASE-28271 at 12/19/23 6:19 PM:


LockProcedure implementation at high level:

Just like any procedure, first it tries to acquire lock => lock acquired (here 
the lock is it's own lock implementation i.e. exclusive/shared locks at 
table/namespace/region level).

Only if lock is acquired, the execution begins as per the generic logic:
{code:java}
LockState lockState = acquireLock(proc);
switch (lockState) {
  case LOCK_ACQUIRED:
execProcedure(procStack, proc);
break;
  case LOCK_YIELD_WAIT:
LOG.info(lockState + " " + proc);
scheduler.yield(proc);
break;
  case LOCK_EVENT_WAIT:
// Someone will wake us up when the lock is available
LOG.debug(lockState + " " + proc);
break;
  default:
throw new UnsupportedOperationException();
} {code}
For LockProc, only when it is executed, the latch is accessed. This is the way 
snapshot ensures that the lock at the table level is already acquired and it 
can move forward with creating snapshot.


was (Author: vjasani):
LockProcedure implementation at high level:

Just like any procedure, first it tries to acquire lock => lock acquired.

Only if lock is acquired, the execution begins as per the generic logic:
{code:java}
LockState lockState = acquireLock(proc);
switch (lockState) {
  case LOCK_ACQUIRED:
execProcedure(procStack, proc);
break;
  case LOCK_YIELD_WAIT:
LOG.info(lockState + " " + proc);
scheduler.yield(proc);
break;
  case LOCK_EVENT_WAIT:
// Someone will wake us up when the lock is available
LOG.debug(lockState + " " + proc);
break;
  default:
throw new UnsupportedOperationException();
} {code}
For LockProc, only when it is executed, the latch is accessed. This is the way 
snapshot ensures that the lock at the table level is already acquired and it 
can move forward with creating snapshot.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798696#comment-17798696
 ] 

Viraj Jasani commented on HBASE-28271:
--

LockProcedure implementation at high level:

Just like any procedure, first it tries to acquire lock => lock acquired.

Only if lock is acquired, the execution begins as per the generic logic:
{code:java}
LockState lockState = acquireLock(proc);
switch (lockState) {
  case LOCK_ACQUIRED:
execProcedure(procStack, proc);
break;
  case LOCK_YIELD_WAIT:
LOG.info(lockState + " " + proc);
scheduler.yield(proc);
break;
  case LOCK_EVENT_WAIT:
// Someone will wake us up when the lock is available
LOG.debug(lockState + " " + proc);
break;
  default:
throw new UnsupportedOperationException();
} {code}
For LockProc, only when it is executed, the latch is accessed. This is the way 
snapshot ensures that the lock at the table level is already acquired and it 
can move forward with creating snapshot.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798694#comment-17798694
 ] 

Viraj Jasani commented on HBASE-28271:
--

Snapshot is the only consumer of the lock procedure that provides countdown 
latch to the procedure and waits until the latch is accessed by the procedure, 
no other consumers of lock procedure provides non-null latch to implement any 
wait strategy.

So, yes the plan is to make it generic enough but the only consumers we have is 
snapshots.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798470#comment-17798470
 ] 

Viraj Jasani commented on HBASE-28271:
--

I think in general we can keep the default timeout much lower (5-10 min?) and 
make it throw SnapshotCreationException sooner so that we don't keep master 
handlers occupied. But otherwise no problem with taking rpc timeout in equation 
too.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> --
>
> Key: HBASE-28271
> URL: https://issues.apache.org/jira/browse/HBASE-28271
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28271:


 Summary: Infinite waiting on lock acquisition by snapshot can 
result in unresponsive master
 Key: HBASE-28271
 URL: https://issues.apache.org/jira/browse/HBASE-28271
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.7, 2.4.17, 3.0.0-alpha-4
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Attachments: image.png

When a region is stuck in transition for significant time, any attempt to take 
snapshot on the table would keep master handler thread in forever waiting 
state. As part of the creating snapshot on enabled or disabled table, in order 
to get the table level lock, LockProcedure is executed but if any region of the 
table is in transition, LockProcedure could not be executed by the snapshot 
handler, resulting in forever waiting until the region transition is completed, 
allowing the table level lock to be acquired by the snapshot handler.

In cases where a region stays in RIT for considerable time, if enough attempts 
are made by the client to create snapshots on the table, it can easily exhaust 
all handler threads, leading to potentially unresponsive master. Attached a 
sample thread dump.

Proposal: The snapshot handler should not stay stuck forever if it cannot take 
table level lock, it should fail-fast.

!image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26192) Master UI hbck should provide a JSON formatted output option

2023-12-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798293#comment-17798293
 ] 

Viraj Jasani commented on HBASE-26192:
--

Some folks are interested to pick this up, will update assignee in sometime. 
Thanks

> Master UI hbck should provide a JSON formatted output option
> 
>
> Key: HBASE-26192
> URL: https://issues.apache.org/jira/browse/HBASE-26192
> Project: HBase
>  Issue Type: New Feature
>Reporter: Andrew Kyle Purtell
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-beta-2
>
> Attachments: Screen Shot 2022-05-31 at 5.18.15 PM.png
>
>
> It used to be possible to get hbck's verdict of cluster status from the 
> command line, especially useful for headless deployments, i.e. without 
> requiring a browser with sufficient connectivity to load a UI, or scrape 
> information out of raw HTML, or write regex to comb over log4j output. The 
> hbck tool's output wasn't particularly convenient to parse but it was 
> straightforward to extract the desired information with a handful of regular 
> expressions. 
> HBCK2 has a different design philosophy than the old hbck, which is to serve 
> as a collection of small and discrete recovery and repair functions, rather 
> than attempt to be a universal repair tool. This makes a lot of sense and 
> isn't the issue at hand. Unfortunately the old hbck's utility for reporting 
> the current cluster health assessment has not been replaced either in whole 
> or in part. Instead:
> {quote}
> HBCK2 is for fixes. For listings of inconsistencies or blockages in the 
> running cluster, you go elsewhere, to the logs and UI of the running cluster 
> Master. Once an issue has been identified, you use the HBCK2 tool to ask the 
> Master to effect fixes or to skip-over bad state. Asking the Master to make 
> the fixes rather than try and effect the repair locally in a fix-it tool's 
> context is another important difference between HBCK2 and hbck1. 
> {quote}
> Developing custom tooling to mine logs and scrape UI simply to gain a top 
> level assessment of system health is unsatisfying. There should be a 
> convenient means for querying the system if issues that rise to the level of 
> _inconsistency_, in the hbck parlance, are believed to be present. It would 
> be relatively simple to bring back the experience of invoking a command line 
> tool to deliver a verdict. This could be added to the hbck2 tool itself but 
> given that hbase-operator-tools is a separate project an intrinsic solution 
> is desirable. 
> An option that immediately comes to mind is modification of the Master's 
> hbck.jsp page to provide a JSON formatted output option if the HTTP Accept 
> header asks for text/json. However, looking at the source of hbck.jsp, it 
> makes more sense to leave it as is and implement a convenient machine 
> parseable output format elsewhere. This can be trivially accomplished with a 
> new servlet. Like hbck.jsp the servlet implementation would get a reference 
> to HbckChore and present the information this class makes available via its 
> various getters.  
> The machine parseable output is sufficient to enable headless hbck status 
> checking but it still would be nice if we could provide operators a command 
> line tool that formats the information for convenient viewing in a terminal. 
> That part could be implemented in the hbck2 tool after this proposal is 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-12-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792703#comment-17792703
 ] 

Viraj Jasani commented on HBASE-28221:
--

Yeah i was thinking about that earlier but flushes can be delayed even if 
compaction is extremely slow or not efficient, hence thought this would be 
better metric at MetricsRegionServerSource level.

> Introduce regionserver metric for delayed flushes
> -
>
> Key: HBASE-28221
> URL: https://issues.apache.org/jira/browse/HBASE-28221
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Assignee: Rahul Kumar
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
> forget re-enabling the compaction. This can result into flushes getting 
> delayed for "hbase.hstore.blockingWaitTime" time (90s). While flushes do 
> happen eventually after waiting for max blocking time, it is important to 
> realize that any cluster cannot function well with compaction disabled for 
> significant amount of time.
>  
> We would also block any write requests until region is flushed (90+ sec, by 
> default):
> {code:java}
> 2023-11-27 20:40:52,124 WARN  [,queue=18,port=60020] regionserver.HRegion - 
> Region is too busy due to exceeding memstore size limit.
> org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, 
> regionName=table1,1699923733811.4fd5e52e2133df1e347f32c646f23ab4., 
> server=server-1,60020,1699421714454, memstoreSize=1073820928, 
> blockingMemStoreSize=1073741824
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4200)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3264)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3215)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:967)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:895)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2524)
>     at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36812)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2432)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) 
> {code}
>  
> Delayed flush logs:
> {code:java}
> LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
>   region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
>   this.blockingWaitTime); {code}
> Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
> num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers

2023-12-01 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28204:
-
Fix Version/s: 2.7.0

> Canary can take lot more time If any region (except the first region) starts 
> with delete markers
> 
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7, 2.7.0
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers

2023-11-30 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791900#comment-17791900
 ] 

Viraj Jasani commented on HBASE-28204:
--

FYI [~bbeaudreault], we have seen a bit of perf regression so need to revert 
the commit. Just wanted to keep you in loop in case you have started preparing 
RC0 already.

> Canary can take lot more time If any region (except the first region) starts 
> with delete markers
> 
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers

2023-11-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-28204:
--

Reopening for revert.

> Canary can take lot more time If any region (except the first region) starts 
> with delete markers
> 
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-25714) Offload the compaction job to independent Compaction Server

2023-11-30 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791724#comment-17791724
 ] 

Viraj Jasani commented on HBASE-25714:
--

[~niuyulin] looks like the feature branch has not had much update for sometime. 
Do you have plans to move this forward?

> Offload the compaction job to independent Compaction Server
> ---
>
> Key: HBASE-25714
> URL: https://issues.apache.org/jira/browse/HBASE-25714
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Yulin Niu
>Assignee: Yulin Niu
>Priority: Major
> Attachments: CoprocessorSupport1.png, CoprocessorSupport2.png
>
>
> The basic idea is add a role "CompactionServer" to take the Compaction job. 
> HMaster is responsible for scheduling the compaction job to different 
> CompactionServer.
> [design 
> doc|https://docs.google.com/document/d/1exmhQpQArAgnryLaV78K3260rKm64BHBNzZE4VdTz0c/edit?usp=sharing]
> Suggestions are welcomed. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28221:
-
Fix Version/s: 2.4.18
   2.5.7

> Introduce regionserver metric for delayed flushes
> -
>
> Key: HBASE-28221
> URL: https://issues.apache.org/jira/browse/HBASE-28221
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
> forget re-enabling the compaction. This can result into flushes getting 
> delayed for "hbase.hstore.blockingWaitTime" time (90s). While flushes do 
> happen eventually after waiting for max blocking time, it is important to 
> realize that any cluster cannot function well with compaction disabled for 
> significant amount of time.
>  
> We would also block any write requests until region is flushed (90+ sec, by 
> default):
> {code:java}
> 2023-11-27 20:40:52,124 WARN  [,queue=18,port=60020] regionserver.HRegion - 
> Region is too busy due to exceeding memstore size limit.
> org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, 
> regionName=table1,1699923733811.4fd5e52e2133df1e347f32c646f23ab4., 
> server=server-1,60020,1699421714454, memstoreSize=1073820928, 
> blockingMemStoreSize=1073741824
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4200)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3264)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3215)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:967)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:895)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2524)
>     at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36812)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2432)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) 
> {code}
>  
> Delayed flush logs:
> {code:java}
> LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
>   region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
>   this.blockingWaitTime); {code}
> Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
> num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28221:
-
Affects Version/s: 2.5.6
   2.4.17

> Introduce regionserver metric for delayed flushes
> -
>
> Key: HBASE-28221
> URL: https://issues.apache.org/jira/browse/HBASE-28221
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
> forget re-enabling the compaction. This can result into flushes getting 
> delayed for "hbase.hstore.blockingWaitTime" time (90s). While flushes do 
> happen eventually after waiting for max blocking time, it is important to 
> realize that any cluster cannot function well with compaction disabled for 
> significant amount of time.
>  
> We would also block any write requests until region is flushed (90+ sec, by 
> default):
> {code:java}
> 2023-11-27 20:40:52,124 WARN  [,queue=18,port=60020] regionserver.HRegion - 
> Region is too busy due to exceeding memstore size limit.
> org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, 
> regionName=table1,1699923733811.4fd5e52e2133df1e347f32c646f23ab4., 
> server=server-1,60020,1699421714454, memstoreSize=1073820928, 
> blockingMemStoreSize=1073741824
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4200)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3264)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3215)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:967)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:895)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2524)
>     at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36812)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2432)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) 
> {code}
>  
> Delayed flush logs:
> {code:java}
> LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
>   region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
>   this.blockingWaitTime); {code}
> Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
> num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28221:
-
Description: 
If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time.

 

We would also block any write requests until region is flushed (90+ sec, by 
default):
{code:java}
2023-11-27 20:40:52,124 WARN  [,queue=18,port=60020] regionserver.HRegion - 
Region is too busy due to exceeding memstore size limit.
org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, 
regionName=table1,1699923733811.4fd5e52e2133df1e347f32c646f23ab4., 
server=server-1,60020,1699421714454, memstoreSize=1073820928, 
blockingMemStoreSize=1073741824
    at 
org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4200)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3264)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3215)
    at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:967)
    at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:895)
    at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2524)
    at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36812)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2432)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
    at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) {code}
 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.

  was:
If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time as we block any write requests until region memstore stays at 
full capacity.

 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.


> Introduce regionserver metric for delayed flushes
> -
>
> Key: HBASE-28221
> URL: https://issues.apache.org/jira/browse/HBASE-28221
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
> forget re-enabling the compaction. This can result into flushes getting 
> delayed for "hbase.hstore.blockingWaitTime" time (90s). While flushes do 
> happen eventually after waiting for max blocking time, it is important to 
> realize that any cluster cannot function well with compaction disabled for 
> significant amount of time.
>  
> We would also block any write requests until region is flushed (90+ sec, by 
> default):
> {code:java}
> 2023-11-27 20:40:52,124 WARN  [,queue=18,port=60020] regionserver.HRegion - 
> Region is too busy due to exceeding memstore size limit.
> org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, 
> regionName=table1,1699923733811.4fd5e52e2133df1e347f32c646f23ab4., 
> server=server-1,60020,1699421714454, memstoreSize=1073820928, 
> blockingMemStoreSize=1073741824
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4200)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3264)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3215)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:967)
>     at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRe

[jira] [Updated] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28221:
-
Description: 
If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time as we block any write requests until region memstore stays at 
full capacity.

 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.

  was:
If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time.

 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.


> Introduce regionserver metric for delayed flushes
> -
>
> Key: HBASE-28221
> URL: https://issues.apache.org/jira/browse/HBASE-28221
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
> forget re-enabling the compaction. This can result into flushes getting 
> delayed for "hbase.hstore.blockingWaitTime" time (90s). While flushes do 
> happen eventually after waiting for max blocking time, it is important to 
> realize that any cluster cannot function well with compaction disabled for 
> significant amount of time as we block any write requests until region 
> memstore stays at full capacity.
>  
> Delayed flush logs:
> {code:java}
> LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
>   region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
>   this.blockingWaitTime); {code}
> Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
> num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28221:


 Summary: Introduce regionserver metric for delayed flushes
 Key: HBASE-28221
 URL: https://issues.apache.org/jira/browse/HBASE-28221
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
 Fix For: 2.6.0, 3.0.0-beta-1


If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time.

 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-21785) master reports open regions as RITs and also messes up rit age metric

2023-11-21 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788526#comment-17788526
 ] 

Viraj Jasani commented on HBASE-21785:
--

Agree, this deserves to be rolled out with any upcoming 2.x releases (patch or 
minor).

> master reports open regions as RITs and also messes up rit age metric
> -
>
> Key: HBASE-21785
> URL: https://issues.apache.org/jira/browse/HBASE-21785
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha-1, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.2.0
>
> Attachments: HBASE-21785.01.patch, HBASE-21785.patch
>
>
> {noformat}
> RegionState   RIT time (ms)   Retries
> dba183f0dadfcc9dc8ae0a6dd59c84e6  dba183f0dadfcc9dc8ae0a6dd59c84e6. 
> state=OPEN, ts=Wed Dec 31 16:00:00 PST 1969 (1548453918s ago), 
> server=server,17020,1548452922054  1548453918735   0
> {noformat}
> RIT age metric also gets set to a bogus value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784677#comment-17784677
 ] 

Viraj Jasani commented on HBASE-28192:
--

{quote}Actually, the most dangerous thing is always that, people think they can 
fix something without knowing the root cause and then they just make thing 
worse...
{quote}
I agree, but this case is quite particular. I am not suggesting we schedule 
recovery for any inconsistent state of meta, I just meant to say that if meta 
is already online as per AssignmentManager but the server it is online on is 
not even live, we already have a problem that we will likely not recover unless 
that dead server SCP is being processed. The only way out for this case is for 
operator to schedule recovery of the old server, the more it takes for operator 
to understand what the current state of the cluster is, higher are the chances 
of client requests failures in that duration and higher num of stuck procedures 
will be accumulated.

If meta state is not online, we don't need any change in the current logic.

 
{quote}So here, meta is already online on server3-1,61020,1699456864765, but 
after server1 becomes active, the loaded meta location is 
server3-1,61020,1698687384632, which is a dead server?
{quote}
Correct.
{quote}And this happens on a rolling upgrading from 2.4 to 2.5? What is the 
version for server1 and server4? Server4 is 2.4.x and server and server1 is 
2.5.x?
{quote}
Yes, so far we observed this only during 2.4 to 2.5 upgrade. Let me get back 
with the version details of masters (server4 and server1) in sometime.

> Master should recover if meta region state is inconsistent
> --
>
> Key: HBASE-28192
> URL: https://issues.apache.org/jira/browse/HBASE-28192
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> During active master initialization, before we set master as active (i.e. 
> {_}setInitialized(true){_}), we need both meta and namespace regions online. 
> If the region state of meta or namespace is inconsistent, active master can 
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
>   RetryCounter rc = null;
>   while (!isStopped()) {
> ...
> ...
> ...
> // Check once-a-minute.
> if (rc == null) {
>   rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
> }
> Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
>   }
>   return false;
> }
>  {code}
> In one of the recent outage, we observed that meta was online on a server, 
> which was correctly reflected in meta znode, but the server starttime was 
> different. This means that as per the latest transition record, meta was 
> marked online on old server (same server with old start time). This kept 
> active master initialization waiting forever and some SCPs got stuck in 
> initial stage where they need to access meta table before getting candidate 
> for region moves.
> The only way out of this outage is for operator to schedule recoveries using 
> hbck for old server, which triggers SCP for old server address of meta. Since 
> many SCPs were stuck, the processing of new SCP too was taking some time and 
> manual restart of active master triggered failover, and new master was able 
> to complete SCP for old meta server, correcting the meta assignment details, 
> which eventually marked master as active and only after this, we were able to 
> see real large num of RITs that were hidden so far.
> We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784659#comment-17784659
 ] 

Viraj Jasani edited comment on HBASE-28192 at 11/10/23 3:15 AM:


Let me add some logs:

regionserver where meta is online:
{code:java}
2023-11-08 18:10:31,079 INFO  [MemStoreFlusher.1] regionserver.HStore - Added 
hdfs://{cluster}/hbase/data/hbase/meta/1588230740/rep_barrier/3e5faf652f1e4c6db1c4ba1ae676c3ee,
 entries=1630, sequenceid=94325525, filesize=362.1 K {code}
master server 4 which thought it was active:
{code:java}
2023-11-08 18:14:34,563 DEBUG [0:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1699456864765, table=hbase:meta, region=1588230740

2023-11-08 18:14:34,609 INFO  [0:becomeActiveMaster] master.ServerManager - 
Registering regionserver=server3-1,61020,1699456864765 {code}
master server 1 which thought it was active:
{code:java}
2023-11-08 18:15:50,350 DEBUG [aster/server1:61000:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1698687384632, table=hbase:meta, region=1588230740

2023-11-08 18:15:50,399 INFO  [aster/server1:61000:becomeActiveMaster] 
master.ServerManager - Registering regionserver=server3-1,61020,1699456864765 
{code}
master server 4 gave up:
{code:java}
2023-11-08 18:16:22,776 INFO  [aster/server4:61000:becomeActiveMaster] 
master.ActiveMasterManager - Another master is the active master, 
server1,61000,1699467212235; waiting to become the next active master {code}
 

When server 4 was trying to be active master and loaded meta, it retrieved the 
correct location of meta i.e. server3-1,61020,1699456864765

However, when server 1 (eventual active master) loaded meta, it retrieved 
incorrect location i.e. server3-1,61020,1698687384632

 

For hbase 2.5, i see that HBASE-26193 no longer relies on zookeeper and rather 
relies on scanning master region:
{code:java}
  // Start the Assignment Thread
  startAssignmentThread();
  // load meta region states.
  // here we are still in the early steps of active master startup. There is 
only one thread(us)
  // can access AssignmentManager and create region node, so here we do not 
need to lock the
  // region node.
  try (ResultScanner scanner =
masterRegion.getScanner(new Scan().addFamily(HConstants.CATALOG_FAMILY))) {
for (;;) {
  Result result = scanner.next();
  if (result == null) {
break;
  }
  RegionStateStore
.visitMetaEntry((r, regionInfo, state, regionLocation, lastHost, 
openSeqNum) -> {
  RegionStateNode regionNode = 
regionStates.getOrCreateRegionStateNode(regionInfo);
  regionNode.setState(state);
  regionNode.setLastHost(lastHost);
  regionNode.setRegionLocation(regionLocation);
  regionNode.setOpenSeqNum(openSeqNum);
  if (regionNode.getProcedure() != null) {
regionNode.getProcedure().stateLoaded(this, regionNode);
  }
  if (regionLocation != null) {
regionStates.addRegionToServer(regionNode);
  }
  if (RegionReplicaUtil.isDefaultReplica(regionInfo.getReplicaId())) {
setMetaAssigned(regionInfo, state == State.OPEN);
  }
  LOG.debug("Loaded hbase:meta {}", regionNode);
}, result);
}
  }
  mirrorMetaLocations();
}
 {code}
 

Maybe this incident was one-off case, maybe only happens during hbase 2.4 to 
2.5 upgrade. Once meta location is only read from master region, there should 
not be inconsistency I think.


was (Author: vjasani):
Let me add some logs:

regionserver where meta is online:
{code:java}
2023-11-08 18:10:31,079 INFO  [MemStoreFlusher.1] regionserver.HStore - Added 
hdfs://{cluster}/hbase/data/hbase/meta/1588230740/rep_barrier/3e5faf652f1e4c6db1c4ba1ae676c3ee,
 entries=1630, sequenceid=94325525, filesize=362.1 K {code}
master server 4 which thought it was active:
{code:java}
2023-11-08 18:14:34,563 DEBUG [0:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1699456864765, table=hbase:meta, region=1588230740

2023-11-08 18:14:34,609 INFO  [0:becomeActiveMaster] master.ServerManager - 
Registering regionserver=server3-1,61020,1699456864765 {code}
master server 1 which thought it was active:
{code:java}
2023-11-08 18:15:50,350 DEBUG [aster/server1:61000:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1698687384632, table=hbase:meta, region=1588230740

2023-11-08 18:15:50,399 INFO  [aster/server1:61000:becomeActiveMaster] 
master.ServerManager - Registering regionserver=server3-1,61020,1699456864765 
{code}
master server 4 gave up:
{code:java}
2023-11-08 18:16:22,776 INFO  [aster/server4:61000:becomeActiveMaster] 
master.ActiveMasterManager - Another master is th

[jira] [Comment Edited] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784659#comment-17784659
 ] 

Viraj Jasani edited comment on HBASE-28192 at 11/10/23 3:15 AM:


Let me add some logs:

regionserver where meta is online:
{code:java}
2023-11-08 18:10:31,079 INFO  [MemStoreFlusher.1] regionserver.HStore - Added 
hdfs://{cluster}/hbase/data/hbase/meta/1588230740/rep_barrier/3e5faf652f1e4c6db1c4ba1ae676c3ee,
 entries=1630, sequenceid=94325525, filesize=362.1 K {code}
master server 4 which thought it was active:
{code:java}
2023-11-08 18:14:34,563 DEBUG [0:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1699456864765, table=hbase:meta, region=1588230740

2023-11-08 18:14:34,609 INFO  [0:becomeActiveMaster] master.ServerManager - 
Registering regionserver=server3-1,61020,1699456864765 {code}
master server 1 which thought it was active:
{code:java}
2023-11-08 18:15:50,350 DEBUG [aster/server1:61000:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1698687384632, table=hbase:meta, region=1588230740

2023-11-08 18:15:50,399 INFO  [aster/server1:61000:becomeActiveMaster] 
master.ServerManager - Registering regionserver=server3-1,61020,1699456864765 
{code}
master server 4 gave up:
{code:java}
2023-11-08 18:16:22,776 INFO  [aster/server4:61000:becomeActiveMaster] 
master.ActiveMasterManager - Another master is the active master, 
server1,61000,1699467212235; waiting to become the next active master {code}
 

When server 4 was trying to be active master and loaded meta, it retrieved the 
correct location of meta i.e. server3-1,61020,1699456864765

However, when server 1 (eventual active master) loaded meta, it retrieved 
incorrect location i.e. server3-1,61020,1698687384632

 

For hbase 2.5, i see that HBASE-26193 no longer relies on zookeeper and rather 
relies on scanning master region:
{code:java}
  // Start the Assignment Thread
  startAssignmentThread();
  // load meta region states.
  // here we are still in the early steps of active master startup. There is 
only one thread(us)
  // can access AssignmentManager and create region node, so here we do not 
need to lock the
  // region node.
  try (ResultScanner scanner =
masterRegion.getScanner(new Scan().addFamily(HConstants.CATALOG_FAMILY))) {
for (;;) {
  Result result = scanner.next();
  if (result == null) {
break;
  }
  RegionStateStore
.visitMetaEntry((r, regionInfo, state, regionLocation, lastHost, 
openSeqNum) -> {
  RegionStateNode regionNode = 
regionStates.getOrCreateRegionStateNode(regionInfo);
  regionNode.setState(state);
  regionNode.setLastHost(lastHost);
  regionNode.setRegionLocation(regionLocation);
  regionNode.setOpenSeqNum(openSeqNum);
  if (regionNode.getProcedure() != null) {
regionNode.getProcedure().stateLoaded(this, regionNode);
  }
  if (regionLocation != null) {
regionStates.addRegionToServer(regionNode);
  }
  if (RegionReplicaUtil.isDefaultReplica(regionInfo.getReplicaId())) {
setMetaAssigned(regionInfo, state == State.OPEN);
  }
  LOG.debug("Loaded hbase:meta {}", regionNode);
}, result);
}
  }
  mirrorMetaLocations();
}
 {code}
 

Maybe this incident was one-off case, maybe only happens during hbase 2.4 to 
2.5 upgrade. Once meta location is only read from master region (for 2.5+ 
releases), there should not be any inconsistency I think.


was (Author: vjasani):
Let me add some logs:

regionserver where meta is online:
{code:java}
2023-11-08 18:10:31,079 INFO  [MemStoreFlusher.1] regionserver.HStore - Added 
hdfs://{cluster}/hbase/data/hbase/meta/1588230740/rep_barrier/3e5faf652f1e4c6db1c4ba1ae676c3ee,
 entries=1630, sequenceid=94325525, filesize=362.1 K {code}
master server 4 which thought it was active:
{code:java}
2023-11-08 18:14:34,563 DEBUG [0:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1699456864765, table=hbase:meta, region=1588230740

2023-11-08 18:14:34,609 INFO  [0:becomeActiveMaster] master.ServerManager - 
Registering regionserver=server3-1,61020,1699456864765 {code}
master server 1 which thought it was active:
{code:java}
2023-11-08 18:15:50,350 DEBUG [aster/server1:61000:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1698687384632, table=hbase:meta, region=1588230740

2023-11-08 18:15:50,399 INFO  [aster/server1:61000:becomeActiveMaster] 
master.ServerManager - Registering regionserver=server3-1,61020,1699456864765 
{code}
master server 4 gave up:
{code:java}
2023-11-08 18:16:22,776 INFO  [aster/server4:61000:becomeActiveMaster] 
master.ActiveMasterManage

[jira] [Commented] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784659#comment-17784659
 ] 

Viraj Jasani commented on HBASE-28192:
--

Let me add some logs:

regionserver where meta is online:
{code:java}
2023-11-08 18:10:31,079 INFO  [MemStoreFlusher.1] regionserver.HStore - Added 
hdfs://{cluster}/hbase/data/hbase/meta/1588230740/rep_barrier/3e5faf652f1e4c6db1c4ba1ae676c3ee,
 entries=1630, sequenceid=94325525, filesize=362.1 K {code}
master server 4 which thought it was active:
{code:java}
2023-11-08 18:14:34,563 DEBUG [0:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1699456864765, table=hbase:meta, region=1588230740

2023-11-08 18:14:34,609 INFO  [0:becomeActiveMaster] master.ServerManager - 
Registering regionserver=server3-1,61020,1699456864765 {code}
master server 1 which thought it was active:
{code:java}
2023-11-08 18:15:50,350 DEBUG [aster/server1:61000:becomeActiveMaster] 
assignment.AssignmentManager - Loaded hbase:meta state=OPEN, 
location=server3-1,61020,1698687384632, table=hbase:meta, region=1588230740

2023-11-08 18:15:50,399 INFO  [aster/server1:61000:becomeActiveMaster] 
master.ServerManager - Registering regionserver=server3-1,61020,1699456864765 
{code}
master server 4 gave up:
{code:java}
2023-11-08 18:16:22,776 INFO  [aster/server4:61000:becomeActiveMaster] 
master.ActiveMasterManager - Another master is the active master, 
server1,61000,1699467212235; waiting to become the next active master {code}
 

When server 4 was trying to be active master and loaded meta, it retrieved the 
correct location of meta i.e. server3-1,61020,1699456864765

However, when server 1 (eventual active master) loaded meta, it retrieved 
incorrect location i.e. server3-1,61020,1698687384632

 

For hbase 2.5, i see that HBASE-26193 no longer relies on zookeeper and rather 
relies on scanning master region:
{code:java}
  // Start the Assignment Thread
  startAssignmentThread();
  // load meta region states.
  // here we are still in the early steps of active master startup. There is 
only one thread(us)
  // can access AssignmentManager and create region node, so here we do not 
need to lock the
  // region node.
  try (ResultScanner scanner =
masterRegion.getScanner(new Scan().addFamily(HConstants.CATALOG_FAMILY))) {
for (;;) {
  Result result = scanner.next();
  if (result == null) {
break;
  }
  RegionStateStore
.visitMetaEntry((r, regionInfo, state, regionLocation, lastHost, 
openSeqNum) -> {
  RegionStateNode regionNode = 
regionStates.getOrCreateRegionStateNode(regionInfo);
  regionNode.setState(state);
  regionNode.setLastHost(lastHost);
  regionNode.setRegionLocation(regionLocation);
  regionNode.setOpenSeqNum(openSeqNum);
  if (regionNode.getProcedure() != null) {
regionNode.getProcedure().stateLoaded(this, regionNode);
  }
  if (regionLocation != null) {
regionStates.addRegionToServer(regionNode);
  }
  if (RegionReplicaUtil.isDefaultReplica(regionInfo.getReplicaId())) {
setMetaAssigned(regionInfo, state == State.OPEN);
  }
  LOG.debug("Loaded hbase:meta {}", regionNode);
}, result);
}
  }
  mirrorMetaLocations();
}
 {code}

> Master should recover if meta region state is inconsistent
> --
>
> Key: HBASE-28192
> URL: https://issues.apache.org/jira/browse/HBASE-28192
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> During active master initialization, before we set master as active (i.e. 
> {_}setInitialized(true){_}), we need both meta and namespace regions online. 
> If the region state of meta or namespace is inconsistent, active master can 
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
>   RetryCounter rc = null;
>   while (!isStopped()) {
> ...
> ...
> ...
> // Check once-a-minute.
> if (rc == null) {
>   rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
> }
> Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
>   }
>   return false;
> }
>  {code}
> In one of the recent outage, we observed that meta was online on a server, 
> which was correctly reflected in meta znode, but the server starttime was 
> different. This means that as per the latest transition record, meta was 
> marked online on old server (same server with old start time). This kept 
> active master initialization waiting forever and some SCPs got stuck in 
> initial 

[jira] [Comment Edited] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784650#comment-17784650
 ] 

Viraj Jasani edited comment on HBASE-28192 at 11/10/23 2:41 AM:


[~zhangduo] i am not aware of the exact root cause but this was hbase 2.4 to 
2.5 upgrade and HBASE-26193 might be suspect, i am not sure, need to dig in, 
but let's say we get to know about the cause and resolve it, but there could be 
something else tomorrow that can make active master init stuck in the loop, 
maybe during upgrade or maybe during usual restarts, it's not good anyway right?

If meta is online but not on live server, master should be able to recover. Any 
cause should be handled separately too, but right now we let master get stuck 
in infinite loop for this edge case, which is also not reliable IMO. At least 
we should not expect operator to perform hbck recovery for meta and/or 
namespace regions while master stay stuck forever in loop.


was (Author: vjasani):
[~zhangduo] i am not aware of the exact root cause but this was hbase 2.4 to 
2.5 upgrade and HBASE-26193 might be suspect, i am not sure, need to dig in, 
but let's say we do know the reason and there could be something else tomorrow 
that can make active master init stuck in the loop, it's not good anyway right? 
If meta is online but not on live server, master should be able to recover. Any 
cause should be handled separately too, but right now we let master stuck in 
infinite loop for this edge case, which is also not reliable IMO.

> Master should recover if meta region state is inconsistent
> --
>
> Key: HBASE-28192
> URL: https://issues.apache.org/jira/browse/HBASE-28192
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> During active master initialization, before we set master as active (i.e. 
> {_}setInitialized(true){_}), we need both meta and namespace regions online. 
> If the region state of meta or namespace is inconsistent, active master can 
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
>   RetryCounter rc = null;
>   while (!isStopped()) {
> ...
> ...
> ...
> // Check once-a-minute.
> if (rc == null) {
>   rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
> }
> Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
>   }
>   return false;
> }
>  {code}
> In one of the recent outage, we observed that meta was online on a server, 
> which was correctly reflected in meta znode, but the server starttime was 
> different. This means that as per the latest transition record, meta was 
> marked online on old server (same server with old start time). This kept 
> active master initialization waiting forever and some SCPs got stuck in 
> initial stage where they need to access meta table before getting candidate 
> for region moves.
> The only way out of this outage is for operator to schedule recoveries using 
> hbck for old server, which triggers SCP for old server address of meta. Since 
> many SCPs were stuck, the processing of new SCP too was taking some time and 
> manual restart of active master triggered failover, and new master was able 
> to complete SCP for old meta server, correcting the meta assignment details, 
> which eventually marked master as active and only after this, we were able to 
> see real large num of RITs that were hidden so far.
> We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784650#comment-17784650
 ] 

Viraj Jasani commented on HBASE-28192:
--

[~zhangduo] i am not aware of the exact root cause but this was hbase 2.4 to 
2.5 upgrade and HBASE-26193 might be suspect, i am not sure, need to dig in, 
but let's say we do know the reason and there could be something else tomorrow 
that can make active master init stuck in the loop, it's not good anyway right? 
If meta is online but not on live server, master should be able to recover. Any 
cause should be handled separately too, but right now we let master stuck in 
infinite loop for this edge case, which is also not reliable IMO.

> Master should recover if meta region state is inconsistent
> --
>
> Key: HBASE-28192
> URL: https://issues.apache.org/jira/browse/HBASE-28192
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.6
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> During active master initialization, before we set master as active (i.e. 
> {_}setInitialized(true){_}), we need both meta and namespace regions online. 
> If the region state of meta or namespace is inconsistent, active master can 
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
>   RetryCounter rc = null;
>   while (!isStopped()) {
> ...
> ...
> ...
> // Check once-a-minute.
> if (rc == null) {
>   rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
> }
> Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
>   }
>   return false;
> }
>  {code}
> In one of the recent outage, we observed that meta was online on a server, 
> which was correctly reflected in meta znode, but the server starttime was 
> different. This means that as per the latest transition record, meta was 
> marked online on old server (same server with old start time). This kept 
> active master initialization waiting forever and some SCPs got stuck in 
> initial stage where they need to access meta table before getting candidate 
> for region moves.
> The only way out of this outage is for operator to schedule recoveries using 
> hbck for old server, which triggers SCP for old server address of meta. Since 
> many SCPs were stuck, the processing of new SCP too was taking some time and 
> manual restart of active master triggered failover, and new master was able 
> to complete SCP for old meta server, correcting the meta assignment details, 
> which eventually marked master as active and only after this, we were able to 
> see real large num of RITs that were hidden so far.
> We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28192:


 Summary: Master should recover if meta region state is inconsistent
 Key: HBASE-28192
 URL: https://issues.apache.org/jira/browse/HBASE-28192
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.6, 2.4.17
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7


During active master initialization, before we set master as active (i.e. 
{_}setInitialized(true){_}), we need both meta and namespace regions online. If 
the region state of meta or namespace is inconsistent, active master can get 
stuck in the initialization step:
{code:java}
private boolean isRegionOnline(RegionInfo ri) {
  RetryCounter rc = null;
  while (!isStopped()) {
...
...
...
// Check once-a-minute.
if (rc == null) {
  rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
}
Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
  }
  return false;
}
 {code}
In one of the recent outage, we observed that meta was online on a server, 
which was correctly reflected in meta znode, but the server starttime was 
different. This means that as per the latest transition record, meta was marked 
online on old server (same server with old start time). This kept active master 
initialization waiting forever and some SCPs got stuck in initial stage where 
they need to access meta table before getting candidate for region moves.

The only way out of this outage is for operator to schedule recoveries using 
hbck for old server, which triggers SCP for old server address of meta. Since 
many SCPs were stuck, the processing of new SCP too was taking some time and 
manual restart of active master triggered failover, and new master was able to 
complete SCP for old meta server, correcting the meta assignment details, which 
eventually marked master as active and only after this, we were able to see 
real large num of RITs that were hidden so far.

We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-20881) Introduce a region transition procedure to handle all the state transition for a region

2023-11-06 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783471#comment-17783471
 ] 

Viraj Jasani commented on HBASE-20881:
--

I have run out of the example, but when I see another incident of 
ABNORMALLY_CLOSED, will be happy to share the logs.

In the meantime, I was curious, what is the best resolution to a region stuck 
in this state? Is running "hbck assigns -o" the only resolution?

> Introduce a region transition procedure to handle all the state transition 
> for a region
> ---
>
> Key: HBASE-20881
> URL: https://issues.apache.org/jira/browse/HBASE-20881
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.2.0
>
> Attachments: HBASE-20881-branch-2-v1.patch, 
> HBASE-20881-branch-2-v2.patch, HBASE-20881-branch-2.patch, 
> HBASE-20881-v1.patch, HBASE-20881-v10.patch, HBASE-20881-v11.patch, 
> HBASE-20881-v12.patch, HBASE-20881-v13.patch, HBASE-20881-v13.patch, 
> HBASE-20881-v14.patch, HBASE-20881-v14.patch, HBASE-20881-v15.patch, 
> HBASE-20881-v16.patch, HBASE-20881-v2.patch, HBASE-20881-v3.patch, 
> HBASE-20881-v4.patch, HBASE-20881-v4.patch, HBASE-20881-v5.patch, 
> HBASE-20881-v6.patch, HBASE-20881-v7.patch, HBASE-20881-v7.patch, 
> HBASE-20881-v8.patch, HBASE-20881-v9.patch, HBASE-20881.patch
>
>
> Now have an AssignProcedure, an UnssignProcedure, and also a 
> MoveRegionProcedure which schedules an AssignProcedure and an 
> UnssignProcedure to move a region. This makes the logic a bit complicated, as 
> MRP is not a RIT, so when SCP can not interrupt it directly...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-20881) Introduce a region transition procedure to handle all the state transition for a region

2023-10-23 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778758#comment-17778758
 ] 

Viraj Jasani edited comment on HBASE-20881 at 10/23/23 5:09 PM:


[~zhangduo] IIUC, the only reason why we had to introduce ABNORMALLY_CLOSED 
state is because when a region is already in RIT, and the target server where 
it is assigned or getting assigned to or getting moved from (and getting closed 
therefore) crashes, SCP has to interrupt old TRSP. SCP anyways creates new 
TRSPs to take care of assigning all regions that were previously hosted by the 
target server, but any region which was already in transition might require 
manual intervention because SCP cannot be certain what step of the previous 
TRSP, the region was stuck in while it was in RIT.

For SCP, any RIT on dead server is a complex state to deal with because it 
cannot know for certain whether the region was stuck in any coproc hook on the 
host or it was stuck while making RPC call to remote server and what was the 
outcome of the RPC call etc.

 

Does this seem correct? We were thinking of digging a bit more in detail to see 
if there are any cases for which we can convert region state to CLOSED rather 
than ABNORMALLY_CLOSED and therefore avoid any operator intervention, but i 
fear we might introduce double assignment of regions if this is not done 
carefully.


was (Author: vjasani):
[~zhangduo] IIUC, the only reason why we had to introduce ABNORMALLY_CLOSED 
state is because when a region is already in RIT, and the target server where 
it is assigned or getting assigned to crashes, SCP has to interrupt old TRSP 
and create new TRSPs to take care of assigning all regions that were previously 
hosted by the target server, but any region already in transition might require 
manual intervention because SCP cannot be certain what step of the previous 
TRSP, the region was stuck while it was in RIT.

For SCP, any RIT on dead server is a complex state to deal with because it 
cannot know for certain whether the region was stuck in any coproc hook on the 
host or it was stuck while making RPC call to remote server and what was the 
outcome of the RPC call etc.

 

Does this seem correct? We were thinking of digging a bit more in detail to see 
if there are any cases for which we can convert region state to CLOSED rather 
than ABNORMALLY_CLOSED and therefore avoid any operator intervention, but i 
fear we might introduce double assignment of regions if this is not done 
carefully.

> Introduce a region transition procedure to handle all the state transition 
> for a region
> ---
>
> Key: HBASE-20881
> URL: https://issues.apache.org/jira/browse/HBASE-20881
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.2.0
>
> Attachments: HBASE-20881-branch-2-v1.patch, 
> HBASE-20881-branch-2-v2.patch, HBASE-20881-branch-2.patch, 
> HBASE-20881-v1.patch, HBASE-20881-v10.patch, HBASE-20881-v11.patch, 
> HBASE-20881-v12.patch, HBASE-20881-v13.patch, HBASE-20881-v13.patch, 
> HBASE-20881-v14.patch, HBASE-20881-v14.patch, HBASE-20881-v15.patch, 
> HBASE-20881-v16.patch, HBASE-20881-v2.patch, HBASE-20881-v3.patch, 
> HBASE-20881-v4.patch, HBASE-20881-v4.patch, HBASE-20881-v5.patch, 
> HBASE-20881-v6.patch, HBASE-20881-v7.patch, HBASE-20881-v7.patch, 
> HBASE-20881-v8.patch, HBASE-20881-v9.patch, HBASE-20881.patch
>
>
> Now have an AssignProcedure, an UnssignProcedure, and also a 
> MoveRegionProcedure which schedules an AssignProcedure and an 
> UnssignProcedure to move a region. This makes the logic a bit complicated, as 
> MRP is not a RIT, so when SCP can not interrupt it directly...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-20881) Introduce a region transition procedure to handle all the state transition for a region

2023-10-23 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778758#comment-17778758
 ] 

Viraj Jasani commented on HBASE-20881:
--

[~zhangduo] IIUC, the only reason why we had to introduce ABNORMALLY_CLOSED 
state is because when a region is already in RIT, and the target server where 
it is assigned or getting assigned to crashes, SCP has to interrupt old TRSP 
and create new TRSPs to take care of assigning all regions that were previously 
hosted by the target server, but any region already in transition might require 
manual intervention because SCP cannot be certain what step of the previous 
TRSP, the region was stuck while it was in RIT.

For SCP, any RIT on dead server is a complex state to deal with because it 
cannot know for certain whether the region was stuck in any coproc hook on the 
host or it was stuck while making RPC call to remote server and what was the 
outcome of the RPC call etc.

 

Does this seem correct? We were thinking of digging a bit more in detail to see 
if there are any cases for which we can convert region state to CLOSED rather 
than ABNORMALLY_CLOSED and therefore avoid any operator intervention, but i 
fear we might introduce double assignment of regions if this is not done 
carefully.

> Introduce a region transition procedure to handle all the state transition 
> for a region
> ---
>
> Key: HBASE-20881
> URL: https://issues.apache.org/jira/browse/HBASE-20881
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.2.0
>
> Attachments: HBASE-20881-branch-2-v1.patch, 
> HBASE-20881-branch-2-v2.patch, HBASE-20881-branch-2.patch, 
> HBASE-20881-v1.patch, HBASE-20881-v10.patch, HBASE-20881-v11.patch, 
> HBASE-20881-v12.patch, HBASE-20881-v13.patch, HBASE-20881-v13.patch, 
> HBASE-20881-v14.patch, HBASE-20881-v14.patch, HBASE-20881-v15.patch, 
> HBASE-20881-v16.patch, HBASE-20881-v2.patch, HBASE-20881-v3.patch, 
> HBASE-20881-v4.patch, HBASE-20881-v4.patch, HBASE-20881-v5.patch, 
> HBASE-20881-v6.patch, HBASE-20881-v7.patch, HBASE-20881-v7.patch, 
> HBASE-20881-v8.patch, HBASE-20881-v9.patch, HBASE-20881.patch
>
>
> Now have an AssignProcedure, an UnssignProcedure, and also a 
> MoveRegionProcedure which schedules an AssignProcedure and an 
> UnssignProcedure to move a region. This makes the logic a bit complicated, as 
> MRP is not a RIT, so when SCP can not interrupt it directly...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2023-10-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28151:
-
Description: 
When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.

 

"-o" should mean "override" the procedure that is attached to the 
RegionStateNode, it should not mean forcefully skip any region transition 
validation checks.

  was:
When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.

 

"-o" should mean "override" the procedure that is attached to the 
RegionStateNode, it should not mean forcefully skip any region transition 
validation checks and perform the region assignments.


> hbck -o should not allow bypassing pre transit check by default
> ---
>
> Key: HBASE-28151
> URL: https://issues.apache.org/jira/browse/HBASE-28151
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
>
> When operator uses hbck assigns or unassigns with "-o", the override will 
> also skip pre transit checks. While this is one of the intentions with "-o", 
> the primary purpose should still be to only unattach existing procedure from 
> RegionStateNode so that newly scheduled assign proc can take exclusive region 
> level lock.
> We should restrict bypassing preTransitCheck by only providing it as site 
> config.
> If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
> allowed to bypass this check, otherwise by default they should go through the 
> check.
>  
> It is important to keep "unset of the procedure from RegionStateNode" and 
> "bypassing preTransitCheck" separate so that when the cluster state is bad, 
> we don't explicitly deteriorate it further e.g. if a region was successfully 
> split and now if operator performs "hbck assigns \{region} -o" and if it 
> bypasses the transit check, master would bring the region online and it could 
> compact store files and archive the store file which is referenced by 
> daughter region. This would not allow daughter region to come online.
> Let's introduce hbase site config to allow bypassing preTransitCheck, it 
> should not be doable only by operator using hbck alone.
>  
> "-o" should mean "override" the procedure that is attached to the 
> RegionStateNode, it should not mean forcefully skip any region transition 
> validation checks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2023-10-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28151:
-
Description: 
When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.

 

"-o" should mean "override" the procedure that is attached to the 
RegionStateNode, it should not mean forcefully skip any region transition 
validation checks and perform the region assignments.

  was:
When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.


> hbck -o should not allow bypassing pre transit check by default
> ---
>
> Key: HBASE-28151
> URL: https://issues.apache.org/jira/browse/HBASE-28151
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
>
> When operator uses hbck assigns or unassigns with "-o", the override will 
> also skip pre transit checks. While this is one of the intentions with "-o", 
> the primary purpose should still be to only unattach existing procedure from 
> RegionStateNode so that newly scheduled assign proc can take exclusive region 
> level lock.
> We should restrict bypassing preTransitCheck by only providing it as site 
> config.
> If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
> allowed to bypass this check, otherwise by default they should go through the 
> check.
>  
> It is important to keep "unset of the procedure from RegionStateNode" and 
> "bypassing preTransitCheck" separate so that when the cluster state is bad, 
> we don't explicitly deteriorate it further e.g. if a region was successfully 
> split and now if operator performs "hbck assigns \{region} -o" and if it 
> bypasses the transit check, master would bring the region online and it could 
> compact store files and archive the store file which is referenced by 
> daughter region. This would not allow daughter region to come online.
> Let's introduce hbase site config to allow bypassing preTransitCheck, it 
> should not be doable only by operator using hbck alone.
>  
> "-o" should mean "override" the procedure that is attached to the 
> RegionStateNode, it should not mean forcefully skip any region transition 
> validation checks and perform the region assignments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2023-10-12 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28151:


 Summary: hbck -o should not allow bypassing pre transit check by 
default
 Key: HBASE-28151
 URL: https://issues.apache.org/jira/browse/HBASE-28151
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2023-10-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28151:
-
Affects Version/s: 2.5.5
   2.4.17

> hbck -o should not allow bypassing pre transit check by default
> ---
>
> Key: HBASE-28151
> URL: https://issues.apache.org/jira/browse/HBASE-28151
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
>
> When operator uses hbck assigns or unassigns with "-o", the override will 
> also skip pre transit checks. While this is one of the intentions with "-o", 
> the primary purpose should still be to only unattach existing procedure from 
> RegionStateNode so that newly scheduled assign proc can take exclusive region 
> level lock.
> We should restrict bypassing preTransitCheck by only providing it as site 
> config.
> If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
> allowed to bypass this check, otherwise by default they should go through the 
> check.
>  
> It is important to keep "unset of the procedure from RegionStateNode" and 
> "bypassing preTransitCheck" separate so that when the cluster state is bad, 
> we don't explicitly deteriorate it further e.g. if a region was successfully 
> split and now if operator performs "hbck assigns \{region} -o" and if it 
> bypasses the transit check, master would bring the region online and it could 
> compact store files and archive the store file which is referenced by 
> daughter region. This would not allow daughter region to come online.
> Let's introduce hbase site config to allow bypassing preTransitCheck, it 
> should not be doable only by operator using hbck alone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28144) Canary publish read failure fails with NPE if region location is null

2023-10-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28144.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Canary publish read failure fails with NPE if region location is null
> -
>
> Key: HBASE-28144
> URL: https://issues.apache.org/jira/browse/HBASE-28144
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.5
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Region with null server name causes canary failures while publishing read 
> failure i.e. while updating perServerFailuresCount map:
> {code:java}
> 2023-10-09 15:24:11 [CanaryMonitor-1696864805801] ERROR tool.Canary(1480): 
> Sniff region failed!
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionMonitor.run(CanaryTool.java:1478)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1837)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.incFailuresCountDetails(CanaryTool.java:327)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.publishReadFailure(CanaryTool.java:353)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:548)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:587)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:502)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:470)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   ... 1 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28144) Canary publish read failure fails with NPE if region location is null

2023-10-09 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HBASE-28144:


Assignee: Viraj Jasani

> Canary publish read failure fails with NPE if region location is null
> -
>
> Key: HBASE-28144
> URL: https://issues.apache.org/jira/browse/HBASE-28144
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.5
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Region with null server name causes canary failures while publishing read 
> failure i.e. while updating perServerFailuresCount map:
> {code:java}
> 2023-10-09 15:24:11 [CanaryMonitor-1696864805801] ERROR tool.Canary(1480): 
> Sniff region failed!
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionMonitor.run(CanaryTool.java:1478)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1837)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.incFailuresCountDetails(CanaryTool.java:327)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.publishReadFailure(CanaryTool.java:353)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:548)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:587)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:502)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:470)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   ... 1 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28144) Canary publish read failure fails with NPE if region location is null

2023-10-09 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28144:


 Summary: Canary publish read failure fails with NPE if region 
location is null
 Key: HBASE-28144
 URL: https://issues.apache.org/jira/browse/HBASE-28144
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.5.5
Reporter: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


Region with null server name causes canary failures while publishing read 
failure i.e. while updating perServerFailuresCount map:
{code:java}
2023-10-09 15:24:11 [CanaryMonitor-1696864805801] ERROR tool.Canary(1480): 
Sniff region failed!
java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionMonitor.run(CanaryTool.java:1478)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1837)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.incFailuresCountDetails(CanaryTool.java:327)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.publishReadFailure(CanaryTool.java:353)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:548)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:587)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:502)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:470)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28119) LogRoller stuck by FanOutOneBlockAsyncDFSOutputHelper.createOutput waitting get future all time

2023-10-01 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28119:
-
Component/s: wal

> LogRoller stuck by FanOutOneBlockAsyncDFSOutputHelper.createOutput waitting 
> get future all time
> ---
>
> Key: HBASE-28119
> URL: https://issues.apache.org/jira/browse/HBASE-28119
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.2.7
>Reporter: Li Chao
>Priority: Major
> Attachments: image-2023-09-29-17-23-04-560.png
>
>
> We found this problem in our production. LogRoller stuck by 
> FanOutOneBlockAsyncDFSOutputHelper.createOutput waitting get future all time
> !image-2023-09-29-17-23-04-560.png|width=566,height=191!
> Check the regionserver's log, the regionServer do sasl negotiate with two 
> dataNode, but just one check complete. Another do nothing after connected 
> with dn.
> {code:java}
> 518415 2023-04-17 14:17:25,434 INFO 
> io.transwarp.guardian.client.cache.PeriodCacheUpdater: Fetch change version: 0
> 518416 2023-04-17 14:17:29,092 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> RefreshCredentials execution time: 0 ms.
> 518417 2023-04-17 14:17:29,768 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> CompactionChecker execution time: 0 ms.
> 518418 2023-04-17 14:17:29,768 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> CompactionThroughputTuner execution time: 0 ms.518419 2023-04-17 14:17:29,768 
> DEBUG org.apache.hadoop.hbase.ScheduledChore: MemstoreFlusherChore execution 
> time: 0 ms.
> 518420 2023-04-17 14:17:29,768 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> gy-dmz-swrzjzcc-gx-2-19,60020,1677341424491-Hea       pMemoryTunerChore 
> execution time: 0 ms.
> 518421 2023-04-17 14:17:39,375 DEBUG 
> org.apache.hadoop.hbase.regionserver.LogRoller: WAL AsyncFSWAL 
> gy-dmz-swrzjzcc-gx-2-19%       2C60020%2C1677341424491:(num 1681711899342) 
> roll requested
> 518422 2023-04-17 14:17:39,389 DEBUG 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper: 
> SASL client        doing general handshake for addr = 
> 10.179.157.10/10.179.157.10, datanodeId = 
> DatanodeInfoWithStorage[10.179.157.10:50       
> 010,DS-4815c34a-8d0c-42b9-b56c-529d2732d956,DISK]
> 518423 2023-04-17 14:17:39,391 DEBUG 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper: 
> SASL client        doing general handshake for addr = 
> 10.179.157.29/10.179.157.29, datanodeId = 
> DatanodeInfoWithStorage[10.179.157.29:50       
> 010,DS-509f84fe-2e88-403e-87b5-f4765e49094f,DISK]
> 518424 2023-04-17 14:17:39,392 DEBUG 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper: 
> Verifying QO       P, requested QOP = [auth], negotiated QOP = auth
> 518425 2023-04-17 14:17:39,743 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> MemstoreFlusherChore execution time: 0 ms.
> 518426 2023-04-17 14:17:39,743 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> CompactionChecker execution time: 0 ms.
> 518427 2023-04-17 14:17:49,977 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> CompactionChecker execution time: 0 ms.
> 518428 2023-04-17 14:17:49,977 DEBUG org.apache.hadoop.hbase.ScheduledChore: 
> MemstoreFlusherChore execution time: 0 ms.
> 518429 2023-04-17 14:17:55,492 INFO {code}
> FanOutOneBlockAsyncDFSOutputHelper.createOutput will connect and 
> trySaslNegotiate to dataNode. In Sasl authentication mode, 
> SaslNegotiateHandler will be used to handle authentication. If datanode is 
> shut down, SaslNegotiateHandler.channelInactive do not  call back to promise 
> and cause future to be stuck forever.
> {code:java}
> @Override
> public void handlerAdded(ChannelHandlerContext ctx) throws Exception {
>   ctx.write(ctx.alloc().buffer(4).writeInt(SASL_TRANSFER_MAGIC_NUMBER));
>   sendSaslMessage(ctx, new byte[0]);
>   ctx.flush();
>   step++;
> }
> @Override
> public void channelInactive(ChannelHandlerContext ctx) throws Exception {
>   saslClient.dispose();
> } {code}
> So SaslNegotiateHandler.channelInactive should call promise.tryFailure to 
> avoid future stuck forever.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28081) Snapshot working dir does not retain ACLs after snapshot commit phase

2023-09-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28081.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Snapshot working dir does not retain ACLs after snapshot commit phase
> -
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard.
>  
> After snapshot is committed from working dir to final destination (under 
> /.hbase-snapshot dir), if the operation was atomic rename, the working dir 
> (e.g. /hbase/.hbase-snapshot/.tmp) no longer preserves ACLs that were derived 
> from snapshot parent dir (e.g. /hbase/.hbase-snapshot) while creating first 
> working snapshot dir. Hence, for new working dir, we should ensure that we 
> preserve ACLs from snapshot parent dir.
> This would ensure that final snapshot commit dir has the expected ACLs 
> regardless of whether we perform atomic rename of non-atomic copy operation 
> in the snapshot commit phase.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28050) RSProcedureDispatcher to fail-fast for krb auth failures

2023-09-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28050.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RSProcedureDispatcher to fail-fast for krb auth failures
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.
>  
> Example log:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying...  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-25549) Provide a switch that allows avoiding reopening all regions when modifying a table to prevent RIT storms.

2023-09-27 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769848#comment-17769848
 ] 

Viraj Jasani commented on HBASE-25549:
--

Thanks [~GeorryHuang], could you please rebase again? build seems broken.

> Provide a switch that allows avoiding reopening all regions when modifying a 
> table to prevent RIT storms.
> -
>
> Key: HBASE-25549
> URL: https://issues.apache.org/jira/browse/HBASE-25549
> Project: HBase
>  Issue Type: Improvement
>  Components: master, shell
>Affects Versions: 3.0.0-alpha-1
>Reporter: Zhuoyue Huang
>Assignee: Zhuoyue Huang
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Under normal circumstances, modifying a table will cause all regions 
> belonging to the table to enter RIT. Imagine the following two scenarios:
>  # Someone entered the wrong configuration (e.g. negative 
> 'hbase.busy.wait.multiplier.max' value) when altering the table, causing 
> thousands of online regions to fail to open, leading to online accidents.
>  # Modify the configuration of a table, but this modification is not urgent, 
> the regions are not expected to enter RIT immediately.
> -'alter_lazy' is a new command to modify a table without reopening any online 
> regions except those regions were assigned by other threads or split etc.-
>  
> Provide an optional lazy_mode for the alter command to modify the 
> TableDescriptor without the region entering the RIT. The modification will 
> take effect when the region is reopened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28081) Snapshot working dir does not retain ACLs after snapshot commit phase

2023-09-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28081:
-
Description: 
TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard.

 

After snapshot is committed from working dir to final destination (under 
/.hbase-snapshot dir), if the operation was atomic rename, the working dir 
(e.g. /hbase/.hbase-snapshot/.tmp) no longer preserves ACLs that were derived 
from snapshot parent dir (e.g. /hbase/.hbase-snapshot) while creating first 
working snapshot dir. Hence, for new working dir, we should ensure that we 
preserve ACLs from snapshot parent dir.

This would ensure that final snapshot commit dir has the expected ACLs 
regardless of whether we perform atomic rename of non-atomic copy operation in 
the snapshot commit phase.

 

> Snapshot working dir does not retain ACLs after snapshot commit phase
> -
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard.
>  
> After snapshot is committed from working dir to final destination (under 
> /.hbase-snapshot dir), if the operation was atomic rename, the working dir 
> (e.g. /hbase/.hbase-snapshot/.tmp) no longer preserves ACLs that were derived 
> from snapshot parent dir (e.g. /hbase/.hbase-snapshot) while creating first 
> working snapshot dir. Hence, for new working dir, we should ensure that we 
> preserve ACLs from snapshot parent dir.
> This would ensure that final snapshot commit dir has the expected ACLs 
> regardless of whether we perform atomic rename of non-atomic copy operation 
> in the snapshot commit phase.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28081) Snapshot working dir does not retain ACLs after snapshot commit phase

2023-09-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28081:
-
Summary: Snapshot working dir does not retain ACLs after snapshot commit 
phase  (was: TestSnapshotScannerHDFSAclController is failing 100% on flaky 
dashboard)

> Snapshot working dir does not retain ACLs after snapshot commit phase
> -
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28081) TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard

2023-09-25 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769019#comment-17769019
 ] 

Viraj Jasani commented on HBASE-28081:
--

[~zhangduo] [~apurtell]

Please help review [https://github.com/apache/hbase/pull/5437]

> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard
> ---
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28081) TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard

2023-09-25 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769015#comment-17769015
 ] 

Viraj Jasani commented on HBASE-28081:
--

We have already done something similar in HBASE-24097 as well.

> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard
> ---
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28081) TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard

2023-09-25 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HBASE-28081:


Assignee: Viraj Jasani

> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard
> ---
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-09-25 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769012#comment-17769012
 ] 

Viraj Jasani commented on HBASE-28042:
--

[~apurtell] this seems more like a test issue, running the test alone always 
passes locally. I updated on HBASE-28081 as well.

> Snapshot corruptions due to non-atomic rename within same filesystem
> 
>
> Key: HBASE-28042
> URL: https://issues.apache.org/jira/browse/HBASE-28042
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Sequence of events that can lead to snapshot corruptions:
>  # Create snapshot using admin command
>  # Active master triggers async snapshot creation
>  # If the snapshot operation doesn't complete within 5 min, client gets 
> exception
> {code:java}
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms 
>   {code}
>  # Client initiates snapshot deletion after this error
>  # In the snapshot completion/commit phase, the files are moved from tmp to 
> final dir.
>  # Snapshot delete and snapshot commit operations can cause corruption by 
> leaving incomplete metadata:
>  * [Snapshot commit] create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot delete from client]  delete 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot commit]  create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"
>  
> The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
> not for hbase 2
> {code:java}
>   public static void completeSnapshot(Path snapshotDir, Path workingDir, 
> FileSystem fs,
> FileSystem workingDirFs, final Configuration conf)
> throws SnapshotCreationException, IOException {
> LOG.debug(
>   "Sentinel is done, just moving the snapshot from " + workingDir + " to 
> " + snapshotDir);
> URI workingURI = workingDirFs.getUri();
> URI rootURI = fs.getUri();
> if (
>   (!workingURI.getScheme().equals(rootURI.getScheme()) || 
> workingURI.getAuthority() == null
> || !workingURI.getAuthority().equals(rootURI.getAuthority())
> || workingURI.getUserInfo() == null //always true for hdfs://{cluster}
> || !workingURI.getUserInfo().equals(rootURI.getUserInfo())
> || !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
> evaluated due to short circuit above
> && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
> true, conf) // non-atomic rename operation
> ) {
>   throw new SnapshotCreationException("Failed to copy working directory(" 
> + workingDir
> + ") to completed directory(" + snapshotDir + ").");
> }
>   } {code}
> whereas for hbase 1
> {code:java}
> // check UGI/userInfo
> if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
>   return true;
> }
> if (workingURI.getUserInfo() != null &&
> !workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
>   return true;
> }
>  {code}
> this causes shouldSkipRenameSnapshotDirectories() to return false if 
> workingURI and rootURI share the same filesystem, which would always lead to 
> atomic rename:
> {code:java}
> if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
> || !fs.rename(workingDir, snapshotDir))
>  && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
> conf)) {
>   throw new SnapshotCreationException("Failed to copy working directory(" + 
> workingDir
>   + ") to completed directory(" + snapshotDir + ").");
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28081) TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard

2023-09-25 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769010#comment-17769010
 ] 

Viraj Jasani commented on HBASE-28081:
--

This is more of a test issue. If i run testModifyTable1() only, it passes 100% 
of the times locally, whereas running all tests in 
TestSnapshotScannerHDFSAclController makes testModifyTable1() fail.

Separating the test should be good enough.

> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard
> ---
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28076) NPE on initialization error in RecoveredReplicationSourceShipper

2023-09-14 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28076.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
 Hadoop Flags: Reviewed
   Resolution: Fixed

> NPE on initialization error in RecoveredReplicationSourceShipper
> 
>
> Key: HBASE-28076
> URL: https://issues.apache.org/jira/browse/HBASE-28076
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.4.17, 2.5.5
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Minor
> Fix For: 2.6.0, 2.4.18, 2.5.6
>
>
> When we run into problems starting RecoveredReplicationSourceShipper, we try 
> to stop the reader thread which we haven't initialized yet, resulting in an 
> NPE.
> {noformat}
> ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
> Unexpected exception in redacted currentPath=hdfs://redacted
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.terminate(RecoveredReplicationSourceShipper.java:100)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.getRecoveredQueueStartPos(RecoveredReplicationSourceShipper.java:87)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.getStartPosition(RecoveredReplicationSourceShipper.java:62)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$tryStartNewShipper$3(ReplicationSource.java:349)
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.tryStartNewShipper(ReplicationSource.java:341)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:601)
>         at java.lang.Thread.run(Thread.java:750)
> {noformat}
> A simple null check should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28050) RSProcedureDispatcher to fail-fast for krb auth failures

2023-09-13 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28050:
-
Summary: RSProcedureDispatcher to fail-fast for krb auth failures  (was: 
RSProcedureDispatcher to fail-fast for krb auth issues)

> RSProcedureDispatcher to fail-fast for krb auth failures
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.
>  
> Example log:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying...  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28050) RSProcedureDispatcher to fail-fast for krb auth issues

2023-09-13 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28050:
-
Summary: RSProcedureDispatcher to fail-fast for krb auth issues  (was: 
RSProcedureDispatcher to fail-fast for SaslException)

> RSProcedureDispatcher to fail-fast for krb auth issues
> --
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.
>  
> Example log:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying...  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27231) FSHLog should retry writing WAL entries when syncs to HDFS failed.

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764432#comment-17764432
 ] 

Viraj Jasani commented on HBASE-27231:
--

+1, besides all WAL implementation is internal anyways, this is good 
improvement and it also has nice refactor from AsyncFSWAL to AbstractFSWAL, i 
am also +1 for backporting to branch-2 and 2.5.

> FSHLog should retry writing WAL entries when syncs to HDFS failed.
> --
>
> Key: HBASE-27231
> URL: https://issues.apache.org/jira/browse/HBASE-27231
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Affects Versions: 3.0.0-alpha-4
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 3.0.0-beta-1
>
>
> Just as HBASE-27223 said, basically, if the {{WAL}} write to HDFS fails, we 
> do not know whether the data has been persistent or not. The implementation 
> for {{AsyncFSWAL}}, is to open a new writer and try to write the WAL entries 
> again, and then adding logic in WAL split and replay to deal with duplicate 
> entries. But for {{FSHLog}}, it does not have the same logic with 
> {{AsyncFSWAL}}, when {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} failed, {{FSHLog.sync}} immediately throws the 
> exception thrown by {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} , we should implement the same retry logic as 
> {{AsyncFSWAL}}, so {{WAL.sync}} could only throw  {{TimeoutIOException}} and 
> we could uniformly abort the RegionServer when  {{WAL.sync}} failed.
> The basic idea is because both {{FSHLog.RingBufferEventHandler}} and 
> {{AsyncFSWAL.consumeExecutor}} are single-thread,  we could reuse the logic 
> in {{AsyncWAL}} and move the most code in {{AsyncWAL}} upward to 
> {{AbstractFSWAL}} , and just adapting the {{SyncRunner}} in {{FSHLog}} to the 
> logic in {{AsyncWriter.sync}}. Once we do that, most logic in {{AsyncWAL}} 
> and {{FSHLog}} are unified, just how to sync the {{writer}} is different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764327#comment-17764327
 ] 

Viraj Jasani commented on HBASE-28068:
--

The default value of such a config is debatable, because if we don't set it as 
Long.MAX_VALUE, perhaps we are breaking compatibility of a feature, but not in 
any way that would impact client queries in any harmful way, so let's wait for 
some more opinions, otherwise i am in favor of setting small value for the 
config to begin with. Because it would be painful for anyone to realize the 
negative impact of high num of region merges getting stuck and eventually 
figure out we already have a config that can be used, instead of setting the 
right value from the beginning such that no one has to face this situation, by 
limiting num of merges that can be triggered by single normalizer run.

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLE, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764326#comment-17764326
 ] 

Viraj Jasani edited comment on HBASE-28068 at 9/12/23 6:07 PM:
---

In fact, the config limit can be applied during plan computation (i.e. 
{_}computeMergeNormalizationPlans(){_}).

For instance, we can limit the size of rangeMembers here:
{code:java}
...
...
...

if (
  rangeMembers.isEmpty() // when there are no range members, seed the range 
with whatever
 // we have. this way we're prepared in case the next 
region is
 // 0-size.
|| (rangeMembers.size() == 1 && sumRangeMembersSizeMb == 0) // when there 
is only one
// region and 
the size is 0,
// seed the 
range with
// whatever we 
have.
|| regionSizeMb == 0 // always add an empty region to the current range.
|| (regionSizeMb + sumRangeMembersSizeMb <= avgRegionSizeMb)
) { // add the current region
// to the range when
// there's capacity
// remaining.
  rangeMembers.add(new NormalizationTarget(regionInfo, regionSizeMb));
  sumRangeMembersSizeMb += regionSizeMb;
  continue;
}

...
...
... {code}
If the configured limit is higher than {_}rangeMembers.size(){_}, we don't need 
to compute any further.

Though this is for merge plan, this might also be improved in general and be 
applicable to _computeSplitNormalizationPlans()_ as well.


was (Author: vjasani):
In fact, the config limit can be applied during plan computation (i.e. 
{_}computeMergeNormalizationPlans(){_}).

For instance, we can limit the size of rangeMembers here:
{code:java}
...
...
...

if (
  rangeMembers.isEmpty() // when there are no range members, seed the range 
with whatever
 // we have. this way we're prepared in case the next 
region is
 // 0-size.
|| (rangeMembers.size() == 1 && sumRangeMembersSizeMb == 0) // when there 
is only one
// region and 
the size is 0,
// seed the 
range with
// whatever we 
have.
|| regionSizeMb == 0 // always add an empty region to the current range.
|| (regionSizeMb + sumRangeMembersSizeMb <= avgRegionSizeMb)
) { // add the current region
// to the range when
// there's capacity
// remaining.
  rangeMembers.add(new NormalizationTarget(regionInfo, regionSizeMb));
  sumRangeMembersSizeMb += regionSizeMb;
  continue;
}

...
...
... {code}
If the configured limit is higher than {_}rangeMembers.size(){_}, we don't need 
to compute any further. This is for merge plan, this might be improved in 
general as well.

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPD

[jira] [Commented] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764326#comment-17764326
 ] 

Viraj Jasani commented on HBASE-28068:
--

In fact, the config limit can be applied during plan computation (i.e. 
{_}computeMergeNormalizationPlans(){_}).

For instance, we can limit the size of rangeMembers here:
{code:java}
...
...
...

if (
  rangeMembers.isEmpty() // when there are no range members, seed the range 
with whatever
 // we have. this way we're prepared in case the next 
region is
 // 0-size.
|| (rangeMembers.size() == 1 && sumRangeMembersSizeMb == 0) // when there 
is only one
// region and 
the size is 0,
// seed the 
range with
// whatever we 
have.
|| regionSizeMb == 0 // always add an empty region to the current range.
|| (regionSizeMb + sumRangeMembersSizeMb <= avgRegionSizeMb)
) { // add the current region
// to the range when
// there's capacity
// remaining.
  rangeMembers.add(new NormalizationTarget(regionInfo, regionSizeMb));
  sumRangeMembersSizeMb += regionSizeMb;
  continue;
}

...
...
... {code}
If the configured limit is higher than {_}rangeMembers.size(){_}, we don't need 
to compute any further. This is for merge plan, this might be improved in 
general as well.

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLE, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28068:
-
Fix Version/s: 2.5.6
   3.0.0-beta-1
   (was: 3.0.0)

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLE, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764315#comment-17764315
 ] 

Viraj Jasani edited comment on HBASE-28068 at 9/12/23 5:49 PM:
---

{quote}I was thinking of exposing a property like 
_*hbase.normalizer.plan_region_limit*_ (limit on every plan)
{quote}
Thanks [~rkrahul324], this sounds great!

Since we know the consequences of unlimited num of region merges that can be 
triggered by normalizer, the default value can be kept as low as 10 (or a bit 
higher, but not higher than 50) so we will have only 10 regions merged at a 
time and in the next run, 10 more regions and so on.

IMO, we don't need to keep the default value as Long.MAX_VALUE. Even though it 
takes too many normalizer runs to completely fix ~25k regions with size 0, it's 
fine as opposed to the procedure resources getting heavily occupied only by 
single normalizer run.

 

WDYT [~ndimiduk] [~zhangduo] [~apurtell] [~rvaleti]?


was (Author: vjasani):
{quote}I was thinking of exposing a property like 
_*hbase.normalizer.plan_region_limit*_ (limit on every plan)
{quote}
sounds good, though given we know the consequences of unlimited num of merges 
that can be triggered, the default value can be kept as low as 10 so we will 
have only 10 regions merged at a time and in the next run, 10 more regions can 
be merged and so on.

IMO, we don't need to keep the default value as Long.MAX_VALUE.

WDYT [~zhangduo] [~apurtell] [~rvaleti]?

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 3.0.0
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLE, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28068) Normalizer should batch merging 0 sized/empty regions

2023-09-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764315#comment-17764315
 ] 

Viraj Jasani commented on HBASE-28068:
--

{quote}I was thinking of exposing a property like 
_*hbase.normalizer.plan_region_limit*_ (limit on every plan)
{quote}
sounds good, though given we know the consequences of unlimited num of merges 
that can be triggered, the default value can be kept as low as 10 so we will 
have only 10 regions merged at a time and in the next run, 10 more regions can 
be merged and so on.

IMO, we don't need to keep the default value as Long.MAX_VALUE.

WDYT [~zhangduo] [~apurtell] [~rvaleti]?

> Normalizer should batch merging 0 sized/empty regions
> -
>
> Key: HBASE-28068
> URL: https://issues.apache.org/jira/browse/HBASE-28068
> Project: HBase
>  Issue Type: Improvement
>  Components: Normalizer
>Affects Versions: 2.5.5
>Reporter: Ravi Kishore Valeti
>Assignee: Rahul Kumar
>Priority: Minor
> Fix For: 2.6.0, 3.0.0
>
>
> In our production environment, while investigating an issue, we observed that 
> the Noramlizer had scheduled one single merge procedure to an RS providing 
> 27K+ empty regions of a table (this was a result of a failed copy table job 
> that left 27K+ empty regions of the table) to merge.
> This action led the procedure to go to stuck state and eventually the 
> procedure framework bailed out after ~40mins. This was happening with each 
> normalizer run until we deleted the table manually.
> Logs
> Normalizer triggers a merge procedure
> normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED 
> => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,.', 
> STARTKEY => '', ENDKEY => ''},{*}regionSizeMb=0{*}], 
> NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, 
> NAME => 'TEST.TEST_TABLE,', STARTKEY => 'XXYY', ENDKEY => 
> 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356*
> procedure immediately gets stuck
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time 12.4850 sec
> Finally fails after ~40 mins
> procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run 
> time *40 mins, 58.055 sec*
> Bails out with RuntimeException
> procedure2.ProcedureExecutor - force=false
> java.lang.UnsupportedOperationException: pid=21968356, 
> state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, 
> exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime 
> exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, 
> locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLE, 
> {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*}
> .
> .
> .
> .
> *27K+ regions printed here]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-09-11 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-28050:
-
Description: 
As discussed on the parent Jira, let's mark the remote procedures fail when we 
encounter SaslException (GSS initiate failed) as this belongs to the category 
of known IOException where we are certain that the request has not yet reached 
to the target regionserver yet.

This should help release dispatcher threads for other 
ExecuteProceduresRemoteCall executions.

 

Example log:
{code:java}
2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=0, retrying...  {code}

  was:
As discussed on the parent Jira, let's mark the remote procedures fail when we 
encounter SaslException (GSS initiate failed) as this belongs to the category 
of known IOException where we are certain that the request has not yet reached 
to the target regionserver yet.

This should help release dispatcher threads for other 
ExecuteProceduresRemoteCall executions.


> RSProcedureDispatcher to fail-fast for SaslException
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.
>  
> Example log:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying...  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-09-11 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HBASE-28050:


Assignee: Viraj Jasani

> RSProcedureDispatcher to fail-fast for SaslException
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27231) FSHLog should retry writing WAL entries when syncs to HDFS failed.

2023-09-10 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763536#comment-17763536
 ] 

Viraj Jasani commented on HBASE-27231:
--

[~comnetwork] [~zhangduo] shall we backport this for upcoming 2.6 and 2.5 
releases?

cc [~apurtell] 

> FSHLog should retry writing WAL entries when syncs to HDFS failed.
> --
>
> Key: HBASE-27231
> URL: https://issues.apache.org/jira/browse/HBASE-27231
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Affects Versions: 3.0.0-alpha-4
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 3.0.0-beta-1
>
>
> Just as HBASE-27223 said, basically, if the {{WAL}} write to HDFS fails, we 
> do not know whether the data has been persistent or not. The implementation 
> for {{AsyncFSWAL}}, is to open a new writer and try to write the WAL entries 
> again, and then adding logic in WAL split and replay to deal with duplicate 
> entries. But for {{FSHLog}}, it does not have the same logic with 
> {{AsyncFSWAL}}, when {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} failed, {{FSHLog.sync}} immediately throws the 
> exception thrown by {{ProtobufLogWriter.append}} and 
> {{ProtobufLogWriter.sync}} , we should implement the same retry logic as 
> {{AsyncFSWAL}}, so {{WAL.sync}} could only throw  {{TimeoutIOException}} and 
> we could uniformly abort the RegionServer when  {{WAL.sync}} failed.
> The basic idea is because both {{FSHLog.RingBufferEventHandler}} and 
> {{AsyncFSWAL.consumeExecutor}} are single-thread,  we could reuse the logic 
> in {{AsyncWAL}} and move the most code in {{AsyncWAL}} upward to 
> {{AbstractFSWAL}} , and just adapting the {{SyncRunner}} in {{FSHLog}} to the 
> logic in {{AsyncWriter.sync}}. Once we do that, most logic in {{AsyncWAL}} 
> and {{FSHLog}} are unified, just how to sync the {{writer}} is different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760044#comment-17760044
 ] 

Viraj Jasani edited comment on HBASE-28050 at 8/29/23 6:25 PM:
---

The changes are not complicated but this requires thorough testing.

(sample patch, exact changes can be different depending on test results)
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
index ac2c971b02..eeeab029a6 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
@@ -22,6 +22,7 @@ import java.lang.Thread.UncaughtExceptionHandler;
 import java.util.List;
 import java.util.Set;
 import java.util.concurrent.TimeUnit;
+import javax.security.sasl.SaslException;
 import org.apache.hadoop.hbase.CallQueueTooBigException;
 import org.apache.hadoop.hbase.DoNotRetryIOException;
 import org.apache.hadoop.hbase.ServerName;
@@ -287,6 +288,11 @@ public class RSProcedureDispatcher extends 
RemoteProcedureDispatcher RSProcedureDispatcher to fail-fast for SaslException
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760052#comment-17760052
 ] 

Viraj Jasani edited comment on HBASE-28048 at 8/29/23 6:24 PM:
---

Let's assume we are moving all regions from server A to server B. If server A 
is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

This should still be okay i guess, assuming remaining servers are responsive to 
requests from master, and hence procedures are overall making good progress 
(instead of some of them getting stuck).

 

Created sub-task HBASE-28050 to specifically deal with SaslException.


was (Author: vjasani):
Let's assume we are moving all regions from server A to server B. If server A 
is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

This should still be okay i guess, assuming remaining servers are responsive to 
requests from master, and hence procedures are overall making good progress 
(instead of some of them getting stuck).

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.third

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760052#comment-17760052
 ] 

Viraj Jasani commented on HBASE-28048:
--

Let's assume we are moving all regions from server A to server B. If server A 
is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

This should still be okay i guess, assuming remaining servers are responsive to 
requests from master, and hence procedures are overall making good progress 
(instead of some of them getting stuck).

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 

[jira] [Comment Edited] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760044#comment-17760044
 ] 

Viraj Jasani edited comment on HBASE-28050 at 8/29/23 5:21 PM:
---

The changes are not complicated but this requires thorough testing.

(sample patch, exact changes can be different)
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
index ac2c971b02..eeeab029a6 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
@@ -22,6 +22,7 @@ import java.lang.Thread.UncaughtExceptionHandler;
 import java.util.List;
 import java.util.Set;
 import java.util.concurrent.TimeUnit;
+import javax.security.sasl.SaslException;
 import org.apache.hadoop.hbase.CallQueueTooBigException;
 import org.apache.hadoop.hbase.DoNotRetryIOException;
 import org.apache.hadoop.hbase.ServerName;
@@ -287,6 +288,11 @@ public class RSProcedureDispatcher extends 
RemoteProcedureDispatcher RSProcedureDispatcher to fail-fast for SaslException
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-08-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760044#comment-17760044
 ] 

Viraj Jasani commented on HBASE-28050:
--

The changes are not complicated but this requires thorough testing
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
index ac2c971b02..eeeab029a6 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
@@ -22,6 +22,7 @@ import java.lang.Thread.UncaughtExceptionHandler;
 import java.util.List;
 import java.util.Set;
 import java.util.concurrent.TimeUnit;
+import javax.security.sasl.SaslException;
 import org.apache.hadoop.hbase.CallQueueTooBigException;
 import org.apache.hadoop.hbase.DoNotRetryIOException;
 import org.apache.hadoop.hbase.ServerName;
@@ -287,6 +288,11 @@ public class RSProcedureDispatcher extends 
RemoteProcedureDispatcher RSProcedureDispatcher to fail-fast for SaslException
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Priority: Major
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-08-29 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28050:


 Summary: RSProcedureDispatcher to fail-fast for SaslException
 Key: HBASE-28050
 URL: https://issues.apache.org/jira/browse/HBASE-28050
 Project: HBase
  Issue Type: Sub-task
Reporter: Viraj Jasani


As discussed on the parent Jira, let's mark the remote procedures fail when we 
encounter SaslException (GSS initiate failed) as this belongs to the category 
of known IOException where we are certain that the request has not yet reached 
to the target regionserver yet.

This should help release dispatcher threads for other 
ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-28 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759843#comment-17759843
 ] 

Viraj Jasani commented on HBASE-28048:
--

I agree, we already have such logic for ServerNotRunningYetException, 
DoNotRetryIOException, CallQueueTooBigException.

Adding SaslException should be relatively straightforward.

 

I still wonder, if we could categorize how many dispatcher threads are occupied 
with a given regionserver (sort of group by all server names and check how many 
are busy serving them), and if we realize that considerably higher num of 
threads are busy performing region transitions with only same target server, 
with majority of them having higher num of retries already exhausted, perhaps 
it would make sense to fail them such that, it can lead to master abort. The 
key is to not saturate dispatcher threads for only single or a few problematic 
regionservers.

At worst, we will see inconsistencies when new master takes over as active, and 
that requires operational intervention, which is still fine, compared to 
majority dispatcher threads getting occupied with some task that is just not 
making any progress.

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...

[jira] [Commented] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-28 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759795#comment-17759795
 ] 

Viraj Jasani commented on HBASE-28048:
--

We already have relevant TODOs :)
{code:java}
try {
  sendRequest(getServerName(), request.build());
} catch (IOException e) {
  e = unwrapException(e);
  // TODO: In the future some operation may want to bail out early.
  // TODO: How many times should we retry (use numberOfAttemptsSoFar)
  if (!scheduleForRetry(e)) {
remoteCallFailed(procedureEnv, e);
  }
} {code}

> RSProcedureDispatcher to abort executing request after configurable retries
> ---
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
> 10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 faile

[jira] [Created] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-28 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28048:


 Summary: RSProcedureDispatcher to abort executing request after 
configurable retries
 Key: HBASE-28048
 URL: https://issues.apache.org/jira/browse/HBASE-28048
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
Reporter: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


In a recent incident, we observed that RSProcedureDispatcher continues 
executing region open/close procedures with unbounded retries even in the 
presence of known failures like GSS initiate failure:

 
{code:java}
2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=0, retrying... {code}
 

 

If the remote execution results in IOException, the dispatcher attempts to 
schedule the procedure for further retries:

 
{code:java}
    private boolean scheduleForRetry(IOException e) {
      LOG.debug("Request to {} failed, try={}", serverName, 
numberOfAttemptsSoFar, e);
      // Should we wait a little before retrying? If the server is starting 
it's yes.
      ...
      ...
      ...
      numberOfAttemptsSoFar++;
      // Add some backoff here as the attempts rise otherwise if a stuck 
condition, will fill logs
      // with failed attempts. None of our backoff classes -- RetryCounter or 
ClientBackoffPolicy
      // -- fit here nicely so just do something simple; increment by 
rsRpcRetryInterval millis *
      // retry^2 on each try
      // up to max of 10 seconds (don't want to back off too much in case of 
situation change).
      submitTask(this,
        Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
          10 * 1000),
        TimeUnit.MILLISECONDS);
      return true;
    }
 {code}
 

 

Even though we try to provide backoff while retrying, max wait time is 10s:

 
{code:java}
submitTask(this,
  Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
10 * 1000),
  TimeUnit.MILLISECONDS); {code}
 

 

This results in endless loop of retries, until either the underlying issue is 
fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
open/close region procedure (and perhaps entire SCP) for the affected 
regionserver is sidelined manually.
{code:java}
2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=217, retrying...
2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=193, retrying...
2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...
2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...{code}
 

While external issues like "krb ticket expiry" requires operator intervention, 
it is not prudent to fill up the active handlers with endless retries while 
attempting to execute RPC on only single affected regionserver. This eventually 
leads to overall cluster state degradation, specifically in the event of 
multiple regionserver restarts resulting from any planned activities.

One of the resolutions here would be:
 # Configure max retries as part of ExecuteProceduresRequest request (or it 
could be part of RemoteProcedureRequest)
 # This retry count should be used b

[jira] [Resolved] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-08-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28042.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Snapshot corruptions due to non-atomic rename within same filesystem
> 
>
> Key: HBASE-28042
> URL: https://issues.apache.org/jira/browse/HBASE-28042
> Project: HBase
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Sequence of events that can lead to snapshot corruptions:
>  # Create snapshot using admin command
>  # Active master triggers async snapshot creation
>  # If the snapshot operation doesn't complete within 5 min, client gets 
> exception
> {code:java}
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms 
>   {code}
>  # Client initiates snapshot deletion after this error
>  # In the snapshot completion/commit phase, the files are moved from tmp to 
> final dir.
>  # Snapshot delete and snapshot commit operations can cause corruption by 
> leaving incomplete metadata:
>  * [Snapshot commit] create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot delete from client]  delete 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot commit]  create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"
>  
> The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
> not for hbase 2
> {code:java}
>   public static void completeSnapshot(Path snapshotDir, Path workingDir, 
> FileSystem fs,
> FileSystem workingDirFs, final Configuration conf)
> throws SnapshotCreationException, IOException {
> LOG.debug(
>   "Sentinel is done, just moving the snapshot from " + workingDir + " to 
> " + snapshotDir);
> URI workingURI = workingDirFs.getUri();
> URI rootURI = fs.getUri();
> if (
>   (!workingURI.getScheme().equals(rootURI.getScheme()) || 
> workingURI.getAuthority() == null
> || !workingURI.getAuthority().equals(rootURI.getAuthority())
> || workingURI.getUserInfo() == null //always true for hdfs://{cluster}
> || !workingURI.getUserInfo().equals(rootURI.getUserInfo())
> || !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
> evaluated due to short circuit above
> && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
> true, conf) // non-atomic rename operation
> ) {
>   throw new SnapshotCreationException("Failed to copy working directory(" 
> + workingDir
> + ") to completed directory(" + snapshotDir + ").");
> }
>   } {code}
> whereas for hbase 1
> {code:java}
> // check UGI/userInfo
> if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
>   return true;
> }
> if (workingURI.getUserInfo() != null &&
> !workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
>   return true;
> }
>  {code}
> this causes shouldSkipRenameSnapshotDirectories() to return false if 
> workingURI and rootURI share the same filesystem, which would always lead to 
> atomic rename:
> {code:java}
> if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
> || !fs.rename(workingDir, snapshotDir))
>  && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
> conf)) {
>   throw new SnapshotCreationException("Failed to copy working directory(" + 
> workingDir
>   + ") to completed directory(" + snapshotDir + ").");
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-08-23 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758257#comment-17758257
 ] 

Viraj Jasani commented on HBASE-28042:
--

Thanks to [~ukumar] for the detailed find

> Snapshot corruptions due to non-atomic rename within same filesystem
> 
>
> Key: HBASE-28042
> URL: https://issues.apache.org/jira/browse/HBASE-28042
> Project: HBase
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Sequence of events that can lead to snapshot corruptions:
>  # Create snapshot using admin command
>  # Active master triggers async snapshot creation
>  # If the snapshot operation doesn't complete within 5 min, client gets 
> exception
> {code:java}
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms 
>   {code}
>  # Client initiates snapshot deletion after this error
>  # In the snapshot completion/commit phase, the files are moved from tmp to 
> final dir.
>  # Snapshot delete and snapshot commit operations can cause corruption by 
> leaving incomplete metadata:
>  * [Snapshot commit] create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot delete from client]  delete 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot commit]  create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"
>  
> The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
> not for hbase 2
> {code:java}
>   public static void completeSnapshot(Path snapshotDir, Path workingDir, 
> FileSystem fs,
> FileSystem workingDirFs, final Configuration conf)
> throws SnapshotCreationException, IOException {
> LOG.debug(
>   "Sentinel is done, just moving the snapshot from " + workingDir + " to 
> " + snapshotDir);
> URI workingURI = workingDirFs.getUri();
> URI rootURI = fs.getUri();
> if (
>   (!workingURI.getScheme().equals(rootURI.getScheme()) || 
> workingURI.getAuthority() == null
> || !workingURI.getAuthority().equals(rootURI.getAuthority())
> || workingURI.getUserInfo() == null //always true for hdfs://{cluster}
> || !workingURI.getUserInfo().equals(rootURI.getUserInfo())
> || !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
> evaluated due to short circuit above
> && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
> true, conf) // non-atomic rename operation
> ) {
>   throw new SnapshotCreationException("Failed to copy working directory(" 
> + workingDir
> + ") to completed directory(" + snapshotDir + ").");
> }
>   } {code}
> whereas for hbase 1
> {code:java}
> // check UGI/userInfo
> if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
>   return true;
> }
> if (workingURI.getUserInfo() != null &&
> !workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
>   return true;
> }
>  {code}
> this causes shouldSkipRenameSnapshotDirectories() to return false if 
> workingURI and rootURI share the same filesystem, which would always lead to 
> atomic rename:
> {code:java}
> if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
> || !fs.rename(workingDir, snapshotDir))
>  && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
> conf)) {
>   throw new SnapshotCreationException("Failed to copy working directory(" + 
> workingDir
>   + ") to completed directory(" + snapshotDir + ").");
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-21098) Improve Snapshot Performance with Temporary Snapshot Directory when rootDir on S3

2023-08-23 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758251#comment-17758251
 ] 

Viraj Jasani commented on HBASE-21098:
--

Can someone please provide a review of HBASE-28042? PR: 
[https://github.com/apache/hbase/pull/5369]

Thanks

> Improve Snapshot Performance with Temporary Snapshot Directory when rootDir 
> on S3
> -
>
> Key: HBASE-21098
> URL: https://issues.apache.org/jira/browse/HBASE-21098
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha-1, 1.4.8, 2.1.1
>Reporter: Tyler Mi
>Assignee: Tyler Mi
>Priority: Major
>  Labels: s3
> Fix For: 3.0.0-alpha-1, 2.2.0, 1.4.9
>
> Attachments: HBASE-21098.branch-1.001.patch, 
> HBASE-21098.branch-1.002.patch, HBASE-21098.master.001.patch, 
> HBASE-21098.master.002.patch, HBASE-21098.master.003.patch, 
> HBASE-21098.master.004.patch, HBASE-21098.master.005.patch, 
> HBASE-21098.master.006.patch, HBASE-21098.master.007.patch, 
> HBASE-21098.master.008.patch, HBASE-21098.master.009.patch, 
> HBASE-21098.master.010.patch, HBASE-21098.master.011.patch, 
> HBASE-21098.master.012.patch, HBASE-21098.master.013.patch
>
>
> When using Apache HBase, the snapshot feature can be used to make a point in 
> time recovery. To do this, HBase creates a manifest of all the files in all 
> of the Regions so that those files can be referenced again when a user 
> restores a snapshot. With HBase's S3 storage mode, developers can store their 
> data off-cluster on Amazon S3. However, utilizing S3 as a file system is 
> inefficient in some operations, namely renames. Most Hadoop ecosystem 
> applications use an atomic rename as a method of committing data. However, 
> with S3, a rename is a separate copy and then a delete of every file which is 
> no longer atomic and, in fact, quite costly. In addition, puts and deletes on 
> S3 have latency issues that traditional filesystems do not encounter when 
> manipulating the region snapshots to consolidate into a single manifest. When 
> HBase on S3 users have a significant amount of regions, puts, deletes, and 
> renames (the final commit stage of the snapshot) become the bottleneck 
> causing snapshots to take many minutes or even hours to complete.
> The purpose of this patch is to increase the overall performance of snapshots 
> while utilizing HBase on S3 through the use of a temporary directory for the 
> snapshots that exists on a traditional filesystem like HDFS to circumvent the 
> bottlenecks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-08-23 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28042:


 Summary: Snapshot corruptions due to non-atomic rename within same 
filesystem
 Key: HBASE-28042
 URL: https://issues.apache.org/jira/browse/HBASE-28042
 Project: HBase
  Issue Type: Bug
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


Sequence of events that can lead to snapshot corruptions:
 # Create snapshot using admin command
 # Active master triggers async snapshot creation
 # If the snapshot operation doesn't complete within 5 min, client gets 
exception

{code:java}
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms   
{code}
 # Client initiates snapshot deletion after this error
 # In the snapshot completion/commit phase, the files are moved from tmp to 
final dir.
 # Snapshot delete and snapshot commit operations can cause corruption by 
leaving incomplete metadata:

 * [Snapshot commit] create 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
 * [Snapshot delete from client]  delete 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
 * [Snapshot commit]  create 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"

 

The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
not for hbase 2
{code:java}
  public static void completeSnapshot(Path snapshotDir, Path workingDir, 
FileSystem fs,
FileSystem workingDirFs, final Configuration conf)
throws SnapshotCreationException, IOException {
LOG.debug(
  "Sentinel is done, just moving the snapshot from " + workingDir + " to " 
+ snapshotDir);
URI workingURI = workingDirFs.getUri();
URI rootURI = fs.getUri();
if (
  (!workingURI.getScheme().equals(rootURI.getScheme()) || 
workingURI.getAuthority() == null
|| !workingURI.getAuthority().equals(rootURI.getAuthority())
|| workingURI.getUserInfo() == null //always true for hdfs://{cluster}
|| !workingURI.getUserInfo().equals(rootURI.getUserInfo())
|| !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
evaluated due to short circuit above
&& !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
true, conf) // non-atomic rename operation
) {
  throw new SnapshotCreationException("Failed to copy working directory(" + 
workingDir
+ ") to completed directory(" + snapshotDir + ").");
}
  } {code}
whereas for hbase 1
{code:java}
// check UGI/userInfo
if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
  return true;
}
if (workingURI.getUserInfo() != null &&
!workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
  return true;
}
 {code}
this causes shouldSkipRenameSnapshotDirectories() to return false if workingURI 
and rootURI share the same filesystem, which would always lead to atomic rename:
{code:java}
if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
|| !fs.rename(workingDir, snapshotDir))
 && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
conf)) {
  throw new SnapshotCreationException("Failed to copy working directory(" + 
workingDir
  + ") to completed directory(" + snapshotDir + ").");
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HBASE-28040) hbck2 bypass should provide an option to bypass existing top N procedures

2023-08-22 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HBASE-28040:


Assignee: Nishtha Shah

> hbck2 bypass should provide an option to bypass existing top N procedures
> -
>
> Key: HBASE-28040
> URL: https://issues.apache.org/jira/browse/HBASE-28040
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Nishtha Shah
>Priority: Major
>
> For the degraded cluster state where several SCPs and underlying TRSPs are 
> stuck due to network issues, it becomes difficult to resolve RITs and recover 
> regions from SCPs.
> In order to bypass stuck procedures, we need to extract and then provide list 
> of proc ids from list_procedures or procedures.jsp page. If we could provide 
> an option to bypass initial N procedures that are listed on the 
> procedures.jsp page, that would be really helpful.
> Implementation steps:
>  # Similar to BypassProcedureRequest, provide BypassTopNProcedureRequest with 
> only attribute value as N
>  # MasterRpcServices to provide new API: 
>  # 
> {code:java}
> bypassProcedure(RpcController controller,
>   MasterProtos.BypassTopNProcedureRequest request) {code}
>  # Hbck to provide utility to consume this master rpc
>  # HBCK2 to use new hbck utility for bypassing top N requests
>  
> For this new option, top N procedures matter, hence they should follow the 
> sorting order similar to the one followed by procedures.jsp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28040) hbck2 bypass should provide an option to bypass existing top N procedures

2023-08-22 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28040:


 Summary: hbck2 bypass should provide an option to bypass existing 
top N procedures
 Key: HBASE-28040
 URL: https://issues.apache.org/jira/browse/HBASE-28040
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


For the degraded cluster state where several SCPs and underlying TRSPs are 
stuck due to network issues, it becomes difficult to resolve RITs and recover 
regions from SCPs.

In order to bypass stuck procedures, we need to extract and then provide list 
of proc ids from list_procedures or procedures.jsp page. If we could provide an 
option to bypass initial N procedures that are listed on the procedures.jsp 
page, that would be really helpful.

Implementation steps:
 # Similar to BypassProcedureRequest, provide BypassTopNProcedureRequest with 
only attribute value as N
 # MasterRpcServices to provide new API: 
 # 
{code:java}
bypassProcedure(RpcController controller,
  MasterProtos.BypassTopNProcedureRequest request) {code}

 # Hbck to provide utility to consume this master rpc
 # HBCK2 to use new hbck utility for bypassing top N requests

 

For this new option, top N procedures matter, hence they should follow the 
sorting order similar to the one followed by procedures.jsp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28011) The logStats about LruBlockCache is not accurate

2023-08-09 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28011.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> The logStats about LruBlockCache is not accurate
> 
>
> Key: HBASE-28011
> URL: https://issues.apache.org/jira/browse/HBASE-28011
> Project: HBase
>  Issue Type: Bug
>  Components: BlockCache
>Affects Versions: 2.4.13
> Environment: Centos 7.6
> HBase 2.4.13
>Reporter: guluo
>Assignee: guluo
>Priority: Minor
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> LruBlockCache.logStats would print info, as follow:
> {code:java}
> //代码占位符
> INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=2.42 MB, 
> freeSize=3.20 GB, max=3.20 GB, blockCount=14, accesses=31200, hits=31164, 
> hitRatio=99.88%, , cachingAccesses=31179, cachingHits=31156, 
> cachingHitsRatio=99.93%, evictions=426355, evicted=0, evictedPerRun=0.0 {code}
> I think the description about *totalSize=2.42 MB* is not accurate, It 
> actually represents the used size of BlockCache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26494) Using RefCnt to fix the flawed MemStoreLABImpl

2023-08-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750929#comment-17750929
 ] 

Viraj Jasani commented on HBASE-26494:
--

probable cause for HBASE-27941?

> Using RefCnt to fix the flawed MemStoreLABImpl
> --
>
> Key: HBASE-26494
> URL: https://issues.apache.org/jira/browse/HBASE-26494
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-2, 2.4.9
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> As  HBASE-26465 said,  reference count implementation in {{MemStoreLABImpl}} 
> is flawed because its checking and increasing or decreasing is not done in 
> atomicity and it ignores state checking when there is illegal state in 
> reference count(eg. increasing or decreasing when the resource is already 
> freed) , just as following {{incScannerCount}} and {{decScannerCount}} 
> methods illustrated, and this flawed implementation has shield the bugs 
> HBASE-26465 and HBASE-26488.
> {code:java}
>   public void incScannerCount() {
> this.openScannerCount.incrementAndGet();
>   }
>   public void decScannerCount() {
> int count = this.openScannerCount.decrementAndGet();
> if (this.closed.get() && count == 0) {
>   recycleChunks();
> }
>   }
> {code}
> We could Introduce {{RefCnt}} into {{MemStoreLABImpl}} to replace its flawed 
> reference count implementation, so checking and increasing or decreasing is 
> done in atomicity and if there is illegal state in reference count, it would 
> throw exception rather than continue using the corrupt data to cause some 
> subtle bugs, and meanwhile, the code is more simpler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-27941) Possible memory leak in MemStoreLAB implementation

2023-08-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750913#comment-17750913
 ] 

Viraj Jasani edited comment on HBASE-27941 at 8/4/23 12:48 AM:
---

[~zhangduo] i came across this today on latest 2.5 as well
{code:java}
2023-08-04 00:02:56,694 ERROR [MemStoreFlusher.0] util.ResourceLeakDetector - 
Cnt.create(RefCnt.java:54)
    
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
    sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
    
org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
    
org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
    
org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
    
org.apache.hadoop.hbase.regionserver.DefaultMemStore.snapshot(DefaultMemStore.java:106)
    
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.prepare(HStore.java:1946)
    
org.apache.hadoop.hbase.regionserver.HRegion.lambda$internalPrepareFlushCache$2(HRegion.java:2712)
    java.util.TreeMap.forEach(TreeMap.java:1005)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2711)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2584)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2558)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1736)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1557)
    
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:120)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:750) {code}


was (Author: vjasani):
[~zhangduo] i came across this today on latest 2.5 as well, though the 
stacktrace is diff, it's coming from _createMutableSegment()_
{code:java}
2023-08-04 00:02:56,694 ERROR [MemStoreFlusher.0] util.ResourceLeakDetector - 
Cnt.create(RefCnt.java:54)
    
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
    sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
    
org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
    
org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
    
org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
    
org.apache.hadoop.hbase.regionserver.DefaultMemStore.snapshot(DefaultMemStore.java:106)
    
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.prepare(HStore.java:1946)
    
org.apache.hadoop.hbase.regionserver.HRegion.lambda$internalPrepareFlushCache$2(HRegion.java:2712)
    java.util.TreeMap.forEach(TreeMap.java:1005)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2711)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2584)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2558)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1736)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1557)
    
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:120)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:750) {code}

> Possible memory leak in MemStoreLAB implementation
> --
>
>  

[jira] [Comment Edited] (HBASE-27941) Possible memory leak in MemStoreLAB implementation

2023-08-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750913#comment-17750913
 ] 

Viraj Jasani edited comment on HBASE-27941 at 8/4/23 12:45 AM:
---

[~zhangduo] i came across this today on latest 2.5 as well, though the 
stacktrace is diff, it's coming from _createMutableSegment()_
{code:java}
2023-08-04 00:02:56,694 ERROR [MemStoreFlusher.0] util.ResourceLeakDetector - 
Cnt.create(RefCnt.java:54)
    
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
    sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
    
org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
    
org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
    
org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
    
org.apache.hadoop.hbase.regionserver.DefaultMemStore.snapshot(DefaultMemStore.java:106)
    
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.prepare(HStore.java:1946)
    
org.apache.hadoop.hbase.regionserver.HRegion.lambda$internalPrepareFlushCache$2(HRegion.java:2712)
    java.util.TreeMap.forEach(TreeMap.java:1005)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2711)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2584)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2558)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1736)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1557)
    
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:120)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:750) {code}


was (Author: vjasani):
[~zhangduo] i came across this today on latest 2.5 as well
{code:java}
2023-08-04 00:02:56,694 ERROR [MemStoreFlusher.0] util.ResourceLeakDetector - 
Cnt.create(RefCnt.java:54)
    
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
    sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
    
org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
    
org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
    
org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
    
org.apache.hadoop.hbase.regionserver.DefaultMemStore.snapshot(DefaultMemStore.java:106)
    
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.prepare(HStore.java:1946)
    
org.apache.hadoop.hbase.regionserver.HRegion.lambda$internalPrepareFlushCache$2(HRegion.java:2712)
    java.util.TreeMap.forEach(TreeMap.java:1005)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2711)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2584)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2558)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1736)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1557)
    
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:120)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:750) {code}

> Possible memory leak in MemStoreLAB implementation
> --
>
>  

[jira] [Comment Edited] (HBASE-27941) Possible memory leak in MemStoreLAB implementation

2023-08-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750913#comment-17750913
 ] 

Viraj Jasani edited comment on HBASE-27941 at 8/4/23 12:43 AM:
---

[~zhangduo] i came across this today on latest 2.5 as well
{code:java}
2023-08-04 00:02:56,694 ERROR [MemStoreFlusher.0] util.ResourceLeakDetector - 
Cnt.create(RefCnt.java:54)
    
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
    sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
    
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
    
org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
    
org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
    
org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
    
org.apache.hadoop.hbase.regionserver.DefaultMemStore.snapshot(DefaultMemStore.java:106)
    
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.prepare(HStore.java:1946)
    
org.apache.hadoop.hbase.regionserver.HRegion.lambda$internalPrepareFlushCache$2(HRegion.java:2712)
    java.util.TreeMap.forEach(TreeMap.java:1005)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2711)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2584)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2558)
    
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1736)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1557)
    
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:120)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:750) {code}


was (Author: vjasani):
[~zhangduo] i came across this today on latest 2.5 as well

> Possible memory leak in MemStoreLAB implementation
> --
>
> Key: HBASE-27941
> URL: https://issues.apache.org/jira/browse/HBASE-27941
> Project: HBase
>  Issue Type: Bug
>  Components: in-memory-compaction, regionserver
>Reporter: Duo Zhang
>Priority: Major
>
> We got this error message when running ITBLL against branch-3.
> {noformat}
> 2023-06-09 14:44:15,386 ERROR 
> [regionserver/core-1-2:16020-shortCompactions-0] util.ResourceLeakDetector: 
> LEAK: RefCnt.release() was not called before it's garbage-collected. See 
> https://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records:
> Created at:
> org.apache.hadoop.hbase.nio.RefCnt.(RefCnt.java:59)
> org.apache.hadoop.hbase.nio.RefCnt.create(RefCnt.java:54)
> 
> org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
> sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
> 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
> 
> org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
> 
> org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
> 
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
> 
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.(AbstractMemStore.java:83)
> 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.(DefaultMemStore.java:79)
> sun.reflect.GeneratedConstructorAccessor12.newInstance(Unknown Source)
> 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
> 
> org.apache.hadoop.hbas

[jira] [Commented] (HBASE-27941) Possible memory leak in MemStoreLAB implementation

2023-08-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750913#comment-17750913
 ] 

Viraj Jasani commented on HBASE-27941:
--

[~zhangduo] i came across this today on latest 2.5 as well

> Possible memory leak in MemStoreLAB implementation
> --
>
> Key: HBASE-27941
> URL: https://issues.apache.org/jira/browse/HBASE-27941
> Project: HBase
>  Issue Type: Bug
>  Components: in-memory-compaction, regionserver
>Reporter: Duo Zhang
>Priority: Major
>
> We got this error message when running ITBLL against branch-3.
> {noformat}
> 2023-06-09 14:44:15,386 ERROR 
> [regionserver/core-1-2:16020-shortCompactions-0] util.ResourceLeakDetector: 
> LEAK: RefCnt.release() was not called before it's garbage-collected. See 
> https://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records:
> Created at:
> org.apache.hadoop.hbase.nio.RefCnt.(RefCnt.java:59)
> org.apache.hadoop.hbase.nio.RefCnt.create(RefCnt.java:54)
> 
> org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.(MemStoreLABImpl.java:108)
> sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
> 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
> 
> org.apache.hadoop.hbase.regionserver.MemStoreLAB.newInstance(MemStoreLAB.java:116)
> 
> org.apache.hadoop.hbase.regionserver.SegmentFactory.createMutableSegment(SegmentFactory.java:81)
> 
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.resetActive(AbstractMemStore.java:93)
> 
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.(AbstractMemStore.java:83)
> 
> org.apache.hadoop.hbase.regionserver.DefaultMemStore.(DefaultMemStore.java:79)
> sun.reflect.GeneratedConstructorAccessor12.newInstance(Unknown Source)
> 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.instantiate(ReflectionUtils.java:55)
> 
> org.apache.hadoop.hbase.util.ReflectionUtils.newInstance(ReflectionUtils.java:92)
> 
> org.apache.hadoop.hbase.regionserver.HStore.getMemstore(HStore.java:377)
> org.apache.hadoop.hbase.regionserver.HStore.(HStore.java:283)
> 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6904)
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1173)
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1170)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:750)
> {noformat}
> Need to dig more.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27904) A random data generator tool leveraging bulk load.

2023-07-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27904:
--

re-opening for branch-2 backport

> A random data generator tool leveraging bulk load.
> --
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
>  Issue Type: New Feature
>  Components: util
>Reporter: Himanshu Gwalani
>Assignee: Himanshu Gwalani
>Priority: Major
> Fix For: 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 
> nodes HBase cluster (having HBase + Hadoop services running). The tool 
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool 
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 100 -split-count 
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27904) A random data generator tool leveraging bulk load.

2023-07-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27904.
--
Fix Version/s: 2.6.0
   Resolution: Fixed

> A random data generator tool leveraging bulk load.
> --
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
>  Issue Type: New Feature
>  Components: util
>Reporter: Himanshu Gwalani
>Assignee: Himanshu Gwalani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 
> nodes HBase cluster (having HBase + Hadoop services running). The tool 
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool 
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 100 -split-count 
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-25549) Provide lazy mode when modifying table to avoid RIT storm

2023-07-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-25549:
-
Fix Version/s: 2.6.0
   2.5.6

> Provide lazy mode when modifying table to avoid RIT storm
> -
>
> Key: HBASE-25549
> URL: https://issues.apache.org/jira/browse/HBASE-25549
> Project: HBase
>  Issue Type: Improvement
>  Components: master, shell
>Affects Versions: 3.0.0-alpha-1
>Reporter: Zhuoyue Huang
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Under normal circumstances, modifying a table will cause all regions 
> belonging to the table to enter RIT. Imagine the following two scenarios:
>  # Someone entered the wrong configuration (e.g. negative 
> 'hbase.busy.wait.multiplier.max' value) when altering the table, causing 
> thousands of online regions to fail to open, leading to online accidents.
>  # Modify the configuration of a table, but this modification is not urgent, 
> the regions are not expected to enter RIT immediately.
> -'alter_lazy' is a new command to modify a table without reopening any online 
> regions except those regions were assigned by other threads or split etc.-
>  
> Provide an optional lazy_mode for the alter command to modify the 
> TableDescriptor without the region entering the RIT. The modification will 
> take effect when the region is reopened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27965) Change the label from InterfaceAudience.Private to InterfaceAudience.LimitedPrivate for org.apache.hadoop.hbase.Server.java

2023-07-14 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27965:
-
Fix Version/s: 2.6.0
   3.0.0-beta-1

> Change the label from InterfaceAudience.Private to 
> InterfaceAudience.LimitedPrivate for org.apache.hadoop.hbase.Server.java 
> 
>
> Key: HBASE-27965
> URL: https://issues.apache.org/jira/browse/HBASE-27965
> Project: HBase
>  Issue Type: Task
>  Components: Coprocessors
>Affects Versions: 2.4.4
>Reporter: Shubham Roy
>Assignee: Shubham Roy
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> Currently the class org.apache.hadoop.hbase.Server.java is marked 
> InterfaceAudience.Private.
> {{{}This prevents in getting shared zookeeper watcher (ZKWatcher) instances 
> from org.apache.hadoop.hbase.regionserver.RegionServerServices.java (extends 
> org.apache.hadoop.hbase.Server.java{}}}{{{}) using the method 
> getZooKeeper().{}}}
>  
> {{This creates a problem in writing custom Coprocessors because now we don't 
> have a shared ZKWatcher instance.}}
>  
> {{The proposed solution is to use the InterfaceAudience.LimitedPrivate. }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27957) HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED exception.

2023-06-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27957:
-
Component/s: Client

> HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED 
> exception.
> 
>
> Key: HBASE-27957
> URL: https://issues.apache.org/jira/browse/HBASE-27957
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.7.2, 2.4.17
>Reporter: Rushabh Shah
>Priority: Critical
>
> Observed this in production environment running some version of 1.7 release.
> Application didn't had the right keytab setup for authentication. Application 
> was trying to create HConnection and zookeeper server threw AUTH_FAILED 
> exception.
> After few hours of application in this state, saw thousands of 
> zk-event-processor thread with below stack trace.
> {noformat}
> "zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms 
> elapsed=41794.58s tid=0x7fd7805066d0 nid=0x1245 waiting on condition  
> [0x7fd75df01000]
>java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native 
> Method)
> - parking to wait for  <0x7fd9874a85e0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628)
> {noformat}
> {code:java|title=ConnectionManager.java|borderStyle=solid}
> HConnectionImplementation(Configuration conf, boolean managed,
> ExecutorService pool, User user, String clusterId) throws IOException 
> {
> ...
> ...
> try {
>this.registry = setupRegistry();
>retrieveClusterId();
>...
>...
> } catch (Throwable e) {
>// avoid leaks: registry, rpcClient, ...
>LOG.debug("connection construction failed", e);
>close();
>throw e;
>  }
> {code}
> retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
> {code:java|title=ZKConnectionRegistry.java|borderStyle=solid}
>   private String clusterId = null;
>   @Override
>   public String getClusterId() {
> if (this.clusterId != null) return this.clusterId;
> // No synchronized here, worse case we will retrieve it twice, that's
> //  not an issue.
> try (ZooKeeperKeepAliveConnection zkw = 
> hci.getKeepAliveZooKeeperWatcher()) {
>   this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
>   if (this.clusterId == null) {
> LOG.info("ClusterId read in ZooKeeper is null");
>   }
> } catch (KeeperException | IOException e) {  --->  WE ARE SWALLOWING 
> THIS EXCEPTION AND RETURNING NULL. 
>   LOG.warn("Can't retrieve clusterId from Zookeeper", e);
> }
> return this.clusterId;
>   }
> {code}
> ZkConnectionRegistry#getClusterId threw the following exception.(Our logging 
> system trims stack traces longer than 5 lines.)
> {noformat}
> Cause: org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid
> StackTrace: 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
> {noformat}
> We should throw KeeperException from ZKConnectionRegistry#getClusterId all 
> the way back to HConnectionImplementation constructor to close all the 
> watcher threads and throw the exception back to the caller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-29 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738226#comment-17738226
 ] 

Viraj Jasani edited comment on HBASE-27955 at 6/29/23 9:31 PM:
---

That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
this.succ = false;
  } else {
LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
this.succ = true;
  }
} {code}
Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck. No INFO logs from POST_PEER_MODIFICATION step 
execution either.

 

Hence, if we could introduce rollback in RefreshPeerProcedure, that would help 
at least complete the procedure with rollback rather than letting it stay stuck 
at next step (POST_PEER_MODIFICATION).


was (Author: vjasani):
That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)

 
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
this.succ = false;
  } else {
LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
this.succ = true;
  }
} {code}
 

 

Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck.

 

If we could introduce rollback in RefreshPeerProcedure, that could help at 
least complete the procedure with rollback rather than letting it stay stuck at 
next step (POST_PEER_MODIFICATION).

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.14
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org

[jira] [Resolved] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27948.
--
Fix Version/s: 2.6.0
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27955:
-
Affects Version/s: (was: 2.4.17)

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.14
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27955:
-
Affects Version/s: 2.4.17

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27955:
-
Affects Version/s: 2.4.14

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.14, 2.4.17
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738226#comment-17738226
 ] 

Viraj Jasani commented on HBASE-27955:
--

That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)

 
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
this.succ = false;
  } else {
LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
this.succ = true;
  }
} {code}
 

 

Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck.

 

If we could introduce rollback in RefreshPeerProcedure, that could help at 
least complete the procedure with rollback rather than letting it stay stuck at 
next step (POST_PEER_MODIFICATION).

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -
>
> Key: HBASE-27955
> URL: https://issues.apache.org/jira/browse/HBASE-27955
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89) <= replication endpoint failure example
>     at xyz(Abc.java:79)     <= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27955:
-
Description: 
UpdatePeerConfigProcedure gets stuck when we see some failures in 
RefreshPeerProcedure. The only way to move forward is either by restarting 
active master or bypassing the stuck procedure.

 

For instance,
{code:java}
2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
{host},{port},1687053857180 failed
java.lang.NullPointerException via 
{host},{port},1687053857180:java.lang.NullPointerException: 
    at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
    at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
Caused by: java.lang.NullPointerException: 
    at xyz(Abc.java:89) <= replication endpoint failure example
    at xyz(Abc.java:79)     <= replication endpoint failure example
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
    at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
    at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}
RefreshPeerProcedure should support reporting this failure and rollback of the 
parent procedure.

  was:
UpdatePeerConfigProcedure gets stuck when we see some failures in 
RefreshPeerProcedure. The only way to move forward is either by restarting 
active master or bypassing the stuck procedure.

 

For instance,
{code:java}
2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
replication.RefreshPeerProcedure - Refresh peer core1.hbase1a_aws.prod5.uswest2 
for UPDATE_CONFIG on {host},{port},1687053857180 failed
java.lang.NullPointerException via 
{host},{port},1687053857180:java.lang.NullPointerException: 
    at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
    at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
Caused by: java.lang.NullPointerException: 
    at xyz(Abc.java:89) <= replication endpoint failure example
    at xyz(Abc.java:79)     <= replication endpoint failure example
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
    at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandle

<    1   2   3   4   5   6   7   8   9   10   >