[jira] [Created] (HBASE-28757) Understand how supportplaintext property works in TLS setup.

2024-07-25 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28757:


 Summary: Understand how supportplaintext property works in TLS 
setup.
 Key: HBASE-28757
 URL: https://issues.apache.org/jira/browse/HBASE-28757
 Project: HBase
  Issue Type: Improvement
  Components: security
Affects Versions: 2.6.0
Reporter: Rushabh Shah


We are testing TLS feature and I am confused on how 
hbase.server.netty.tls.supportplaintext property works.
Here is our current setup. This is a fresh cluster deployment.
hbase.server.netty.tls.enabled --> true
hbase.client.netty.tls.enabled  -->  true
hbase.server.netty.tls.supportplaintext --> false (We don't want to fallback on 
kerberos)
We still have our kerberos related configuration enabled.
hbase.security.authentication --> kerberos

*Our expectation:*
During regionserver startup, regionserver will use TLS for authentication and 
the communication will succeed.

*Actual observation*
During regionserver startup, hmaster authenticates regionserver* via kerberos 
authentication*and *regionserver's reportForDuty RPC fails*.

RS logs:
{noformat}
2024-07-25 16:59:55,098 INFO  [regionserver/regionserver-0:60020] 
regionserver.HRegionServer - reportForDuty to 
master=hmaster-0,6,1721926791062 with 
isa=regionserver-0/:60020, startcode=1721926793434

2024-07-25 16:59:55,548 DEBUG [RS-EventLoopGroup-1-2] ssl.SslHandler - [id: 
0xa48e3487, L:/:39837 - R:hmaster-0/:6] 
HANDSHAKEN: protocol:TLSv1.2 cipher suite:TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256

2024-07-25 16:59:55,578 DEBUG [RS-EventLoopGroup-1-2] 
security.UserGroupInformation - PrivilegedAction [as: hbase/regionserver-0. 
(auth:KERBEROS)][action: 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler$2@3769e55]
java.lang.Exception
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
at 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:161)
at 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:43)
...
...

2024-07-25 16:59:55,581 DEBUG [RS-EventLoopGroup-1-2] 
security.UserGroupInformation - PrivilegedAction [as: hbase/regionserver-0 
(auth:KERBEROS)][action: 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler$2@c6f0806]
java.lang.Exception
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896)
at 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:161)
at 
org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:43)
at 
org.apache.hbase.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)

2024-07-25 16:59:55,602 WARN  [regionserver/regionserver-0:60020] 
regionserver.HRegionServer - error telling master we are up
org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
address=hmaster-0:6 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:340)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:92)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:595)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:16398)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2997)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.lambda$run$2(HRegionServer.java:1084)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079)
Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call 
to address=hmaster-0:6 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:233)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
at 

[jira] [Created] (HBASE-28515) Validate access check for each RegionServerCoprocessor Endpoint.

2024-04-11 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28515:


 Summary: Validate access check for each RegionServerCoprocessor 
Endpoint.
 Key: HBASE-28515
 URL: https://issues.apache.org/jira/browse/HBASE-28515
 Project: HBase
  Issue Type: Improvement
  Components: Coprocessors
Reporter: Rushabh Shah


Currently we enforce ADMIN permissions for each regionserver coprocessor 
method. See HBASE-28508 for more details.
There can be regionserver endpoint implementation which don't want to enforce 
calling users to have ADMIN permissions. So there needs to be a way for RS 
coproc implementation to override the access checks, with the default being 
requiring ADMIN permissions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28508) Remove the need for ADMIN permissions for RSRpcServices#execRegionServerService

2024-04-09 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28508:


 Summary: Remove the need for ADMIN permissions for 
RSRpcServices#execRegionServerService
 Key: HBASE-28508
 URL: https://issues.apache.org/jira/browse/HBASE-28508
 Project: HBase
  Issue Type: Bug
  Components: acl
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We have introduced a new regionserver coproc within phoenix and all the 
permission related tests are failing with the following exception.
{noformat}
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException):
 org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient 
permissions for user 'groupUser_N42' (global, action=ADMIN)
at 
org.apache.hadoop.hbase.security.access.AccessChecker.requireGlobalPermission(AccessChecker.java:152)
at 
org.apache.hadoop.hbase.security.access.AccessChecker.requirePermission(AccessChecker.java:125)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.requirePermission(RSRpcServices.java:1318)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.rpcPreCheck(RSRpcServices.java:584)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execRegionServerService(RSRpcServices.java:3804)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45016)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
{noformat}

This check is failing. 
[RSRpcServices|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3815]

{code}
  @Override
  public CoprocessorServiceResponse execRegionServerService(RpcController 
controller,
CoprocessorServiceRequest request) throws ServiceException {
rpcPreCheck("execRegionServerService");
return server.execRegionServerService(controller, request);
  }

  private void rpcPreCheck(String requestName) throws ServiceException {
try {
  checkOpen();
  requirePermission(requestName, Permission.Action.ADMIN);
} catch (IOException ioe) {
  throw new ServiceException(ioe);
}
  }
{code}

Why do we need ADMIN permissions to call region server coproc? We don't need 
ADMIN permissions to call all region co-procs. We require ADMIN permissions to 
execute some region coprocs (compactionSwitch, clearRegionBlockCache).

Can we change the permission to READ? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28437) Region Server crash in our production environment.

2024-03-11 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28437:


 Summary: Region Server crash in our production environment.
 Key: HBASE-28437
 URL: https://issues.apache.org/jira/browse/HBASE-28437
 Project: HBase
  Issue Type: Bug
Reporter: Rushabh Shah


Recently we are seeing lot of RS crash in our production environment creating 
core dump file and hs_err_pid.log file.
HBase:  hbase-2.5
Java: openjdk 1.8

Copying contents from hs_err_pid.log below:
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f9fb1415ba2, pid=50172, tid=0x7f92a97ec700
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.76.0.18-SA-linux64) 
(8.0_402-b06) (build 1.8.0_402-b06)
# Java VM: OpenJDK 64-Bit Server VM (25.402-b06 mixed mode linux-amd64 )
# Problematic frame:
# J 19801 C2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x7f9fb1415ba2 [0x7f9fb14159a0+0x202]
#
# Core dump written. Default location: /home/sfdc/core or core.50172
#
# If you would like to submit a bug report, please visit:
#   http://www.azul.com/support/
#

---  T H R E A D  ---

Current thread (0x7f9fa2d13000):  JavaThread "RS-EventLoopGroup-1-92" 
daemon [_thread_in_Java, id=54547, stack(0x7f92a96ec000,0x7f92a97ed000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
0x559869daf000

Registers:
RAX=0x7f9dbd8b6460, RBX=0x0008, RCX=0x0005c86b, 
RDX=0x7f9dbd8b6460
RSP=0x7f92a97eaf20, RBP=0x0002, RSI=0x7f92d225e970, 
RDI=0x0069
R8 =0x55986975f028, R9 =0x0064ffd8, R10=0x005f, 
R11=0x7f94a778b290
R12=0x7f9e62855ae8, R13=0x, R14=0x7f9e5a14b1e0, 
R15=0x7f9fa2d13000
RIP=0x7f9fb1415ba2, EFLAGS=0x00010216, CSGSFS=0x0033, 
ERR=0x0004
  TRAPNO=0x000e

Top of Stack: (sp=0x7f92a97eaf20)
0x7f92a97eaf20:   00690064ff79 7f9dbd8b6460
0x7f92a97eaf30:   7f9dbd8b6460 00570003
0x7f92a97eaf40:   7f94a778b290 000400010004
0x7f92a97eaf50:   0004d090c130 7f9db550
0x7f92a97eaf60:   000800040001 7f92a97eaf90
0x7f92a97eaf70:   7f92d0908648 0001
0x7f92a97eaf80:   0001 005c
0x7f92a97eaf90:   7f94ee8078d0 0206
0x7f92a97eafa0:   7f9db5545a00 7f9fafb63670
0x7f92a97eafb0:   7f9e5a13ed70 00690001
0x7f92a97eafc0:   7f93ab8965b8 7f93b9959210
0x7f92a97eafd0:   7f9db5545a00 7f9fb04b3e30
0x7f92a97eafe0:   7f9e5a13ed70 7f930001
0x7f92a97eaff0:   7f93ab8965b8 7f93a8ae3920
0x7f92a97eb000:   7f93b9959210 7f94a778b290
0x7f92a97eb010:   7f9b60707c20 7f93a8938c28
0x7f92a97eb020:   7f94ee8078d0 7f9b60708608
0x7f92a97eb030:   7f9b60707bc0 7f9b60707c20
0x7f92a97eb040:   0069 7f93ab8965b8
0x7f92a97eb050:   7f94a778b290 7f94a778b290
0x7f92a97eb060:   0005c80d0005c80c a828a590
0x7f92a97eb070:   7f9e5a13ed70 0001270e
0x7f92a97eb080:   7f9db5545790 01440022
0x7f92a97eb090:   7f95ddc800c0 7f93ab89a6c8
0x7f92a97eb0a0:   7f93ae65c270 7f9fb24af990
0x7f92a97eb0b0:   7f93ae65c290 7f93ae65c270
0x7f92a97eb0c0:   7f9e5a13ed70 7f92ca328528
0x7f92a97eb0d0:   7f9e5a13ed98 7f9e5e1e88b0
0x7f92a97eb0e0:   7f92ca32d870 7f9e5a13ed98
0x7f92a97eb0f0:   7f9e5e1e88b0 7f93b9956288
0x7f92a97eb100:   7f9e5a13ed70 7f9fb23c3aac
0x7f92a97eb110:   7f9317c9c8d0 7f9b60708608 

Instructions: (pc=0x7f9fb1415ba2)
0x7f9fb1415b82:   44 3b d7 0f 8d 6d fe ff ff 4c 8b 40 10 45 8b ca
0x7f9fb1415b92:   44 03 0c 24 c4 c1 f9 7e c3 4d 8b 5b 18 4d 63 c9
0x7f9fb1415ba2:   47 0f be 04 08 4d 85 db 0f 84 49 03 00 00 4d 8b
0x7f9fb1415bb2:   4b 08 48 b9 10 1c be 10 93 7f 00 00 4c 3b c9 0f 

Register to memory mapping:

RAX=0x7f9dbd8b6460 is an oop
java.nio.DirectByteBuffer 
 - klass: 'java/nio/DirectByteBuffer'
RBX=0x0008 is an unknown value
RCX=0x0005c86b is an unknown value
RDX=0x7f9dbd8b6460 is an oop
java.nio.DirectByteBuffer 
 - klass: 'java/nio/DirectByteBuffer'
RSP=0x7f92a97eaf20 is pointing into the stack for thread: 0x7f9fa2d13000
RBP=0x0002 is an unknown value
RSI=0x7f92d225e970 is pointing into metadata
RDI=0x0069 is an unknown value
R8 =0x55986975f028 is an unknown value
R9 =0x0064ffd8 is an unknown value
R10=0x005f is an unknown value
R11=0x7f94a778b290 is an oop
org.apache.hbase.thirdparty.io.netty.buffer.PooledUnsafeDirectByteBuf 
 - klass: 

[jira] [Resolved] (HBASE-28391) Remove the need for ADMIN permissions for listDecommissionedRegionServers

2024-02-27 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-28391.
--
Fix Version/s: 2.6.0
   2.4.18
   4.0.0-alpha-1
   2.7.0
   2.5.8
   3.0.0-beta-2
   Resolution: Fixed

> Remove the need for ADMIN permissions for listDecommissionedRegionServers
> -
>
> Key: HBASE-28391
> URL: https://issues.apache.org/jira/browse/HBASE-28391
> Project: HBase
>  Issue Type: Bug
>  Components: Admin
>Affects Versions: 2.4.17, 2.5.7
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 4.0.0-alpha-1, 2.7.0, 2.5.8, 3.0.0-beta-2
>
>
> Why we need {{ADMIN}} permissions for 
> {{AccessController#preListDecommissionedRegionServers}} ?
> From Phoenix, we are calling {{Admin#getRegionServers(true)}} where the 
> argument {{excludeDecommissionedRS}} is set to true. Refer 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java#L1721-L1730].
> If {{excludeDecommissionedRS}}  is set to true and if we have 
> {{AccessController}} co-proc attached, it requires ADMIN permissions to 
> execute {{listDecommissionedRegionServers}} RPC. Refer 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java#L1205-L1207].
>  
> {code:java}
>   @Override
>   public void 
> preListDecommissionedRegionServers(ObserverContext
>  ctx)
> throws IOException {
> requirePermission(ctx, "listDecommissionedRegionServers", Action.ADMIN);
>   }
> {code}
> I understand that we need ADMIN permissions for 
> _preDecommissionRegionServers_ and _preRecommissionRegionServer_ because it 
> changes the membership of regionservers but I don’t see any need for ADMIN 
> permissions for _listDecommissionedRegionServers_.  Do you think we can 
> remove need for ADMIN permissions for  _listDecommissionedRegionServers_ RPC?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28391) Remove the need for ADMIN permissions for listDecommissionedRegionServers

2024-02-21 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28391:


 Summary: Remove the need for ADMIN permissions for 
listDecommissionedRegionServers
 Key: HBASE-28391
 URL: https://issues.apache.org/jira/browse/HBASE-28391
 Project: HBase
  Issue Type: Bug
  Components: Admin
Affects Versions: 2.5.7, 2.4.17
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Why we need {{ADMIN}} permissions for 
{{AccessController#preListDecommissionedRegionServers}} ?

>From Phoenix, we are calling {{Admin#getRegionServers(true)}} where the 
>argument {{excludeDecommissionedRS}} is set to true. Refer 
>[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java#L1721-L1730].
If {{excludeDecommissionedRS}}  is set to true and if we have 
{{AccessController}} co-proc attached, it requires ADMIN permissions to execute 
{{listDecommissionedRegionServers}} RPC. Refer 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java#L1205-L1207].
 
{code:java}
  @Override
  public void 
preListDecommissionedRegionServers(ObserverContext
 ctx)
throws IOException {
requirePermission(ctx, "listDecommissionedRegionServers", Action.ADMIN);
  }
{code}
I understand that we need ADMIN permissions for _preDecommissionRegionServers_ 
and _preRecommissionRegionServer_ because it changes the membership of 
regionservers but I don’t see any need for ADMIN permissions for 
_listDecommissionedRegionServers_.  Do you think we can remove need for ADMIN 
permissions for  _listDecommissionedRegionServers_ RPC?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28293) Add metric for GetClusterStatus request count.

2024-01-05 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28293:


 Summary: Add metric for GetClusterStatus request count.
 Key: HBASE-28293
 URL: https://issues.apache.org/jira/browse/HBASE-28293
 Project: HBase
  Issue Type: Bug
Reporter: Rushabh Shah


We have been bitten multiple times by GetClusterStatus request overwhelming 
HMaster's memory usage. It would be good to add a metric for the total 
GetClusterStatus requests count.

In almost all of our production incidents involving GetClusterStatus request, 
HMaster were running out of memory with many clients call this RPC in parallel 
and the response size is very big.

In hbase2 we have 
[ClusterMetrics.Option|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterMetrics.java#L164-L224]
 which can reduce the size of the response.

It would be nice to add another metric to indicate if the response size of 
GetClusterStatus is greater than some threshold (like 5MB)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers

2023-11-15 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-28204.
--
Fix Version/s: 2.6.0
   2.4.18
   3.0.0-beta-1
   2.5.7
   Resolution: Fixed

> Canary can take lot more time If any region (except the first region) starts 
> with delete markers
> 
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-07 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-28184.
--
Resolution: Fixed

> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
> {noformat}
>  All the peers will contend for this lock during every batch read.
> Look at the code snippet below. We are guarding this section with 
> rollWriterLock if we are replicating the active WAL file. But in our case we 
> are NOT replicating active WAL file but still we acquire this lock only to 
> return OptionalLong.empty();
> {noformat}
>   /**
>* if the given {@code path} is being written currently, then return its 
> length.
>* 
>* This is used by replication to prevent replicating unacked log entries. 
> See
>* 

[jira] [Created] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-01 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28184:


 Summary: Tailing the WAL is very slow if there are multiple peers.
 Key: HBASE-28184
 URL: https://issues.apache.org/jira/browse/HBASE-28184
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.0.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Noticed in one of our production clusters which has 4 peers.

Due to sudden ingestion of data, the size of log queue increased to a peak of 
506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
were from a table for which replication is disabled. 

So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
NOT replicate them. Still it took 12 hours to drain the queue.

Took few jstacks and found that ReplicationSourceWALReader was waiting to 
acquire rollWriterLock 
[here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
{noformat}
"regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
nid=0xbd7f waiting on condition [0x7f37b4d19000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f3897a3e150> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
at 
org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
at 
org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
Source)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at 
java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
at 
java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at 
java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
at 
org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
at 
org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
 Source)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
{noformat}
 All the peers will contend for this lock during every batch read.
Look at the code snippet below. We are guarding this section with 
rollWriterLock if we are replicating the active WAL file. But in our case we 
are NOT replicating active WAL file but still we acquire this lock only to 
return OptionalLong.empty();
{noformat}
  /**
   * if the given {@code path} is being written currently, then return its 
length.
   * 
   * This is used by replication to prevent replicating unacked log entries. See
   * https://issues.apache.org/jira/browse/HBASE-14004 for more details.
   */
  @Override
  public OptionalLong getLogFileSizeIfBeingWritten(Path path) {
rollWriterLock.lock();
try {
   ...
   ...
} finally {
  rollWriterLock.unlock();
}
{noformat}
We can check the size of log queue and if it is greater than 1 then we can 
return early without acquiring the lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28045) Sort on store file size on hmaster page is broken.

2023-08-25 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-28045.
--
Resolution: Invalid

> Sort on store file size on hmaster page is broken.
> --
>
> Key: HBASE-28045
> URL: https://issues.apache.org/jira/browse/HBASE-28045
> Project: HBase
>  Issue Type: Bug
>  Components: UI
>Affects Versions: 2.5.2
>Reporter: Rushabh Shah
>Priority: Major
>  Labels: newbie, starter
> Attachments: Screenshot 2023-08-24 at 11.08.54 AM.png, Screenshot 
> 2023-08-25 at 1.02.49 PM.png
>
>
> An image is worth 1000 words.
> !Screenshot 2023-08-24 at 11.08.54 AM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28045) Sort on store file size on hmaster page is borken.

2023-08-24 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28045:


 Summary: Sort on store file size on hmaster page is borken.
 Key: HBASE-28045
 URL: https://issues.apache.org/jira/browse/HBASE-28045
 Project: HBase
  Issue Type: Bug
  Components: UI
Affects Versions: 2.5.2
Reporter: Rushabh Shah
 Attachments: Screenshot 2023-08-24 at 11.08.54 AM.png

An image is worth 1000 words.

!Screenshot 2023-08-24 at 11.08.54 AM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28039) Create metric for region in transition count per table.

2023-08-22 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-28039:


 Summary: Create metric for region in transition count per table.
 Key: HBASE-28039
 URL: https://issues.apache.org/jira/browse/HBASE-28039
 Project: HBase
  Issue Type: Bug
Reporter: Rushabh Shah


Currently we have ritCount and ritCountOverThreshold metric for the whole 
cluster. It would be nice to have the ritCount and ritCountOverThreshold metric 
per table. It helps in debugging failed queries for a given table due to RIT.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27957) HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED exception.

2023-06-30 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27957:


 Summary: HConnection (and ZookeeprWatcher threads) leak in case of 
AUTH_FAILED exception.
 Key: HBASE-27957
 URL: https://issues.apache.org/jira/browse/HBASE-27957
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.17, 1.7.2
Reporter: Rushabh Shah


Observed this in production environment running some version of 1.7 release.
Application didn't had the right keytab setup for authentication. Application 
was trying to create HConnection and zookeeper server threw AUTH_FAILED 
exception.
After few hours of application in this state, saw thousands of 
zk-event-processor thread with below stack trace.
{noformat}
"zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms 
elapsed=41794.58s tid=0x7fd7805066d0 nid=0x1245 waiting on condition  
[0x7fd75df01000]
   java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native Method)
- parking to wait for  <0x7fd9874a85e0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081)
at 
java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628)
{noformat}
{code:java|title=ConnectionManager.java|borderStyle=solid}
HConnectionImplementation(Configuration conf, boolean managed,
ExecutorService pool, User user, String clusterId) throws IOException {
...
...
try {
   this.registry = setupRegistry();
   retrieveClusterId();
   ...
   ...
} catch (Throwable e) {
   // avoid leaks: registry, rpcClient, ...
   LOG.debug("connection construction failed", e);
   close();
   throw e;
 }
{code}
retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
{code:java|title=ZKConnectionRegistry.java|borderStyle=solid}
  private String clusterId = null;

  @Override
  public String getClusterId() {
if (this.clusterId != null) return this.clusterId;
// No synchronized here, worse case we will retrieve it twice, that's
//  not an issue.
try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher()) 
{
  this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
  if (this.clusterId == null) {
LOG.info("ClusterId read in ZooKeeper is null");
  }
} catch (KeeperException | IOException e) {  --->  WE ARE SWALLOWING 
THIS EXCEPTION AND RETURNING NULL. 

  LOG.warn("Can't retrieve clusterId from Zookeeper", e);
}
return this.clusterId;
  }
{code}

ZkConnectionRegistry#getClusterId threw the following exception.(Our logging 
system trims stack traces longer than 5 lines.)
{noformat}
Cause: org.apache.zookeeper.KeeperException$AuthFailedException: 
KeeperErrorCode = AuthFailed for /hbase/hbaseid
StackTrace: 
org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
{noformat}

We should throw KeeperException from ZKConnectionRegistry#getClusterId all the 
way back to HConnectionImplementation constructor to close all the watcher 
threads and throw the exception back to the caller.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27908) Can't get connection to ZooKeeper

2023-06-05 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-27908.
--
Resolution: Invalid

> Can't get connection to ZooKeeper
> -
>
> Key: HBASE-27908
> URL: https://issues.apache.org/jira/browse/HBASE-27908
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.4.13
>Reporter: Ibrar Ahmed
>Priority: Major
>
> I am using Hbase cluster along with apache kylin, the connection between Edge 
> node and the Hbase cluster is good.
> following are the logs from Kylin side which shows Error exception:
> {code:java}
> java.net.SocketTimeoutException: callTimeout=120, callDuration=1275361: 
> org.apache.hadoop.hbase.MasterNotRunningException: Can't get connection to 
> ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase 
>     at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:178)
>     at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4551)
>     at 
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:561)
>     at 
> org.apache.hadoop.hbase.client.HTable.getTableDescriptor(HTable.java:585)
>     at 
> org.apache.kylin.storage.hbase.steps.HFileOutputFormat3.configureIncrementalLoad(HFileOutputFormat3.java:328)
>     at 
> org.apache.kylin.storage.hbase.steps.CubeHFileJob.run(CubeHFileJob.java:101)
>     at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:144)
>     at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
>     at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>     at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
>     at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hadoop.hbase.MasterNotRunningException: 
> org.apache.hadoop.hbase.MasterNotRunningException: Can't get connection to 
> ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1618)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1638)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1795)
>     at 
> org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
>     at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:140)
>     ... 13 more
> Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Can't get 
> connection to ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(ConnectionManager.java:971)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.access$400(ConnectionManager.java:566)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1567)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1609)
>     ... 17 more
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /hbase
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>     at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:)
>     at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220)
>     at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:425)
>     at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(ConnectionManager.java:960)
>     ... 20 more {code}
> Following are the logs from Hbase cluster master NOde which accepts the 
> connection from Edge NOde(Kylin):
> {code:java}
> 2023-06-05 10:00:30,336 [myid:0] - INFO 
> [CommitProcessor:0:NIOServerCnxn@1056] - Closed socket connection for client 
> /10.127.2.201:37328 which had sessionid 0x7311c000c
> 2023-06-05 13:14:48,346 [myid:0] - INFO 
> [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task 

[jira] [Resolved] (HBASE-26913) Replication Observability Framework

2023-04-20 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-26913.
--
Resolution: Fixed

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27697) Create a dummy metric implementation in ConnectionImplementation.

2023-03-09 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27697:


 Summary: Create a dummy metric implementation in 
ConnectionImplementation.
 Key: HBASE-27697
 URL: https://issues.apache.org/jira/browse/HBASE-27697
 Project: HBase
  Issue Type: Bug
  Components: metrics
Affects Versions: 2.5.0
Reporter: Rushabh Shah


This Jira is for branch-2 only.

If CLIENT_SIDE_METRICS_ENABLED_KEY conf is set to true, then we initialize 
metrics to MetricsConnection, otherwise it is set to null.
{code:java}
  if (conf.getBoolean(CLIENT_SIDE_METRICS_ENABLED_KEY, false)) {
  this.metricsScope = MetricsConnection.getScope(conf, clusterId, 
this);
  this.metrics = 
MetricsConnection.getMetricsConnection(this.metricsScope, this::getBatchPool, 
this::getMetaLookupPool);
  } else {
  this.metrics = null;  
  }
{code}

Whenever we want to update metrics, we always do a null check. We can improve 
this by creating a dummy metrics object and have an empty implementation. When 
we want to populate the metrics, we can check if metrics is an instance of 
dummy metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.

2022-11-03 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-27100.
--
Fix Version/s: 3.0.0-alpha-4
   Resolution: Fixed

> Add documentation for Replication Observability Framework in hbase book.
> 
>
> Key: HBASE-27100
> URL: https://issues.apache.org/jira/browse/HBASE-27100
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-26925) Create WAL event tracker table to track all the WAL events.

2022-11-03 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-26925.
--
Fix Version/s: 3.0.0-alpha-4
   Resolution: Fixed

> Create WAL event tracker table to track all the WAL events.
> ---
>
> Key: HBASE-26925
> URL: https://issues.apache.org/jira/browse/HBASE-26925
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> Design Doc: 
> [https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#]
> Create wal event tracker table to track WAL events. Whenever we roll the WAL, 
> we will save the WAL name, WAL length, region server, timestamp in a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.

2022-11-03 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-27085.
--
Fix Version/s: 3.0.0-alpha-4
   Resolution: Fixed

> Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from 
> source cluster.
> --
>
> Key: HBASE-27085
> URL: https://issues.apache.org/jira/browse/HBASE-27085
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> This work is to create sink tracker table to persist tracker rows coming from 
> replication source cluster. 
> Create ReplicationMarkerChore to create replication marker rows periodically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27461) Add multiWAL support for Replication Observability framework.

2022-11-02 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27461:


 Summary: Add multiWAL support for Replication Observability 
framework.
 Key: HBASE-27461
 URL: https://issues.apache.org/jira/browse/HBASE-27461
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Reporter: Rushabh Shah


HBASE-26913 added a new framework for observing health of replication 
subsystem. Currently we add replication marker row just to one WAL (i.e. 
default WAL). We need to add support for multi WAL implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27383) Add dead region server to SplitLogManager#deadWorkers set as the first step.

2022-09-21 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27383:


 Summary: Add dead region server to SplitLogManager#deadWorkers set 
as the first step.
 Key: HBASE-27383
 URL: https://issues.apache.org/jira/browse/HBASE-27383
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.7.2, 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in 
SERVER_CRASH_SPLIT_LOGS state. 
Consider a case where a region server is handling split log task for hbase:meta 
table and SplitLogManager has exhausted all the retries and won't try any more 
region server. 
The region server which is handling split log task has died. 
We have a check in SplitLogManager where if a region server is declared dead 
and if that region server is responsible for split log task then we forcefully 
resubmit split log task. See the code 
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726].

But we add a region server to SplitLogManager#deadWorkers set in 
[SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252]
 state. 
Before that it runs 
[SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214]
 state  and checks if hbase:meta table is up. In this case, hbase:meta table 
was not online and that prevented SplitLogManager to add this RS to deadWorkers 
list. This created a deadlock and hbase cluster was completely down for an 
extended period of time until we failed over active hmaster. See HBASE-27382 
for more details.

Improvements:
1.  We should a dead region server to +SplitLogManager#deadWorkers+ list as the 
first step.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27382) Cluster completely down due to wal splitting failing for hbase:meta table.

2022-09-21 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27382:


 Summary: Cluster completely down due to wal splitting failing for 
hbase:meta table.
 Key: HBASE-27382
 URL: https://issues.apache.org/jira/browse/HBASE-27382
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.7.2
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We are running some version of 1.7.2 in our production environment. We 
encountered this issue recently.
We colocate namenode and region server holding hbase:meta table on a set of 5 
master nodes. Co-incidentally active namenode and region server holding meta 
table were on the same physical node and that node went down due to hardware 
issue. We have sub optimal hdfs level timeouts configured so whenever active 
namenode goes down, it takes around 12-15 minutes for hdfs client within hbase 
to connect to new active namenode. So all the region servers were having 
problems for about 15 minutes to connect to new active namenode.

Below are the sequence of events:

1. Host running active namenode and hbase:meta went down at +2022-09-09 
16:56:56,878_
2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
{noformat}
2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
serverName=,61020,1662714013670, shouldSplitWal=true, 
carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
the store.

2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
Added=,61020,1662714013670 to dead servers, submitted shutdown 
handler to be executed meta=true

2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - Started 
processing ,61020,1662714013670; numProcessing=1
2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
procedure.ServerCrashProcedure - Start processing crashed 
,61020,1662714013670
{noformat}

3. SplitLogManager created 2 split log tasks in zookeeper.

{noformat}
2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
Started splitting 2 logs in 
[hdfs:///hbase/WALs/,61020,1662714013670-splitting]
 for [,61020,1662714013670]

2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - put up splitlog task at znode 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - put up splitlog task at znode 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
{noformat}


4. The first split log task is more interesting: 
+/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+

5. Since all the region servers were having problems connecting to active 
namenode, SplitLogManager tried total of 4 times to assign this task (3 
resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave up.

{noformat}
-- try 1 -
2022-09-09 16:59:06,205 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662540522069

-- try 2 -

2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
coordination.SplitLogManagerCoordination - resubmitting task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - task not yet acquired 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 ver = 2

2022-09-09 17:01:06,715 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662530684713

-- try 3 -

2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
coordination.SplitLogManagerCoordination - resubmitting task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - task not yet acquired 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 ver = 4

2022-09-09 17:03:07,738 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662542355806


-- try 4 -
2022-09-09 

[jira] [Created] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.

2022-06-08 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27100:


 Summary: Add documentation for Replication Observability Framework 
in hbase book.
 Key: HBASE-27100
 URL: https://issues.apache.org/jira/browse/HBASE-27100
 Project: HBase
  Issue Type: Sub-task
  Components: documentation
Reporter: Rushabh Shah
Assignee: Rushabh Shah






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.

2022-06-02 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-27085:


 Summary: Create REPLICATION_SINK_TRACKER table to persist sentinel 
rows coming from source cluster.
 Key: HBASE-27085
 URL: https://issues.apache.org/jira/browse/HBASE-27085
 Project: HBase
  Issue Type: Sub-task
Reporter: Rushabh Shah
Assignee: Rushabh Shah


This work is to create sink tracker table to persist tracker rows coming from 
replication source cluster. 

Create ReplicationMarkerChore to create replication marker rows periodically.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-26963) ReplicationSource#removePeer hangs if we try to remove bad peer.

2022-04-20 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26963:


 Summary: ReplicationSource#removePeer hangs if we try to remove 
bad peer.
 Key: HBASE-26963
 URL: https://issues.apache.org/jira/browse/HBASE-26963
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.11
Reporter: Rushabh Shah


ReplicationSource#removePeer hangs if we try to remove bad peer.

Steps to reproduce:
1. Set config replication.source.regionserver.abort to false so that it doesn't 
abort regionserver.
2. Add a dummy peer.
2. Remove that peer.

RemovePeer call will hang indefinitely until the test times out.
Attached a patch to reproduce the above behavior.

I can see following threads in the stack trace:


{noformat}
"RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1"
 #339 daemon prio=5 os_prio=31 tid=0x7f8caa
44a800 nid=0x22107 waiting on condition [0x7000107e5000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown
 Source)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
{noformat}


{noformat}
"RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 
os_prio=31 tid=0x7f8ca82fa800 nid=0x22307 in Object.wait() 
[0x7000106e2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1260)
- locked <0x000799975ea0> (a java.lang.Thread)
at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330)
at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61)
at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

{noformat}
"Listener at localhost/55013" #20 daemon prio=5 os_prio=31 
tid=0x7f8caf95a000 nid=0x6703 waiting on condition [0x72
544000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442)
at 
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372)
at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
at 
org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861)
at 
org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74)
at 
org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66)
{noformat}

The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is 
waiting for Admin#removeReplicationPeer.

The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to 
terminate peer (#338) is waiting on ReplicationSource thread to be terminated.

The ReplicateSource thread (#339) is in sleeping state. Notice that this 
thread's stack trace is in ReplicationSource#uncaughtException method.

When we call ReplicationSourceManager#removePeer, we set sourceRunning flag to 
false, send an interrupt signal to ReplicationSource thread 

[jira] [Created] (HBASE-26957) Add support to hbase shell to remove coproc by its class name instead of coproc ID

2022-04-15 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26957:


 Summary: Add support to hbase shell to remove coproc by its class 
name instead of coproc ID
 Key: HBASE-26957
 URL: https://issues.apache.org/jira/browse/HBASE-26957
 Project: HBase
  Issue Type: Bug
  Components: Coprocessors, shell
Affects Versions: 1.7.1
Reporter: Rushabh Shah


The syntax for removing coproc is as below:
  hbase> alter 't1', METHOD => 'table_att_unset', NAME => 'coprocessor$1'

We have to use coproc id to remove a coproc from a given table.

Consider the following scenario. Due to some bug in a coproc, we have to remove 
a given coproc from all the tables in a cluster. Every table can have different 
set of co-procs. For a given co-proc class, the coproc ID will not be same for 
all the tables in a cluster. This gets more complex if we want to remove 
co-proc from all the production clusters. 

Instead we can pass a co-proc class name to alter table command. So if a table 
has that co-proc, it will remove otherwise do nothing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26925) Create WAL event tracker table to track all the WAL events.

2022-04-04 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26925:


 Summary: Create WAL event tracker table to track all the WAL 
events.
 Key: HBASE-26925
 URL: https://issues.apache.org/jira/browse/HBASE-26925
 Project: HBase
  Issue Type: Sub-task
  Components: wal
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Design Doc: 
[https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#]

Create wal event tracker table to track WAL events. Whenever we roll the WAL, 
we will save the WAL name, WAL length, region server, timestamp in a table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26913) Replication Observability Framework

2022-04-01 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26913:


 Summary: Replication Observability Framework
 Key: HBASE-26913
 URL: https://issues.apache.org/jira/browse/HBASE-26913
 Project: HBase
  Issue Type: New Feature
  Components: regionserver, Replication
Reporter: Rushabh Shah
Assignee: Rushabh Shah


{*}{*}In our production clusters, we have seen cases where data is present in 
source cluster but not in the sink cluster and 1 case where data is present in 
sink cluster but not in source cluster. 

We have internal tools where we take incremental backup every day on both 
source and sink clusters and we compare the hash of the data in both the 
backups. We have seen many cases where hash doesn't match which means data is 
not consistent between source and sink for that given day. The Mean Time To 
Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
manual debugging.

We need some tool where we can reduce MTTD and requires less manual debugging.

I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26905) ReplicationPeerManager#checkPeerExists should throw ReplicationPeerNotFoundException if peer doesn't exists

2022-03-29 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26905:


 Summary: ReplicationPeerManager#checkPeerExists should throw 
ReplicationPeerNotFoundException if peer doesn't exists
 Key: HBASE-26905
 URL: https://issues.apache.org/jira/browse/HBASE-26905
 Project: HBase
  Issue Type: Bug
  Components: Replication
Reporter: Rushabh Shah


ReplicationPeerManager#checkPeerExists should throw 
ReplicationPeerNotFoundException if peer doesn't exists. Currently it throws 
generic DoNotRetryIOException.
{code:java}
private ReplicationPeerDescription checkPeerExists(String peerId) throws 
DoNotRetryIOException {
  ReplicationPeerDescription desc = peers.get(peerId);
  if (desc == null) {
throw new DoNotRetryIOException("Replication peer " + peerId + " does not 
exist");
  }
  return desc;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26792) Implement ScanInfo#toString

2022-03-02 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26792:


 Summary: Implement ScanInfo#toString
 Key: HBASE-26792
 URL: https://issues.apache.org/jira/browse/HBASE-26792
 Project: HBase
  Issue Type: Improvement
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We don't have ScanInfo#toString. We use ScanInfo while creating StoreScanner 
which is used in preFlushScannerOpen in co-proc.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26702) Make ageOfLastShip extend TimeHistogram instead of plain histogram.

2022-01-24 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26702:


 Summary: Make ageOfLastShip extend TimeHistogram instead of plain 
histogram.
 Key: HBASE-26702
 URL: https://issues.apache.org/jira/browse/HBASE-26702
 Project: HBase
  Issue Type: Improvement
  Components: metrics, Replication
Affects Versions: 2.3.7, 1.7.1, 3.0.0-alpha-3
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Currently age of last ship metric is an instance of an Histogram type. 
[Here|https://github.com/apache/hbase/blob/master/hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSourceImpl.java#L58]
{quote}
   ageOfLastShippedOpHist = 
rms.getMetricsRegistry().getHistogram(SOURCE_AGE_OF_LAST_SHIPPED_OP);
{quote}

We can change it to TimeHistogram so that we get the range information also. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26480) Close NamedQueueRecorder to allow HMaster/RS to shutdown gracefully

2021-11-22 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26480:


 Summary: Close NamedQueueRecorder to allow HMaster/RS to shutdown 
gracefully
 Key: HBASE-26480
 URL: https://issues.apache.org/jira/browse/HBASE-26480
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Saw one case in our production cluster where RS was not exiting. Saw this 
non-daemon thread in hung RS stack trace:

{noformat}
"main.slowlog.append-pool-pool1-t1" #26 prio=5 os_prio=31 
tid=0x7faf23bf7800 nid=0x6d07 waiting on condition [0x73f4d000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0004039e3840> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
at 
com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at 
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
        
This is coming from 
[NamedQueueRecorder|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/namequeues/NamedQueueRecorder.java#L65]
 implementation. 
This bug doesn't exists in branch-2 and master since the Disruptor 
initialization has changed and we set daemon=true also. See [this 
code|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/namequeues/NamedQueueRecorder.java#L68]
 
FYI [~vjasani] [~zhangduo]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26468) Region Server doesn't exit cleanly incase it crashes.

2021-11-18 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26468:


 Summary: Region Server doesn't exit cleanly incase it crashes.
 Key: HBASE-26468
 URL: https://issues.apache.org/jira/browse/HBASE-26468
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 1.6.0
Reporter: Rushabh Shah


Observed this in our production cluster running 1.6 version.
RS crashed due to some reason but the process was still running. On debugging 
more, found out there was 1 non-daemon thread running and that was not allowing 
RS to stop cleanly. Our clusters are managed by Ambari and have auto restart 
capability within them. But since the process was running and pid file was 
present, Ambari also couldn't do much. There will be some bug where we will 
miss to stop some daemon thread but there should be some maximum amount of time 
we should wait before exiting the thread.

Relevant code: 
[HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java]


{code:java}
logProcessInfo(getConf());
HRegionServer hrs = 
HRegionServer.constructRegionServer(regionServerClass, conf);
hrs.start();
hrs.join();  -> This should be a timed join.
if (hrs.isAborted()) {
  throw new RuntimeException("HRegionServer Aborted");
}
  }
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26442) TestReplicationEndpoint#testInterClusterReplication fails in branch-1

2021-11-10 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26442:


 Summary: TestReplicationEndpoint#testInterClusterReplication fails 
in branch-1
 Key: HBASE-26442
 URL: https://issues.apache.org/jira/browse/HBASE-26442
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.7.1
Reporter: Rushabh Shah
Assignee: Rushabh Shah


{noformat}
[INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ hbase-server ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running org.apache.hadoop.hbase.replication.TestReplicationEndpoint
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 20.978 
s <<< FAILURE! - in org.apache.hadoop.hbase.replication.TestReplicationEndpoint
[ERROR] org.apache.hadoop.hbase.replication.TestReplicationEndpoint  Time 
elapsed: 3.921 s  <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint.tearDownAfterClass(TestReplicationEndpoint.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestReplicationEndpoint.tearDownAfterClass:88
[INFO] 
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26435) [branch-1] The log rolling request maybe canceled immediately in LogRoller due to a race

2021-11-09 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26435:


 Summary: [branch-1] The log rolling request maybe canceled 
immediately in LogRoller due to a race 
 Key: HBASE-26435
 URL: https://issues.apache.org/jira/browse/HBASE-26435
 Project: HBase
  Issue Type: Sub-task
  Components: wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah
 Fix For: 1.7.2


Saw this issue in our internal 1.6 branch.

The WAL  was rolled but the new WAL file was not writable and it logged the 
following error also
{noformat}
2021-11-03 19:20:19,503 WARN  [.168:60020.logRoller] hdfs.DFSClient - Error 
while syncing
java.io.IOException: Could not get block locations. Source file 
"/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389"
 - Aborting...
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)

2021-11-03 19:20:19,507 WARN  [.168:60020.logRoller] wal.FSHLog - pre-sync 
failed but an optimization so keep going
java.io.IOException: Could not get block locations. Source file 
"/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389"
 - Aborting...
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
{noformat}

Since the new WAL file was not writable, appends to that file started failing 
immediately it was rolled.

{noformat}
2021-11-03 19:20:19,677 INFO  [.168:60020.logRoller] wal.FSHLog - Rolled WAL 
/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635965392022
 with entries=253234, filesize=425.67 MB; new WAL 
/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389


2021-11-03 19:20:19,690 WARN  [020.append-pool17-t1] wal.FSHLog - Append 
sequenceId=1962661783, requesting roll of WAL
java.io.IOException: Could not get block locations. Source file 
"/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389"
 - Aborting...
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)


2021-11-03 19:20:19,690 INFO  [.168:60020.logRoller] wal.FSHLog - Archiving 
hdfs://prod-EMPTY-hbase2a/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635960792837
 to 
hdfs://prod-EMPTY-hbase2a/hbase/oldWALs/hbase2a-dnds1-232-ukb.ops.sfdc.net%2C60020%2C1635567166484.1635960792837
{noformat}

We always reset the rollLog flag within LogRoller thread after the rollWal call 
is complete.
Within FSHLog#rollWriter method, it does many things, like replacing the writer 
and archiving old logs. If append thread fails to write to new file while 
logRoller thread is cleaning old logs, we will miss the rollLog flag since 
LogRoller will reset the flag to false while the previous rollWriter call is 
going on.
Relevant code: 
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java#L183-L203

We need to reset rollLog flag before we start rolling the wal. 
This is fixed in branch-2 and master via HBASE-22684 but we didn't fix it in 
branch-1
Also branch-2 has multi wal implementation so it can apply cleanly in branch-1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26404) Update javadoc for CellUtil#createCell with tags methods.

2021-10-27 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26404:


 Summary: Update javadoc for CellUtil#createCell  with tags methods.
 Key: HBASE-26404
 URL: https://issues.apache.org/jira/browse/HBASE-26404
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.8
Reporter: Rushabh Shah


We have the following methods in CellUtil class which are deprecated. We used 
to use this method within custom COPROC to create cells with custom tags. We 
deprecated them in 2.0.0 version and created a new class called RawCell which 
is LimitedPrivate with visibility set to COPROC. There is no reference to 
RawCell#createCell(Cell cell, List tags) method in javadoc of 
CellUtil#createCell methods which are now deprecated. This is not user 
friendly. We should improve the javadoc within CellUtil#createCell(Cell, tags) 
method.

{noformat}
  /**
   * Note : Now only CPs can create cell with tags using the CP environment
   * @return A new cell which is having the extra tags also added to it.
   * @deprecated As of release 2.0.0, this will be removed in HBase 3.0.0.
   *
   */
  @Deprecated
  public static Cell createCell(Cell cell, List tags) {
return PrivateCellUtil.createCell(cell, tags);
  }
{noformat}

{noformat}
  public static Cell createCell(Cell cell, byte[] tags) 
  public static Cell createCell(Cell cell, byte[] value, byte[] tags) {
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-26195) Data is present in replicated cluster but not present in primary cluster.

2021-09-08 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-26195.
--
Resolution: Fixed

> Data is present in replicated cluster but not present in primary cluster.
> -
>
> Key: HBASE-26195
> URL: https://issues.apache.org/jira/browse/HBASE-26195
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 1.7.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 1.8.0
>
>
> We encountered a case where we are seeing some rows (via Phoenix) in 
> replicated cluster but they are not present in source/active cluster.
> Triaging further we found memstore rollback logs in few of the region servers.
> {noformat}
> 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
> rollbackMemstore rolled back 23
> {noformat}
> Looking more into logs, found that there were some hdfs layer issues sync'ing 
> wal to hdfs.
> It was taking around 6 mins to sync wal. Logs below
> {noformat}
> 2021-07-28 14:19:30,511 WARN  [sync.0] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391210ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.1] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391148ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.2] hdfs.DataStreamer - Slow 
> waitForAckedSeqno took 391147ms (threshold=3ms). File being written: 
> /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
>  block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
> pipeline datanodes: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.0] wal.FSHLog - Slow sync cost: 391289 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.1] wal.FSHLog - Slow sync cost: 391227 
> ms, current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.1] wal.FSHLog - Requesting log roll 
> because we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, 
> current pipeline: 
> [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
>  
> DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 

[jira] [Created] (HBASE-26247) TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance doesn't read archived WAL file.

2021-09-01 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26247:


 Summary: 
TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance doesn't read 
archived WAL file. 
 Key: HBASE-26247
 URL: https://issues.apache.org/jira/browse/HBASE-26247
 Project: HBase
  Issue Type: Bug
Reporter: Rushabh Shah


TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance is testing the 
following scenario.
1. Create a new WAL file
2. Write 2 KVs to WAL file.
3. Close the WAL file.
4. Instantiate WALInputFormat#WALKeyRecordReader with the WAL created in step 1.
5. Read the first KV.
6. Archive the WAL file in oldWALs directory via rename operation.
7. Read the second KV. This will test that WALKeyRecordReader will encounter 
FNFE since the WAL file is not longer present in the original location and it 
will handle the FNFE by opening the WAL file from archived location.

In step#7, the test is expecting that it will encounter FNFE and it will open a 
new reader but in reality, it is not encountering FNFE. I think the reason is, 
during rename operation, HDFS just changes the internal metadata for the file 
name. The InodeID, hdfs blocks and block locations remains the same. While 
reading the first KV, DFSInputStream caches all the HDFS blocks and location 
data so it doesn't have to go to namenode to re-resolve the file name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26195) Data is present in replicated cluster but not visible on primary cluster.

2021-08-12 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26195:


 Summary: Data is present in replicated cluster but not visible on 
primary cluster.
 Key: HBASE-26195
 URL: https://issues.apache.org/jira/browse/HBASE-26195
 Project: HBase
  Issue Type: Bug
  Components: Replication, wal
Affects Versions: 1.7.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We encountered a case where we are seeing some rows (via Phoenix) in replicated 
cluster but they are not present in source/active cluster.

Triaging further we found memstore rollback logs in few of the region servers.
{noformat}
2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - 
rollbackMemstore rolled back 23
{noformat}

Looking more into logs, found that there were some hdfs layer issues sync'ing 
wal to hdfs.
It was taking around 6 mins to sync wal. Logs below

{noformat}
2021-07-28 14:19:30,511 WARN  [sync.0] hdfs.DataStreamer - Slow 
waitForAckedSeqno took 391210ms (threshold=3ms). File being written: 
/hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
 block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
pipeline datanodes: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
2021-07-28 14:19:30,589 WARN  [sync.1] hdfs.DataStreamer - Slow 
waitForAckedSeqno took 391148ms (threshold=3ms). File being written: 
/hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
 block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
pipeline datanodes: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
2021-07-28 14:19:30,589 WARN  [sync.2] hdfs.DataStreamer - Slow 
waitForAckedSeqno took 391147ms (threshold=3ms). File being written: 
/hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620,
 block: BP-958889176--1567030695029:blk_1689647875_616028364, Write 
pipeline datanodes: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
2021-07-28 14:19:30,591 INFO  [sync.0] wal.FSHLog - Slow sync cost: 391289 ms, 
current pipeline: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
2021-07-28 14:19:30,591 INFO  [sync.1] wal.FSHLog - Slow sync cost: 391227 ms, 
current pipeline: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
2021-07-28 14:19:30,591 WARN  [sync.1] wal.FSHLog - Requesting log roll because 
we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, current 
pipeline: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
2021-07-28 14:19:30,591 INFO  [sync.2] wal.FSHLog - Slow sync cost: 391227 ms, 
current pipeline: 
[DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK],
 
DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK],
 
DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
2021-07-28 14:19:30,591 WARN  [sync.2] wal.FSHLog - Requesting log roll because 
we exceeded slow 

[jira] [Created] (HBASE-26121) Formatter to convert from epoch time to human readable date format.

2021-07-26 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26121:


 Summary: Formatter to convert from epoch time to human readable 
date format.
 Key: HBASE-26121
 URL: https://issues.apache.org/jira/browse/HBASE-26121
 Project: HBase
  Issue Type: Improvement
  Components: shell
Reporter: Rushabh Shah


In shell, we have custom formatter to convert from bytes to Long/Int for 
long/int data type values.

Many times we store the epoch timestamp (event creation, updation time) as long 
in our table columns. Even after converting this column to Long, the date is 
not in a human readable format. We still have to convert this long into date 
using some bash shell tricks and it is not convenient to do for many columns. 
We can introduce a new format method called +toLongDate+ which signifies that 
we want to convert the bytes to long first and then to date.

Please let me know if any such functionality already exists and I am not aware 
of.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26106) AbstractFSWALProvider#getArchivedLogPath doesn't look for wal file in all oldWALs directory.

2021-07-21 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26106:


 Summary: AbstractFSWALProvider#getArchivedLogPath doesn't look for 
wal file in all oldWALs directory.
 Key: HBASE-26106
 URL: https://issues.apache.org/jira/browse/HBASE-26106
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.4.4, 3.0.0-alpha-1, 2.5.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Below is the code for AbstractFSWALProvider#getArchivedLogPath

{code:java}
 public static Path getArchivedLogPath(Path path, Configuration conf) throws 
IOException {
Path rootDir = CommonFSUtils.getWALRootDir(conf);
Path oldLogDir = new Path(rootDir, HConstants.HREGION_OLDLOGDIR_NAME);
if (conf.getBoolean(SEPARATE_OLDLOGDIR, DEFAULT_SEPARATE_OLDLOGDIR)) {
  ServerName serverName = getServerNameFromWALDirectoryName(path);
  if (serverName == null) {
LOG.error("Couldn't locate log: " + path);
return path;
  }
  oldLogDir = new Path(oldLogDir, serverName.getServerName());
}
Path archivedLogLocation = new Path(oldLogDir, path.getName());
final FileSystem fs = CommonFSUtils.getWALFileSystem(conf);

if (fs.exists(archivedLogLocation)) {
  LOG.info("Log " + path + " was moved to " + archivedLogLocation);
  return archivedLogLocation;
} else {
  LOG.error("Couldn't locate log: " + path);
  return path;
}
  }
{code}

This method is called from the following places.
[AbstractFSWALProvider#openReader|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/AbstractFSWALProvider.java#L524]

[ReplicationSource#getFileSize|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L399]

[WALInputFormat.WALRecordReader#nextKeyValue|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALInputFormat.java#L220]

All of the above calls are trying to find the log in archive path after they 
couldn't locate the wal in walsDir and they are not used for moving a log file 
to archive directory.
But we will look for archive path within serverName directory only if conf key 
is true.
Cc [~zhangduo] 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26103) conn.getBufferedMutator(tableName) leaks thread executors and other problems (for master branch))

2021-07-20 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26103:


 Summary: conn.getBufferedMutator(tableName) leaks thread executors 
and other problems (for master branch))
 Key: HBASE-26103
 URL: https://issues.apache.org/jira/browse/HBASE-26103
 Project: HBase
  Issue Type: Sub-task
  Components: Client
Affects Versions: 3.0.0-alpha-1
Reporter: Rushabh Shah
Assignee: Rushabh Shah






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26075:


 Summary: Replication is stuck due to zero length wal file in 
oldWALs directory.
 Key: HBASE-26075
 URL: https://issues.apache.org/jira/browse/HBASE-26075
 Project: HBase
  Issue Type: Bug
  Components: Replication, wal
Affects Versions: 1.7.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Recently we encountered a case where size of log queue was increasing to around 
300 in few region servers in our production environment.

There were 295 wals in the oldWALs directory for that region server and the 
*first file* was a 0 length file.

Replication was throwing the following error.

{noformat}
2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
 java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
SequenceFile
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
Caused by: java.io.EOFException: hdfs:///hbase/oldWALs/ 
not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
... 1 more
{noformat}

We fixed similar error  via HBASE-25536 but the zero length file was in 
recovered sources.

There were more logs after the above stack trace.

{noformat}
2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
information about log 
hdfs:///hbase/WALs/
2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
was moved to hdfs:///hbase/oldWALs/
{noformat}


There is a special logic in ReplicationSourceWALReader thread to handle 0 
length files but we only look in WALs directory and not in oldWALs directory.

{code}
  private boolean handleEofException(Exception e, WALEntryBatch batch) {
PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
// Dump the log even if logQueue size is 1 if the source is from recovered 
Source
// since we don't add current log to recovered source queue so it is safe 
to remove.
if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
  (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
  Path head = queue.peek();
  try {
if (fs.getFileStatus(head).getLen() == 0) {
  // head of the queue is an empty log file
  LOG.warn("Forcing removal of 0 length log in queue: {}", head);
  logQueue.remove(walGroupId);
  currentPosition = 0;
  if (batch != null) {
// After we removed the WAL from the queue, we should try shipping 
the existing batch of
// entries
addBatchToShippingQueue(batch);
  }
  return true;
}
  } 

[jira] [Created] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.

2021-05-27 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25932:


 Summary: TestWALEntryStream#testCleanClosedWALs test is failing.
 Key: HBASE-25932
 URL: https://issues.apache.org/jira/browse/HBASE-25932
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We are seeing the following test failure. TestWALEntryStream#testCleanClosedWALs
 This test was added in HBASE-25924. I don't think the test failure has 
anything to do with the patch in HBASE-25924.
 Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In 
all the branches, we were not parsing the wal trailer when we close the wal 
reader inside ReplicationSourceWALReader thread. The root cause was when we add 
active WAL to ReplicationSourceWALReader, we cache the file size when the wal 
was being actively written and once the wal was closed and replicated and 
removed from WALEntryStream, we did reset the ProtobufLogReader object but we 
didn't update the length of the wal and that was causing EOF errors since it 
can't find the WALTrailer with the stale wal file length.

The fix applied nicely to branch-1 since we use FSHlog implementation which 
closes the WAL file sychronously.

But in branch-2 and master, we use _AsyncFSWAL_ implementation and the closing 
of wal file is done asynchronously (as the name suggests). This is causing the 
test to fail. Below is the test.
{code:java}
  @Test
  public void testCleanClosedWALs() throws Exception {
try (WALEntryStream entryStream = new WALEntryStream(
  logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) {
  assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
  appendToLogAndSync();
  assertNotNull(entryStream.next());
  log.rollWriter();  ===> This does an asynchronous close of wal.
  appendToLogAndSync();
  assertNotNull(entryStream.next());
  assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
}
  }
{code}
In the above code, when we roll writer, we don't close the old wal file 
immediately so the ReplicationReader thread is not able to get the updated wal 
file size and that is throwing EOF errors.
 If I added a sleep of few milliseconds (1 ms in my local env) between 
rollWriter and appendToLogAndSync statement then the test passes but this is 
*not* a proper fix since we are working around the race between 
ReplicationSourceWALReaderThread and closing of WAL file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.

2021-05-26 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25924:


 Summary: Seeing a spike in uncleanlyClosedWALs metric.
 Key: HBASE-25924
 URL: https://issues.apache.org/jira/browse/HBASE-25924
 Project: HBase
  Issue Type: Bug
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Getting the following log line in all of our production clusters when 
WALEntryStream is dequeuing WAL file.

{noformat}
 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - Reached 
the end of WAL file hdfs://. It was not closed cleanly, so we 
did not parse 8 bytes of data. This is normally ok.
{noformat}
The 8 bytes are usually the trailer size.

While dequeue'ing the WAL file from WALEntryStream, we reset the reader here.
[WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221]

{code:java}
  private void tryAdvanceEntry() throws IOException {
if (checkReader()) {
  readNextEntryAndSetPosition();
  if (currentEntry == null) { // no more entries in this log file - see if 
log was rolled
if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled
  // Before dequeueing, we should always get one more attempt at 
reading.
  // This is in case more entries came in after we opened the reader,
  // and a new log was enqueued while we were reading. See HBASE-6758
  resetReader(); ---> HERE
  readNextEntryAndSetPosition();
  if (currentEntry == null) {
if (checkAllBytesParsed()) { // now we're certain we're done with 
this log file
  dequeueCurrentLog();
  if (openNextLog()) {
readNextEntryAndSetPosition();
  }
}
  }
} // no other logs, we've simply hit the end of the current open log. 
Do nothing
  }
}
// do nothing if we don't have a WAL Reader (e.g. if there's no logs in 
queue)
  }
{code}

In resetReader, we call the following methods, WALEntryStream#resetReader  
>  ProtobufLogReader#reset ---> ProtobufLogReader#initInternal.
In ProtobufLogReader#initInternal, we try to create the whole reader object 
from scratch to see if any new data has been written.
We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength.
We calculate whether trailer is present or not depending on fileLength.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25893) Corruption in recovered WAL in WALSplitter

2021-05-17 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25893:


 Summary: Corruption in recovered WAL in WALSplitter
 Key: HBASE-25893
 URL: https://issues.apache.org/jira/browse/HBASE-25893
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Recently we encountered RS aborts due to NPE while replaying edits from split 
logs during region open.

{noformat}
2021-05-13 19:34:28,871 ERROR [:60020-17] handler.OpenRegionHandler - 
Failed open of 
region=,1619036437822.0556ab96be88000b6f5f3fad47938ccd., starting 
to roll back the global memstore size.
java.lang.NullPointerException
at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4682)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsForPaths(HRegion.java:4557)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:4470)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:949)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:908)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7253)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7214)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7185)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7141)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7092)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:364)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:131)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Tracing back how the corrupt wal was generated.

{noformat}
 2021-05-12 05:21:23,333 FATAL [:60020-0-Writer-1] wal.WALSplitter - 
556ab96be88000b6f5f3fad47938ccd/5039807= to log
 java.nio.channels.ClosedChannelException
 at 
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
 at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
 at 
org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
 at 
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
 at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
 at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1742)
 at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1714)
 at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1179)
 at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1171)
 at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1141)


2021-05-12 05:21:23,333 ERROR [:60020-0-Writer-1] wal.WALSplitter - 
Exiting thread
java.nio.channels.ClosedChannelException
at 
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
at 
org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
at 
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1742)
at 

[jira] [Created] (HBASE-25887) Corrupt wal while region server is aborting.

2021-05-13 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25887:


 Summary: Corrupt wal while region server is aborting.
 Key: HBASE-25887
 URL: https://issues.apache.org/jira/browse/HBASE-25887
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We have seen a case in our production cluster where we ended up in corrupt wal. 
WALSplitter logged the below error
{noformat}
2021-05-12 00:42:46,786 FATAL [:60020-1] regionserver.HRegionServer - 
ABORTING region server HOST-B,60020,16207794418
88: Caught throwable while processing event RS_LOG_REPLAY
java.lang.NullPointerException
at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
at 
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
at 
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
at 
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
Looking at the raw wal file, we could see that the last WALEdit contains the 
region id, tablename and sequence number but cells were not persisted.
 Looking at the logs of the RS that generated that corrupt wal file,
{noformat}
2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer 
in /hbase/WALs/HOST-A,60020,1620774393046
2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping 
rpc client
2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
regionserver/HOST-A/:60020 closing leases
2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
regionserver/HOST-A:/HOST-A:60020 closed leases
2021-05-11 23:29:22,198 WARN  [0020.append-pool8-t1] wal.FSHLog - Append 
sequenceId=7147823, requesting roll of WAL
java.nio.channels.ClosedChannelException
at 
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
at 
org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
at 
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
These 2 lines are interesting.
{quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL 
writer in /hbase/WALs/HOST-A,60020,1620774393046
 
 
 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append 
sequenceId=7147823, requesting roll of WAL
 java.nio.channels.ClosedChannelException
{quote}
The append thread encountered java.nio.channels.ClosedChannelException while 
writing to wal file because the wal file was already closed.

This is the sequence of shutting down of threads when RS aborts.
{noformat}
  // With disruptor down, this is safe to let go.
  if (this.appendExecutor !=  null) this.appendExecutor.shutdown();

  // Tell our listeners that the log is closing
   ...
  if (LOG.isDebugEnabled()) {
LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
  }
  if (this.writer != null) {

[jira] [Created] (HBASE-25860) Add metric for successful wal roll requests.

2021-05-06 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25860:


 Summary: Add metric for successful wal roll requests.
 Key: HBASE-25860
 URL: https://issues.apache.org/jira/browse/HBASE-25860
 Project: HBase
  Issue Type: Improvement
  Components: metrics, wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah


We don't have any metric for number of successful wal roll requests. If we have 
that metric then we can add some alerts if that metric is stuck for some reason.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25856) sorry for my mistake, could someone delete it.

2021-05-06 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-25856.
--
Resolution: Invalid

> sorry for my mistake, could someone delete it.
> --
>
> Key: HBASE-25856
> URL: https://issues.apache.org/jira/browse/HBASE-25856
> Project: HBase
>  Issue Type: Improvement
>Reporter: junwen yang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25679) Size of log queue metric is incorrect in branch-1

2021-03-18 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25679:


 Summary: Size of log queue metric is incorrect in branch-1
 Key: HBASE-25679
 URL: https://issues.apache.org/jira/browse/HBASE-25679
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.7.0
Reporter: Rushabh Shah


In HBASE-25539 I did some refactoring for adding a new metric "oldestWalAge" 
and tried to consolidate update to all the metrics related to ReplicationSource 
class (size of log queue and oldest wal age) at one place.  In that refactoring 
introduced one bug where I am decrementing twice from size of log queue metric 
whenever we remove a wal from Replication source queue.

 

We need to fix this only in branch-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25622) Result#compareResults should compare tags.

2021-03-01 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25622:


 Summary: Result#compareResults should compare tags.
 Key: HBASE-25622
 URL: https://issues.apache.org/jira/browse/HBASE-25622
 Project: HBase
  Issue Type: Improvement
  Components: Client
Affects Versions: 1.7.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


Today +Result#compareResults+ compares the 2 cells based on following 
parameters.
{noformat}
for (int i = 0; i < res1.size(); i++) {
  if (!ourKVs[i].equals(replicatedKVs[i]) ||
  !CellUtil.matchingValue(ourKVs[i], replicatedKVs[i])) {
throw new Exception("This result was different: "
+ res1.toString() + " compared to " + res2.toString());
  }
{noformat}

row, family, qualifier, timestamp, type, value.

We also need to compare tags to determine if both cells are equal or not.
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25612) HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs.

2021-02-26 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25612:


 Summary: HMaster should abort if ReplicationLogCleaner is not able 
to delete oldWALs.
 Key: HBASE-25612
 URL: https://issues.apache.org/jira/browse/HBASE-25612
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


In our production cluster, we encountered an issue where the number of files 
within /hbase/oldWALs directory were growing exponentially from about 4000 
baseline to 15 and growing at the rate of 333 files per minute.

On further investigation we found that ReplicatonLogCleaner thread was getting 
aborted since it was not able to talk to zookeeper. Stack trace below
{noformat}
2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - 
replicationLogCleaner-0x302e05e0d8f, 
quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181,
 baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/replication/rs
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
 at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
 at 
org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
 at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
 at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

2021-02-25 23:05:01,149 WARN  [an-pool3-thread-1729] 
master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring.  
Reason: Failed to get stat of replication rs node
2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] 
master.ReplicationLogCleaner - 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/replication/rs
2021-02-25 23:05:01,150 WARN  [an-pool3-thread-1729] 
master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking 
deletable files
 {noformat}
 

{quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] 
master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring. 
Reason: Failed to get stat of replication rs node
{quote}
 

This line is more scary where HMaster invoked Abortable but just ignored and 
HMaster was doing it business as usual.

We have max files per directory configuration in namenode which is set to 1M in 
our clusters. If this directory reached that limit then that would have brought 
down the whole cluster.

We shouldn't ignore Abortable and should crash the Hmaster if Abortable is 
invoked.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25500) Add metric for age of oldest wal in region server replication queue.

2021-02-08 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah resolved HBASE-25500.
--
Resolution: Duplicate

Dup of HBASE-25539

> Add metric for age of oldest wal in region server replication queue.
> 
>
> Key: HBASE-25500
> URL: https://issues.apache.org/jira/browse/HBASE-25500
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics, regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0
>
>
> We have seen cases recently where we have wal from 2018 timestamp in our 
> recovered replication queue. We came across this un replicated wal while 
> debugging something else. We need to have metrics for the oldest wal in the 
> replication queue and have alerts if it exceeds some threshold. Clearly 2 
> years old wal is not desirable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25539) Add metric for age of oldest wal.

2021-01-29 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25539:


 Summary: Add metric for age of oldest wal.
 Key: HBASE-25539
 URL: https://issues.apache.org/jira/browse/HBASE-25539
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


In our production clusters, we have seen multiple cases where some of the wals 
are lingering in zk replication queues for months and we have no insight into 
it. Recently we fixed one case where wal was getting stuck if it is 0 size and 
from old sources. HBASE-25536. It would be helpful if we can have some metric 
which will tell us the age of oldest wal and we can add an alert for monitoring 
purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25536) Remove 0 length wal file from queue if it belongs to old sources.

2021-01-28 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25536:


 Summary: Remove 0 length wal file from queue if it belongs to old 
sources.
 Key: HBASE-25536
 URL: https://issues.apache.org/jira/browse/HBASE-25536
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah
 Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2


In our production clusters, we found one case where RS is not removing 0 length 
file from replication queue (in memory one not the zk replication queue) if the 
logQueue size is 1.
 Stack trace below:
{noformat}
2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] 
regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
 java.io.EOFException: 
hdfs://hbase/oldWALs/%2C60020%2C1606126266791.1606852981112 not a 
SequenceFile
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
Caused by: java.io.EOFException: 
hdfs://hbase/oldWALs/%2C60020%2C1606126266791.1606852981112 not a 
SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
... 1 more
{noformat}
The wal in question is of length 0 (verified via hadoop ls command) and is from 
recovered sources. There is just 1 log file in the queue (verified via heap 
dump).

 We have logic to remove 0 length log file from queue when we encounter 
EOFException and logQueue#size is greater than 1. Code snippet below.
{code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
  // if we get an EOF due to a zero-length log, and there are other logs in 
queue
  // (highly likely we've closed the current log), we've hit the max retries, 
and autorecovery is
  // enabled, then dump the log
  private void handleEofException(IOException e) {
if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
   logQueue.size() > 1 && this.eofAutoRecovery) {
  try {
if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
  LOG.warn("Forcing removal of 0 length log in queue: " + 
logQueue.peek());
  logQueue.remove();
  currentPosition = 0;
}
  } catch (IOException ioe) {
LOG.warn("Couldn't get file length information about log " + 
logQueue.peek());
  }
}
  }
{code}
This size check is valid for active sources where we need to have atleast one 
wal file which is the current wal file but for recovered sources where we don't 
add current wal file to queue, we can skip the logQueue#size check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25500) Add metric for age of oldest wal in region server replication queue.

2021-01-12 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25500:


 Summary: Add metric for age of oldest wal in region server 
replication queue.
 Key: HBASE-25500
 URL: https://issues.apache.org/jira/browse/HBASE-25500
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver, Replication
Reporter: Rushabh Shah
Assignee: Rushabh Shah
 Fix For: 3.0.0-alpha-1, 1.7.0


We have seen cases recently where we have wal from 2018 timestamp in our 
recovered replication queue. We came across this un replicated wal while 
debugging something else. We need to have metrics for the oldest wal in the 
replication queue and have alerts if it exceeds some threshold. Clearly 2 years 
old wal is not desirable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25390) CopyTable and Coprocessor based export tool should backup and restore cell tags.

2020-12-12 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25390:


 Summary: CopyTable and Coprocessor based export tool should backup 
and restore cell tags.
 Key: HBASE-25390
 URL: https://issues.apache.org/jira/browse/HBASE-25390
 Project: HBase
  Issue Type: Improvement
  Components: backuprestore
Reporter: Rushabh Shah


In HBASE-25246 we added support for Mapreduce based Export/Import tool to 
backup/restore cell tags. Mapreduce based export tool is not the only tool that 
takes snapshot or backup of a given table.
We also have Coprocessor based Export and CopyTable tools which takes backup of 
a given table. We need to add support for the above 2 tools to save cell tags 
to file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25328) Make PrivateCellUtil annotation as LimitatePrivate.

2020-11-24 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25328:


 Summary: Make PrivateCellUtil annotation as LimitatePrivate.
 Key: HBASE-25328
 URL: https://issues.apache.org/jira/browse/HBASE-25328
 Project: HBase
  Issue Type: Improvement
Reporter: Rushabh Shah
Assignee: Rushabh Shah


In PHOENIX-6213 phoenix project is using Cell Tag feature to add some metadata 
for delete mutations. We are adding Cell Tags in co-processor but we need some 
util methods available in +PrivateCellUtil+ class. 

Below are the methods we need in phoenix.

1. +PrivateCellUtil#createCell(Cell cell, List tags)+ method has an api 
which will accept an existing Cell and list of tags to create a new cell. 

But RawCellBuilder has a builder method which doesn't have any method which 
accepts a cell. I need to explicitly convert my input cell by extracting all 
fields and use the builder methods (like setRow, setseFamily, etc) and then use 
the build method.

 

2. +PrivateCellUtil.getTags(Cell cell)+ returns a list of existing tags which I 
want to use and add a new tag.

But RawCell#getTags() returns Iterator  which then I have to iterate over 
them and depending on whether they are byte buffer backed or array backed, I 
need to convert them to List since RawCellBuilder#setTags accepts List of Tags. 
We are already doing this conversion in PrivateCellUtil#getTags method.

All these conversion utility methods needs to be duplicated in phoenix project 
also.

 

Is it reasonable to make PrivateCellUtil method LimitedPrivate with 
HBaseInterfaceAudience as COPROC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25246) Backup/Restore hbase cell tags.

2020-11-04 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25246:


 Summary: Backup/Restore hbase cell tags.
 Key: HBASE-25246
 URL: https://issues.apache.org/jira/browse/HBASE-25246
 Project: HBase
  Issue Type: Improvement
  Components: backuprestore
Reporter: Rushabh Shah
Assignee: Rushabh Shah


In PHOENIX-6213 we are planning to add cell tags for Delete mutations. After 
having a discussion with hbase community via dev mailing thread, it was decided 
that we will pass the tags via an attribute in Mutation object and persist them 
to hbase via phoenix co-processor. The intention of PHOENIX-6213 is to store 
metadata in Delete marker so that while running Restore tool we can selectively 
restore certain Delete markers and ignore others. For that to happen we need to 
persist these tags in Backup and retrieve them in Restore MR jobs 
(Import/Export tool). 
Currently we don't persist the tags in Backup. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25179) Assert format is incorrect in HFilePerformanceEvaluation class.

2020-10-12 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25179:


 Summary: Assert format is incorrect in HFilePerformanceEvaluation 
class.
 Key: HBASE-25179
 URL: https://issues.apache.org/jira/browse/HBASE-25179
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: Rushabh Shah
Assignee: Rushabh Shah


[HFilePerformanceEvaluation 
|https://github.com/apache/hbase/blob/master/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java#L518]
 
The format of expected and actual is interchanged.

{code:java}
PerformanceEvaluationCommons.assertValueSize(c.getValueLength(), ROW_LENGTH);
{code}
The first argument should be expected and second should be actual.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25118) Extend Cell Tags to Delete object.

2020-09-29 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25118:


 Summary: Extend Cell Tags to Delete object.
 Key: HBASE-25118
 URL: https://issues.apache.org/jira/browse/HBASE-25118
 Project: HBase
  Issue Type: Improvement
Reporter: Rushabh Shah
Assignee: Rushabh Shah


We want to track the source of mutations (especially Deletes) via Phoenix. We 
have multiple use cases which does the deletes namely: customer deleting the 
data, internal process like GDPR compliance, Phoenix TTL MR jobs. For every 
mutations we want to track the source of operation which initiated the deletes.

At my day job, we have custom Backup/Restore tool.
For example: During GDPR compliance cleanup (lets say at time t0), we 
mistakenly deleted some customer data and it were possible that customer also 
deleted some data from their side (at time t1). To recover mistakenly deleted 
data, we restore from the backup at time (t0 - 1). By doing this, we also 
recovered the data that customer intentionally deleted.
We need a way for Restore tool to selectively recover data.

We want to leverage Cell Tag feature for Delete mutations to store these 
metadata. Currently Delete object doesn't support Tag feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25052) FastLongHistogram#getCountAtOrBelow method is broken.

2020-09-16 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-25052:


 Summary: FastLongHistogram#getCountAtOrBelow method is broken.
 Key: HBASE-25052
 URL: https://issues.apache.org/jira/browse/HBASE-25052
 Project: HBase
  Issue Type: Bug
  Components: metrics
Affects Versions: 2.2.3, 1.6.0, 2.3.0, 3.0.0-alpha-1
Reporter: Rushabh Shah


FastLongHistogram#getCountAtOrBelow method is broken.
If I revert HBASE-23245 then it works fine.
Wrote a small test case in TestHistogramImpl.java : 

{code:java}
  @Test
  public void testAdd1() {
HistogramImpl histogram = new HistogramImpl();
for (int i = 0; i < 100; i++) {
  histogram.update(i);
}
Snapshot snapshot = histogram.snapshot();
// This should return count as 6 since we added 0, 1, 2, 3, 4, 5
Assert.assertEquals(6, snapshot.getCountAtOrBelow(5));
{code}

It fails as below:

java.lang.AssertionError: 
Expected :6
Actual  :100




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24983) Wrap ConnectionImplemetation#locateRegioninMeta under operation timeout.

2020-09-05 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24983:


 Summary: Wrap ConnectionImplemetation#locateRegioninMeta under 
operation timeout.
 Key: HBASE-24983
 URL: https://issues.apache.org/jira/browse/HBASE-24983
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 1.6.0
Reporter: Rushabh Shah


We have config property (hbase.client.operation.timeout and 
hbase.client.meta.operation.timeout).
Description of hbase.client.operation.timeout which is for non meta tables.
{noformat}
Operation timeout is a top-level restriction (millisecond) that makes sure a 
blocking operation in Table will not be blocked more than this. In each 
operation, if rpc request fails because of timeout or other reason, it will 
retry until success or throw RetriesExhaustedException. But if the total time 
being blocking reach the operation timeout before retries exhausted, it will 
break early and throw SocketTimeoutException.
{noformat}

Most of the operations like get, put, delete are wrapped under this timeout but 
scan operation is not wrapped in this timeout. We need to wrap scan operations 
also within operation timeout.
More discussion in this PR thread:  
https://github.com/apache/hbase/pull/2322#discussion_r478687341



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24957) ZKTableStateClientSideReader#isDisabledTable doesn't check if table exists or not.

2020-08-26 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24957:


 Summary: ZKTableStateClientSideReader#isDisabledTable doesn't 
check if table exists or not.
 Key: HBASE-24957
 URL: https://issues.apache.org/jira/browse/HBASE-24957
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah


The following bug exists only in branch-1 and below.

ZKTableStateClientSideReader#isDisabledTable returns false even if table 
doesn't exists.

Below is the code snippet:

 {code:title=ZKTableStateClientSideReader.java|borderStyle=solid}
  public static boolean isDisabledTable(final ZooKeeperWatcher zkw,
  final TableName tableName)
  throws KeeperException, InterruptedException {
ZooKeeperProtos.Table.State state = getTableState(zkw, tableName);---> 
We should check here if state is null or not.
return isTableState(ZooKeeperProtos.Table.State.DISABLED, state);
  }
}
{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24956) ConnectionManager#userRegionLock waits for lock indefinitely.

2020-08-26 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24956:


 Summary: ConnectionManager#userRegionLock waits for lock 
indefinitely.
 Key: HBASE-24956
 URL: https://issues.apache.org/jira/browse/HBASE-24956
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 1.3.2
Reporter: Rushabh Shah
Assignee: Rushabh Shah


One of our customers experienced high latencies (in order of 3-4 minutes) for 
point lookup query (We use phoenix on top of hbase).

We have different threads sharing the same hconnection.  Looks like multiple 
threads are stuck at the same place. 
[https://github.com/apache/hbase/blob/branch-1.3/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java#L1282]
 

We have set the following configuration parameters to ensure query fails with a 
reasonable SLAs:

1. hbase.client.meta.operation.timeout

2. hbase.client.operation.timeout

3. hbase.client.scanner.timeout.period

But since  userRegionLock can wait for lock indefinitely the call will not fail 
within SLA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24615) MutableRangeHistogram#updateSnapshotRangeMetrics doesn't calculate the distribution for last bucket.

2020-06-22 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24615:


 Summary: MutableRangeHistogram#updateSnapshotRangeMetrics doesn't 
calculate the distribution for last bucket.
 Key: HBASE-24615
 URL: https://issues.apache.org/jira/browse/HBASE-24615
 Project: HBase
  Issue Type: Bug
  Components: metrics
Affects Versions: 1.3.7
Reporter: Rushabh Shah


We are not processing the distribution for last bucket. 

https://github.com/apache/hbase/blob/master/hbase-hadoop-compat/src/main/java/org/apache/hadoop/metrics2/lib/MutableRangeHistogram.java#L70

{code:java}
  public void updateSnapshotRangeMetrics(MetricsRecordBuilder 
metricsRecordBuilder,
 Snapshot snapshot) {
long priorRange = 0;
long cumNum = 0;

final long[] ranges = getRanges();
final String rangeType = getRangeType();
for (int i = 0; i < ranges.length - 1; i++) { -> The bug lies 
here. We are not processing last bucket.
  long val = snapshot.getCountAtOrBelow(ranges[i]);
  if (val - cumNum > 0) {
metricsRecordBuilder.addCounter(
Interns.info(name + "_" + rangeType + "_" + priorRange + "-" + 
ranges[i], desc),
val - cumNum);
  }
  priorRange = ranges[i];
  cumNum = val;
}
long val = snapshot.getCount();
if (val - cumNum > 0) {
  metricsRecordBuilder.addCounter(
  Interns.info(name + "_" + rangeType + "_" + ranges[ranges.length - 1] 
+ "-inf", desc),
  val - cumNum);
}
  }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24576) Changing "whitelist" and "blacklist" in our docs and project.

2020-06-16 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24576:


 Summary: Changing "whitelist" and "blacklist" in our docs and 
project.
 Key: HBASE-24576
 URL: https://issues.apache.org/jira/browse/HBASE-24576
 Project: HBase
  Issue Type: Improvement
Reporter: Rushabh Shah


Replace instances of “whitelist” and “blacklist” in our project, trails, 
documentation and UI text. 
We can replace  blacklist with blocklist, blocklisted, or block  and whitelist 
with allowlist, allowlisted, or allow.
In my current workplace they are suggesting us to make this change. Also Google 
has an exhaustive list to write inclusive documentation: 
https://developers.google.com/style/inclusive-documentation

There might be few issues while making the change.
1. If these words are part of config property name then all customers needs to 
make the change.
2. There might be some client-server compatibility if we change server side 
variables/method names.

Creating this jira just to start the conversation. Please chip in with ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24520) Change the IA for MutableSizeHistogram and MutableTimeHistogram

2020-06-08 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24520:


 Summary: Change the IA for MutableSizeHistogram and 
MutableTimeHistogram
 Key: HBASE-24520
 URL: https://issues.apache.org/jira/browse/HBASE-24520
 Project: HBase
  Issue Type: Task
  Components: metrics
Reporter: Rushabh Shah


Currently the IA for MutableSizeHistogram and MutableTimeHistogram is private. 
We want to use these classes in phoenix project and I thought we can leverage 
the existing implementation from hbase histo implementation. IIUC the private 
IA can't be used in other projects. Proposing to make it LimitedPrivate and 
mark HBaseInterfaceAudience.PHOENIX. Please suggest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24502) hbase hfile command logs exception message at WARN level.

2020-06-03 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-24502:


 Summary: hbase hfile command logs exception message at WARN level.
 Key: HBASE-24502
 URL: https://issues.apache.org/jira/browse/HBASE-24502
 Project: HBase
  Issue Type: Bug
Affects Versions: master
Reporter: Rushabh Shah


Ran the following command.
{noformat}
 ./hbase hfile -f 
~/hbase-3.0.0-SNAPSHOT/tmp/hbase/data/default/emp/b1972be371596e074a1ae465782a209f/personal\
 data/0930965e5b914debb33a9be047efc493 -p
{noformat}

It logged the following warn message on console.
{noformat}
2020-06-03 09:23:45,485 WARN  [main] util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2020-06-03 09:23:45,871 WARN  [main] beanutils.FluentPropertyBeanIntrospector: 
Error when creating PropertyDescriptor for public final void 
org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)!
 Ignoring this property.
java.beans.IntrospectionException: bad write method arg count: public final 
void 
org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)
at 
java.beans.PropertyDescriptor.findPropertyType(PropertyDescriptor.java:657)
at 
java.beans.PropertyDescriptor.setWriteMethod(PropertyDescriptor.java:327)
at java.beans.PropertyDescriptor.(PropertyDescriptor.java:139)
at 
org.apache.commons.beanutils.FluentPropertyBeanIntrospector.createFluentPropertyDescritor(FluentPropertyBeanIntrospector.java:177)
at 
org.apache.commons.beanutils.FluentPropertyBeanIntrospector.introspect(FluentPropertyBeanIntrospector.java:140)
at 
org.apache.commons.beanutils.PropertyUtilsBean.fetchIntrospectionData(PropertyUtilsBean.java:2234)
at 
org.apache.commons.beanutils.PropertyUtilsBean.getIntrospectionData(PropertyUtilsBean.java:2215)
at 
org.apache.commons.beanutils.PropertyUtilsBean.getPropertyDescriptor(PropertyUtilsBean.java:950)
at 
org.apache.commons.beanutils.PropertyUtilsBean.isWriteable(PropertyUtilsBean.java:1466)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.isPropertyWriteable(BeanHelper.java:521)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.initProperty(BeanHelper.java:357)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.initBeanProperties(BeanHelper.java:273)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.initBean(BeanHelper.java:192)
at 
org.apache.commons.configuration2.beanutils.BeanHelper$BeanCreationContextImpl.initBean(BeanHelper.java:669)
at 
org.apache.commons.configuration2.beanutils.DefaultBeanFactory.initBeanInstance(DefaultBeanFactory.java:162)
at 
org.apache.commons.configuration2.beanutils.DefaultBeanFactory.createBean(DefaultBeanFactory.java:116)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:459)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:479)
at 
org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:492)
at 
org.apache.commons.configuration2.builder.BasicConfigurationBuilder.createResultInstance(BasicConfigurationBuilder.java:447)
at 
org.apache.commons.configuration2.builder.BasicConfigurationBuilder.createResult(BasicConfigurationBuilder.java:417)
at 
org.apache.commons.configuration2.builder.BasicConfigurationBuilder.getConfiguration(BasicConfigurationBuilder.java:285)
at 
org.apache.hadoop.metrics2.impl.MetricsConfig.loadFirst(MetricsConfig.java:119)
at 
org.apache.hadoop.metrics2.impl.MetricsConfig.create(MetricsConfig.java:98)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:478)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:188)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:163)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:62)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:58)
at 
org.apache.hadoop.hbase.metrics.BaseSourceImpl$DefaultMetricsSystemInitializer.init(BaseSourceImpl.java:54)
at 
org.apache.hadoop.hbase.metrics.BaseSourceImpl.(BaseSourceImpl.java:116)
at 
org.apache.hadoop.hbase.io.MetricsIOSourceImpl.(MetricsIOSourceImpl.java:46)
at 
org.apache.hadoop.hbase.io.MetricsIOSourceImpl.(MetricsIOSourceImpl.java:38)
at 
org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:94)
at org.apache.hadoop.hbase.io.MetricsIO.(MetricsIO.java:35)
at