[jira] [Created] (HBASE-28757) Understand how supportplaintext property works in TLS setup.
Rushabh Shah created HBASE-28757: Summary: Understand how supportplaintext property works in TLS setup. Key: HBASE-28757 URL: https://issues.apache.org/jira/browse/HBASE-28757 Project: HBase Issue Type: Improvement Components: security Affects Versions: 2.6.0 Reporter: Rushabh Shah We are testing TLS feature and I am confused on how hbase.server.netty.tls.supportplaintext property works. Here is our current setup. This is a fresh cluster deployment. hbase.server.netty.tls.enabled --> true hbase.client.netty.tls.enabled --> true hbase.server.netty.tls.supportplaintext --> false (We don't want to fallback on kerberos) We still have our kerberos related configuration enabled. hbase.security.authentication --> kerberos *Our expectation:* During regionserver startup, regionserver will use TLS for authentication and the communication will succeed. *Actual observation* During regionserver startup, hmaster authenticates regionserver* via kerberos authentication*and *regionserver's reportForDuty RPC fails*. RS logs: {noformat} 2024-07-25 16:59:55,098 INFO [regionserver/regionserver-0:60020] regionserver.HRegionServer - reportForDuty to master=hmaster-0,6,1721926791062 with isa=regionserver-0/:60020, startcode=1721926793434 2024-07-25 16:59:55,548 DEBUG [RS-EventLoopGroup-1-2] ssl.SslHandler - [id: 0xa48e3487, L:/:39837 - R:hmaster-0/:6] HANDSHAKEN: protocol:TLSv1.2 cipher suite:TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 2024-07-25 16:59:55,578 DEBUG [RS-EventLoopGroup-1-2] security.UserGroupInformation - PrivilegedAction [as: hbase/regionserver-0. (auth:KERBEROS)][action: org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler$2@3769e55] java.lang.Exception at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896) at org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:161) at org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:43) ... ... 2024-07-25 16:59:55,581 DEBUG [RS-EventLoopGroup-1-2] security.UserGroupInformation - PrivilegedAction [as: hbase/regionserver-0 (auth:KERBEROS)][action: org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler$2@c6f0806] java.lang.Exception at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1896) at org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:161) at org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.channelRead0(NettyHBaseSaslRpcClientHandler.java:43) at org.apache.hbase.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) 2024-07-25 16:59:55,602 WARN [regionserver/regionserver-0:60020] regionserver.HRegionServer - error telling master we are up org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to address=hmaster-0:6 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:340) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:92) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:595) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:16398) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2997) at org.apache.hadoop.hbase.regionserver.HRegionServer.lambda$run$2(HRegionServer.java:1084) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079) Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to address=hmaster-0:6 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection closed at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:233) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) at
[jira] [Created] (HBASE-28515) Validate access check for each RegionServerCoprocessor Endpoint.
Rushabh Shah created HBASE-28515: Summary: Validate access check for each RegionServerCoprocessor Endpoint. Key: HBASE-28515 URL: https://issues.apache.org/jira/browse/HBASE-28515 Project: HBase Issue Type: Improvement Components: Coprocessors Reporter: Rushabh Shah Currently we enforce ADMIN permissions for each regionserver coprocessor method. See HBASE-28508 for more details. There can be regionserver endpoint implementation which don't want to enforce calling users to have ADMIN permissions. So there needs to be a way for RS coproc implementation to override the access checks, with the default being requiring ADMIN permissions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28508) Remove the need for ADMIN permissions for RSRpcServices#execRegionServerService
Rushabh Shah created HBASE-28508: Summary: Remove the need for ADMIN permissions for RSRpcServices#execRegionServerService Key: HBASE-28508 URL: https://issues.apache.org/jira/browse/HBASE-28508 Project: HBase Issue Type: Bug Components: acl Reporter: Rushabh Shah Assignee: Rushabh Shah We have introduced a new regionserver coproc within phoenix and all the permission related tests are failing with the following exception. {noformat} Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.security.AccessDeniedException): org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions for user 'groupUser_N42' (global, action=ADMIN) at org.apache.hadoop.hbase.security.access.AccessChecker.requireGlobalPermission(AccessChecker.java:152) at org.apache.hadoop.hbase.security.access.AccessChecker.requirePermission(AccessChecker.java:125) at org.apache.hadoop.hbase.regionserver.RSRpcServices.requirePermission(RSRpcServices.java:1318) at org.apache.hadoop.hbase.regionserver.RSRpcServices.rpcPreCheck(RSRpcServices.java:584) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execRegionServerService(RSRpcServices.java:3804) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45016) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) {noformat} This check is failing. [RSRpcServices|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L3815] {code} @Override public CoprocessorServiceResponse execRegionServerService(RpcController controller, CoprocessorServiceRequest request) throws ServiceException { rpcPreCheck("execRegionServerService"); return server.execRegionServerService(controller, request); } private void rpcPreCheck(String requestName) throws ServiceException { try { checkOpen(); requirePermission(requestName, Permission.Action.ADMIN); } catch (IOException ioe) { throw new ServiceException(ioe); } } {code} Why do we need ADMIN permissions to call region server coproc? We don't need ADMIN permissions to call all region co-procs. We require ADMIN permissions to execute some region coprocs (compactionSwitch, clearRegionBlockCache). Can we change the permission to READ? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28437) Region Server crash in our production environment.
Rushabh Shah created HBASE-28437: Summary: Region Server crash in our production environment. Key: HBASE-28437 URL: https://issues.apache.org/jira/browse/HBASE-28437 Project: HBase Issue Type: Bug Reporter: Rushabh Shah Recently we are seeing lot of RS crash in our production environment creating core dump file and hs_err_pid.log file. HBase: hbase-2.5 Java: openjdk 1.8 Copying contents from hs_err_pid.log below: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f9fb1415ba2, pid=50172, tid=0x7f92a97ec700 # # JRE version: OpenJDK Runtime Environment (Zulu 8.76.0.18-SA-linux64) (8.0_402-b06) (build 1.8.0_402-b06) # Java VM: OpenJDK 64-Bit Server VM (25.402-b06 mixed mode linux-amd64 ) # Problematic frame: # J 19801 C2 org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V (75 bytes) @ 0x7f9fb1415ba2 [0x7f9fb14159a0+0x202] # # Core dump written. Default location: /home/sfdc/core or core.50172 # # If you would like to submit a bug report, please visit: # http://www.azul.com/support/ # --- T H R E A D --- Current thread (0x7f9fa2d13000): JavaThread "RS-EventLoopGroup-1-92" daemon [_thread_in_Java, id=54547, stack(0x7f92a96ec000,0x7f92a97ed000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x559869daf000 Registers: RAX=0x7f9dbd8b6460, RBX=0x0008, RCX=0x0005c86b, RDX=0x7f9dbd8b6460 RSP=0x7f92a97eaf20, RBP=0x0002, RSI=0x7f92d225e970, RDI=0x0069 R8 =0x55986975f028, R9 =0x0064ffd8, R10=0x005f, R11=0x7f94a778b290 R12=0x7f9e62855ae8, R13=0x, R14=0x7f9e5a14b1e0, R15=0x7f9fa2d13000 RIP=0x7f9fb1415ba2, EFLAGS=0x00010216, CSGSFS=0x0033, ERR=0x0004 TRAPNO=0x000e Top of Stack: (sp=0x7f92a97eaf20) 0x7f92a97eaf20: 00690064ff79 7f9dbd8b6460 0x7f92a97eaf30: 7f9dbd8b6460 00570003 0x7f92a97eaf40: 7f94a778b290 000400010004 0x7f92a97eaf50: 0004d090c130 7f9db550 0x7f92a97eaf60: 000800040001 7f92a97eaf90 0x7f92a97eaf70: 7f92d0908648 0001 0x7f92a97eaf80: 0001 005c 0x7f92a97eaf90: 7f94ee8078d0 0206 0x7f92a97eafa0: 7f9db5545a00 7f9fafb63670 0x7f92a97eafb0: 7f9e5a13ed70 00690001 0x7f92a97eafc0: 7f93ab8965b8 7f93b9959210 0x7f92a97eafd0: 7f9db5545a00 7f9fb04b3e30 0x7f92a97eafe0: 7f9e5a13ed70 7f930001 0x7f92a97eaff0: 7f93ab8965b8 7f93a8ae3920 0x7f92a97eb000: 7f93b9959210 7f94a778b290 0x7f92a97eb010: 7f9b60707c20 7f93a8938c28 0x7f92a97eb020: 7f94ee8078d0 7f9b60708608 0x7f92a97eb030: 7f9b60707bc0 7f9b60707c20 0x7f92a97eb040: 0069 7f93ab8965b8 0x7f92a97eb050: 7f94a778b290 7f94a778b290 0x7f92a97eb060: 0005c80d0005c80c a828a590 0x7f92a97eb070: 7f9e5a13ed70 0001270e 0x7f92a97eb080: 7f9db5545790 01440022 0x7f92a97eb090: 7f95ddc800c0 7f93ab89a6c8 0x7f92a97eb0a0: 7f93ae65c270 7f9fb24af990 0x7f92a97eb0b0: 7f93ae65c290 7f93ae65c270 0x7f92a97eb0c0: 7f9e5a13ed70 7f92ca328528 0x7f92a97eb0d0: 7f9e5a13ed98 7f9e5e1e88b0 0x7f92a97eb0e0: 7f92ca32d870 7f9e5a13ed98 0x7f92a97eb0f0: 7f9e5e1e88b0 7f93b9956288 0x7f92a97eb100: 7f9e5a13ed70 7f9fb23c3aac 0x7f92a97eb110: 7f9317c9c8d0 7f9b60708608 Instructions: (pc=0x7f9fb1415ba2) 0x7f9fb1415b82: 44 3b d7 0f 8d 6d fe ff ff 4c 8b 40 10 45 8b ca 0x7f9fb1415b92: 44 03 0c 24 c4 c1 f9 7e c3 4d 8b 5b 18 4d 63 c9 0x7f9fb1415ba2: 47 0f be 04 08 4d 85 db 0f 84 49 03 00 00 4d 8b 0x7f9fb1415bb2: 4b 08 48 b9 10 1c be 10 93 7f 00 00 4c 3b c9 0f Register to memory mapping: RAX=0x7f9dbd8b6460 is an oop java.nio.DirectByteBuffer - klass: 'java/nio/DirectByteBuffer' RBX=0x0008 is an unknown value RCX=0x0005c86b is an unknown value RDX=0x7f9dbd8b6460 is an oop java.nio.DirectByteBuffer - klass: 'java/nio/DirectByteBuffer' RSP=0x7f92a97eaf20 is pointing into the stack for thread: 0x7f9fa2d13000 RBP=0x0002 is an unknown value RSI=0x7f92d225e970 is pointing into metadata RDI=0x0069 is an unknown value R8 =0x55986975f028 is an unknown value R9 =0x0064ffd8 is an unknown value R10=0x005f is an unknown value R11=0x7f94a778b290 is an oop org.apache.hbase.thirdparty.io.netty.buffer.PooledUnsafeDirectByteBuf - klass:
[jira] [Resolved] (HBASE-28391) Remove the need for ADMIN permissions for listDecommissionedRegionServers
[ https://issues.apache.org/jira/browse/HBASE-28391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-28391. -- Fix Version/s: 2.6.0 2.4.18 4.0.0-alpha-1 2.7.0 2.5.8 3.0.0-beta-2 Resolution: Fixed > Remove the need for ADMIN permissions for listDecommissionedRegionServers > - > > Key: HBASE-28391 > URL: https://issues.apache.org/jira/browse/HBASE-28391 > Project: HBase > Issue Type: Bug > Components: Admin >Affects Versions: 2.4.17, 2.5.7 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 2.4.18, 4.0.0-alpha-1, 2.7.0, 2.5.8, 3.0.0-beta-2 > > > Why we need {{ADMIN}} permissions for > {{AccessController#preListDecommissionedRegionServers}} ? > From Phoenix, we are calling {{Admin#getRegionServers(true)}} where the > argument {{excludeDecommissionedRS}} is set to true. Refer > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java#L1721-L1730]. > If {{excludeDecommissionedRS}} is set to true and if we have > {{AccessController}} co-proc attached, it requires ADMIN permissions to > execute {{listDecommissionedRegionServers}} RPC. Refer > [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java#L1205-L1207]. > > {code:java} > @Override > public void > preListDecommissionedRegionServers(ObserverContext > ctx) > throws IOException { > requirePermission(ctx, "listDecommissionedRegionServers", Action.ADMIN); > } > {code} > I understand that we need ADMIN permissions for > _preDecommissionRegionServers_ and _preRecommissionRegionServer_ because it > changes the membership of regionservers but I don’t see any need for ADMIN > permissions for _listDecommissionedRegionServers_. Do you think we can > remove need for ADMIN permissions for _listDecommissionedRegionServers_ RPC? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28391) Remove the need for ADMIN permissions for listDecommissionedRegionServers
Rushabh Shah created HBASE-28391: Summary: Remove the need for ADMIN permissions for listDecommissionedRegionServers Key: HBASE-28391 URL: https://issues.apache.org/jira/browse/HBASE-28391 Project: HBase Issue Type: Bug Components: Admin Affects Versions: 2.5.7, 2.4.17 Reporter: Rushabh Shah Assignee: Rushabh Shah Why we need {{ADMIN}} permissions for {{AccessController#preListDecommissionedRegionServers}} ? >From Phoenix, we are calling {{Admin#getRegionServers(true)}} where the >argument {{excludeDecommissionedRS}} is set to true. Refer >[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java#L1721-L1730]. If {{excludeDecommissionedRS}} is set to true and if we have {{AccessController}} co-proc attached, it requires ADMIN permissions to execute {{listDecommissionedRegionServers}} RPC. Refer [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java#L1205-L1207]. {code:java} @Override public void preListDecommissionedRegionServers(ObserverContext ctx) throws IOException { requirePermission(ctx, "listDecommissionedRegionServers", Action.ADMIN); } {code} I understand that we need ADMIN permissions for _preDecommissionRegionServers_ and _preRecommissionRegionServer_ because it changes the membership of regionservers but I don’t see any need for ADMIN permissions for _listDecommissionedRegionServers_. Do you think we can remove need for ADMIN permissions for _listDecommissionedRegionServers_ RPC? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28293) Add metric for GetClusterStatus request count.
Rushabh Shah created HBASE-28293: Summary: Add metric for GetClusterStatus request count. Key: HBASE-28293 URL: https://issues.apache.org/jira/browse/HBASE-28293 Project: HBase Issue Type: Bug Reporter: Rushabh Shah We have been bitten multiple times by GetClusterStatus request overwhelming HMaster's memory usage. It would be good to add a metric for the total GetClusterStatus requests count. In almost all of our production incidents involving GetClusterStatus request, HMaster were running out of memory with many clients call this RPC in parallel and the response size is very big. In hbase2 we have [ClusterMetrics.Option|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterMetrics.java#L164-L224] which can reduce the size of the response. It would be nice to add another metric to indicate if the response size of GetClusterStatus is greater than some threshold (like 5MB) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers
[ https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-28204. -- Fix Version/s: 2.6.0 2.4.18 3.0.0-beta-1 2.5.7 Resolution: Fixed > Canary can take lot more time If any region (except the first region) starts > with delete markers > > > Key: HBASE-28204 > URL: https://issues.apache.org/jira/browse/HBASE-28204 > Project: HBase > Issue Type: Bug > Components: canary >Reporter: Mihir Monani >Assignee: Mihir Monani >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > In CanaryTool.java, Canary reads only the first row of the region using > [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33] > for any region of the table. Canary uses [Scan with FirstRowKeyFilter for > table > scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530] > If the said region has empty start key (This will only happen when region is > the first region for a table) > With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- > RawScan was > [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534] > to improve performance for regions which can have high number of delete > markers. Based on currently implementation, [RawScan is only > enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519] > if region has empty start-key (or region is first region for the table). > RawScan doesn't work for rest of the regions in the table except first > region. Also If the region has all the rows or majority of the rows with > delete markers, Get Operation can take a lot of time. This is can cause > timeouts for CanaryTool. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-28184. -- Resolution: Fixed > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) > {noformat} > All the peers will contend for this lock during every batch read. > Look at the code snippet below. We are guarding this section with > rollWriterLock if we are replicating the active WAL file. But in our case we > are NOT replicating active WAL file but still we acquire this lock only to > return OptionalLong.empty(); > {noformat} > /** >* if the given {@code path} is being written currently, then return its > length. >* >* This is used by replication to prevent replicating unacked log entries. > See >*
[jira] [Created] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
Rushabh Shah created HBASE-28184: Summary: Tailing the WAL is very slow if there are multiple peers. Key: HBASE-28184 URL: https://issues.apache.org/jira/browse/HBASE-28184 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.0.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Noticed in one of our production clusters which has 4 peers. Due to sudden ingestion of data, the size of log queue increased to a peak of 506. We have configured log roll size to 256 MB. Most of the edits in the WAL were from a table for which replication is disabled. So all ReplicationSourceWALReader thread had to do was to replay the WAL and NOT replicate them. Still it took 12 hours to drain the queue. Took few jstacks and found that ReplicationSourceWALReader was waiting to acquire rollWriterLock [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] {noformat} "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 nid=0xbd7f waiting on condition [0x7f37b4d19000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f3897a3e150> (a java.util.concurrent.locks.ReentrantLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) at org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) at org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) at org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) at org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) {noformat} All the peers will contend for this lock during every batch read. Look at the code snippet below. We are guarding this section with rollWriterLock if we are replicating the active WAL file. But in our case we are NOT replicating active WAL file but still we acquire this lock only to return OptionalLong.empty(); {noformat} /** * if the given {@code path} is being written currently, then return its length. * * This is used by replication to prevent replicating unacked log entries. See * https://issues.apache.org/jira/browse/HBASE-14004 for more details. */ @Override public OptionalLong getLogFileSizeIfBeingWritten(Path path) { rollWriterLock.lock(); try { ... ... } finally { rollWriterLock.unlock(); } {noformat} We can check the size of log queue and if it is greater than 1 then we can return early without acquiring the lock. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28045) Sort on store file size on hmaster page is broken.
[ https://issues.apache.org/jira/browse/HBASE-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-28045. -- Resolution: Invalid > Sort on store file size on hmaster page is broken. > -- > > Key: HBASE-28045 > URL: https://issues.apache.org/jira/browse/HBASE-28045 > Project: HBase > Issue Type: Bug > Components: UI >Affects Versions: 2.5.2 >Reporter: Rushabh Shah >Priority: Major > Labels: newbie, starter > Attachments: Screenshot 2023-08-24 at 11.08.54 AM.png, Screenshot > 2023-08-25 at 1.02.49 PM.png > > > An image is worth 1000 words. > !Screenshot 2023-08-24 at 11.08.54 AM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28045) Sort on store file size on hmaster page is borken.
Rushabh Shah created HBASE-28045: Summary: Sort on store file size on hmaster page is borken. Key: HBASE-28045 URL: https://issues.apache.org/jira/browse/HBASE-28045 Project: HBase Issue Type: Bug Components: UI Affects Versions: 2.5.2 Reporter: Rushabh Shah Attachments: Screenshot 2023-08-24 at 11.08.54 AM.png An image is worth 1000 words. !Screenshot 2023-08-24 at 11.08.54 AM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28039) Create metric for region in transition count per table.
Rushabh Shah created HBASE-28039: Summary: Create metric for region in transition count per table. Key: HBASE-28039 URL: https://issues.apache.org/jira/browse/HBASE-28039 Project: HBase Issue Type: Bug Reporter: Rushabh Shah Currently we have ritCount and ritCountOverThreshold metric for the whole cluster. It would be nice to have the ritCount and ritCountOverThreshold metric per table. It helps in debugging failed queries for a given table due to RIT. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27957) HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED exception.
Rushabh Shah created HBASE-27957: Summary: HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED exception. Key: HBASE-27957 URL: https://issues.apache.org/jira/browse/HBASE-27957 Project: HBase Issue Type: Bug Affects Versions: 2.4.17, 1.7.2 Reporter: Rushabh Shah Observed this in production environment running some version of 1.7 release. Application didn't had the right keytab setup for authentication. Application was trying to create HConnection and zookeeper server threw AUTH_FAILED exception. After few hours of application in this state, saw thousands of zk-event-processor thread with below stack trace. {noformat} "zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms elapsed=41794.58s tid=0x7fd7805066d0 nid=0x1245 waiting on condition [0x7fd75df01000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native Method) - parking to wait for <0x7fd9874a85e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081) at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628) {noformat} {code:java|title=ConnectionManager.java|borderStyle=solid} HConnectionImplementation(Configuration conf, boolean managed, ExecutorService pool, User user, String clusterId) throws IOException { ... ... try { this.registry = setupRegistry(); retrieveClusterId(); ... ... } catch (Throwable e) { // avoid leaks: registry, rpcClient, ... LOG.debug("connection construction failed", e); close(); throw e; } {code} retrieveClusterId internally calls ZKConnectionRegistry#getClusterId {code:java|title=ZKConnectionRegistry.java|borderStyle=solid} private String clusterId = null; @Override public String getClusterId() { if (this.clusterId != null) return this.clusterId; // No synchronized here, worse case we will retrieve it twice, that's // not an issue. try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher()) { this.clusterId = ZKClusterId.readClusterIdZNode(zkw); if (this.clusterId == null) { LOG.info("ClusterId read in ZooKeeper is null"); } } catch (KeeperException | IOException e) { ---> WE ARE SWALLOWING THIS EXCEPTION AND RETURNING NULL. LOG.warn("Can't retrieve clusterId from Zookeeper", e); } return this.clusterId; } {code} ZkConnectionRegistry#getClusterId threw the following exception.(Our logging system trims stack traces longer than 5 lines.) {noformat} Cause: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase/hbaseid StackTrace: org.apache.zookeeper.KeeperException.create(KeeperException.java:126) org.apache.zookeeper.KeeperException.create(KeeperException.java:54) org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213) org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285) org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470) {noformat} We should throw KeeperException from ZKConnectionRegistry#getClusterId all the way back to HConnectionImplementation constructor to close all the watcher threads and throw the exception back to the caller. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27908) Can't get connection to ZooKeeper
[ https://issues.apache.org/jira/browse/HBASE-27908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-27908. -- Resolution: Invalid > Can't get connection to ZooKeeper > - > > Key: HBASE-27908 > URL: https://issues.apache.org/jira/browse/HBASE-27908 > Project: HBase > Issue Type: Bug > Components: build >Affects Versions: 1.4.13 >Reporter: Ibrar Ahmed >Priority: Major > > I am using Hbase cluster along with apache kylin, the connection between Edge > node and the Hbase cluster is good. > following are the logs from Kylin side which shows Error exception: > {code:java} > java.net.SocketTimeoutException: callTimeout=120, callDuration=1275361: > org.apache.hadoop.hbase.MasterNotRunningException: Can't get connection to > ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:178) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4551) > at > org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:561) > at > org.apache.hadoop.hbase.client.HTable.getTableDescriptor(HTable.java:585) > at > org.apache.kylin.storage.hbase.steps.HFileOutputFormat3.configureIncrementalLoad(HFileOutputFormat3.java:328) > at > org.apache.kylin.storage.hbase.steps.CubeHFileJob.run(CubeHFileJob.java:101) > at > org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:144) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179) > at > org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179) > at > org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.hadoop.hbase.MasterNotRunningException: > org.apache.hadoop.hbase.MasterNotRunningException: Can't get connection to > ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1618) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1638) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1795) > at > org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:140) > ... 13 more > Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Can't get > connection to ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(ConnectionManager.java:971) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.access$400(ConnectionManager.java:566) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1567) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1609) > ... 17 more > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:425) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(ConnectionManager.java:960) > ... 20 more {code} > Following are the logs from Hbase cluster master NOde which accepts the > connection from Edge NOde(Kylin): > {code:java} > 2023-06-05 10:00:30,336 [myid:0] - INFO > [CommitProcessor:0:NIOServerCnxn@1056] - Closed socket connection for client > /10.127.2.201:37328 which had sessionid 0x7311c000c > 2023-06-05 13:14:48,346 [myid:0] - INFO > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task
[jira] [Resolved] (HBASE-26913) Replication Observability Framework
[ https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-26913. -- Resolution: Fixed > Replication Observability Framework > --- > > Key: HBASE-26913 > URL: https://issues.apache.org/jira/browse/HBASE-26913 > Project: HBase > Issue Type: New Feature > Components: regionserver, Replication >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > > In our production clusters, we have seen cases where data is present in > source cluster but not in the sink cluster and 1 case where data is present > in sink cluster but not in source cluster. > We have internal tools where we take incremental backup every day on both > source and sink clusters and we compare the hash of the data in both the > backups. We have seen many cases where hash doesn't match which means data is > not consistent between source and sink for that given day. The Mean Time To > Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of > manual debugging. > We need some tool where we can reduce MTTD and requires less manual debugging. > I have attached design doc. Huge thanks to [~bharathv] to come up with this > design at my work place. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27697) Create a dummy metric implementation in ConnectionImplementation.
Rushabh Shah created HBASE-27697: Summary: Create a dummy metric implementation in ConnectionImplementation. Key: HBASE-27697 URL: https://issues.apache.org/jira/browse/HBASE-27697 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 2.5.0 Reporter: Rushabh Shah This Jira is for branch-2 only. If CLIENT_SIDE_METRICS_ENABLED_KEY conf is set to true, then we initialize metrics to MetricsConnection, otherwise it is set to null. {code:java} if (conf.getBoolean(CLIENT_SIDE_METRICS_ENABLED_KEY, false)) { this.metricsScope = MetricsConnection.getScope(conf, clusterId, this); this.metrics = MetricsConnection.getMetricsConnection(this.metricsScope, this::getBatchPool, this::getMetaLookupPool); } else { this.metrics = null; } {code} Whenever we want to update metrics, we always do a null check. We can improve this by creating a dummy metrics object and have an empty implementation. When we want to populate the metrics, we can check if metrics is an instance of dummy metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.
[ https://issues.apache.org/jira/browse/HBASE-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-27100. -- Fix Version/s: 3.0.0-alpha-4 Resolution: Fixed > Add documentation for Replication Observability Framework in hbase book. > > > Key: HBASE-27100 > URL: https://issues.apache.org/jira/browse/HBASE-27100 > Project: HBase > Issue Type: Sub-task > Components: documentation >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-26925) Create WAL event tracker table to track all the WAL events.
[ https://issues.apache.org/jira/browse/HBASE-26925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-26925. -- Fix Version/s: 3.0.0-alpha-4 Resolution: Fixed > Create WAL event tracker table to track all the WAL events. > --- > > Key: HBASE-26925 > URL: https://issues.apache.org/jira/browse/HBASE-26925 > Project: HBase > Issue Type: Sub-task > Components: wal >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-4 > > > Design Doc: > [https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#] > Create wal event tracker table to track WAL events. Whenever we roll the WAL, > we will save the WAL name, WAL length, region server, timestamp in a table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.
[ https://issues.apache.org/jira/browse/HBASE-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-27085. -- Fix Version/s: 3.0.0-alpha-4 Resolution: Fixed > Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from > source cluster. > -- > > Key: HBASE-27085 > URL: https://issues.apache.org/jira/browse/HBASE-27085 > Project: HBase > Issue Type: Sub-task >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-4 > > > This work is to create sink tracker table to persist tracker rows coming from > replication source cluster. > Create ReplicationMarkerChore to create replication marker rows periodically. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27461) Add multiWAL support for Replication Observability framework.
Rushabh Shah created HBASE-27461: Summary: Add multiWAL support for Replication Observability framework. Key: HBASE-27461 URL: https://issues.apache.org/jira/browse/HBASE-27461 Project: HBase Issue Type: Sub-task Components: regionserver Reporter: Rushabh Shah HBASE-26913 added a new framework for observing health of replication subsystem. Currently we add replication marker row just to one WAL (i.e. default WAL). We need to add support for multi WAL implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27383) Add dead region server to SplitLogManager#deadWorkers set as the first step.
Rushabh Shah created HBASE-27383: Summary: Add dead region server to SplitLogManager#deadWorkers set as the first step. Key: HBASE-27383 URL: https://issues.apache.org/jira/browse/HBASE-27383 Project: HBase Issue Type: Bug Affects Versions: 1.7.2, 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in SERVER_CRASH_SPLIT_LOGS state. Consider a case where a region server is handling split log task for hbase:meta table and SplitLogManager has exhausted all the retries and won't try any more region server. The region server which is handling split log task has died. We have a check in SplitLogManager where if a region server is declared dead and if that region server is responsible for split log task then we forcefully resubmit split log task. See the code [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726]. But we add a region server to SplitLogManager#deadWorkers set in [SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252] state. Before that it runs [SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214] state and checks if hbase:meta table is up. In this case, hbase:meta table was not online and that prevented SplitLogManager to add this RS to deadWorkers list. This created a deadlock and hbase cluster was completely down for an extended period of time until we failed over active hmaster. See HBASE-27382 for more details. Improvements: 1. We should a dead region server to +SplitLogManager#deadWorkers+ list as the first step. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27382) Cluster completely down due to wal splitting failing for hbase:meta table.
Rushabh Shah created HBASE-27382: Summary: Cluster completely down due to wal splitting failing for hbase:meta table. Key: HBASE-27382 URL: https://issues.apache.org/jira/browse/HBASE-27382 Project: HBase Issue Type: Bug Affects Versions: 1.7.2 Reporter: Rushabh Shah Assignee: Rushabh Shah We are running some version of 1.7.2 in our production environment. We encountered this issue recently. We colocate namenode and region server holding hbase:meta table on a set of 5 master nodes. Co-incidentally active namenode and region server holding meta table were on the same physical node and that node went down due to hardware issue. We have sub optimal hdfs level timeouts configured so whenever active namenode goes down, it takes around 12-15 minutes for hdfs client within hbase to connect to new active namenode. So all the region servers were having problems for about 15 minutes to connect to new active namenode. Below are the sequence of events: 1. Host running active namenode and hbase:meta went down at +2022-09-09 16:56:56,878_ 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+ {noformat} 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] procedure2.ProcedureExecutor - Procedure ServerCrashProcedure serverName=,61020,1662714013670, shouldSplitWal=true, carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to the store. 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - Added=,61020,1662714013670 to dead servers, submitted shutdown handler to be executed meta=true 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - Started processing ,61020,1662714013670; numProcessing=1 2022-09-09 16:59:05,712 INFO [ProcedureExecutor-0] procedure.ServerCrashProcedure - Start processing crashed ,61020,1662714013670 {noformat} 3. SplitLogManager created 2 split log tasks in zookeeper. {noformat} 2022-09-09 16:59:06,049 INFO [ProcedureExecutor-1] master.SplitLogManager - Started splitting 2 logs in [hdfs:///hbase/WALs/,61020,1662714013670-splitting] for [,61020,1662714013670] 2022-09-09 16:59:06,081 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - put up splitlog task at znode /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta 2022-09-09 16:59:06,093 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - put up splitlog task at znode /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta {noformat} 4. The first split log task is more interesting: +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+ 5. Since all the region servers were having problems connecting to active namenode, SplitLogManager tried total of 4 times to assign this task (3 resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave up. {noformat} -- try 1 - 2022-09-09 16:59:06,205 INFO [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta acquired by ,61020,1662540522069 -- try 2 - 2022-09-09 17:01:06,642 INFO [ager__ChoreService_1] coordination.SplitLogManagerCoordination - resubmitting task /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta 2022-09-09 17:01:06,666 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - task not yet acquired /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta ver = 2 2022-09-09 17:01:06,715 INFO [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta acquired by ,61020,1662530684713 -- try 3 - 2022-09-09 17:03:07,643 INFO [ager__ChoreService_1] coordination.SplitLogManagerCoordination - resubmitting task /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta 2022-09-09 17:03:07,687 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - task not yet acquired /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta ver = 4 2022-09-09 17:03:07,738 INFO [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta acquired by ,61020,1662542355806 -- try 4 - 2022-09-09
[jira] [Created] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.
Rushabh Shah created HBASE-27100: Summary: Add documentation for Replication Observability Framework in hbase book. Key: HBASE-27100 URL: https://issues.apache.org/jira/browse/HBASE-27100 Project: HBase Issue Type: Sub-task Components: documentation Reporter: Rushabh Shah Assignee: Rushabh Shah -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.
Rushabh Shah created HBASE-27085: Summary: Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster. Key: HBASE-27085 URL: https://issues.apache.org/jira/browse/HBASE-27085 Project: HBase Issue Type: Sub-task Reporter: Rushabh Shah Assignee: Rushabh Shah This work is to create sink tracker table to persist tracker rows coming from replication source cluster. Create ReplicationMarkerChore to create replication marker rows periodically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26963) ReplicationSource#removePeer hangs if we try to remove bad peer.
Rushabh Shah created HBASE-26963: Summary: ReplicationSource#removePeer hangs if we try to remove bad peer. Key: HBASE-26963 URL: https://issues.apache.org/jira/browse/HBASE-26963 Project: HBase Issue Type: Bug Affects Versions: 2.4.11 Reporter: Rushabh Shah ReplicationSource#removePeer hangs if we try to remove bad peer. Steps to reproduce: 1. Set config replication.source.regionserver.abort to false so that it doesn't abort regionserver. 2. Add a dummy peer. 2. Remove that peer. RemovePeer call will hang indefinitely until the test times out. Attached a patch to reproduce the above behavior. I can see following threads in the stack trace: {noformat} "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1" #339 daemon prio=5 os_prio=31 tid=0x7f8caa 44a800 nid=0x22107 waiting on condition [0x7000107e5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown Source) at java.lang.Thread.dispatchUncaughtException(Thread.java:1959) {noformat} {noformat} "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 os_prio=31 tid=0x7f8ca82fa800 nid=0x22307 in Object.wait() [0x7000106e2000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1260) - locked <0x000799975ea0> (a java.lang.Thread) at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330) at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35) at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} {noformat} "Listener at localhost/55013" #20 daemon prio=5 os_prio=31 tid=0x7f8caf95a000 nid=0x6703 waiting on condition [0x72 544000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372) at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) at org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861) at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74) at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66) {noformat} The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is waiting for Admin#removeReplicationPeer. The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to terminate peer (#338) is waiting on ReplicationSource thread to be terminated. The ReplicateSource thread (#339) is in sleeping state. Notice that this thread's stack trace is in ReplicationSource#uncaughtException method. When we call ReplicationSourceManager#removePeer, we set sourceRunning flag to false, send an interrupt signal to ReplicationSource thread
[jira] [Created] (HBASE-26957) Add support to hbase shell to remove coproc by its class name instead of coproc ID
Rushabh Shah created HBASE-26957: Summary: Add support to hbase shell to remove coproc by its class name instead of coproc ID Key: HBASE-26957 URL: https://issues.apache.org/jira/browse/HBASE-26957 Project: HBase Issue Type: Bug Components: Coprocessors, shell Affects Versions: 1.7.1 Reporter: Rushabh Shah The syntax for removing coproc is as below: hbase> alter 't1', METHOD => 'table_att_unset', NAME => 'coprocessor$1' We have to use coproc id to remove a coproc from a given table. Consider the following scenario. Due to some bug in a coproc, we have to remove a given coproc from all the tables in a cluster. Every table can have different set of co-procs. For a given co-proc class, the coproc ID will not be same for all the tables in a cluster. This gets more complex if we want to remove co-proc from all the production clusters. Instead we can pass a co-proc class name to alter table command. So if a table has that co-proc, it will remove otherwise do nothing. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26925) Create WAL event tracker table to track all the WAL events.
Rushabh Shah created HBASE-26925: Summary: Create WAL event tracker table to track all the WAL events. Key: HBASE-26925 URL: https://issues.apache.org/jira/browse/HBASE-26925 Project: HBase Issue Type: Sub-task Components: wal Reporter: Rushabh Shah Assignee: Rushabh Shah Design Doc: [https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#] Create wal event tracker table to track WAL events. Whenever we roll the WAL, we will save the WAL name, WAL length, region server, timestamp in a table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26913) Replication Observability Framework
Rushabh Shah created HBASE-26913: Summary: Replication Observability Framework Key: HBASE-26913 URL: https://issues.apache.org/jira/browse/HBASE-26913 Project: HBase Issue Type: New Feature Components: regionserver, Replication Reporter: Rushabh Shah Assignee: Rushabh Shah {*}{*}In our production clusters, we have seen cases where data is present in source cluster but not in the sink cluster and 1 case where data is present in sink cluster but not in source cluster. We have internal tools where we take incremental backup every day on both source and sink clusters and we compare the hash of the data in both the backups. We have seen many cases where hash doesn't match which means data is not consistent between source and sink for that given day. The Mean Time To Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of manual debugging. We need some tool where we can reduce MTTD and requires less manual debugging. I have attached design doc. Huge thanks to [~bharathv] to come up with this design at my work place. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26905) ReplicationPeerManager#checkPeerExists should throw ReplicationPeerNotFoundException if peer doesn't exists
Rushabh Shah created HBASE-26905: Summary: ReplicationPeerManager#checkPeerExists should throw ReplicationPeerNotFoundException if peer doesn't exists Key: HBASE-26905 URL: https://issues.apache.org/jira/browse/HBASE-26905 Project: HBase Issue Type: Bug Components: Replication Reporter: Rushabh Shah ReplicationPeerManager#checkPeerExists should throw ReplicationPeerNotFoundException if peer doesn't exists. Currently it throws generic DoNotRetryIOException. {code:java} private ReplicationPeerDescription checkPeerExists(String peerId) throws DoNotRetryIOException { ReplicationPeerDescription desc = peers.get(peerId); if (desc == null) { throw new DoNotRetryIOException("Replication peer " + peerId + " does not exist"); } return desc; } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26792) Implement ScanInfo#toString
Rushabh Shah created HBASE-26792: Summary: Implement ScanInfo#toString Key: HBASE-26792 URL: https://issues.apache.org/jira/browse/HBASE-26792 Project: HBase Issue Type: Improvement Reporter: Rushabh Shah Assignee: Rushabh Shah We don't have ScanInfo#toString. We use ScanInfo while creating StoreScanner which is used in preFlushScannerOpen in co-proc. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26702) Make ageOfLastShip extend TimeHistogram instead of plain histogram.
Rushabh Shah created HBASE-26702: Summary: Make ageOfLastShip extend TimeHistogram instead of plain histogram. Key: HBASE-26702 URL: https://issues.apache.org/jira/browse/HBASE-26702 Project: HBase Issue Type: Improvement Components: metrics, Replication Affects Versions: 2.3.7, 1.7.1, 3.0.0-alpha-3 Reporter: Rushabh Shah Assignee: Rushabh Shah Currently age of last ship metric is an instance of an Histogram type. [Here|https://github.com/apache/hbase/blob/master/hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSourceImpl.java#L58] {quote} ageOfLastShippedOpHist = rms.getMetricsRegistry().getHistogram(SOURCE_AGE_OF_LAST_SHIPPED_OP); {quote} We can change it to TimeHistogram so that we get the range information also. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26480) Close NamedQueueRecorder to allow HMaster/RS to shutdown gracefully
Rushabh Shah created HBASE-26480: Summary: Close NamedQueueRecorder to allow HMaster/RS to shutdown gracefully Key: HBASE-26480 URL: https://issues.apache.org/jira/browse/HBASE-26480 Project: HBase Issue Type: Bug Affects Versions: 1.7.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Saw one case in our production cluster where RS was not exiting. Saw this non-daemon thread in hung RS stack trace: {noformat} "main.slowlog.append-pool-pool1-t1" #26 prio=5 os_prio=31 tid=0x7faf23bf7800 nid=0x6d07 waiting on condition [0x73f4d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0004039e3840> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47) at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56) at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} This is coming from [NamedQueueRecorder|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/namequeues/NamedQueueRecorder.java#L65] implementation. This bug doesn't exists in branch-2 and master since the Disruptor initialization has changed and we set daemon=true also. See [this code|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/namequeues/NamedQueueRecorder.java#L68] FYI [~vjasani] [~zhangduo] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26468) Region Server doesn't exit cleanly incase it crashes.
Rushabh Shah created HBASE-26468: Summary: Region Server doesn't exit cleanly incase it crashes. Key: HBASE-26468 URL: https://issues.apache.org/jira/browse/HBASE-26468 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.6.0 Reporter: Rushabh Shah Observed this in our production cluster running 1.6 version. RS crashed due to some reason but the process was still running. On debugging more, found out there was 1 non-daemon thread running and that was not allowing RS to stop cleanly. Our clusters are managed by Ambari and have auto restart capability within them. But since the process was running and pid file was present, Ambari also couldn't do much. There will be some bug where we will miss to stop some daemon thread but there should be some maximum amount of time we should wait before exiting the thread. Relevant code: [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java] {code:java} logProcessInfo(getConf()); HRegionServer hrs = HRegionServer.constructRegionServer(regionServerClass, conf); hrs.start(); hrs.join(); -> This should be a timed join. if (hrs.isAborted()) { throw new RuntimeException("HRegionServer Aborted"); } } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26442) TestReplicationEndpoint#testInterClusterReplication fails in branch-1
Rushabh Shah created HBASE-26442: Summary: TestReplicationEndpoint#testInterClusterReplication fails in branch-1 Key: HBASE-26442 URL: https://issues.apache.org/jira/browse/HBASE-26442 Project: HBase Issue Type: Bug Affects Versions: 1.7.1 Reporter: Rushabh Shah Assignee: Rushabh Shah {noformat} [INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ hbase-server --- [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] Running org.apache.hadoop.hbase.replication.TestReplicationEndpoint [ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 20.978 s <<< FAILURE! - in org.apache.hadoop.hbase.replication.TestReplicationEndpoint [ERROR] org.apache.hadoop.hbase.replication.TestReplicationEndpoint Time elapsed: 3.921 s <<< FAILURE! java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.hbase.replication.TestReplicationEndpoint.tearDownAfterClass(TestReplicationEndpoint.java:88) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) [INFO] [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestReplicationEndpoint.tearDownAfterClass:88 [INFO] [ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0 {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26435) [branch-1] The log rolling request maybe canceled immediately in LogRoller due to a race
Rushabh Shah created HBASE-26435: Summary: [branch-1] The log rolling request maybe canceled immediately in LogRoller due to a race Key: HBASE-26435 URL: https://issues.apache.org/jira/browse/HBASE-26435 Project: HBase Issue Type: Sub-task Components: wal Affects Versions: 1.6.0 Reporter: Rushabh Shah Fix For: 1.7.2 Saw this issue in our internal 1.6 branch. The WAL was rolled but the new WAL file was not writable and it logged the following error also {noformat} 2021-11-03 19:20:19,503 WARN [.168:60020.logRoller] hdfs.DFSClient - Error while syncing java.io.IOException: Could not get block locations. Source file "/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) 2021-11-03 19:20:19,507 WARN [.168:60020.logRoller] wal.FSHLog - pre-sync failed but an optimization so keep going java.io.IOException: Could not get block locations. Source file "/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) {noformat} Since the new WAL file was not writable, appends to that file started failing immediately it was rolled. {noformat} 2021-11-03 19:20:19,677 INFO [.168:60020.logRoller] wal.FSHLog - Rolled WAL /hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635965392022 with entries=253234, filesize=425.67 MB; new WAL /hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389 2021-11-03 19:20:19,690 WARN [020.append-pool17-t1] wal.FSHLog - Append sequenceId=1962661783, requesting roll of WAL java.io.IOException: Could not get block locations. Source file "/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) 2021-11-03 19:20:19,690 INFO [.168:60020.logRoller] wal.FSHLog - Archiving hdfs://prod-EMPTY-hbase2a/hbase/WALs/,60020,1635567166484/%2C60020%2C1635567166484.1635960792837 to hdfs://prod-EMPTY-hbase2a/hbase/oldWALs/hbase2a-dnds1-232-ukb.ops.sfdc.net%2C60020%2C1635567166484.1635960792837 {noformat} We always reset the rollLog flag within LogRoller thread after the rollWal call is complete. Within FSHLog#rollWriter method, it does many things, like replacing the writer and archiving old logs. If append thread fails to write to new file while logRoller thread is cleaning old logs, we will miss the rollLog flag since LogRoller will reset the flag to false while the previous rollWriter call is going on. Relevant code: https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java#L183-L203 We need to reset rollLog flag before we start rolling the wal. This is fixed in branch-2 and master via HBASE-22684 but we didn't fix it in branch-1 Also branch-2 has multi wal implementation so it can apply cleanly in branch-1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26404) Update javadoc for CellUtil#createCell with tags methods.
Rushabh Shah created HBASE-26404: Summary: Update javadoc for CellUtil#createCell with tags methods. Key: HBASE-26404 URL: https://issues.apache.org/jira/browse/HBASE-26404 Project: HBase Issue Type: Bug Affects Versions: 2.4.8 Reporter: Rushabh Shah We have the following methods in CellUtil class which are deprecated. We used to use this method within custom COPROC to create cells with custom tags. We deprecated them in 2.0.0 version and created a new class called RawCell which is LimitedPrivate with visibility set to COPROC. There is no reference to RawCell#createCell(Cell cell, List tags) method in javadoc of CellUtil#createCell methods which are now deprecated. This is not user friendly. We should improve the javadoc within CellUtil#createCell(Cell, tags) method. {noformat} /** * Note : Now only CPs can create cell with tags using the CP environment * @return A new cell which is having the extra tags also added to it. * @deprecated As of release 2.0.0, this will be removed in HBase 3.0.0. * */ @Deprecated public static Cell createCell(Cell cell, List tags) { return PrivateCellUtil.createCell(cell, tags); } {noformat} {noformat} public static Cell createCell(Cell cell, byte[] tags) public static Cell createCell(Cell cell, byte[] value, byte[] tags) { {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26195) Data is present in replicated cluster but not present in primary cluster.
[ https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-26195. -- Resolution: Fixed > Data is present in replicated cluster but not present in primary cluster. > - > > Key: HBASE-26195 > URL: https://issues.apache.org/jira/browse/HBASE-26195 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 1.7.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 1.8.0 > > > We encountered a case where we are seeing some rows (via Phoenix) in > replicated cluster but they are not present in source/active cluster. > Triaging further we found memstore rollback logs in few of the region servers. > {noformat} > 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - > rollbackMemstore rolled back 23 > {noformat} > Looking more into logs, found that there were some hdfs layer issues sync'ing > wal to hdfs. > It was taking around 6 mins to sync wal. Logs below > {noformat} > 2021-07-28 14:19:30,511 WARN [sync.0] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391210ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,589 WARN [sync.1] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391148ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,589 WARN [sync.2] hdfs.DataStreamer - Slow > waitForAckedSeqno took 391147ms (threshold=3ms). File being written: > /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, > block: BP-958889176--1567030695029:blk_1689647875_616028364, Write > pipeline datanodes: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. > 2021-07-28 14:19:30,591 INFO [sync.0] wal.FSHLog - Slow sync cost: 391289 > ms, current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] > 2021-07-28 14:19:30,591 INFO [sync.1] wal.FSHLog - Slow sync cost: 391227 > ms, current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] > 2021-07-28 14:19:30,591 WARN [sync.1] wal.FSHLog - Requesting log roll > because we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, > current pipeline: > [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], > > DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], > > DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] >
[jira] [Created] (HBASE-26247) TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance doesn't read archived WAL file.
Rushabh Shah created HBASE-26247: Summary: TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance doesn't read archived WAL file. Key: HBASE-26247 URL: https://issues.apache.org/jira/browse/HBASE-26247 Project: HBase Issue Type: Bug Reporter: Rushabh Shah TestWALRecordReader#testWALRecordReaderActiveArchiveTolerance is testing the following scenario. 1. Create a new WAL file 2. Write 2 KVs to WAL file. 3. Close the WAL file. 4. Instantiate WALInputFormat#WALKeyRecordReader with the WAL created in step 1. 5. Read the first KV. 6. Archive the WAL file in oldWALs directory via rename operation. 7. Read the second KV. This will test that WALKeyRecordReader will encounter FNFE since the WAL file is not longer present in the original location and it will handle the FNFE by opening the WAL file from archived location. In step#7, the test is expecting that it will encounter FNFE and it will open a new reader but in reality, it is not encountering FNFE. I think the reason is, during rename operation, HDFS just changes the internal metadata for the file name. The InodeID, hdfs blocks and block locations remains the same. While reading the first KV, DFSInputStream caches all the HDFS blocks and location data so it doesn't have to go to namenode to re-resolve the file name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26195) Data is present in replicated cluster but not visible on primary cluster.
Rushabh Shah created HBASE-26195: Summary: Data is present in replicated cluster but not visible on primary cluster. Key: HBASE-26195 URL: https://issues.apache.org/jira/browse/HBASE-26195 Project: HBase Issue Type: Bug Components: Replication, wal Affects Versions: 1.7.0 Reporter: Rushabh Shah Assignee: Rushabh Shah We encountered a case where we are seeing some rows (via Phoenix) in replicated cluster but they are not present in source/active cluster. Triaging further we found memstore rollback logs in few of the region servers. {noformat} 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23 {noformat} Looking more into logs, found that there were some hdfs layer issues sync'ing wal to hdfs. It was taking around 6 mins to sync wal. Logs below {noformat} 2021-07-28 14:19:30,511 WARN [sync.0] hdfs.DataStreamer - Slow waitForAckedSeqno took 391210ms (threshold=3ms). File being written: /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, block: BP-958889176--1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. 2021-07-28 14:19:30,589 WARN [sync.1] hdfs.DataStreamer - Slow waitForAckedSeqno took 391148ms (threshold=3ms). File being written: /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, block: BP-958889176--1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. 2021-07-28 14:19:30,589 WARN [sync.2] hdfs.DataStreamer - Slow waitForAckedSeqno took 391147ms (threshold=3ms). File being written: /hbase/WALs/,60020,1626191371499/%2C60020%2C1626191371499.1627480615620, block: BP-958889176--1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]. 2021-07-28 14:19:30,591 INFO [sync.0] wal.FSHLog - Slow sync cost: 391289 ms, current pipeline: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] 2021-07-28 14:19:30,591 INFO [sync.1] wal.FSHLog - Slow sync cost: 391227 ms, current pipeline: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] 2021-07-28 14:19:30,591 WARN [sync.1] wal.FSHLog - Requesting log roll because we exceeded slow sync threshold; time=391227 ms, threshold=1 ms, current pipeline: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] 2021-07-28 14:19:30,591 INFO [sync.2] wal.FSHLog - Slow sync cost: 391227 ms, current pipeline: [DatanodeInfoWithStorage[:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]] 2021-07-28 14:19:30,591 WARN [sync.2] wal.FSHLog - Requesting log roll because we exceeded slow
[jira] [Created] (HBASE-26121) Formatter to convert from epoch time to human readable date format.
Rushabh Shah created HBASE-26121: Summary: Formatter to convert from epoch time to human readable date format. Key: HBASE-26121 URL: https://issues.apache.org/jira/browse/HBASE-26121 Project: HBase Issue Type: Improvement Components: shell Reporter: Rushabh Shah In shell, we have custom formatter to convert from bytes to Long/Int for long/int data type values. Many times we store the epoch timestamp (event creation, updation time) as long in our table columns. Even after converting this column to Long, the date is not in a human readable format. We still have to convert this long into date using some bash shell tricks and it is not convenient to do for many columns. We can introduce a new format method called +toLongDate+ which signifies that we want to convert the bytes to long first and then to date. Please let me know if any such functionality already exists and I am not aware of. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26106) AbstractFSWALProvider#getArchivedLogPath doesn't look for wal file in all oldWALs directory.
Rushabh Shah created HBASE-26106: Summary: AbstractFSWALProvider#getArchivedLogPath doesn't look for wal file in all oldWALs directory. Key: HBASE-26106 URL: https://issues.apache.org/jira/browse/HBASE-26106 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.4.4, 3.0.0-alpha-1, 2.5.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Below is the code for AbstractFSWALProvider#getArchivedLogPath {code:java} public static Path getArchivedLogPath(Path path, Configuration conf) throws IOException { Path rootDir = CommonFSUtils.getWALRootDir(conf); Path oldLogDir = new Path(rootDir, HConstants.HREGION_OLDLOGDIR_NAME); if (conf.getBoolean(SEPARATE_OLDLOGDIR, DEFAULT_SEPARATE_OLDLOGDIR)) { ServerName serverName = getServerNameFromWALDirectoryName(path); if (serverName == null) { LOG.error("Couldn't locate log: " + path); return path; } oldLogDir = new Path(oldLogDir, serverName.getServerName()); } Path archivedLogLocation = new Path(oldLogDir, path.getName()); final FileSystem fs = CommonFSUtils.getWALFileSystem(conf); if (fs.exists(archivedLogLocation)) { LOG.info("Log " + path + " was moved to " + archivedLogLocation); return archivedLogLocation; } else { LOG.error("Couldn't locate log: " + path); return path; } } {code} This method is called from the following places. [AbstractFSWALProvider#openReader|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/AbstractFSWALProvider.java#L524] [ReplicationSource#getFileSize|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L399] [WALInputFormat.WALRecordReader#nextKeyValue|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/WALInputFormat.java#L220] All of the above calls are trying to find the log in archive path after they couldn't locate the wal in walsDir and they are not used for moving a log file to archive directory. But we will look for archive path within serverName directory only if conf key is true. Cc [~zhangduo] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26103) conn.getBufferedMutator(tableName) leaks thread executors and other problems (for master branch))
Rushabh Shah created HBASE-26103: Summary: conn.getBufferedMutator(tableName) leaks thread executors and other problems (for master branch)) Key: HBASE-26103 URL: https://issues.apache.org/jira/browse/HBASE-26103 Project: HBase Issue Type: Sub-task Components: Client Affects Versions: 3.0.0-alpha-1 Reporter: Rushabh Shah Assignee: Rushabh Shah -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
Rushabh Shah created HBASE-26075: Summary: Replication is stuck due to zero length wal file in oldWALs directory. Key: HBASE-26075 URL: https://issues.apache.org/jira/browse/HBASE-26075 Project: HBase Issue Type: Bug Components: Replication, wal Affects Versions: 1.7.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Recently we encountered a case where size of log queue was increasing to around 300 in few region servers in our production environment. There were 295 wals in the oldWALs directory for that region server and the *first file* was a 0 length file. Replication was throwing the following error. {noformat} 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] regionserver.ReplicationSourceWALReaderThread - Failed to read stream of replication entries org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: java.io.EOFException: hdfs:///hbase/oldWALs/ not a SequenceFile at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) Caused by: java.io.EOFException: hdfs:///hbase/oldWALs/ not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) ... 1 more {noformat} We fixed similar error via HBASE-25536 but the zero length file was in recovered sources. There were more logs after the above stack trace. {noformat} 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] regionserver.ReplicationSourceWALReaderThread - Couldn't get file length information about log hdfs:///hbase/WALs/ 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ was moved to hdfs:///hbase/oldWALs/ {noformat} There is a special logic in ReplicationSourceWALReader thread to handle 0 length files but we only look in WALs directory and not in oldWALs directory. {code} private boolean handleEofException(Exception e, WALEntryBatch batch) { PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); // Dump the log even if logQueue size is 1 if the source is from recovered Source // since we don't add current log to recovered source queue so it is safe to remove. if ((e instanceof EOFException || e.getCause() instanceof EOFException) && (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) { Path head = queue.peek(); try { if (fs.getFileStatus(head).getLen() == 0) { // head of the queue is an empty log file LOG.warn("Forcing removal of 0 length log in queue: {}", head); logQueue.remove(walGroupId); currentPosition = 0; if (batch != null) { // After we removed the WAL from the queue, we should try shipping the existing batch of // entries addBatchToShippingQueue(batch); } return true; } }
[jira] [Created] (HBASE-25932) TestWALEntryStream#testCleanClosedWALs test is failing.
Rushabh Shah created HBASE-25932: Summary: TestWALEntryStream#testCleanClosedWALs test is failing. Key: HBASE-25932 URL: https://issues.apache.org/jira/browse/HBASE-25932 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4 Reporter: Rushabh Shah Assignee: Rushabh Shah We are seeing the following test failure. TestWALEntryStream#testCleanClosedWALs This test was added in HBASE-25924. I don't think the test failure has anything to do with the patch in HBASE-25924. Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In all the branches, we were not parsing the wal trailer when we close the wal reader inside ReplicationSourceWALReader thread. The root cause was when we add active WAL to ReplicationSourceWALReader, we cache the file size when the wal was being actively written and once the wal was closed and replicated and removed from WALEntryStream, we did reset the ProtobufLogReader object but we didn't update the length of the wal and that was causing EOF errors since it can't find the WALTrailer with the stale wal file length. The fix applied nicely to branch-1 since we use FSHlog implementation which closes the WAL file sychronously. But in branch-2 and master, we use _AsyncFSWAL_ implementation and the closing of wal file is done asynchronously (as the name suggests). This is causing the test to fail. Below is the test. {code:java} @Test public void testCleanClosedWALs() throws Exception { try (WALEntryStream entryStream = new WALEntryStream( logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) { assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); appendToLogAndSync(); assertNotNull(entryStream.next()); log.rollWriter(); ===> This does an asynchronous close of wal. appendToLogAndSync(); assertNotNull(entryStream.next()); assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); } } {code} In the above code, when we roll writer, we don't close the old wal file immediately so the ReplicationReader thread is not able to get the updated wal file size and that is throwing EOF errors. If I added a sleep of few milliseconds (1 ms in my local env) between rollWriter and appendToLogAndSync statement then the test passes but this is *not* a proper fix since we are working around the race between ReplicationSourceWALReaderThread and closing of WAL file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25924) Seeing a spike in uncleanlyClosedWALs metric.
Rushabh Shah created HBASE-25924: Summary: Seeing a spike in uncleanlyClosedWALs metric. Key: HBASE-25924 URL: https://issues.apache.org/jira/browse/HBASE-25924 Project: HBase Issue Type: Bug Reporter: Rushabh Shah Assignee: Rushabh Shah Getting the following log line in all of our production clusters when WALEntryStream is dequeuing WAL file. {noformat} 2021-05-02 04:01:30,437 DEBUG [04901996] regionserver.WALEntryStream - Reached the end of WAL file hdfs://. It was not closed cleanly, so we did not parse 8 bytes of data. This is normally ok. {noformat} The 8 bytes are usually the trailer size. While dequeue'ing the WAL file from WALEntryStream, we reset the reader here. [WALEntryStream|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L199-L221] {code:java} private void tryAdvanceEntry() throws IOException { if (checkReader()) { readNextEntryAndSetPosition(); if (currentEntry == null) { // no more entries in this log file - see if log was rolled if (logQueue.getQueue(walGroupId).size() > 1) { // log was rolled // Before dequeueing, we should always get one more attempt at reading. // This is in case more entries came in after we opened the reader, // and a new log was enqueued while we were reading. See HBASE-6758 resetReader(); ---> HERE readNextEntryAndSetPosition(); if (currentEntry == null) { if (checkAllBytesParsed()) { // now we're certain we're done with this log file dequeueCurrentLog(); if (openNextLog()) { readNextEntryAndSetPosition(); } } } } // no other logs, we've simply hit the end of the current open log. Do nothing } } // do nothing if we don't have a WAL Reader (e.g. if there's no logs in queue) } {code} In resetReader, we call the following methods, WALEntryStream#resetReader > ProtobufLogReader#reset ---> ProtobufLogReader#initInternal. In ProtobufLogReader#initInternal, we try to create the whole reader object from scratch to see if any new data has been written. We reset all the fields of ProtobufLogReader except for ReaderBase#fileLength. We calculate whether trailer is present or not depending on fileLength. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25893) Corruption in recovered WAL in WALSplitter
Rushabh Shah created HBASE-25893: Summary: Corruption in recovered WAL in WALSplitter Key: HBASE-25893 URL: https://issues.apache.org/jira/browse/HBASE-25893 Project: HBase Issue Type: Improvement Components: regionserver, wal Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Recently we encountered RS aborts due to NPE while replaying edits from split logs during region open. {noformat} 2021-05-13 19:34:28,871 ERROR [:60020-17] handler.OpenRegionHandler - Failed open of region=,1619036437822.0556ab96be88000b6f5f3fad47938ccd., starting to roll back the global memstore size. java.lang.NullPointerException at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4682) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsForPaths(HRegion.java:4557) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:4470) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:949) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:908) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7253) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7214) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7185) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7141) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7092) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:364) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:131) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Tracing back how the corrupt wal was generated. {noformat} 2021-05-12 05:21:23,333 FATAL [:60020-0-Writer-1] wal.WALSplitter - 556ab96be88000b6f5f3fad47938ccd/5039807= to log java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331) at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543) at org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104) at org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128) at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1742) at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1714) at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1179) at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1171) at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1141) 2021-05-12 05:21:23,333 ERROR [:60020-0-Writer-1] wal.WALSplitter - Exiting thread java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331) at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543) at org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104) at org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128) at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1742) at
[jira] [Created] (HBASE-25887) Corrupt wal while region server is aborting.
Rushabh Shah created HBASE-25887: Summary: Corrupt wal while region server is aborting. Key: HBASE-25887 URL: https://issues.apache.org/jira/browse/HBASE-25887 Project: HBase Issue Type: Improvement Components: regionserver, wal Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah We have seen a case in our production cluster where we ended up in corrupt wal. WALSplitter logged the below error {noformat} 2021-05-12 00:42:46,786 FATAL [:60020-1] regionserver.HRegionServer - ABORTING region server HOST-B,60020,16207794418 88: Caught throwable while processing event RS_LOG_REPLAY java.lang.NullPointerException at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Looking at the raw wal file, we could see that the last WALEdit contains the region id, tablename and sequence number but cells were not persisted. Looking at the logs of the RS that generated that corrupt wal file, {noformat} 2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046 2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping rpc client 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A/:60020 closing leases 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A:/HOST-A:60020 closed leases 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331) at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543) at org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104) at org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} These 2 lines are interesting. {quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL java.nio.channels.ClosedChannelException {quote} The append thread encountered java.nio.channels.ClosedChannelException while writing to wal file because the wal file was already closed. This is the sequence of shutting down of threads when RS aborts. {noformat} // With disruptor down, this is safe to let go. if (this.appendExecutor != null) this.appendExecutor.shutdown(); // Tell our listeners that the log is closing ... if (LOG.isDebugEnabled()) { LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir)); } if (this.writer != null) {
[jira] [Created] (HBASE-25860) Add metric for successful wal roll requests.
Rushabh Shah created HBASE-25860: Summary: Add metric for successful wal roll requests. Key: HBASE-25860 URL: https://issues.apache.org/jira/browse/HBASE-25860 Project: HBase Issue Type: Improvement Components: metrics, wal Affects Versions: 1.6.0 Reporter: Rushabh Shah We don't have any metric for number of successful wal roll requests. If we have that metric then we can add some alerts if that metric is stuck for some reason. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25856) sorry for my mistake, could someone delete it.
[ https://issues.apache.org/jira/browse/HBASE-25856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-25856. -- Resolution: Invalid > sorry for my mistake, could someone delete it. > -- > > Key: HBASE-25856 > URL: https://issues.apache.org/jira/browse/HBASE-25856 > Project: HBase > Issue Type: Improvement >Reporter: junwen yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25679) Size of log queue metric is incorrect in branch-1
Rushabh Shah created HBASE-25679: Summary: Size of log queue metric is incorrect in branch-1 Key: HBASE-25679 URL: https://issues.apache.org/jira/browse/HBASE-25679 Project: HBase Issue Type: Improvement Affects Versions: 1.7.0 Reporter: Rushabh Shah In HBASE-25539 I did some refactoring for adding a new metric "oldestWalAge" and tried to consolidate update to all the metrics related to ReplicationSource class (size of log queue and oldest wal age) at one place. In that refactoring introduced one bug where I am decrementing twice from size of log queue metric whenever we remove a wal from Replication source queue. We need to fix this only in branch-1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25622) Result#compareResults should compare tags.
Rushabh Shah created HBASE-25622: Summary: Result#compareResults should compare tags. Key: HBASE-25622 URL: https://issues.apache.org/jira/browse/HBASE-25622 Project: HBase Issue Type: Improvement Components: Client Affects Versions: 1.7.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Today +Result#compareResults+ compares the 2 cells based on following parameters. {noformat} for (int i = 0; i < res1.size(); i++) { if (!ourKVs[i].equals(replicatedKVs[i]) || !CellUtil.matchingValue(ourKVs[i], replicatedKVs[i])) { throw new Exception("This result was different: " + res1.toString() + " compared to " + res2.toString()); } {noformat} row, family, qualifier, timestamp, type, value. We also need to compare tags to determine if both cells are equal or not. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25612) HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs.
Rushabh Shah created HBASE-25612: Summary: HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs. Key: HBASE-25612 URL: https://issues.apache.org/jira/browse/HBASE-25612 Project: HBase Issue Type: Improvement Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah In our production cluster, we encountered an issue where the number of files within /hbase/oldWALs directory were growing exponentially from about 4000 baseline to 15 and growing at the rate of 333 files per minute. On further investigation we found that ReplicatonLogCleaner thread was getting aborted since it was not able to talk to zookeeper. Stack trace below {noformat} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - replicationLogCleaner-0x302e05e0d8f, quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181, baseZNode=/hbase Unable to get data of znode /hbase/replication/rs org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/replication/rs at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52) at org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413) at org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52) at org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring. Reason: Failed to get stat of replication rs node 2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] master.ReplicationLogCleaner - org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/replication/rs 2021-02-25 23:05:01,150 WARN [an-pool3-thread-1729] master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking deletable files {noformat} {quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring. Reason: Failed to get stat of replication rs node {quote} This line is more scary where HMaster invoked Abortable but just ignored and HMaster was doing it business as usual. We have max files per directory configuration in namenode which is set to 1M in our clusters. If this directory reached that limit then that would have brought down the whole cluster. We shouldn't ignore Abortable and should crash the Hmaster if Abortable is invoked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25500) Add metric for age of oldest wal in region server replication queue.
[ https://issues.apache.org/jira/browse/HBASE-25500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rushabh Shah resolved HBASE-25500. -- Resolution: Duplicate Dup of HBASE-25539 > Add metric for age of oldest wal in region server replication queue. > > > Key: HBASE-25500 > URL: https://issues.apache.org/jira/browse/HBASE-25500 > Project: HBase > Issue Type: Improvement > Components: metrics, regionserver, Replication >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0 > > > We have seen cases recently where we have wal from 2018 timestamp in our > recovered replication queue. We came across this un replicated wal while > debugging something else. We need to have metrics for the oldest wal in the > replication queue and have alerts if it exceeds some threshold. Clearly 2 > years old wal is not desirable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25539) Add metric for age of oldest wal.
Rushabh Shah created HBASE-25539: Summary: Add metric for age of oldest wal. Key: HBASE-25539 URL: https://issues.apache.org/jira/browse/HBASE-25539 Project: HBase Issue Type: Improvement Components: metrics, regionserver Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah In our production clusters, we have seen multiple cases where some of the wals are lingering in zk replication queues for months and we have no insight into it. Recently we fixed one case where wal was getting stuck if it is 0 size and from old sources. HBASE-25536. It would be helpful if we can have some metric which will tell us the age of oldest wal and we can add an alert for monitoring purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25536) Remove 0 length wal file from queue if it belongs to old sources.
Rushabh Shah created HBASE-25536: Summary: Remove 0 length wal file from queue if it belongs to old sources. Key: HBASE-25536 URL: https://issues.apache.org/jira/browse/HBASE-25536 Project: HBase Issue Type: Improvement Components: Replication Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2 In our production clusters, we found one case where RS is not removing 0 length file from replication queue (in memory one not the zk replication queue) if the logQueue size is 1. Stack trace below: {noformat} 2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] regionserver.ReplicationSourceWALReaderThread - Failed to read stream of replication entries org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: java.io.EOFException: hdfs://hbase/oldWALs/%2C60020%2C1606126266791.1606852981112 not a SequenceFile at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147) Caused by: java.io.EOFException: hdfs://hbase/oldWALs/%2C60020%2C1606126266791.1606852981112 not a SequenceFile at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198) at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108) ... 1 more {noformat} The wal in question is of length 0 (verified via hadoop ls command) and is from recovered sources. There is just 1 log file in the queue (verified via heap dump). We have logic to remove 0 length log file from queue when we encounter EOFException and logQueue#size is greater than 1. Code snippet below. {code:java|title=ReplicationSourceWALReader.java|borderStyle=solid} // if we get an EOF due to a zero-length log, and there are other logs in queue // (highly likely we've closed the current log), we've hit the max retries, and autorecovery is // enabled, then dump the log private void handleEofException(IOException e) { if ((e instanceof EOFException || e.getCause() instanceof EOFException) && logQueue.size() > 1 && this.eofAutoRecovery) { try { if (fs.getFileStatus(logQueue.peek()).getLen() == 0) { LOG.warn("Forcing removal of 0 length log in queue: " + logQueue.peek()); logQueue.remove(); currentPosition = 0; } } catch (IOException ioe) { LOG.warn("Couldn't get file length information about log " + logQueue.peek()); } } } {code} This size check is valid for active sources where we need to have atleast one wal file which is the current wal file but for recovered sources where we don't add current wal file to queue, we can skip the logQueue#size check. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25500) Add metric for age of oldest wal in region server replication queue.
Rushabh Shah created HBASE-25500: Summary: Add metric for age of oldest wal in region server replication queue. Key: HBASE-25500 URL: https://issues.apache.org/jira/browse/HBASE-25500 Project: HBase Issue Type: Improvement Components: metrics, regionserver, Replication Reporter: Rushabh Shah Assignee: Rushabh Shah Fix For: 3.0.0-alpha-1, 1.7.0 We have seen cases recently where we have wal from 2018 timestamp in our recovered replication queue. We came across this un replicated wal while debugging something else. We need to have metrics for the oldest wal in the replication queue and have alerts if it exceeds some threshold. Clearly 2 years old wal is not desirable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25390) CopyTable and Coprocessor based export tool should backup and restore cell tags.
Rushabh Shah created HBASE-25390: Summary: CopyTable and Coprocessor based export tool should backup and restore cell tags. Key: HBASE-25390 URL: https://issues.apache.org/jira/browse/HBASE-25390 Project: HBase Issue Type: Improvement Components: backuprestore Reporter: Rushabh Shah In HBASE-25246 we added support for Mapreduce based Export/Import tool to backup/restore cell tags. Mapreduce based export tool is not the only tool that takes snapshot or backup of a given table. We also have Coprocessor based Export and CopyTable tools which takes backup of a given table. We need to add support for the above 2 tools to save cell tags to file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25328) Make PrivateCellUtil annotation as LimitatePrivate.
Rushabh Shah created HBASE-25328: Summary: Make PrivateCellUtil annotation as LimitatePrivate. Key: HBASE-25328 URL: https://issues.apache.org/jira/browse/HBASE-25328 Project: HBase Issue Type: Improvement Reporter: Rushabh Shah Assignee: Rushabh Shah In PHOENIX-6213 phoenix project is using Cell Tag feature to add some metadata for delete mutations. We are adding Cell Tags in co-processor but we need some util methods available in +PrivateCellUtil+ class. Below are the methods we need in phoenix. 1. +PrivateCellUtil#createCell(Cell cell, List tags)+ method has an api which will accept an existing Cell and list of tags to create a new cell. But RawCellBuilder has a builder method which doesn't have any method which accepts a cell. I need to explicitly convert my input cell by extracting all fields and use the builder methods (like setRow, setseFamily, etc) and then use the build method. 2. +PrivateCellUtil.getTags(Cell cell)+ returns a list of existing tags which I want to use and add a new tag. But RawCell#getTags() returns Iterator which then I have to iterate over them and depending on whether they are byte buffer backed or array backed, I need to convert them to List since RawCellBuilder#setTags accepts List of Tags. We are already doing this conversion in PrivateCellUtil#getTags method. All these conversion utility methods needs to be duplicated in phoenix project also. Is it reasonable to make PrivateCellUtil method LimitedPrivate with HBaseInterfaceAudience as COPROC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25246) Backup/Restore hbase cell tags.
Rushabh Shah created HBASE-25246: Summary: Backup/Restore hbase cell tags. Key: HBASE-25246 URL: https://issues.apache.org/jira/browse/HBASE-25246 Project: HBase Issue Type: Improvement Components: backuprestore Reporter: Rushabh Shah Assignee: Rushabh Shah In PHOENIX-6213 we are planning to add cell tags for Delete mutations. After having a discussion with hbase community via dev mailing thread, it was decided that we will pass the tags via an attribute in Mutation object and persist them to hbase via phoenix co-processor. The intention of PHOENIX-6213 is to store metadata in Delete marker so that while running Restore tool we can selectively restore certain Delete markers and ignore others. For that to happen we need to persist these tags in Backup and retrieve them in Restore MR jobs (Import/Export tool). Currently we don't persist the tags in Backup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25179) Assert format is incorrect in HFilePerformanceEvaluation class.
Rushabh Shah created HBASE-25179: Summary: Assert format is incorrect in HFilePerformanceEvaluation class. Key: HBASE-25179 URL: https://issues.apache.org/jira/browse/HBASE-25179 Project: HBase Issue Type: Improvement Components: Performance Reporter: Rushabh Shah Assignee: Rushabh Shah [HFilePerformanceEvaluation |https://github.com/apache/hbase/blob/master/hbase-server/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java#L518] The format of expected and actual is interchanged. {code:java} PerformanceEvaluationCommons.assertValueSize(c.getValueLength(), ROW_LENGTH); {code} The first argument should be expected and second should be actual. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25118) Extend Cell Tags to Delete object.
Rushabh Shah created HBASE-25118: Summary: Extend Cell Tags to Delete object. Key: HBASE-25118 URL: https://issues.apache.org/jira/browse/HBASE-25118 Project: HBase Issue Type: Improvement Reporter: Rushabh Shah Assignee: Rushabh Shah We want to track the source of mutations (especially Deletes) via Phoenix. We have multiple use cases which does the deletes namely: customer deleting the data, internal process like GDPR compliance, Phoenix TTL MR jobs. For every mutations we want to track the source of operation which initiated the deletes. At my day job, we have custom Backup/Restore tool. For example: During GDPR compliance cleanup (lets say at time t0), we mistakenly deleted some customer data and it were possible that customer also deleted some data from their side (at time t1). To recover mistakenly deleted data, we restore from the backup at time (t0 - 1). By doing this, we also recovered the data that customer intentionally deleted. We need a way for Restore tool to selectively recover data. We want to leverage Cell Tag feature for Delete mutations to store these metadata. Currently Delete object doesn't support Tag feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25052) FastLongHistogram#getCountAtOrBelow method is broken.
Rushabh Shah created HBASE-25052: Summary: FastLongHistogram#getCountAtOrBelow method is broken. Key: HBASE-25052 URL: https://issues.apache.org/jira/browse/HBASE-25052 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 2.2.3, 1.6.0, 2.3.0, 3.0.0-alpha-1 Reporter: Rushabh Shah FastLongHistogram#getCountAtOrBelow method is broken. If I revert HBASE-23245 then it works fine. Wrote a small test case in TestHistogramImpl.java : {code:java} @Test public void testAdd1() { HistogramImpl histogram = new HistogramImpl(); for (int i = 0; i < 100; i++) { histogram.update(i); } Snapshot snapshot = histogram.snapshot(); // This should return count as 6 since we added 0, 1, 2, 3, 4, 5 Assert.assertEquals(6, snapshot.getCountAtOrBelow(5)); {code} It fails as below: java.lang.AssertionError: Expected :6 Actual :100 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24983) Wrap ConnectionImplemetation#locateRegioninMeta under operation timeout.
Rushabh Shah created HBASE-24983: Summary: Wrap ConnectionImplemetation#locateRegioninMeta under operation timeout. Key: HBASE-24983 URL: https://issues.apache.org/jira/browse/HBASE-24983 Project: HBase Issue Type: Bug Components: Client Affects Versions: 1.6.0 Reporter: Rushabh Shah We have config property (hbase.client.operation.timeout and hbase.client.meta.operation.timeout). Description of hbase.client.operation.timeout which is for non meta tables. {noformat} Operation timeout is a top-level restriction (millisecond) that makes sure a blocking operation in Table will not be blocked more than this. In each operation, if rpc request fails because of timeout or other reason, it will retry until success or throw RetriesExhaustedException. But if the total time being blocking reach the operation timeout before retries exhausted, it will break early and throw SocketTimeoutException. {noformat} Most of the operations like get, put, delete are wrapped under this timeout but scan operation is not wrapped in this timeout. We need to wrap scan operations also within operation timeout. More discussion in this PR thread: https://github.com/apache/hbase/pull/2322#discussion_r478687341 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24957) ZKTableStateClientSideReader#isDisabledTable doesn't check if table exists or not.
Rushabh Shah created HBASE-24957: Summary: ZKTableStateClientSideReader#isDisabledTable doesn't check if table exists or not. Key: HBASE-24957 URL: https://issues.apache.org/jira/browse/HBASE-24957 Project: HBase Issue Type: Bug Components: Client Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah The following bug exists only in branch-1 and below. ZKTableStateClientSideReader#isDisabledTable returns false even if table doesn't exists. Below is the code snippet: {code:title=ZKTableStateClientSideReader.java|borderStyle=solid} public static boolean isDisabledTable(final ZooKeeperWatcher zkw, final TableName tableName) throws KeeperException, InterruptedException { ZooKeeperProtos.Table.State state = getTableState(zkw, tableName);---> We should check here if state is null or not. return isTableState(ZooKeeperProtos.Table.State.DISABLED, state); } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24956) ConnectionManager#userRegionLock waits for lock indefinitely.
Rushabh Shah created HBASE-24956: Summary: ConnectionManager#userRegionLock waits for lock indefinitely. Key: HBASE-24956 URL: https://issues.apache.org/jira/browse/HBASE-24956 Project: HBase Issue Type: Bug Components: Client Affects Versions: 1.3.2 Reporter: Rushabh Shah Assignee: Rushabh Shah One of our customers experienced high latencies (in order of 3-4 minutes) for point lookup query (We use phoenix on top of hbase). We have different threads sharing the same hconnection. Looks like multiple threads are stuck at the same place. [https://github.com/apache/hbase/blob/branch-1.3/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java#L1282] We have set the following configuration parameters to ensure query fails with a reasonable SLAs: 1. hbase.client.meta.operation.timeout 2. hbase.client.operation.timeout 3. hbase.client.scanner.timeout.period But since userRegionLock can wait for lock indefinitely the call will not fail within SLA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24615) MutableRangeHistogram#updateSnapshotRangeMetrics doesn't calculate the distribution for last bucket.
Rushabh Shah created HBASE-24615: Summary: MutableRangeHistogram#updateSnapshotRangeMetrics doesn't calculate the distribution for last bucket. Key: HBASE-24615 URL: https://issues.apache.org/jira/browse/HBASE-24615 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 1.3.7 Reporter: Rushabh Shah We are not processing the distribution for last bucket. https://github.com/apache/hbase/blob/master/hbase-hadoop-compat/src/main/java/org/apache/hadoop/metrics2/lib/MutableRangeHistogram.java#L70 {code:java} public void updateSnapshotRangeMetrics(MetricsRecordBuilder metricsRecordBuilder, Snapshot snapshot) { long priorRange = 0; long cumNum = 0; final long[] ranges = getRanges(); final String rangeType = getRangeType(); for (int i = 0; i < ranges.length - 1; i++) { -> The bug lies here. We are not processing last bucket. long val = snapshot.getCountAtOrBelow(ranges[i]); if (val - cumNum > 0) { metricsRecordBuilder.addCounter( Interns.info(name + "_" + rangeType + "_" + priorRange + "-" + ranges[i], desc), val - cumNum); } priorRange = ranges[i]; cumNum = val; } long val = snapshot.getCount(); if (val - cumNum > 0) { metricsRecordBuilder.addCounter( Interns.info(name + "_" + rangeType + "_" + ranges[ranges.length - 1] + "-inf", desc), val - cumNum); } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24576) Changing "whitelist" and "blacklist" in our docs and project.
Rushabh Shah created HBASE-24576: Summary: Changing "whitelist" and "blacklist" in our docs and project. Key: HBASE-24576 URL: https://issues.apache.org/jira/browse/HBASE-24576 Project: HBase Issue Type: Improvement Reporter: Rushabh Shah Replace instances of “whitelist” and “blacklist” in our project, trails, documentation and UI text. We can replace blacklist with blocklist, blocklisted, or block and whitelist with allowlist, allowlisted, or allow. In my current workplace they are suggesting us to make this change. Also Google has an exhaustive list to write inclusive documentation: https://developers.google.com/style/inclusive-documentation There might be few issues while making the change. 1. If these words are part of config property name then all customers needs to make the change. 2. There might be some client-server compatibility if we change server side variables/method names. Creating this jira just to start the conversation. Please chip in with ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24520) Change the IA for MutableSizeHistogram and MutableTimeHistogram
Rushabh Shah created HBASE-24520: Summary: Change the IA for MutableSizeHistogram and MutableTimeHistogram Key: HBASE-24520 URL: https://issues.apache.org/jira/browse/HBASE-24520 Project: HBase Issue Type: Task Components: metrics Reporter: Rushabh Shah Currently the IA for MutableSizeHistogram and MutableTimeHistogram is private. We want to use these classes in phoenix project and I thought we can leverage the existing implementation from hbase histo implementation. IIUC the private IA can't be used in other projects. Proposing to make it LimitedPrivate and mark HBaseInterfaceAudience.PHOENIX. Please suggest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24502) hbase hfile command logs exception message at WARN level.
Rushabh Shah created HBASE-24502: Summary: hbase hfile command logs exception message at WARN level. Key: HBASE-24502 URL: https://issues.apache.org/jira/browse/HBASE-24502 Project: HBase Issue Type: Bug Affects Versions: master Reporter: Rushabh Shah Ran the following command. {noformat} ./hbase hfile -f ~/hbase-3.0.0-SNAPSHOT/tmp/hbase/data/default/emp/b1972be371596e074a1ae465782a209f/personal\ data/0930965e5b914debb33a9be047efc493 -p {noformat} It logged the following warn message on console. {noformat} 2020-06-03 09:23:45,485 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2020-06-03 09:23:45,871 WARN [main] beanutils.FluentPropertyBeanIntrospector: Error when creating PropertyDescriptor for public final void org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)! Ignoring this property. java.beans.IntrospectionException: bad write method arg count: public final void org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object) at java.beans.PropertyDescriptor.findPropertyType(PropertyDescriptor.java:657) at java.beans.PropertyDescriptor.setWriteMethod(PropertyDescriptor.java:327) at java.beans.PropertyDescriptor.(PropertyDescriptor.java:139) at org.apache.commons.beanutils.FluentPropertyBeanIntrospector.createFluentPropertyDescritor(FluentPropertyBeanIntrospector.java:177) at org.apache.commons.beanutils.FluentPropertyBeanIntrospector.introspect(FluentPropertyBeanIntrospector.java:140) at org.apache.commons.beanutils.PropertyUtilsBean.fetchIntrospectionData(PropertyUtilsBean.java:2234) at org.apache.commons.beanutils.PropertyUtilsBean.getIntrospectionData(PropertyUtilsBean.java:2215) at org.apache.commons.beanutils.PropertyUtilsBean.getPropertyDescriptor(PropertyUtilsBean.java:950) at org.apache.commons.beanutils.PropertyUtilsBean.isWriteable(PropertyUtilsBean.java:1466) at org.apache.commons.configuration2.beanutils.BeanHelper.isPropertyWriteable(BeanHelper.java:521) at org.apache.commons.configuration2.beanutils.BeanHelper.initProperty(BeanHelper.java:357) at org.apache.commons.configuration2.beanutils.BeanHelper.initBeanProperties(BeanHelper.java:273) at org.apache.commons.configuration2.beanutils.BeanHelper.initBean(BeanHelper.java:192) at org.apache.commons.configuration2.beanutils.BeanHelper$BeanCreationContextImpl.initBean(BeanHelper.java:669) at org.apache.commons.configuration2.beanutils.DefaultBeanFactory.initBeanInstance(DefaultBeanFactory.java:162) at org.apache.commons.configuration2.beanutils.DefaultBeanFactory.createBean(DefaultBeanFactory.java:116) at org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:459) at org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:479) at org.apache.commons.configuration2.beanutils.BeanHelper.createBean(BeanHelper.java:492) at org.apache.commons.configuration2.builder.BasicConfigurationBuilder.createResultInstance(BasicConfigurationBuilder.java:447) at org.apache.commons.configuration2.builder.BasicConfigurationBuilder.createResult(BasicConfigurationBuilder.java:417) at org.apache.commons.configuration2.builder.BasicConfigurationBuilder.getConfiguration(BasicConfigurationBuilder.java:285) at org.apache.hadoop.metrics2.impl.MetricsConfig.loadFirst(MetricsConfig.java:119) at org.apache.hadoop.metrics2.impl.MetricsConfig.create(MetricsConfig.java:98) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:478) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:188) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:163) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:62) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:58) at org.apache.hadoop.hbase.metrics.BaseSourceImpl$DefaultMetricsSystemInitializer.init(BaseSourceImpl.java:54) at org.apache.hadoop.hbase.metrics.BaseSourceImpl.(BaseSourceImpl.java:116) at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.(MetricsIOSourceImpl.java:46) at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.(MetricsIOSourceImpl.java:38) at org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:94) at org.apache.hadoop.hbase.io.MetricsIO.(MetricsIO.java:35) at