[jira] [Resolved] (HBASE-28745) Default Zookeeper ConnectionRegistry APIs timeout should be less

2024-07-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28745.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Default Zookeeper ConnectionRegistry APIs timeout should be less
> 
>
> Key: HBASE-28745
> URL: https://issues.apache.org/jira/browse/HBASE-28745
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Divneet Kaur
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.11
>
>
> HBASE-28428 introduces timeout for Zookeeper ConnectionRegistry APIs. 
> However, the default timeout value we have set is 60s. Given that connection 
> registry are metadata APIs, they should have much lesser timeout value, 
> including default.
> Let's set default timeout to 10s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28745) Default Zookeeper ConnectionRegistry APIs timeout should be less

2024-07-19 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28745:


 Summary: Default Zookeeper ConnectionRegistry APIs timeout should 
be less
 Key: HBASE-28745
 URL: https://issues.apache.org/jira/browse/HBASE-28745
 Project: HBase
  Issue Type: Sub-task
Reporter: Viraj Jasani


HBASE-28428 introduces timeout for Zookeeper ConnectionRegistry APIs. However, 
the default timeout value we have set is 60s. Given that connection registry 
are metadata APIs, they should have much lesser timeout value, including 
default.

Let's set default timeout to 10s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28428) Zookeeper ConnectionRegistry APIs should have timeout

2024-07-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28428.
--
Fix Version/s: 2.7.0
   2.6.1
   2.5.11
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Zookeeper ConnectionRegistry APIs should have timeout
> -
>
> Key: HBASE-28428
> URL: https://issues.apache.org/jira/browse/HBASE-28428
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
>Reporter: Viraj Jasani
>Assignee: Divneet Kaur
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.11
>
>
> Came across a couple of instances where active master failover happens around 
> the same time as Zookeeper leader failover, leading to stuck HBase client if 
> one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
> usages do not have any timeouts, which can potentially lead to the entire 
> client in stuck state indefinitely as we take some global locks. For 
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ 
> gets stuck, we can block any admin operation that needs 
> {_}getKeepAliveMasterService(){_}.
>  
> Sample stacktrace that blocked all client operations that required table 
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case, 
> retrieving clusterId from ZNode got stuck and that blocked client from being 
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> 

[jira] [Created] (HBASE-28741) Rpc ConnectionRegistry APIs should have timeout

2024-07-19 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28741:


 Summary: Rpc ConnectionRegistry APIs should have timeout
 Key: HBASE-28741
 URL: https://issues.apache.org/jira/browse/HBASE-28741
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.10, 2.4.18, 2.6.0
Reporter: Viraj Jasani


ConnectionRegistry are some of the most basic metadata APIs that determine how 
clients can interact with the servers after getting required metadata. These 
APIs should timeout quickly if they cannot server metadata in time.

Similar to HBASE-28428 introducing timeout for Zookeeper ConnectionRegistry 
APIs, we should also introduce timeout (same timeout values) for Rpc 
ConnectionRegistry APIs as well. RpcConnectionRegistry uses HBase RPC framework 
with hedge read fanout mode.

We have two options to introduce timeout:
 # Use RetryTimer to keep watch on CompletableFuture and make it complete 
exceptionally if timeout is reached (similar proposal as HBASE-28428).
 # Introduce separate Rpc timeout config for AbstractRpcBasedConnectionRegistry 
as the rpc timeout for generic RPC operations (hbase.rpc.timeout) could be 
higher.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28665) WALs not marked closed when there are errors in closing WALs

2024-07-11 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28665.
--
Fix Version/s: 2.7.0
   2.6.1
   2.5.10
 Hadoop Flags: Reviewed
   Resolution: Fixed

> WALs not marked closed when there are errors in closing WALs
> 
>
> Key: HBASE-28665
> URL: https://issues.apache.org/jira/browse/HBASE-28665
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.5.8
>Reporter: Kiran Kumar Maturi
>Assignee: Kiran Kumar Maturi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.7.0, 2.6.1, 2.5.10
>
>
> In our production clusters we have observed that when WAL close fails It 
> causes the the oldWAL files not marked as close and not letting them cleaned. 
> When a WAL close fails in closeWriter it increments the error count. 
> {code:java}
> Span span = Span.current();
>  try {
>   span.addEvent("closing writer");
>   writer.close();
>   span.addEvent("writer closed");
> } catch (IOException ioe) {
>   int errors = closeErrorCount.incrementAndGet();
>   boolean hasUnflushedEntries = isUnflushedEntries();
>   if (syncCloseCall && (hasUnflushedEntries || (errors > 
> this.closeErrorsTolerated))) {
> LOG.error("Close of WAL " + path + " failed. Cause=\"" + 
> ioe.getMessage() + "\", errors="
>   + errors + ", hasUnflushedEntries=" + hasUnflushedEntries);
> throw ioe;
>   }
>   LOG.warn("Riding over failed WAL close of " + path
> + "; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK", 
> ioe);
> }
> {code}
> When there are errors in closing WAL only twice doReplaceWALWriter enters 
> this code block
> {code:java}
> if (isUnflushedEntries() || closeErrorCount.get() >= 
> this.closeErrorsTolerated) {
>   try {
> closeWriter(this.writer, oldPath, true);
>   } finally {
> inflightWALClosures.remove(oldPath.getName());
>   }
> }
> {code}
>  as we don't mark them closed here like we do it here 
>   
> {code:java}
>   Writer localWriter = this.writer;
>   closeExecutor.execute(() -> {
> try {
>   closeWriter(localWriter, oldPath, false);
> } catch (IOException e) {
>   LOG.warn("close old writer failed", e);
> } finally {
>   // call this even if the above close fails, as there is no 
> other chance we can set
>   // closed to true, it will not cause big problems.
>  {color:red} markClosedAndClean(oldPath);{color}
>   inflightWALClosures.remove(oldPath.getName());
> }
>   });
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28049) RSProcedureDispatcher to log the request details during retries

2024-06-11 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28049.
--
Fix Version/s: 2.7.0
   3.0.0-beta-2
   2.6.1
   2.5.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RSProcedureDispatcher to log the request details during retries
> ---
>
> Key: HBASE-28049
> URL: https://issues.apache.org/jira/browse/HBASE-28049
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Khyati Vaghamshi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> As of today, RSProcedureDispatcher only logs the exception details for the 
> given RPC request, however it does not log any other details. We should log:
>  * whether the request is for region open/close
>  * proc id, proc class names
>  * region name
>  
> Sample log without any of the above mentioned details:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28640) Hbase compaction is slow in 2.4.11 compared to hbase 1.x

2024-06-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28640.
--
Resolution: Duplicate

Duplicate of HBASE-28641

> Hbase compaction is slow in 2.4.11 compared to hbase 1.x
> 
>
> Key: HBASE-28640
> URL: https://issues.apache.org/jira/browse/HBASE-28640
> Project: HBase
>  Issue Type: Improvement
>  Components: Compaction
>Affects Versions: 2.4.11
>Reporter: Divesh Katta
>Priority: Major
> Attachments: image-2024-06-04-16-17-38-873.png, 
> image-2024-06-04-16-18-06-474.png, image-2024-06-04-16-18-26-818.png, 
> image-2024-06-04-16-18-55-672.png
>
>
> Hi Team,
> we build an Hbase 2.4.11 cluster comprising HDFS and HBase components. 
> However, during our performance testing, we observed that the HBase 
> compaction process was taking longer than expected.
> With identical configurations, in  HBase-1 cluster is completing compaction 
> tasks in less time compared to the new Hbase 2  cluster.
> Hbase1 cluster details:
> HBASE: 1.1.2
> Hbase2 cluster details
> HBASE: 2.4.11
> Please find the screenshot for the Compaction iteration and timeline:
> !image-2024-06-05-16-11-50-691.png!
> IN Hbase 1  COMPACTION TIME 
> In the HBASE1 cluster with the same set of configurations and tables, we 
> observed consistent behavior in terms of compaction time(3hrs).
> Start Time: 1:30AM 
> End Time: 4:30AM
> !image-2024-06-04-16-17-38-873.png!
>  
> IN Hbase 2  COMPACTION TIME
> Start Time: 1:30AM
> End Time: 23:00PM+
> !image-2024-06-04-16-18-06-474.png!
>  
> Actions Taken:
> Tuning of HBase configurations related to compactions was performed initially 
> but didn't yield significant improvement.
> OS Mitigations were switched off and filesystem was migrated from ext4 to 
> xfs, yet compactions didn't improve as expected.
> Observations:
> We suspected that this absence of data encoding might be contributing to 
> longer compaction times in HBase-2.
> Scheduled two separate compactions for tables with and without DATA_ENCODING 
> enabled in Hbase 2.
> In the screenshot below, the first compaction started at 1:30 AM for 15 
> tables (total size exceeding 220TB+) with (DATA ENCODED=FAST_DIFF) enabled 
> and this compaction completed by 2:30 AM. However, at 2:30 AM, we scheduled 
> another compaction for only 6 tables, totalling over 60TB in size, but these 
> tables were not enabled with DATA ENCODING , and this compaction is taking 
> longer.
> !image-2024-06-04-16-18-26-818.png!
>  
> After enabling DATA ENCODING for all the  tables in HBASE-2, we initiated 
> compaction at 1:30 AM, which completed by 4:30 AM(3hrs)
> Start Time: 1:30AM
> End Time: 4:30 AM
> !image-2024-06-04-16-18-55-672.png!
> Noticed that tables with (DATA_ENCODING=FAST_DIFF) enabled underwent faster 
> compactions compared to those without.
> Upon comparing the debug logs of the two Hbase 2 clusters, we discovered that 
> the cluster with (DATA_ENCODING=FAST_DIFF) enabled exhibited higher 
> throughput(average throughput is 89.54 MB/Second), whereas the cluster with 
> DATA_ENCODING disabled showed lower throughput(average throughput is 8.19 
> MB/second,).
>  
> #DATA_ENCODED ENBALED Tables throughput
> 2024-04-02 03:11:58,956 INFO [regionserver/:16020-shortCompactions-0] 
> throttle.PressureAwareThroughputController: 
> 803e9ef64aec8e526837c0477cc48884#scr#compaction#34672 average throughput is 
> 89.54 MB/second, slept 0 time(s) and total slept time is 0 ms. 21 active 
> operations remaining, total limit is unlimited 2024-04-02 03:12:23,680 INFO 
> [regionserver/usr-Hbase 1201:16020-shortCompactions-7] 
> throttle.PressureAwareThroughputController: 
> 82565df649e3e6eb53a1e168435204db#scr#compaction#34677 average throughput is 
> 65.02 MB/second, slept 0 time(s) and total slept time is 0 ms. 21 active 
> operations remaining, total limit is unlimited
> 2024-04-02 02:45:46,538 DEBUG [regionserver/xx:16020-longCompactions-7] 
> compactions.Compactor: Compaction progress: 
> b746a46d64df25ca82571b10bb1e2c03#key#compaction#34484 128896600/371074603 
> (34.74%), rate=17151.92 KB/sec, throughputController is 
> DefaultCompactionThroughputController [maxThroughput=unlimited, 
> activeCompactions=22] 2024-04-02 02:45:47,291 DEBUG [regionserver/usr-Hbase 
> 1201:16020-shortCompactions-0] compactions.Compactor: Compaction 
> progress: a1338704db18a8fc613ad2c9a4561d65#key#compaction#34496 
> 90148941/110577879 (81.53%), rate=20950.64 KB/sec, throughputController is 
> DefaultCompactionThroughputController [maxThroughput=unlimited, 
> activeCompactions=22]
>  
> #DATA_ENCODED DISABLED tables throughput
> 2024-04-04 03:12:48,521 INFO [regionserver/xxx:16020-longCompactions-3] 
> throttle.PressureAwareThroughputController: 
> 

[jira] [Created] (HBASE-28638) RSProcedureDispatcher to fail-fast for connection closed errors

2024-06-04 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28638:


 Summary: RSProcedureDispatcher to fail-fast for connection closed 
errors
 Key: HBASE-28638
 URL: https://issues.apache.org/jira/browse/HBASE-28638
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.5.8
Reporter: Viraj Jasani
 Fix For: 3.0.0-beta-2, 2.6.1, 2.5.9


As per one of the recent incidents, some regions faced 5+ minute of 
availability drop because before active master could initiate SCP for the dead 
server, some region moves tried to assign regions on the already dead 
regionserver. Sometimes, due to transient issues, we see that active master 
gets notified after few minutes (5+ minute in this case).
{code:java}
2024-05-08 03:47:38,518 WARN  [RSProcedureDispatcher-pool-4790] 
procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed 
due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to 
address=host1:61020 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection 
closed, try=0, retrying... {code}
And as we know, we have infinite retries here, so it kept going on..

 

Eventually, SCP could be initiated only after active master discovered the 
server as dead:
{code:java}
2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - 
Processing host1,61020,1713411866443; numProcessing=1

2024-05-08 03:50:01,038 INFO  [RegionServerTracker-0] 
master.RegionServerTracker - RegionServer ephemeral node deleted, processing 
expiration [host1,61020,1713411866443] {code}
leading to
{code:java}
2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] 
assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, 
state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, 
server=host1,61020,1713411866443 for region state=OPENING, 
location=host1,61020,1713411866443, table=T1, 
region=5cafbe54d5685acc6c4866758e67fd51, targetServer host1,61020,1713411866443 
is dead, SCP will interrupt us, give up {code}
This entire duration of outage could be avoided if we can fail-fast for 
connection drop errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27938) Enable PE to load any custom implementation of tests at runtime

2024-05-15 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27938.
--
Fix Version/s: 2.7.0
   3.0.0-beta-2
   2.6.1
   2.5.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Enable PE to load any custom implementation of tests at runtime
> ---
>
> Key: HBASE-27938
> URL: https://issues.apache.org/jira/browse/HBASE-27938
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: Prathyusha
>Assignee: Prathyusha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.9
>
>
> Right now to add any custom PE.Test implementation it has to have a compile 
> time dependency of those new test classes in PE, this is to enable PE to load 
> any custom impl of tests at runtime and utilise PE framework for any custom 
> implementations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28366) Mis-order of SCP and regionServerReport results into region inconsistencies

2024-04-04 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28366.
--
Fix Version/s: 2.6.0
   2.4.18
   3.0.0-beta-2
   2.5.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Mis-order of SCP and regionServerReport results into region inconsistencies
> ---
>
> Key: HBASE-28366
> URL: https://issues.apache.org/jira/browse/HBASE-28366
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-2, 2.5.9
>
>
> If the regionserver is online but due to network issue, if it's rs ephemeral 
> node gets deleted in zookeeper, active master schedules the SCP. However, if 
> the regionserver is alive, it can still send regionServerReport to active 
> master. In the case where SCP assigns regions to other regionserver that were 
> previously hosted on the old regionserver (which is still alive), the old rs 
> can continue to sent regionServerReport to active master.
> Eventually this results into region inconsistencies because region is alive 
> on two regionservers at the same time (though it's temporary state because 
> the rs will be aborted soon). While old regionserver can have zookeeper 
> connectivity issues, it can still make rpc calls to active master.
> Logs:
> SCP:
> {code:java}
> 2024-01-29 16:50:33,956 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=9812440 for 
> server1-114.xyz,61020,1706541866103 (carryingMeta=false) 
> server1-114.xyz,61020,1706541866103/CRASHED/regionCount=364/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5d5fc31[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2024-01-29 16:50:33,956 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor - Stored pid=9812440, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1-114.xyz,61020,1706541866103, splitWal=true, meta=false
> 2024-01-29 16:50:33,973 INFO [PEWorker-36] procedure.ServerCrashProcedure - 
> Splitting WALs pid=9812440, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, 
> locked=true; ServerCrashProcedure server1-114.xyz,61020,1706541866103, 
> splitWal=true, meta=false, isMeta: false
>  {code}
> As part of SCP, d743ace9f70d55f55ba1ecc6dc49a5cb was assigned to another 
> server:
>  
> {code:java}
> 2024-01-29 16:50:42,656 INFO [PEWorker-24] procedure.MasterProcedureScheduler 
> - Took xlock for pid=9818494, ppid=9812440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure 
> table=PLATFORM_ENTITY.PLATFORM_IMMUTABLE_ENTITY_DATA, 
> region=d743ace9f70d55f55ba1ecc6dc49a5cb, ASSIGN
> 2024-01-29 16:50:43,106 INFO [PEWorker-23] assignment.RegionStateStore - 
> pid=9818494 updating hbase:meta row=d743ace9f70d55f55ba1ecc6dc49a5cb, 
> regionState=OPEN, repBarrier=12867482, openSeqNum=12867482, 
> regionLocation=server1-65.xyz,61020,1706165574050
>  {code}
>  
> rs abort, after ~5 min:
> {code:java}
> 2024-01-29 16:54:27,235 ERROR [regionserver/server1-114:61020] 
> regionserver.HRegionServer - * ABORTING region server 
> server1-114.xyz,61020,1706541866103: Unexpected exception handling getData 
> *
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
>     at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:414)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:403)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:367)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKNodeTracker.getData(ZKNodeTracker.java:180)
>     at 
> org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:152)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionServerStatusStub(HRegionServer.java:2892)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1352)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1142)
>  {code}
>  
> Several region transition failure report logs:
> {code:java}
> 2024-01-29 16:55:13,029 INFO  [_REGION-regionserver/server1-114:61020-0] 
> regionserver.HRegionServer - Failed report transition server { host_name: 
> "server1-114.xyz" port: 61020 

[jira] [Resolved] (HBASE-28424) Set correct Result to RegionActionResult for successful Put/Delete mutations

2024-03-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28424.
--
Fix Version/s: 2.6.0
   2.4.18
   3.0.0-beta-2
   2.5.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Set correct Result to RegionActionResult for successful Put/Delete mutations
> 
>
> Key: HBASE-28424
> URL: https://issues.apache.org/jira/browse/HBASE-28424
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Jing Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-2, 2.5.9
>
>
> While returning response of multi(), RSRpcServices build the 
> RegionActionResult with Result or Exception (ClientProtos.ResultOrException). 
> It sets the Exception to this class in all cases where the operation fails 
> with corresponding exception types e.g. NoSuchColumnFamilyException or 
> FailedSanityCheckException etc.
> In case of atomic mutations Increment and Append, we add the Result object to 
> ClientProtos.ResultOrException, which is used by client to retrieve result 
> from the batch API: {_}Table#batch(List actions, Object[] 
> results){_}.
> Phoenix performs atomic mutation for Put using _preBatchMutate()_ endpoint. 
> Hence, returning Result object with ResultOrException is important for the 
> purpose of returning the result back to the client as part of the atomic 
> operation. Even if Phoenix returns the OperationStatus (with Result) to 
> MiniBatchOperationInProgress, since HBase uses the empty Result for the 
> Success case, the client would not be able to get the expected result.
> {code:java}
> case SUCCESS:
>   builder.addResultOrException(
> getResultOrException(ClientProtos.Result.getDefaultInstance(), index));
>   break; {code}
> If OperationStatus returned by _Region#batchMutate_ has valid Result object, 
> it should be used by RSRpcServices while returning the response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28428) ConnectionRegistry APIs should have timeout

2024-03-07 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28428:


 Summary: ConnectionRegistry APIs should have timeout
 Key: HBASE-28428
 URL: https://issues.apache.org/jira/browse/HBASE-28428
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.8, 3.0.0-beta-1, 2.4.17
Reporter: Viraj Jasani


Came across a couple of instances where active master failover happens around 
the same time as Zookeeper leader failover, leading to stuck HBase client if 
one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
usages do not have any timeouts, which can potentially lead to the entire 
client in stuck state indefinitely as we take some global locks. For instance, 
_getKeepAliveMasterService()_ takes
{_}masterLock{_}, hence if getting active master from _masterAddressZNode_ gets 
stuck, we can block any admin operation that needs 
{_}getKeepAliveMasterService(){_}.
 
Sample stacktrace that blocked all client operations that required table 
descriptor from Admin:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.get
org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
org.apache.hadoop.hbase.client.MasterCallable.prepare
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
org.apache.phoenix.execute.MutationState.sendBatch
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.commit
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.call.CallRunner.run
org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
Another similar incident is captured on PHOENIX-7233. In this case, retrieving 
clusterId from ZNode got stuck and that blocked client from being able to 
create any more HBase Connection. Stacktrace for referece:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?

[jira] [Created] (HBASE-28424) Set correct Result to RegionActionResult for successful Put/Delete mutations

2024-03-06 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28424:


 Summary: Set correct Result to RegionActionResult for successful 
Put/Delete mutations
 Key: HBASE-28424
 URL: https://issues.apache.org/jira/browse/HBASE-28424
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


While returning response of multi(), RSRpcServices build the RegionActionResult 
with Result or Exception (ClientProtos.ResultOrException). It sets the 
Exception to this class in all cases where the operation fails with 
corresponding exception types e.g. NoSuchColumnFamilyException or 
FailedSanityCheckException etc.

In case of atomic mutations Increment and Append, we add the Result object to 
ClientProtos.ResultOrException, which is used by client to retrieve result from 
the batch API: {_}Table#batch(List actions, Object[] results){_}.

Phoenix performs atomic mutation for Put using _preBatchMutate()_ endpoint. 
Hence, returning Result object with ResultOrException is important for the 
purpose of returning the result back to the client as part of the atomic 
operation. Even if Phoenix returns the OperationStatus (with Result) to 
MiniBatchOperationInProgress, since HBase uses the empty Result for the Success 
case, the client would not be able to get the expected result.
{code:java}
case SUCCESS:
  builder.addResultOrException(
getResultOrException(ClientProtos.Result.getDefaultInstance(), index));
  break; {code}
If OperationStatus returned by _Region#batchMutate_ has valid Result object, it 
should be used by RSRpcServices while returning the response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28366) Mis-order of SCP and regionServerReport results into region inconsistencies

2024-02-13 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28366:


 Summary: Mis-order of SCP and regionServerReport results into 
region inconsistencies
 Key: HBASE-28366
 URL: https://issues.apache.org/jira/browse/HBASE-28366
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.5.7, 3.0.0-beta-1, 2.4.17
Reporter: Viraj Jasani


If the regionserver is online but due to network issue, if it's rs ephemeral 
node gets deleted in zookeeper, active master schedules the SCP. However, if 
the regionserver is alive, it can still send regionServerReport to active 
master. In the case where SCP assigns regions to other regionserver that were 
previously hosted on the old regionserver (which is still alive), the old rs 
can continue to sent regionServerReport to active master.

Eventually this results into region inconsistencies because region is alive on 
two regionservers at the same time. While old regionserver can have zookeeper 
connectivity issues, it can still make rpc calls to active master.

 

Logs:

SCP:
{code:java}
2024-01-29 16:50:33,956 INFO [RegionServerTracker-0] 
assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=9812440 for 
server1-114.xyz,61020,1706541866103 (carryingMeta=false) 
server1-114.xyz,61020,1706541866103/CRASHED/regionCount=364/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5d5fc31[Write
 locks = 1, Read locks = 0], oldState=ONLINE.

2024-01-29 16:50:33,956 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor - Stored pid=9812440, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server1-114.xyz,61020,1706541866103, splitWal=true, meta=false

2024-01-29 16:50:33,973 INFO [PEWorker-36] procedure.ServerCrashProcedure - 
Splitting WALs pid=9812440, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, 
locked=true; ServerCrashProcedure server1-114.xyz,61020,1706541866103, 
splitWal=true, meta=false, isMeta: false
 {code}
As part of SCP, d743ace9f70d55f55ba1ecc6dc49a5cb was assigned to another server:

 
{code:java}
2024-01-29 16:50:42,656 INFO [PEWorker-24] procedure.MasterProcedureScheduler - 
Took xlock for pid=9818494, ppid=9812440, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure 
table=PLATFORM_ENTITY.PLATFORM_IMMUTABLE_ENTITY_DATA, 
region=d743ace9f70d55f55ba1ecc6dc49a5cb, ASSIGN

2024-01-29 16:50:43,106 INFO [PEWorker-23] assignment.RegionStateStore - 
pid=9818494 updating hbase:meta row=d743ace9f70d55f55ba1ecc6dc49a5cb, 
regionState=OPEN, repBarrier=12867482, openSeqNum=12867482, 
regionLocation=server1-65.xyz,61020,1706165574050
 {code}
 

 

rs abort, after ~5 min:

 
{code:java}
2024-01-29 16:54:27,235 ERROR [regionserver/server1-114:61020] 
regionserver.HRegionServer - * ABORTING region server 
server1-114.xyz,61020,1706541866103: Unexpected exception handling getData *
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:414)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:403)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:367)
    at 
org.apache.hadoop.hbase.zookeeper.ZKNodeTracker.getData(ZKNodeTracker.java:180)
    at 
org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:152)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionServerStatusStub(HRegionServer.java:2892)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1352)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1142)
 {code}
 

 

Several region transition failure report logs:

 
{code:java}
2024-01-29 16:55:13,029 INFO  [_REGION-regionserver/server1-114:61020-0] 
regionserver.HRegionServer - Failed report transition server { host_name: 
"server1-114.xyz" port: 61020 start_code: 1706541866103 } transition { 
transition_code: CLOSED region_info { region_id: 1671555604277 table_name { 
namespace: "default" qualifier: "TABLE1" } start_key: "abc" end_key: "xyz" 
offline: false split: false replica_id: 0 } proc_id: -1 }; retry (#0) 
immediately.
java.net.UnknownHostException: Call to address=master-server1.xyz:61000 failed 
on local exception: java.net.UnknownHostException: master-server1.xyz:61000 
could not be resolved
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at 

[jira] [Resolved] (HBASE-26352) Provide HBase upgrade guidelines from 1.6 to 2.4+ versions

2024-02-13 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26352.
--
Resolution: Implemented

> Provide HBase upgrade guidelines from 1.6 to 2.4+ versions
> --
>
> Key: HBASE-26352
> URL: https://issues.apache.org/jira/browse/HBASE-26352
> Project: HBase
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> Provide some ref guide under section: 
> [https://hbase.apache.org/book.html#upgrade2.0.rolling.upgrades] 
> This should include experience of performing in-place rolling upgrade 
> (without downtime) from 1.6/1.7 to 2.4+ release versions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28356) RegionServer Canary can should use Scan just like Region Canary with option to enable Raw Scan

2024-02-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28356.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.8
   3.0.0-beta-2
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RegionServer Canary can should use Scan just like Region Canary with option 
> to enable Raw Scan
> --
>
> Key: HBASE-28356
> URL: https://issues.apache.org/jira/browse/HBASE-28356
> Project: HBase
>  Issue Type: Improvement
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2
>
>
> While working on HBASE-28204 to improve Region Canary, It came to notice that 
> RegionServer canary uses the same code as Region Canary to check if a region 
> is accessible. Plus RegionSever Canary doesn't have Raw Scan enabled. 
>  
> This JIRA aims to enable Raw Scan option for RegionServer Canary and use Scan 
> only just like Region Canary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28357) MoveWithAck#isSuccessfulScan for Region movement should use Region End Key for limiting scan to one region only.

2024-02-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28357.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.8
   3.0.0-beta-2
 Hadoop Flags: Reviewed
   Resolution: Fixed

> MoveWithAck#isSuccessfulScan for Region movement should use Region End Key 
> for limiting scan to one region only.
> 
>
> Key: HBASE-28357
> URL: https://issues.apache.org/jira/browse/HBASE-28357
> Project: HBase
>  Issue Type: Improvement
>  Components: Region Assignment
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2
>
>
> Based on recent learnings and improvements in HBase Canary in HBASE-28204 and 
> HBASE-28356, I noticed that MoveWithAck.java class also uses similar code to 
> check that Region is online after region move.
>  
> {code:java}
>   private void isSuccessfulScan(RegionInfo region) throws IOException {
>     Scan scan = new 
> Scan().withStartRow(region.getStartKey()).setRaw(true).setOneRowLimit()
>       .setMaxResultSize(1L).setCaching(1).setFilter(new 
> FirstKeyOnlyFilter()).setCacheBlocks(false); {code}
> If the region, that was moved, is empty then MoveWithAck#isSuccessfulScan() 
> will end up scanning next region key space, which is not the intent. If 
> multiple regions in sequence are empty, then this could create too many 
> unnecessary scans.  By setting withStopRow(endKeyOfRegion, false) for the 
> scan object, this scan can be bound to only single region.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28204) Region Canary can take lot more time If any region (except the first region) starts with delete markers

2024-02-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28204.
--
Fix Version/s: 2.6.0
   (was: 2.7.0)
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Region Canary can take lot more time If any region (except the first region) 
> starts with delete markers
> ---
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2024-01-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28151.
--
Fix Version/s: 3.0.0-beta-2
 Hadoop Flags: Incompatible change,Reviewed
   Resolution: Fixed

> hbck -o should not allow bypassing pre transit check by default
> ---
>
> Key: HBASE-28151
> URL: https://issues.apache.org/jira/browse/HBASE-28151
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 2.5.5
>Reporter: Viraj Jasani
>Assignee: Rahul Kumar
>Priority: Major
> Fix For: 3.0.0-beta-2
>
>
> When operator uses hbck assigns or unassigns with "-o", the override will 
> also skip pre transit checks. While this is one of the intentions with "-o", 
> the primary purpose should still be to only unattach existing procedure from 
> RegionStateNode so that newly scheduled assign proc can take exclusive region 
> level lock.
> We should restrict bypassing preTransitCheck by only providing it as site 
> config.
> If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
> allowed to bypass this check, otherwise by default they should go through the 
> check.
>  
> It is important to keep "unset of the procedure from RegionStateNode" and 
> "bypassing preTransitCheck" separate so that when the cluster state is bad, 
> we don't explicitly deteriorate it further e.g. if a region was successfully 
> split and now if operator performs "hbck assigns \{region} -o" and if it 
> bypasses the transit check, master would bring the region online and it could 
> compact store files and archive the store file which is referenced by 
> daughter region. This would not allow daughter region to come online.
> Let's introduce hbase site config to allow bypassing preTransitCheck, it 
> should not be doable only by operator using hbck alone.
>  
> "-o" should mean "override" the procedure that is attached to the 
> RegionStateNode, it should not mean forcefully skip any region transition 
> validation checks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

2023-12-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28271:


 Summary: Infinite waiting on lock acquisition by snapshot can 
result in unresponsive master
 Key: HBASE-28271
 URL: https://issues.apache.org/jira/browse/HBASE-28271
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.7, 2.4.17, 3.0.0-alpha-4
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Attachments: image.png

When a region is stuck in transition for significant time, any attempt to take 
snapshot on the table would keep master handler thread in forever waiting 
state. As part of the creating snapshot on enabled or disabled table, in order 
to get the table level lock, LockProcedure is executed but if any region of the 
table is in transition, LockProcedure could not be executed by the snapshot 
handler, resulting in forever waiting until the region transition is completed, 
allowing the table level lock to be acquired by the snapshot handler.

In cases where a region stays in RIT for considerable time, if enough attempts 
are made by the client to create snapshots on the table, it can easily exhaust 
all handler threads, leading to potentially unresponsive master. Attached a 
sample thread dump.

Proposal: The snapshot handler should not stay stuck forever if it cannot take 
table level lock, it should fail-fast.

!image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-28204) Canary can take lot more time If any region (except the first region) starts with delete markers

2023-11-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-28204:
--

Reopening for revert.

> Canary can take lot more time If any region (except the first region) starts 
> with delete markers
> 
>
> Key: HBASE-28204
> URL: https://issues.apache.org/jira/browse/HBASE-28204
> Project: HBase
>  Issue Type: Bug
>  Components: canary
>Reporter: Mihir Monani
>Assignee: Mihir Monani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> In CanaryTool.java, Canary reads only the first row of the region using 
> [Get|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L520C33-L520C33]
>  for any region of the table. Canary uses [Scan with FirstRowKeyFilter for 
> table 
> scan|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L530]
>  If the said region has empty start key (This will only happen when region is 
> the first region for a table)
> With -[HBASE-16091|https://issues.apache.org/jira/browse/HBASE-16091]- 
> RawScan was 
> [implemented|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519-L534]
>  to improve performance for regions which can have high number of delete 
> markers. Based on currently implementation, [RawScan is only 
> enabled|https://github.com/apache/hbase/blob/23c41560d58cc1353b8a466deacd02dfee9e6743/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L519]
>  if region has empty start-key (or region is first region for the table).
> RawScan doesn't work for rest of the regions in the table except first 
> region. Also If the region has all the rows or majority of the rows with 
> delete markers, Get Operation can take a lot of time. This is can cause 
> timeouts for CanaryTool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28221) Introduce regionserver metric for delayed flushes

2023-11-27 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28221:


 Summary: Introduce regionserver metric for delayed flushes
 Key: HBASE-28221
 URL: https://issues.apache.org/jira/browse/HBASE-28221
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
 Fix For: 2.6.0, 3.0.0-beta-1


If compaction is disabled temporarily to allow stabilizing hdfs load, we can 
forget re-enabling the compaction. This can result into flushes getting delayed 
for "hbase.hstore.blockingWaitTime" time (90s). While flushes do happen 
eventually after waiting for max blocking time, it is important to realize that 
any cluster cannot function well with compaction disabled for significant 
amount of time.

 

Delayed flush logs:
{code:java}
LOG.warn("{} has too many store files({}); delaying flush up to {} ms",
  region.getRegionInfo().getEncodedName(), getStoreFileCount(region),
  this.blockingWaitTime); {code}
Suggestion: Introduce regionserver metric (MetricsRegionServerSource) for the 
num of flushes getting delayed due to too many store files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28192) Master should recover if meta region state is inconsistent

2023-11-09 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28192:


 Summary: Master should recover if meta region state is inconsistent
 Key: HBASE-28192
 URL: https://issues.apache.org/jira/browse/HBASE-28192
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.6, 2.4.17
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7


During active master initialization, before we set master as active (i.e. 
{_}setInitialized(true){_}), we need both meta and namespace regions online. If 
the region state of meta or namespace is inconsistent, active master can get 
stuck in the initialization step:
{code:java}
private boolean isRegionOnline(RegionInfo ri) {
  RetryCounter rc = null;
  while (!isStopped()) {
...
...
...
// Check once-a-minute.
if (rc == null) {
  rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
}
Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
  }
  return false;
}
 {code}
In one of the recent outage, we observed that meta was online on a server, 
which was correctly reflected in meta znode, but the server starttime was 
different. This means that as per the latest transition record, meta was marked 
online on old server (same server with old start time). This kept active master 
initialization waiting forever and some SCPs got stuck in initial stage where 
they need to access meta table before getting candidate for region moves.

The only way out of this outage is for operator to schedule recoveries using 
hbck for old server, which triggers SCP for old server address of meta. Since 
many SCPs were stuck, the processing of new SCP too was taking some time and 
manual restart of active master triggered failover, and new master was able to 
complete SCP for old meta server, correcting the meta assignment details, which 
eventually marked master as active and only after this, we were able to see 
real large num of RITs that were hidden so far.

We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28151) hbck -o should not allow bypassing pre transit check by default

2023-10-12 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28151:


 Summary: hbck -o should not allow bypassing pre transit check by 
default
 Key: HBASE-28151
 URL: https://issues.apache.org/jira/browse/HBASE-28151
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


When operator uses hbck assigns or unassigns with "-o", the override will also 
skip pre transit checks. While this is one of the intentions with "-o", the 
primary purpose should still be to only unattach existing procedure from 
RegionStateNode so that newly scheduled assign proc can take exclusive region 
level lock.

We should restrict bypassing preTransitCheck by only providing it as site 
config.

If bypassing preTransitCheck is configured, only then any hbck "-o" should be 
allowed to bypass this check, otherwise by default they should go through the 
check.

 

It is important to keep "unset of the procedure from RegionStateNode" and 
"bypassing preTransitCheck" separate so that when the cluster state is bad, we 
don't explicitly deteriorate it further e.g. if a region was successfully split 
and now if operator performs "hbck assigns \{region} -o" and if it bypasses the 
transit check, master would bring the region online and it could compact store 
files and archive the store file which is referenced by daughter region. This 
would not allow daughter region to come online.

Let's introduce hbase site config to allow bypassing preTransitCheck, it should 
not be doable only by operator using hbck alone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28144) Canary publish read failure fails with NPE if region location is null

2023-10-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28144.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Canary publish read failure fails with NPE if region location is null
> -
>
> Key: HBASE-28144
> URL: https://issues.apache.org/jira/browse/HBASE-28144
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.5
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Region with null server name causes canary failures while publishing read 
> failure i.e. while updating perServerFailuresCount map:
> {code:java}
> 2023-10-09 15:24:11 [CanaryMonitor-1696864805801] ERROR tool.Canary(1480): 
> Sniff region failed!
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionMonitor.run(CanaryTool.java:1478)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1837)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.incFailuresCountDetails(CanaryTool.java:327)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.publishReadFailure(CanaryTool.java:353)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:548)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:587)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:502)
>   at 
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:470)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   ... 1 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28144) Canary publish read failure fails with NPE if region location is null

2023-10-09 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28144:


 Summary: Canary publish read failure fails with NPE if region 
location is null
 Key: HBASE-28144
 URL: https://issues.apache.org/jira/browse/HBASE-28144
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.5.5
Reporter: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


Region with null server name causes canary failures while publishing read 
failure i.e. while updating perServerFailuresCount map:
{code:java}
2023-10-09 15:24:11 [CanaryMonitor-1696864805801] ERROR tool.Canary(1480): 
Sniff region failed!
java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionMonitor.run(CanaryTool.java:1478)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1837)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.incFailuresCountDetails(CanaryTool.java:327)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionStdOutSink.publishReadFailure(CanaryTool.java:353)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:548)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:587)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:502)
at 
org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:470)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28081) Snapshot working dir does not retain ACLs after snapshot commit phase

2023-09-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28081.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Snapshot working dir does not retain ACLs after snapshot commit phase
> -
>
> Key: HBASE-28081
> URL: https://issues.apache.org/jira/browse/HBASE-28081
> Project: HBase
>  Issue Type: Bug
>  Components: acl, test
>Reporter: Duo Zhang
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> TestSnapshotScannerHDFSAclController is failing 100% on flaky dashboard.
>  
> After snapshot is committed from working dir to final destination (under 
> /.hbase-snapshot dir), if the operation was atomic rename, the working dir 
> (e.g. /hbase/.hbase-snapshot/.tmp) no longer preserves ACLs that were derived 
> from snapshot parent dir (e.g. /hbase/.hbase-snapshot) while creating first 
> working snapshot dir. Hence, for new working dir, we should ensure that we 
> preserve ACLs from snapshot parent dir.
> This would ensure that final snapshot commit dir has the expected ACLs 
> regardless of whether we perform atomic rename of non-atomic copy operation 
> in the snapshot commit phase.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28050) RSProcedureDispatcher to fail-fast for krb auth failures

2023-09-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28050.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RSProcedureDispatcher to fail-fast for krb auth failures
> 
>
> Key: HBASE-28050
> URL: https://issues.apache.org/jira/browse/HBASE-28050
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> As discussed on the parent Jira, let's mark the remote procedures fail when 
> we encounter SaslException (GSS initiate failed) as this belongs to the 
> category of known IOException where we are certain that the request has not 
> yet reached to the target regionserver yet.
> This should help release dispatcher threads for other 
> ExecuteProceduresRemoteCall executions.
>  
> Example log:
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying...  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28076) NPE on initialization error in RecoveredReplicationSourceShipper

2023-09-14 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28076.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
 Hadoop Flags: Reviewed
   Resolution: Fixed

> NPE on initialization error in RecoveredReplicationSourceShipper
> 
>
> Key: HBASE-28076
> URL: https://issues.apache.org/jira/browse/HBASE-28076
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.4.17, 2.5.5
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Minor
> Fix For: 2.6.0, 2.4.18, 2.5.6
>
>
> When we run into problems starting RecoveredReplicationSourceShipper, we try 
> to stop the reader thread which we haven't initialized yet, resulting in an 
> NPE.
> {noformat}
> ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
> Unexpected exception in redacted currentPath=hdfs://redacted
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.terminate(RecoveredReplicationSourceShipper.java:100)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.getRecoveredQueueStartPos(RecoveredReplicationSourceShipper.java:87)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSourceShipper.getStartPosition(RecoveredReplicationSourceShipper.java:62)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$tryStartNewShipper$3(ReplicationSource.java:349)
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.tryStartNewShipper(ReplicationSource.java:341)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:601)
>         at java.lang.Thread.run(Thread.java:750)
> {noformat}
> A simple null check should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28050) RSProcedureDispatcher to fail-fast for SaslException

2023-08-29 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28050:


 Summary: RSProcedureDispatcher to fail-fast for SaslException
 Key: HBASE-28050
 URL: https://issues.apache.org/jira/browse/HBASE-28050
 Project: HBase
  Issue Type: Sub-task
Reporter: Viraj Jasani


As discussed on the parent Jira, let's mark the remote procedures fail when we 
encounter SaslException (GSS initiate failed) as this belongs to the category 
of known IOException where we are certain that the request has not yet reached 
to the target regionserver yet.

This should help release dispatcher threads for other 
ExecuteProceduresRemoteCall executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28048) RSProcedureDispatcher to abort executing request after configurable retries

2023-08-28 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28048:


 Summary: RSProcedureDispatcher to abort executing request after 
configurable retries
 Key: HBASE-28048
 URL: https://issues.apache.org/jira/browse/HBASE-28048
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
Reporter: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


In a recent incident, we observed that RSProcedureDispatcher continues 
executing region open/close procedures with unbounded retries even in the 
presence of known failures like GSS initiate failure:

 
{code:java}
2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=0, retrying... {code}
 

 

If the remote execution results in IOException, the dispatcher attempts to 
schedule the procedure for further retries:

 
{code:java}
    private boolean scheduleForRetry(IOException e) {
      LOG.debug("Request to {} failed, try={}", serverName, 
numberOfAttemptsSoFar, e);
      // Should we wait a little before retrying? If the server is starting 
it's yes.
      ...
      ...
      ...
      numberOfAttemptsSoFar++;
      // Add some backoff here as the attempts rise otherwise if a stuck 
condition, will fill logs
      // with failed attempts. None of our backoff classes -- RetryCounter or 
ClientBackoffPolicy
      // -- fit here nicely so just do something simple; increment by 
rsRpcRetryInterval millis *
      // retry^2 on each try
      // up to max of 10 seconds (don't want to back off too much in case of 
situation change).
      submitTask(this,
        Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
          10 * 1000),
        TimeUnit.MILLISECONDS);
      return true;
    }
 {code}
 

 

Even though we try to provide backoff while retrying, max wait time is 10s:

 
{code:java}
submitTask(this,
  Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
10 * 1000),
  TimeUnit.MILLISECONDS); {code}
 

 

This results in endless loop of retries, until either the underlying issue is 
fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
open/close region procedure (and perhaps entire SCP) for the affected 
regionserver is sidelined manually.
{code:java}
2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=217, retrying...
2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=193, retrying...
2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...
2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...{code}
 

While external issues like "krb ticket expiry" requires operator intervention, 
it is not prudent to fill up the active handlers with endless retries while 
attempting to execute RPC on only single affected regionserver. This eventually 
leads to overall cluster state degradation, specifically in the event of 
multiple regionserver restarts resulting from any planned activities.

One of the resolutions here would be:
 # Configure max retries as part of ExecuteProceduresRequest request (or it 
could be part of RemoteProcedureRequest)
 # This retry count should be used 

[jira] [Resolved] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-08-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28042.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Snapshot corruptions due to non-atomic rename within same filesystem
> 
>
> Key: HBASE-28042
> URL: https://issues.apache.org/jira/browse/HBASE-28042
> Project: HBase
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Sequence of events that can lead to snapshot corruptions:
>  # Create snapshot using admin command
>  # Active master triggers async snapshot creation
>  # If the snapshot operation doesn't complete within 5 min, client gets 
> exception
> {code:java}
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
> 'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms 
>   {code}
>  # Client initiates snapshot deletion after this error
>  # In the snapshot completion/commit phase, the files are moved from tmp to 
> final dir.
>  # Snapshot delete and snapshot commit operations can cause corruption by 
> leaving incomplete metadata:
>  * [Snapshot commit] create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot delete from client]  delete 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
>  * [Snapshot commit]  create 
> "/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"
>  
> The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
> not for hbase 2
> {code:java}
>   public static void completeSnapshot(Path snapshotDir, Path workingDir, 
> FileSystem fs,
> FileSystem workingDirFs, final Configuration conf)
> throws SnapshotCreationException, IOException {
> LOG.debug(
>   "Sentinel is done, just moving the snapshot from " + workingDir + " to 
> " + snapshotDir);
> URI workingURI = workingDirFs.getUri();
> URI rootURI = fs.getUri();
> if (
>   (!workingURI.getScheme().equals(rootURI.getScheme()) || 
> workingURI.getAuthority() == null
> || !workingURI.getAuthority().equals(rootURI.getAuthority())
> || workingURI.getUserInfo() == null //always true for hdfs://{cluster}
> || !workingURI.getUserInfo().equals(rootURI.getUserInfo())
> || !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
> evaluated due to short circuit above
> && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
> true, conf) // non-atomic rename operation
> ) {
>   throw new SnapshotCreationException("Failed to copy working directory(" 
> + workingDir
> + ") to completed directory(" + snapshotDir + ").");
> }
>   } {code}
> whereas for hbase 1
> {code:java}
> // check UGI/userInfo
> if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
>   return true;
> }
> if (workingURI.getUserInfo() != null &&
> !workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
>   return true;
> }
>  {code}
> this causes shouldSkipRenameSnapshotDirectories() to return false if 
> workingURI and rootURI share the same filesystem, which would always lead to 
> atomic rename:
> {code:java}
> if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
> || !fs.rename(workingDir, snapshotDir))
>  && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
> conf)) {
>   throw new SnapshotCreationException("Failed to copy working directory(" + 
> workingDir
>   + ") to completed directory(" + snapshotDir + ").");
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28042) Snapshot corruptions due to non-atomic rename within same filesystem

2023-08-23 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28042:


 Summary: Snapshot corruptions due to non-atomic rename within same 
filesystem
 Key: HBASE-28042
 URL: https://issues.apache.org/jira/browse/HBASE-28042
 Project: HBase
  Issue Type: Bug
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


Sequence of events that can lead to snapshot corruptions:
 # Create snapshot using admin command
 # Active master triggers async snapshot creation
 # If the snapshot operation doesn't complete within 5 min, client gets 
exception

{code:java}
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 
'T1_1691888405683_1691888440827_1' wasn't completed in expectedTime:60 ms   
{code}
 # Client initiates snapshot deletion after this error
 # In the snapshot completion/commit phase, the files are moved from tmp to 
final dir.
 # Snapshot delete and snapshot commit operations can cause corruption by 
leaving incomplete metadata:

 * [Snapshot commit] create 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
 * [Snapshot delete from client]  delete 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/.snapshotinfo"
 * [Snapshot commit]  create 
"/hbase/.hbase-snapshot/T1_1691888405683_1691888440827_1/data-manifest"

 

The changes introduced by HBASE-21098 performs atomic rename for hbase 1 but 
not for hbase 2
{code:java}
  public static void completeSnapshot(Path snapshotDir, Path workingDir, 
FileSystem fs,
FileSystem workingDirFs, final Configuration conf)
throws SnapshotCreationException, IOException {
LOG.debug(
  "Sentinel is done, just moving the snapshot from " + workingDir + " to " 
+ snapshotDir);
URI workingURI = workingDirFs.getUri();
URI rootURI = fs.getUri();
if (
  (!workingURI.getScheme().equals(rootURI.getScheme()) || 
workingURI.getAuthority() == null
|| !workingURI.getAuthority().equals(rootURI.getAuthority())
|| workingURI.getUserInfo() == null //always true for hdfs://{cluster}
|| !workingURI.getUserInfo().equals(rootURI.getUserInfo())
|| !fs.rename(workingDir, snapshotDir)) //this condition isn't even 
evaluated due to short circuit above
&& !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, 
true, conf) // non-atomic rename operation
) {
  throw new SnapshotCreationException("Failed to copy working directory(" + 
workingDir
+ ") to completed directory(" + snapshotDir + ").");
}
  } {code}
whereas for hbase 1
{code:java}
// check UGI/userInfo
if (workingURI.getUserInfo() == null && rootURI.getUserInfo() != null) {
  return true;
}
if (workingURI.getUserInfo() != null &&
!workingURI.getUserInfo().equals(rootURI.getUserInfo())) {
  return true;
}
 {code}
this causes shouldSkipRenameSnapshotDirectories() to return false if workingURI 
and rootURI share the same filesystem, which would always lead to atomic rename:
{code:java}
if ((shouldSkipRenameSnapshotDirectories(workingURI, rootURI)
|| !fs.rename(workingDir, snapshotDir))
 && !FileUtil.copy(workingDirFs, workingDir, fs, snapshotDir, true, true, 
conf)) {
  throw new SnapshotCreationException("Failed to copy working directory(" + 
workingDir
  + ") to completed directory(" + snapshotDir + ").");
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28040) hbck2 bypass should provide an option to bypass existing top N procedures

2023-08-22 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-28040:


 Summary: hbck2 bypass should provide an option to bypass existing 
top N procedures
 Key: HBASE-28040
 URL: https://issues.apache.org/jira/browse/HBASE-28040
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


For the degraded cluster state where several SCPs and underlying TRSPs are 
stuck due to network issues, it becomes difficult to resolve RITs and recover 
regions from SCPs.

In order to bypass stuck procedures, we need to extract and then provide list 
of proc ids from list_procedures or procedures.jsp page. If we could provide an 
option to bypass initial N procedures that are listed on the procedures.jsp 
page, that would be really helpful.

Implementation steps:
 # Similar to BypassProcedureRequest, provide BypassTopNProcedureRequest with 
only attribute value as N
 # MasterRpcServices to provide new API: 
 # 
{code:java}
bypassProcedure(RpcController controller,
  MasterProtos.BypassTopNProcedureRequest request) {code}

 # Hbck to provide utility to consume this master rpc
 # HBCK2 to use new hbck utility for bypassing top N requests

 

For this new option, top N procedures matter, hence they should follow the 
sorting order similar to the one followed by procedures.jsp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28011) The logStats about LruBlockCache is not accurate

2023-08-09 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-28011.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> The logStats about LruBlockCache is not accurate
> 
>
> Key: HBASE-28011
> URL: https://issues.apache.org/jira/browse/HBASE-28011
> Project: HBase
>  Issue Type: Bug
>  Components: BlockCache
>Affects Versions: 2.4.13
> Environment: Centos 7.6
> HBase 2.4.13
>Reporter: guluo
>Assignee: guluo
>Priority: Minor
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> LruBlockCache.logStats would print info, as follow:
> {code:java}
> //代码占位符
> INFO  [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=2.42 MB, 
> freeSize=3.20 GB, max=3.20 GB, blockCount=14, accesses=31200, hits=31164, 
> hitRatio=99.88%, , cachingAccesses=31179, cachingHits=31156, 
> cachingHitsRatio=99.93%, evictions=426355, evicted=0, evictedPerRun=0.0 {code}
> I think the description about *totalSize=2.42 MB* is not accurate, It 
> actually represents the used size of BlockCache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27904) A random data generator tool leveraging bulk load.

2023-07-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27904.
--
Fix Version/s: 2.6.0
   Resolution: Fixed

> A random data generator tool leveraging bulk load.
> --
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
>  Issue Type: New Feature
>  Components: util
>Reporter: Himanshu Gwalani
>Assignee: Himanshu Gwalani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 
> nodes HBase cluster (having HBase + Hadoop services running). The tool 
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool 
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 100 -split-count 
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27904) A random data generator tool leveraging bulk load.

2023-07-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27904:
--

re-opening for branch-2 backport

> A random data generator tool leveraging bulk load.
> --
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
>  Issue Type: New Feature
>  Components: util
>Reporter: Himanshu Gwalani
>Assignee: Himanshu Gwalani
>Priority: Major
> Fix For: 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 
> nodes HBase cluster (having HBase + Hadoop services running). The tool 
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool 
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 100 -split-count 
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27948) Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean

2023-06-29 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27948.
--
Fix Version/s: 2.6.0
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Report memstore on-heap and off-heap size as jmx metrics in sub=Memory bean
> ---
>
> Key: HBASE-27948
> URL: https://issues.apache.org/jira/browse/HBASE-27948
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jing Yu
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
>
> Currently we only report "memStoreSize" jmx metric in sub=Memory bean. There 
> are "Memstore On-Heap Size" and "Memsotre Off-Heap Size" in the RS UI. It 
> would be useful to report them in JMX.
> In addition, "memStoreSize" metric under sub=Memory is 0 for some reason 
> (while that under sub=Server is not). Need to do some digging to see if it is 
> a bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27955) RefreshPeerProcedure should be resilient to replication endpoint failures

2023-06-28 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27955:


 Summary: RefreshPeerProcedure should be resilient to replication 
endpoint failures
 Key: HBASE-27955
 URL: https://issues.apache.org/jira/browse/HBASE-27955
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani


UpdatePeerConfigProcedure gets stuck when we see some failures in 
RefreshPeerProcedure. The only way to move forward is either by restarting 
active master or bypassing the stuck procedure.

 

For instance,
{code:java}
2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
replication.RefreshPeerProcedure - Refresh peer core1.hbase1a_aws.prod5.uswest2 
for UPDATE_CONFIG on {host},{port},1687053857180 failed
java.lang.NullPointerException via 
{host},{port},1687053857180:java.lang.NullPointerException: 
    at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
    at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
Caused by: java.lang.NullPointerException: 
    at xyz(Abc.java:89) <= replication endpoint failure example
    at xyz(Abc.java:79)     <= replication endpoint failure example     
at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
    at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
    at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}
RefreshPeerProcedure should support reporting this failure and rollback of the 
parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27892) Report memstore on-heap and off-heap size as jmx metrics

2023-06-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27892.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
   4.0.0-alpha-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Report memstore on-heap and off-heap size as jmx metrics
> 
>
> Key: HBASE-27892
> URL: https://issues.apache.org/jira/browse/HBASE-27892
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> Currently we only report "memStoreSize" metric in sub=RegionServer bean. I've 
> noticed a big discrepancy between this metric and the RS UI's "Memstore 
> On-Heap Size". It seems like "memStoreSize" is the overall data size, while 
> the on-heap size is coming from our heap estimation which includes POJO heap 
> overhead, etc.
> I have a regionserver with only 750mb of "memStoreSize", but the on-heap size 
> is over 1gb.  This is non-trivial for estimating overall heap size necessary 
> for a regionserver. Since we have the data, let's report it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27904) A random data generator tool leveraging bulk load.

2023-06-22 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27904.
--
Fix Version/s: (was: 2.6.0)
 Hadoop Flags: Reviewed
   Resolution: Fixed

> A random data generator tool leveraging bulk load.
> --
>
> Key: HBASE-27904
> URL: https://issues.apache.org/jira/browse/HBASE-27904
> Project: HBase
>  Issue Type: New Feature
>  Components: util
>Reporter: Himanshu Gwalani
>Assignee: Himanshu Gwalani
>Priority: Major
> Fix For: 3.0.0-beta-1
>
>
> As of now, there is no data generator tool in HBase leveraging bulk load. 
> Since bulk load skips client writes path, it's much faster to generate data 
> and use of for load/performance tests where client writes are not a mandate.
> {*}Example{*}: Any tooling over HBase that need x TBs of HBase Table for load 
> testing.
> {*}Requirements{*}:
> 1. Tooling should generate RANDOM data on the fly and should not require any 
> pre-generated data as CSV/XML files as input.
> 2. Tooling should support pre-splited tables (number of splits to be taken as 
> input).
> 3. Data should be UNIFORMLY distributed across all regions of the table.
> *High-level Steps*
> 1. A table will be created (pre-splited with number of splits as input)
> 2. The mapper of a custom Map Reduce job will generate random key-value pair 
> and ensure that those are equally distributed across all regions of the table.
> 3. 
> [HFileOutputFormat2|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java]
>  will be used to add reducer to the MR job and create HFiles based on key 
> value pairs generated by mapper. 
> 4. Bulk load those HFiles to the respective regions of the table using 
> [LoadIncrementalFiles|https://hbase.apache.org/2.2/devapidocs/org/apache/hadoop/hbase/tool/LoadIncrementalHFiles.html]
> *Results*
> We had POC for this tool in our organization, tested this tool with a 11 
> nodes HBase cluster (having HBase + Hadoop services running). The tool 
> generated:
> 1. *100* *GB* of data in *6 minutes*
> 2. *340 GB* of data in *13 minutes*
> 3. *3.5 TB* of data in *3 hours and 10 minutes*
> *Usage*
> hbase org.apache.hadoop.hbase.util.bulkdatagenerator.BulkDataGeneratorTool 
> -mapper-count 100 -table TEST_TABLE_1 -rows-per-mapper 100 -split-count 
> 100 -delete-if-exist -table-options "NORMALIZATION_ENABLED=false"
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27902) New async admin api to invoke coproc on multiple servers

2023-06-20 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27902.
--
Fix Version/s: 2.6.0
   2.4.18
   2.5.6
   3.0.0-beta-1
 Hadoop Flags: Reviewed
   Resolution: Fixed

> New async admin api to invoke coproc on multiple servers
> 
>
> Key: HBASE-27902
> URL: https://issues.apache.org/jira/browse/HBASE-27902
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Jing Yu
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> We can execute regionserver coproc on a given regionserver using:
> {code:java}
>  CompletableFuture coprocessorService(Function 
> stubMaker,
>   ServiceCaller callable, ServerName serverName); {code}
> We should also provide an api at admin layer that can invoke the given coproc 
> endpoint on multiple servers, to help execute the given function in parallel 
> across the entire cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27902) New async admin api to invoke coproc on multiple servers

2023-06-01 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27902:


 Summary: New async admin api to invoke coproc on multiple servers
 Key: HBASE-27902
 URL: https://issues.apache.org/jira/browse/HBASE-27902
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Jing Yu


We can execute regionserver coproc on a given regionserver using:
{code:java}
 CompletableFuture coprocessorService(Function 
stubMaker,
  ServiceCaller callable, ServerName serverName); {code}
We should also provide an api at admin layer that can invoke the given coproc 
endpoint on multiple servers, to help execute the given function in parallel 
across the entire cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27535) Separate slowlog thresholds for scans vs other requests

2023-04-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27535.
--
Fix Version/s: 2.6.0
   3.0.0-alpha-4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Separate slowlog thresholds for scans vs other requests
> ---
>
> Key: HBASE-27535
> URL: https://issues.apache.org/jira/browse/HBASE-27535
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: slowlog
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> Scans by their nature are able to more efficiently pull back larger response 
> sizes than gets. They also may take longer to execute than other request 
> types. We should make it possible to configure a separate threshold for 
> response time and response time for scans. This will allow us to tune down 
> the thresholds for others without adding unnecessary noise for requests which 
> are known to be slower/bigger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27536) Include more request information in slowlog for Scans

2023-04-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27536.
--
Fix Version/s: 2.6.0
   3.0.0-alpha-4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Include more request information in slowlog for Scans
> -
>
> Key: HBASE-27536
> URL: https://issues.apache.org/jira/browse/HBASE-27536
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: slowlog
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> Currently the slowlog only includes a barebones text format of the underlying 
> protobuf Message fields. This is not a great UX for 2 reasons:
>  # Most of the proto fields dont mirror the actual API names in our requests 
> (Scan, Get, etc).
>  # The chosen data is often not enough to actually infer anything about the 
> request
> Any of the API class's toString method would be a much better representation 
> of the request. On the server side, we already have to turn the protobuf 
> Message into an actual API class in order to serve the request in 
> RSRpcServices. Given slow logs should be a very small percent of total 
> requests, I think we should do a similar parsing in SlowLogQueueService. Or 
> better yet, perhaps we can pass the already parsed request into the queue at 
> the start to avoid the extra work. 
> When hydrating a SlowLogPayload with this request information, I believe we 
> should use {{Operation's toMap(int maxCols)}} method. Adding this onto the 
> SlowLogPayload as a map (or list of key/values) will make it easier to 
> consume via downstream automation. Alternatively we could use 
> {{{}toJSON(){}}}.
> We should also include any attributes from the queries, as those made aid 
> tracing at the client level.
> Edit: because of nuance related to handling multis and the adequacy of info 
> available for gets/puts, we're scoping this issue down to focus on improving 
> the information available on Scan slowlogs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27684) Client metrics for user region lock related behaviors.

2023-03-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27684.
--
Fix Version/s: 2.4.17
   Resolution: Fixed

> Client metrics for user region lock related behaviors.
> --
>
> Key: HBASE-27684
> URL: https://issues.apache.org/jira/browse/HBASE-27684
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.0.0
>Reporter: Victor Li
>Assignee: Victor Li
>Priority: Major
> Fix For: 2.6.0, 2.4.17, 2.5.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27684) Client metrics for user region lock related behaviors.

2023-03-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27684:
--

Reopening for branch-2.4 backport.

> Client metrics for user region lock related behaviors.
> --
>
> Key: HBASE-27684
> URL: https://issues.apache.org/jira/browse/HBASE-27684
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.0.0
>Reporter: Victor Li
>Assignee: Victor Li
>Priority: Major
> Fix For: 2.6.0, 2.5.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27684) Client metrics for user region lock related behaviors.

2023-03-20 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27684.
--
Fix Version/s: 2.6.0
   2.5.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Client metrics for user region lock related behaviors.
> --
>
> Key: HBASE-27684
> URL: https://issues.apache.org/jira/browse/HBASE-27684
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 2.0.0
>Reporter: Victor Li
>Assignee: Victor Li
>Priority: Major
> Fix For: 2.6.0, 2.5.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27671) Client should not be able to restore/clone a snapshot after it's TTL has expired

2023-03-20 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27671.
--
Resolution: Fixed

> Client should not be able to restore/clone a snapshot after it's TTL has 
> expired
> 
>
> Key: HBASE-27671
> URL: https://issues.apache.org/jira/browse/HBASE-27671
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.2
> Environment: ENV- HBase 2.5.2
>Reporter: Ashok shetty
>Assignee: Nihal Jain
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
> Attachments: Screenshot 2023-03-15 at 8.20.31 PM.png
>
>
> Steps:
> precondition : base.master.cleaner.snapshot.interval to 5 min in 
> hbase-site.xml
> 1. create a table t1 , put some data
> 2. create a snapshot 'snapt1' with TTL 1 mins
> let the TTL expries 
> 3. disable and drop table t1
> 4. restore snapshot t1
> Actual : restore snapshot successful 
> Expected : restore operation should fail and throw specified snapshot TTL 
> expried cant restore
> Note : its can consider as improvement point 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27671) Client should not be able to restore/clone a snapshot after it's TTL has expired

2023-03-20 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27671:
--

Backporting to branch-2.4

> Client should not be able to restore/clone a snapshot after it's TTL has 
> expired
> 
>
> Key: HBASE-27671
> URL: https://issues.apache.org/jira/browse/HBASE-27671
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.2
> Environment: ENV- HBase 2.5.2
>Reporter: Ashok shetty
>Assignee: Nihal Jain
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: Screenshot 2023-03-15 at 8.20.31 PM.png
>
>
> Steps:
> precondition : base.master.cleaner.snapshot.interval to 5 min in 
> hbase-site.xml
> 1. create a table t1 , put some data
> 2. create a snapshot 'snapt1' with TTL 1 mins
> let the TTL expries 
> 3. disable and drop table t1
> 4. restore snapshot t1
> Actual : restore snapshot successful 
> Expected : restore operation should fail and throw specified snapshot TTL 
> expried cant restore
> Note : its can consider as improvement point 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27671) Client should not be able to restore/clone a snapshot after it's TTL has expired

2023-03-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27671.
--
Fix Version/s: 2.6.0
   3.0.0-alpha-4
   2.5.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Client should not be able to restore/clone a snapshot after it's TTL has 
> expired
> 
>
> Key: HBASE-27671
> URL: https://issues.apache.org/jira/browse/HBASE-27671
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.2
> Environment: ENV- HBase 2.5.2
>Reporter: Ashok shetty
>Assignee: Nihal Jain
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.4
>
> Attachments: Screenshot 2023-03-15 at 8.20.31 PM.png
>
>
> Steps:
> precondition : base.master.cleaner.snapshot.interval to 5 min in 
> hbase-site.xml
> 1. create a table t1 , put some data
> 2. create a snapshot 'snapt1' with TTL 1 mins
> let the TTL expries 
> 3. disable and drop table t1
> 4. restore snapshot t1
> Actual : restore snapshot successful 
> Expected : restore operation should fail and throw specified snapshot TTL 
> expried cant restore
> Note : its can consider as improvement point 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27529) Provide RS coproc ability to attach WAL extended attributes to mutations at replication sink

2023-01-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27529.
--
Hadoop Flags: Reviewed
Release Note: 
New regionserver coproc endpoints that can be used by coproc at the replication 
sink cluster if WAL has extended attributes. 
Using the new endpoints, WAL extended attributes can be transferred to Mutation 
attributes at the replication sink cluster.
  Resolution: Fixed

> Provide RS coproc ability to attach WAL extended attributes to mutations at 
> replication sink
> 
>
> Key: HBASE-27529
> URL: https://issues.apache.org/jira/browse/HBASE-27529
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.16, 2.5.3
>
>
> HBase provides coproc ability to enhance WALKey attributes (a.k.a. WAL 
> annotations) in order for the replication sink cluster to build required 
> metadata with the mutations. The endpoint is preWALAppend(). This ability was 
> provided by HBASE-22622. The map of extended attributes is optional and hence 
> not directly used by hbase internally. 
> For any hbase downstreamers to build CDC (Change Data Capture) like 
> functionality, it might required additional metadata in addition to the ones 
> being used by hbase already (replication scope, list of cluster ids, seq id, 
> table name, region id etc). For instance, Phoenix uses many additional 
> attributes like tenant id, schema name, table type etc.
> We already have this extended map of attributes available in WAL protobuf, to 
> provide us the capability to (de)serialize it. While creating new 
> ReplicateWALEntryRequest from the list of WAL entires, we are able to 
> serialize the additional attributes. Similarly, at the replication sink side, 
> the deserialized WALEntry has the extended attributes available.
> At the sink cluster, we should be able to attach the deserialized extended 
> attributes to the newly generated mutations so that the peer cluster can 
> utilize the mutation attributes to re-build required metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27529) Attach WAL extended attributes to mutations at replication sink

2022-12-12 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27529:


 Summary: Attach WAL extended attributes to mutations at 
replication sink
 Key: HBASE-27529
 URL: https://issues.apache.org/jira/browse/HBASE-27529
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.16, 2.5.3


HBase provides coproc ability to enhance WALKey attributes (a.k.a. WAL 
annotations) in order for the replication sink cluster to build required 
metadata with the mutations. The endpoint is preWALAppend(). This ability was 
provided by HBASE-22622. The map of extended attributes is optional and hence 
not directly used by hbase internally. 

For any hbase downstreamers to build CDC (Change Data Capture) like 
functionality, it might required additional metadata in addition to the ones 
being used by hbase already (replication scope, list of cluster ids, seq id, 
table name, region id etc). For instance, Phoenix uses many additional 
attributes like tenant id, schema name, table type etc.
We already have this extended map of attributed available in WAL protobuf, to 
provide us the capability to (de)serialize it. While creating new 
ReplicateWALEntryRequest from the list of WAL entires, we are able to serialize 
the additional attributes. Similarly, at the replication sink side, the 
deserialized WALEntry has the extended attributed available.

At the sink cluster, we should be able to attach the deserialized extended 
attributes to the newly generated mutations so that the peer cluster can 
utilize the mutation attributes to re-build required metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27466) hbase client metrics per user specified identity on hconnections.

2022-12-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27466.
--
Resolution: Fixed

> hbase client metrics per user specified identity on hconnections.
> -
>
> Key: HBASE-27466
> URL: https://issues.apache.org/jira/browse/HBASE-27466
> Project: HBase
>  Issue Type: Improvement
>  Components: Client
>Reporter: Victor Li
>Assignee: Victor Li
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.3
>
>
> At present, hbase client metrics is per each individual hconnection and with 
> a pre-configured scope in conf or cluster ID plus connection ID which has 
> generated hash code as suffix. If a client has more than one connections and 
> the scope configuration is common among all connections, the metrics might 
> override each other.
> I am proposing connections to share a common metrics object if the 
> connections have a common configured scope.
> If a connection identity is not provided, client metrics will continue 
> working per hconnection and has a connection ID as it scope. I.e. no behavior 
> change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27502) Regionservers aborted as mvcc read point is less than max seq id derived from .seqid files

2022-12-01 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27502.
--
  Assignee: Viraj Jasani
Resolution: Workaround

It turns out, this issue happened because snapshot was created by a MapReduce 
job that was still using hbase 1.x binaries in the classpath, and since we 
don't have HBASE-21977 backported to branch-1, we had this issue.

HBASE-21977 prevents creating new seqid for region open/close that happens as 
part of creating snapshot scanners, but this fix is only available for hbase 
2.x clusters.

 

[~shahrs87] [~syuanjiang] 

> Regionservers aborted as mvcc read point is less than max seq id derived from 
> .seqid files
> --
>
> Key: HBASE-27502
> URL: https://issues.apache.org/jira/browse/HBASE-27502
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.4.15
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> HBase cluster is upgraded from 1.6 to 2.4.14/15 version recently. The cluster 
> doesn't have much traffic. After 4-5 days of this upgrade, suddenly 144 out 
> of ~150 regionservers were aborted with {*}java.io.IOException: The new max 
> sequence id {} is less than the old max sequence id {}{*}.
> After starting regionservers, things were normal.
>  
> Sequence of events for the first regionserver that was aborted:
> Table snapshot was created, hence region went through snapshot subprocedure.
> {code:java}
> 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] 
> snapshot.SnapshotManifest - Storing 
> 'TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.'
>  region-info for snapshot=SNAPSHOT_TABLE01_1668560404026_1668560447300_0
> 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] 
> snapshot.FlushSnapshotSubprocedure - Starting snapshot operation on 
> TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
> 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] 
> snapshot.SnapshotManifest - Adding snapshot references for 
> [hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a]
>  hfiles
> 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] 
> snapshot.SnapshotManifest - Adding reference for file (1/1): 
> hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a
>  for snapshot=SNAPSHOT_TABLE01_1668560404026_1668560447300_0
> 2022-11-16 01:03:11,562 DEBUG [532)-snapshot-pool-0] 
> snapshot.FlushSnapshotSubprocedure - Closing snapshot operation on 
> TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
> 2022-11-16 01:03:11,562 DEBUG [532)-snapshot-pool-0] 
> snapshot.FlushSnapshotSubprocedure - ... SkipFlush Snapshotting region 
> TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
>  completed.
>  {code}
>  
> After 6+ hr, major compaction of the table was triggered.
> Logs from RS c01-dabc11-12-xyz.abcxyz:
> {code:java}
> 2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HStore - 
> Starting compaction of 
> [hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a]
>  into 
> tmpdir=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/.tmp,
>  totalSize=939.0 M
> 2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HRegion - 
> Starting compaction of d773829ad7e76202cccac6fbc314091b/0 in 
> TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
> {code}
>  
>  
> Region split is triggered by CompactSplit.
> Logs from RS c01-dabc11-12-xyz.abcxyz:
> {code:java}
> 2022-11-16 07:38:03,570 DEBUG [0-shortCompactions-0] 
> regionserver.CompactSplit - Splitting 
> TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.,
>  compactionQueue=(longCompactions=0:shortCompactions=0), splitQueue=0
> 2022-11-16 

[jira] [Created] (HBASE-27502) Regionservers aborted as new seq id is less than max seq id derived from .seqid files

2022-11-22 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27502:


 Summary: Regionservers aborted as new seq id is less than max seq 
id derived from .seqid files
 Key: HBASE-27502
 URL: https://issues.apache.org/jira/browse/HBASE-27502
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.15
Reporter: Viraj Jasani


HBase cluster is upgraded from 1.6 to 2.4.14/15 version recently. The cluster 
doesn't have much traffic. After 4-5 days of this upgrade, suddenly 144 out of 
~150 regionservers were aborted with {*}java.io.IOException: The new max 
sequence id {} is less than the old max sequence id {}{*}.

After starting regionservers, things were normal.

 

Sequence of events for the first regionserver that was aborted (all servers 
aborted due to same reason):

Major compaction of one of the tables (TABLE01) was triggered.

Logs from RS c01-dabc11-12-xyz.abcxyz:

 
{code:java}
2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HStore - 
Starting compaction of 
[hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a]
 into 
tmpdir=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/.tmp,
 totalSize=939.0 M

2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HRegion - 
Starting compaction of d773829ad7e76202cccac6fbc314091b/0 in 
TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
{code}
 

 

Region split is triggered by CompactSplit.

Logs from RS c01-dabc11-12-xyz.abcxyz:

 
{code:java}
2022-11-16 07:38:03,570 DEBUG [0-shortCompactions-0] regionserver.CompactSplit 
- Splitting 
TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.,
 compactionQueue=(longCompactions=0:shortCompactions=0), splitQueue=0

2022-11-16 07:38:03,848 INFO  [abc11-12-xyz:61020-0] regionserver.HRegion - 
Closing region 
TABLE01,00DJ003PwkV05J00681gsU\x08\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.

2022-11-16 07:38:03,860 DEBUG [2cccac6fbc314091b.-1] backup.HFileArchiver - 
Archived from FileableStoreFile, 
hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a
 to 
hdfs://c01/hbase/archive/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a

2022-11-16 07:38:03,881 DEBUG [abc11-12-xyz:61020-0] regionserver.HRegion - 
Region close journal for d773829ad7e76202cccac6fbc314091b:
Waiting for close lock at 1668584283848Running coprocessor pre-close hooks at 
1668584283848Disabling compacts and flushes for region at 
1668584283848Disabling writes for close at 1668584283848Writing region close 
event to WAL at 1668584283876 (+28 ms)

2022-11-16 07:38:03,881 WARN  [abc11-12-xyz:61020-0] 
handler.UnassignRegionHandler - Fatal error occurred while closing region 
d773829ad7e76202cccac6fbc314091b, aborting...
java.io.IOException: The new max sequence id 1963762 is less than the old max 
sequence id 1963764
    at 
org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
    at 
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:118)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{code}
Leading to RS abort:

 
{code:java}
2022-11-16 07:38:03,889 ERROR [abc11-12-xyz:61020-0] regionserver.HRegionServer 
- * ABORTING region server c01-dabc11-12-xyz.abcxyz,61020,1668064189532: 
Failed to close region d773829ad7e76202cccac6fbc314091b and can not recover 
*
java.io.IOException: The new max sequence id 1963762 is less than the old max 
sequence id 1963764
    at 
org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
    at 

[jira] [Resolved] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27100.
--
Fix Version/s: (was: 3.0.0-alpha-4)
   Resolution: Implemented

> Add documentation for Replication Observability Framework in hbase book.
> 
>
> Key: HBASE-27100
> URL: https://issues.apache.org/jira/browse/HBASE-27100
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27100) Add documentation for Replication Observability Framework in hbase book.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27100:
--

> Add documentation for Replication Observability Framework in hbase book.
> 
>
> Key: HBASE-27100
> URL: https://issues.apache.org/jira/browse/HBASE-27100
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27085.
--
Resolution: Implemented

> Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from 
> source cluster.
> --
>
> Key: HBASE-27085
> URL: https://issues.apache.org/jira/browse/HBASE-27085
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> This work is to create sink tracker table to persist tracker rows coming from 
> replication source cluster. 
> Create ReplicationMarkerChore to create replication marker rows periodically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-26925) Create WAL event tracker table to track all the WAL events.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26925.
--
Fix Version/s: (was: 3.0.0-alpha-4)
   Resolution: Implemented

> Create WAL event tracker table to track all the WAL events.
> ---
>
> Key: HBASE-26925
> URL: https://issues.apache.org/jira/browse/HBASE-26925
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> Design Doc: 
> [https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#]
> Create wal event tracker table to track WAL events. Whenever we roll the WAL, 
> we will save the WAL name, WAL length, region server, timestamp in a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27085) Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from source cluster.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27085:
--

> Create REPLICATION_SINK_TRACKER table to persist sentinel rows coming from 
> source cluster.
> --
>
> Key: HBASE-27085
> URL: https://issues.apache.org/jira/browse/HBASE-27085
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> This work is to create sink tracker table to persist tracker rows coming from 
> replication source cluster. 
> Create ReplicationMarkerChore to create replication marker rows periodically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-26925) Create WAL event tracker table to track all the WAL events.

2022-11-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-26925:
--

> Create WAL event tracker table to track all the WAL events.
> ---
>
> Key: HBASE-26925
> URL: https://issues.apache.org/jira/browse/HBASE-26925
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 3.0.0-alpha-4
>
>
> Design Doc: 
> [https://docs.google.com/document/d/14oZ5ssY28hvJaQD_Jg9kWX7LfUKUyyU2PCA93PPzVko/edit#]
> Create wal event tracker table to track WAL events. Whenever we roll the WAL, 
> we will save the WAL name, WAL length, region server, timestamp in a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27398) Remove dumping of EOFException while reading WAL with ProtobufLogReader

2022-09-28 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27398:


 Summary: Remove dumping of EOFException while reading WAL with 
ProtobufLogReader
 Key: HBASE-27398
 URL: https://issues.apache.org/jira/browse/HBASE-27398
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15


The log processing tooling that helps extract and analyze Exceptions from 
regionserver logs can read EOFException while reseting seeking to original 
position as part of ProtobufLogReader implementation.

Common logs:
{code:java}
2022-09-28 17:02:00,288 DEBUG [20%2C1664323516467,1] wal.ProtobufLogReader - 
Encountered a malformed edit, seeking back to last good position in file, from 
187159 to 187158
java.io.EOFException: Partial PB while reading WAL, probably an unexpected EOF, 
ignoring. current offset=187159
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:390)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:258)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.tryAdvanceStreamAndCreateWALBatch(ReplicationSourceWALReader.java:251)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:148)
 {code}
{code:java}
2022-09-28 11:02:10,792 DEBUG [20%2C1664323193648,1] wal.ProtobufLogReader - 
Encountered a malformed edit, seeking back to last good position in file, from 
112026775 to 112026303
java.io.EOFException: EOF  while reading 296 WAL KVs; started reading at 
112026367 and read up to 112026775
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:418)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:258)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
Caused by: java.io.EOFException: Only read 6
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:406)
... 7 more {code}
After looking at these logs, it seems that having EOFException even at DEBUG 
level is not helping much because reseting seek to a different position is 
going to be expected often. We should remove dumping EOFException in these 
cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27384) Backport HBASE-27064 to branch 2.4

2022-09-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27384.
--
Fix Version/s: 2.4.15
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Backport  HBASE-27064 to branch 2.4
> ---
>
> Key: HBASE-27384
> URL: https://issues.apache.org/jira/browse/HBASE-27384
> Project: HBase
>  Issue Type: Sub-task
>  Components: Normalizer
>Affects Versions: 2.4.14
>Reporter: Aman Poonia
>Assignee: Aman Poonia
>Priority: Minor
> Fix For: 2.4.15
>
>
> {*}Error: 
> java.util.ConcurrentModificationException{*}{{{}java.util.concurrent.ExecutionException:
>  java.util.ConcurrentModificationException at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
> org.apache.hadoop.hbase.master.normalizer.TestRegionNormalizerWorkQueue.testTake(TestRegionNormalizerWorkQueue.java:211)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at 
> org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:39) at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.lang.Thread.run(Thread.java:750) Caused by: 
> java.util.ConcurrentModificationException at 
> java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) 
> at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) at 
> org.apache.hadoop.hbase.master.normalizer.RegionNormalizerWorkQueue.take(RegionNormalizerWorkQueue.java:192)
>  at 
> org.apache.hadoop.hbase.master.normalizer.TestRegionNormalizerWorkQueue.lambda$testTake$3(TestRegionNormalizerWorkQueue.java:192)
>  at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>  at 
> java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1632)
>  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
> at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175){}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27112) Investigate Netty resource usage limits

2022-07-13 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-27112:
--

> Investigate Netty resource usage limits
> ---
>
> Key: HBASE-27112
> URL: https://issues.apache.org/jira/browse/HBASE-27112
> Project: HBase
>  Issue Type: Sub-task
>  Components: IPC/RPC
>Affects Versions: 2.5.0
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: Image 7-11-22 at 10.12 PM.jpg, Image 7-12-22 at 10.45 
> PM.jpg
>
>
> We leave Netty level resource limits unbounded. The number of threads to use 
> for the event loop is default 0 (unbounded). The default for 
> io.netty.eventLoop.maxPendingTasks is INT_MAX. 
> We don't do that for our own RPC handlers. We have a notion of maximum 
> handler pool size, with a default of 30, typically raised in production by 
> the user. We constrain the depth of the request queue in multiple ways... 
> limits on number of queued calls, limits on total size of calls data that can 
> be queued (to avoid memory usage overrun, CoDel conditioning of the call 
> queues if it is enabled, and so on.
> Under load can we pile up a excess of pending request state, such as direct 
> buffers containing request bytes, at the netty layer because of downstream 
> resource limits? Those limits will act as a bottleneck, as intended, and 
> before would have also applied backpressure through RPC too, because 
> SimpleRpcServer had thread limits ("hbase.ipc.server.read.threadpool.size", 
> default 10), but Netty may be able to queue up a lot more, in comparison, 
> because Netty has been optimized to prefer concurrency.
> Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
> (unbounded). I don't know what it can actually get up to in production, 
> because we lack the metric, but there are diminishing returns when threads > 
> cores so a reasonable default here could be 
> Runtime.getRuntime().availableProcessors() instead of unbounded?
> maxPendingTasks probably should not be INT_MAX, but that may matter less.
> The tasks here are:
> - Instrument netty level resources to understand better actual resource 
> allocations under load. Investigate what we need to plug in where to gain 
> visibility. 
> - Where instrumentation designed for this issue can be implemented as low 
> overhead metrics, consider formally adding them as a metric. 
> - Based on the findings from this instrumentation, consider and implement 
> next steps. The goal would be to limit concurrency at the Netty layer in such 
> a way that performance is still good, and under load we don't balloon 
> resource usage at the Netty layer.
> If the instrumentation and experimental results indicate no changes are 
> necessary, we can close this as Not A Problem or WontFix. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27183) Support regionserver to connect to HMaster proxy port

2022-07-09 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27183.
--
Fix Version/s: (was: 3.0.0-alpha-4)
   Resolution: Won't Fix

The requirement of having different binding vs proxy port is being reviewed 
again. Will reopen this if required as we get more networking requirements 
clarification.

> Support regionserver to connect to HMaster proxy port
> -
>
> Key: HBASE-27183
> URL: https://issues.apache.org/jira/browse/HBASE-27183
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> Regionservers get active master address from Zookeeper/Master registry and 
> tries to make RPC calls to master.
> For security concerns, regionservers might require making connection to a 
> different proxy port of master rather than it's original port retrieved from 
> Zookeeper.
> Configs:
>  # hbase.master.expose.proxy.port: Master can use this config (int) to expose 
> new proxy port on active and backup master znodes.
>  # hbase.client.consume.master.proxy.port: Clients/Regionservers can use this 
> config (boolean) to determine whether to connect to active master on new 
> proxy port that master has exposed or continue using original port of master 
> for connection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27183) Support regionserver to connect to HMaster proxy port

2022-07-07 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27183:


 Summary: Support regionserver to connect to HMaster proxy port
 Key: HBASE-27183
 URL: https://issues.apache.org/jira/browse/HBASE-27183
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14


Regionservers get active master address from Zookeeper/Master registry and 
tries to make RPC calls to master.

For security concerns, regionservers might require making connection to a 
different proxy port of master rather than it's original port retrieved from 
Zookeeper. We should support this case by introducing a new config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-18045) Add ' -o ConnectTimeout=10' to the ssh command we use in ITBLL chaos monkeys

2022-07-07 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-18045.
--
Fix Version/s: 3.0.0-alpha-4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add ' -o ConnectTimeout=10' to the ssh command we use in ITBLL chaos monkeys
> 
>
> Key: HBASE-18045
> URL: https://issues.apache.org/jira/browse/HBASE-18045
> Project: HBase
>  Issue Type: Improvement
>  Components: integration tests
>Reporter: Michael Stack
>Assignee: Narasimha Sharma
>Priority: Trivial
> Fix For: 3.0.0-alpha-4
>
>
> Monkeys hang on me in long running tests. I've not spent too much time on it 
> since it rare enough but I just went through a spate of them. When monkey 
> kill ssh hangs, all killing stops which can give a false sense of victory 
> when you wake up in the morning and your job 'passed'. I also see monkeys 
> kill all servers in a cluster and fail to bring them back which causes job 
> fail as no one is serving data. The latter may actually be another issue but 
> for the former, I've  had some success adding  -o ConnectTimeout=10 as an 
> option on ssh. You can do it easily enough via config but this issue is to 
> suggest that we add it in code.
> Here is how you add it via config if interested:
> 
> hbase.it.clustermanager.ssh.opts
>  -o ConnectTimeout=10 
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27175) Failure to cleanup WAL split dir log should be at INFO level

2022-07-06 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27175.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Failure to cleanup WAL split dir log should be at INFO level
> 
>
> Key: HBASE-27175
> URL: https://issues.apache.org/jira/browse/HBASE-27175
> Project: HBase
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Ujjawal Kumar
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> As part of the SCP, after we are done splitting WALs, we try removing the 
> -splitting dirs but if the dir doesn't exist, dfs#delete fails with IOE. 
> Since we are aware of this case, we just log the message but we don't 
> interrupt SCP because handling of the failure is gracefully done.
> Hence, failure to remove "-splitting" dir should not be mentioned as WARN 
> log, let's convert this to INFO level:
> {code:java}
> LOG.warn("Remove WAL directory for {} failed, ignore...{}", serverName, 
> e.getMessage()); {code}
>  
> Any other genuine failure to remove the splitting dir is anyways covered by 
> this log, hence we don't have worry about keeping the above mentioned log at 
> WARN level, we have anyways mentioned "ignore..." in the log message.
> {code:java}
> if (!fs.delete(splitDir, false)) {
>   LOG.warn("Failed delete {}, contains {}", splitDir, fs.listFiles(splitDir, 
> true));
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27175) Failure to cleanup WAL split dir log should be at INFO level

2022-07-02 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27175:


 Summary: Failure to cleanup WAL split dir log should be at INFO 
level
 Key: HBASE-27175
 URL: https://issues.apache.org/jira/browse/HBASE-27175
 Project: HBase
  Issue Type: Task
Reporter: Viraj Jasani
 Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14


As part of the SCP, after we are done splitting WALs, we try removing the 
-splitting dirs but if the dir doesn't exist, dfs#delete fails with IOE. Since 
we are aware of this case, we just log the message but we don't interrupt SCP 
because handling of the failure is gracefully done.

Hence, failure to remove "-splitting" dir should not be mentioned as WARN log, 
let's convert this to INFO level:
{code:java}
LOG.warn("Remove WAL directory for {} failed, ignore...{}", serverName, 
e.getMessage()); {code}
 

Any other genuine failure to remove the splitting dir is anyways covered by 
this log, hence we don't have worry about keeping the above mentioned log at 
WARN level, we have anyways mentioned "ignore..." in the log message.
{code:java}
if (!fs.delete(splitDir, false)) {
  LOG.warn("Failed delete {}, contains {}", splitDir, fs.listFiles(splitDir, 
true));
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27150) TestMultiRespectsLimits consistently failing

2022-06-23 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27150.
--
Fix Version/s: (was: 2.5.0)
   (was: 2.4.13)
   (was: 3.0.0-alpha-4)
   Resolution: Duplicate

> TestMultiRespectsLimits consistently failing
> 
>
> Key: HBASE-27150
> URL: https://issues.apache.org/jira/browse/HBASE-27150
> Project: HBase
>  Issue Type: Test
>Affects Versions: 2.4.12
>Reporter: Viraj Jasani
>Priority: Major
>
> TestMultiRespectsLimits#testBlockMultiLimits is consistently failing:
> {code:java}
> Error Messageexceptions (0) should be greater than 
> 0Stacktracejava.lang.AssertionError: exceptions (0) should be greater than 0
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.assertTrue(Assert.java:42)
>   at 
> org.apache.hadoop.hbase.test.MetricsAssertHelperImpl.assertCounterGt(MetricsAssertHelperImpl.java:191)
>   at 
> org.apache.hadoop.hbase.client.TestMultiRespectsLimits.testBlockMultiLimits(TestMultiRespectsLimits.java:185)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  {code}
> Reports:
> [https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3377/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]
> [https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3378/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]
> [https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3376/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27150) TestMultiRespectsLimits consistently failing

2022-06-22 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27150:


 Summary: TestMultiRespectsLimits consistently failing
 Key: HBASE-27150
 URL: https://issues.apache.org/jira/browse/HBASE-27150
 Project: HBase
  Issue Type: Test
Affects Versions: 2.4.12
Reporter: Viraj Jasani
 Fix For: 2.5.0, 2.4.13, 3.0.0-alpha-4


TestMultiRespectsLimits#testBlockMultiLimits is consistently failing:
{code:java}
Error Messageexceptions (0) should be greater than 
0Stacktracejava.lang.AssertionError: exceptions (0) should be greater than 0
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at 
org.apache.hadoop.hbase.test.MetricsAssertHelperImpl.assertCounterGt(MetricsAssertHelperImpl.java:191)
at 
org.apache.hadoop.hbase.client.TestMultiRespectsLimits.testBlockMultiLimits(TestMultiRespectsLimits.java:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 {code}
Reports:

[https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3377/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]

[https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3378/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]

[https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/branch-2.4/3376/testReport/junit/org.apache.hadoop.hbase.client/TestMultiRespectsLimits/testBlockMultiLimits/]

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27098) Fix link for field comments

2022-06-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27098.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-4
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix link for field comments
> ---
>
> Key: HBASE-27098
> URL: https://issues.apache.org/jira/browse/HBASE-27098
> Project: HBase
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-4
>
>
> Fix link for field `REJECT_BATCH_ROWS_OVER_THRESHOLD` comments.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27117) Update the method comments for RegionServerAccounting

2022-06-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27117.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Update the method comments for RegionServerAccounting
> -
>
> Key: HBASE-27117
> URL: https://issues.apache.org/jira/browse/HBASE-27117
> Project: HBase
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
>
> After HBASE-15787, the return value type of 
> RegionServerAccounting#isAboveHighWaterMark and 
> RegionServerAccounting#isAboveLowWaterMark are no longer boolean.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27108) Revert HBASE-25709

2022-06-11 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-27108:


 Summary: Revert HBASE-25709
 Key: HBASE-27108
 URL: https://issues.apache.org/jira/browse/HBASE-27108
 Project: HBase
  Issue Type: Task
Affects Versions: 2.4.11
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13


HBASE-25709 has caused regression for large rows scan results and since the 
change has already been released to 2.4.11, creating this Jira to track it's 
revert.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27092) Regionserver table on Master UI is broken

2022-06-07 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27092.
--
Fix Version/s: 2.6.0
   3.0.0-alpha-3
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Regionserver table on Master UI is broken
> -
>
> Key: HBASE-27092
> URL: https://issues.apache.org/jira/browse/HBASE-27092
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 3.0.0-alpha-2
>Reporter: Nick Dimiduk
>Assignee: Tao Li
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-3
>
> Attachments: 27092.jpg, image-2022-06-06-23-31-13-425.png
>
>
> Playing around with pseudo-distributed mode, I see that we've broken the 
> region servers table.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26985) SecureBulkLoadManager will set wrong permission if umask too strict

2022-06-02 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26985.
--
Fix Version/s: 2.6.0
   3.0.0-alpha-3
   2.4.13
   2.5.1
 Hadoop Flags: Reviewed
 Assignee: Zhang Dongsheng
   Resolution: Fixed

> SecureBulkLoadManager will set wrong permission if umask too strict
> ---
>
> Key: HBASE-26985
> URL: https://issues.apache.org/jira/browse/HBASE-26985
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.4.11
>Reporter: Zhang Dongsheng
>Assignee: Zhang Dongsheng
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-3, 2.4.13, 2.5.1
>
> Attachments: HBASE-26985.patch
>
>
> SecureBulkLoadManager will create baseStagingDir if not exist. start method 
> use 
> fs.mkdirs(baseStagingDir, PERM_HIDDEN); to create directory with permission 
> -rwx–x–x.BUT if umask is too strict such as 077 ,this directory will create 
> with 0700 so it too strict for GROUP and OTHER user to own execute permission



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27018) Add a tool command list_liveservers

2022-05-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27018.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add a tool command list_liveservers
> ---
>
> Key: HBASE-27018
> URL: https://issues.apache.org/jira/browse/HBASE-27018
> Project: HBase
>  Issue Type: New Feature
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
> Attachments: image-2022-05-10-08-34-33-711.png
>
>
> To make it easier for us to query the living region Servers. We can add a 
> command `list_liveservers`. There are already `list_deadServers` and 
> `list_Decommissioned_regionServers`.
> !image-2022-05-10-08-34-33-711.png|width=457,height=123!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-25465) Use javac --release option for supporting cross version compilation

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-25465.
--
Fix Version/s: 2.4.13
   Resolution: Fixed

> Use javac --release option for supporting cross version compilation
> ---
>
> Key: HBASE-25465
> URL: https://issues.apache.org/jira/browse/HBASE-25465
> Project: HBase
>  Issue Type: Improvement
>  Components: create-release
>Affects Versions: 3.0.0-alpha-3
>Reporter: Andrew Kyle Purtell
>Assignee: Duo Zhang
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
>
> See https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/
> {quote}
>  the Java compiler’s --release parameter, which was introduced via JEP 247 
> ("Compile for Older Platform Versions"), added to the platform also in JDK 9. 
> In contrast to the more widely known pair of --source and --target, the 
> --release switch will ensure that only byte code is produced which actually 
> will be usable with the specified Java version. For this purpose, the JDK 
> contains the signature data for all supported Java versions (stored in the 
> $JAVA_HOME/lib/ct.sym file).
> {quote}
> Using one JDK (i.e. Java 11) to build Java 8-and-up and Java 11-and-up 
> compatible release artifacts would reduce some sources of accidental 
> complexity, assuming the --release parameter actually works as advertised. To 
> produce Java 8-and-up compatible artifacts, supply --release=8. To produce 
> Java 11-and-up compatible release artifacts, supply --release=11. Maven 
> activations based on JDK version and command line defined profiles can 
> control what --release parameter, if any, should be passed to the compiler. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26855) Delete unnecessary dependency on jaxb-runtime jar

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26855.
--
Fix Version/s: 2.4.13
   Resolution: Fixed

> Delete unnecessary dependency on jaxb-runtime jar
> -
>
> Key: HBASE-26855
> URL: https://issues.apache.org/jira/browse/HBASE-26855
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Nick Dimiduk
>Assignee: Nick Dimiduk
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
>
> Since we've moved to using only shaded versions of our jersey stuff, we have 
> no need for this explicit dependency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26523) Upgrade hbase-thirdparty dependency to 4.0.1

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26523.
--
Fix Version/s: 2.4.13
   Resolution: Fixed

> Upgrade hbase-thirdparty dependency to 4.0.1
> 
>
> Key: HBASE-26523
> URL: https://issues.apache.org/jira/browse/HBASE-26523
> Project: HBase
>  Issue Type: Task
>  Components: thirdparty
>Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HBASE-25465) Use javac --release option for supporting cross version compilation

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-25465:
--

Reopening for 2.4 backport

> Use javac --release option for supporting cross version compilation
> ---
>
> Key: HBASE-25465
> URL: https://issues.apache.org/jira/browse/HBASE-25465
> Project: HBase
>  Issue Type: Improvement
>  Components: create-release
>Affects Versions: 3.0.0-alpha-3
>Reporter: Andrew Kyle Purtell
>Assignee: Duo Zhang
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> See https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/
> {quote}
>  the Java compiler’s --release parameter, which was introduced via JEP 247 
> ("Compile for Older Platform Versions"), added to the platform also in JDK 9. 
> In contrast to the more widely known pair of --source and --target, the 
> --release switch will ensure that only byte code is produced which actually 
> will be usable with the specified Java version. For this purpose, the JDK 
> contains the signature data for all supported Java versions (stored in the 
> $JAVA_HOME/lib/ct.sym file).
> {quote}
> Using one JDK (i.e. Java 11) to build Java 8-and-up and Java 11-and-up 
> compatible release artifacts would reduce some sources of accidental 
> complexity, assuming the --release parameter actually works as advertised. To 
> produce Java 8-and-up compatible artifacts, supply --release=8. To produce 
> Java 11-and-up compatible release artifacts, supply --release=11. Maven 
> activations based on JDK version and command line defined profiles can 
> control what --release parameter, if any, should be passed to the compiler. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HBASE-26855) Delete unnecessary dependency on jaxb-runtime jar

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-26855:
--

Reopening for 2.4 backport

> Delete unnecessary dependency on jaxb-runtime jar
> -
>
> Key: HBASE-26855
> URL: https://issues.apache.org/jira/browse/HBASE-26855
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Nick Dimiduk
>Assignee: Nick Dimiduk
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> Since we've moved to using only shaded versions of our jersey stuff, we have 
> no need for this explicit dependency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (HBASE-26523) Upgrade hbase-thirdparty dependency to 4.0.1

2022-05-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reopened HBASE-26523:
--

Reopening for 2.4 backport

> Upgrade hbase-thirdparty dependency to 4.0.1
> 
>
> Key: HBASE-26523
> URL: https://issues.apache.org/jira/browse/HBASE-26523
> Project: HBase
>  Issue Type: Task
>  Components: thirdparty
>Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27003) Optimize log format for PerformanceEvaluation

2022-05-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27003.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Optimize log format for PerformanceEvaluation
> -
>
> Key: HBASE-27003
> URL: https://issues.apache.org/jira/browse/HBASE-27003
> Project: HBase
>  Issue Type: Improvement
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
> Attachments: image-2022-05-06-18-13-50-763.png, 
> image-2022-05-06-18-14-28-578.png, image-2022-05-06-18-15-13-913.png
>
>
> The logs in PerformanceEvaluation look a little confusing to new users, we 
> should optimize the format.
> Before:
> !image-2022-05-06-18-13-50-763.png|width=787,height=156!
> After:
> !image-2022-05-06-18-15-13-913.png|width=674,height=147!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27015) Fix log format for ServerManager

2022-05-08 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27015.
--
Fix Version/s: 3.0.0-alpha-3
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix log format for ServerManager
> 
>
> Key: HBASE-27015
> URL: https://issues.apache.org/jira/browse/HBASE-27015
> Project: HBase
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
> Fix For: 3.0.0-alpha-3
>
> Attachments: ServerManagerLog.jpg
>
>
> A space is missing from the ServerManager log.
> !ServerManagerLog.jpg|width=919,height=128!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26994) MasterFileSystem create directory without permission check

2022-05-08 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26994.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Hadoop Flags: Reviewed
   Resolution: Fixed

> MasterFileSystem create directory without permission check
> --
>
> Key: HBASE-26994
> URL: https://issues.apache.org/jira/browse/HBASE-26994
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.12
>Reporter: Zhang Dongsheng
>Assignee: Zhang Dongsheng
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
> Attachments: HBASE-26994.patch
>
>
> Method checkStagingDir and checkSubDir first check if directory is exist ,if 
> not , create it with special permission. If exist then setPermission for this 
> directory. BUT if not exist ,we still need set special permission for this 
> directory



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27000) Block cache stats (Misses Caching) display error in RS web UI

2022-05-06 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-27000.
--
Fix Version/s: 2.5.0
   3.0.0-alpha-3
   2.4.13
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Block cache stats (Misses Caching) display error in RS web UI
> -
>
> Key: HBASE-27000
> URL: https://issues.apache.org/jira/browse/HBASE-27000
> Project: HBase
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.13
>
> Attachments: image-2022-05-05-20-11-47-884.png
>
>
> Block cache stats (Misses Caching) display error in RS web UI.
> !image-2022-05-05-20-11-47-884.png|width=547,height=303!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-26712) Balancer encounters NPE in rare case

2022-02-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26712.
--
Fix Version/s: 2.6.0
   (was: 2.4.10)
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks for the contribution [~comnetwork].

> Balancer encounters NPE in rare case
> 
>
> Key: HBASE-26712
> URL: https://issues.apache.org/jira/browse/HBASE-26712
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Viraj Jasani
>Assignee: chenglei
>Priority: Major
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3
>
>
>  
> {code:java}
> ERROR [ster-1:6.Chore.1] hbase.ScheduledChore - Caught error
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.balance(AssignmentManager.java:758)
>     at 
> org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:1834)
>     at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1797)
>     at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
>     at 
> org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
>     at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:153)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>     at 
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>  {code}
> Let's fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26752) Fix flappy test TestSimpleRegionNormalizerOnCluster.java

2022-02-13 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26752.
--
Fix Version/s: 1.7.2
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix flappy test TestSimpleRegionNormalizerOnCluster.java
> 
>
> Key: HBASE-26752
> URL: https://issues.apache.org/jira/browse/HBASE-26752
> Project: HBase
>  Issue Type: Bug
>  Components: Normalizer
>Affects Versions: 1.7.1
>Reporter: Aman Poonia
>Assignee: Aman Poonia
>Priority: Minor
> Fix For: 1.7.2
>
>
> TestSimpleRegionNormalizerOnCluster.java can hang after HBASE-26744
> The assumption that order of HTable list is sorted is wrong so depending on 
> that order can cause the test to hang or be inaccurate



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26712) Balancer encounters NPE in rare case

2022-01-26 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26712:


 Summary: Balancer encounters NPE in rare case
 Key: HBASE-26712
 URL: https://issues.apache.org/jira/browse/HBASE-26712
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.9
Reporter: Viraj Jasani


 
{code:java}
ERROR [ster-1:6.Chore.1] hbase.ScheduledChore - Caught error
java.lang.NullPointerException
    at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.balance(AssignmentManager.java:758)
    at 
org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:1834)
    at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1797)
    at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
    at 
org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
    at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:153)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at 
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
 {code}
Let's fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26708) Netty Leak detected and eventually results in OutOfDirectMemoryError

2022-01-25 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26708:


 Summary: Netty Leak detected and eventually results in 
OutOfDirectMemoryError
 Key: HBASE-26708
 URL: https://issues.apache.org/jira/browse/HBASE-26708
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.6
Reporter: Viraj Jasani


Under constant data ingestion, using default Netty based RpcServer and 
RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
by leaks detected by Netty's LeakDetector.
{code:java}
2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] util.ResourceLeakDetector 
- java:115)
  
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
  
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
  
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
  
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
  
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  java.lang.Thread.run(Thread.java:748)
 {code}
{code:java}
2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] util.ResourceLeakDetector 
- 
apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
  
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
  
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
  
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
  
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
  
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  java.lang.Thread.run(Thread.java:748)
 {code}
And finally handlers are removed from the pipeline due to 
OutOfDirectMemoryError:
{code:java}
2022-01-25 17:36:28,657 WARN  

[jira] [Resolved] (HBASE-26657) ProfileServlet should move the output location to hbase specific directory

2022-01-11 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26657.
--
Fix Version/s: 2.5.0
   2.6.0
   3.0.0-alpha-3
   2.4.10
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks for the review [~weichiu] 

> ProfileServlet should move the output location to hbase specific directory
> --
>
> Key: HBASE-26657
> URL: https://issues.apache.org/jira/browse/HBASE-26657
> Project: HBase
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.10
>
>
> Since ProfileServlet is forked and being used by several projects, we should 
> allow HBase specific profile servlet to use hbase specific profiler output 
> location rather than the common location: "${java.io.tmpdir}/prof-output".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26657) ProfileServlet should move the output location to hbase specific directory

2022-01-10 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26657:


 Summary: ProfileServlet should move the output location to hbase 
specific directory
 Key: HBASE-26657
 URL: https://issues.apache.org/jira/browse/HBASE-26657
 Project: HBase
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Since ProfileServlet is forked and being used by several projects, we should 
allow HBase specific profile servlet to use hbase specific profiler output 
location rather than the common location: "${java.io.tmpdir}/prof-output".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26596) region_mover should gracefully ignore null response from RSGroupAdmin#getRSGroupOfServer

2021-12-17 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26596:


 Summary: region_mover should gracefully ignore null response from 
RSGroupAdmin#getRSGroupOfServer
 Key: HBASE-26596
 URL: https://issues.apache.org/jira/browse/HBASE-26596
 Project: HBase
  Issue Type: Bug
  Components: mover, rsgroup
Affects Versions: 1.7.1
Reporter: Viraj Jasani


If regionserver has any non-daemon thread running even after it's own shutdown, 
the running non-daemon thread can prevent clean JVM exit and regionserver could 
be stuck in the zombie state. We have recently provided a workaround for this 
in HBASE-26468 for regionserver exit hook to wait 30s for all non-daemon 
threads to get stopped before terminating JVM abnormally.

However, if regionserver is stuck in such state, region_mover unload fails with:
{code:java}
NoMethodError: undefined method `getName` for nil:NilClass
  getSameRSGroupServers at /bin/region_mover.rb:503
 __ensure__ at /bin/region_mover.rb:313 
  unloadRegions at /bin/region_mover.rb:310   
 (root) at /bin/region_mover.rb:572   
 {code}
This happens if the cluster has RSGroup enabled and the given server is already 
stopped, hence RSGroupAdmin#getRSGroupOfServer would return null (as the server 
is not running anymore so it is not part of any RSGroup). region_mover should 
ride over this null response and gracefully exit from unloadRegions() call.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26459) HMaster should move non-meta region only if meta is ONLINE

2021-12-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26459.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

Thanks for the contribution [~xytss123].

> HMaster should move non-meta region only if meta is ONLINE
> --
>
> Key: HBASE-26459
> URL: https://issues.apache.org/jira/browse/HBASE-26459
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.7.1
>Reporter: Viraj Jasani
>Assignee: Yutong Xiao
>Priority: Major
> Fix For: 1.7.2
>
>
> Any non-meta region movement has dependency over meta's availability, hence 
> it is important for any non-meta region to wait for meta to be assigned and 
> be available for scan before attempting to move non-meta region.
> This concept has already been well managed by SCP (ServerCrashProcedure) on 
> HBase 1.x and 2.x versions. However, for 1.x versions, HMaster#move API 
> doesn't check for any prerequisite of meta being available before attempting 
> to move any non-meta region.
> On the other hand, 2.x versions already have TransitRegionStateProcedure 
> (TRSP) in place that uses lock _LockState.LOCK_EVENT_WAIT_ in case if meta is 
> not yet assigned and loaded in AssignmentManager's memory:
> {code:java}
> @Override
> protected boolean waitInitialized(MasterProcedureEnv env) {
>   if (TableName.isMetaTableName(getTableName())) {
> return false;
>   }
>   // First we need meta to be loaded, and second, if meta is not online then 
> we will likely to
>   // fail when updating meta so we wait until it is assigned.
>   AssignmentManager am = env.getAssignmentManager();
>   return am.waitMetaLoaded(this) || am.waitMetaAssigned(this, getRegion());
> }
>  {code}
> For 1.x versions, it is recommended to introduce configurable wait time in 
> master's region move API for non-meta region movement until meta region is 
> available. If meta remains in transition after the wait time elapses, we 
> should fail fast and avoid non-meta region move.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26468) Region Server doesn't exit cleanly incase it crashes.

2021-11-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26468.
--
Fix Version/s: (was: 2.3.8)
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks for this nice contribution [~shahrs87] and thanks for the reviews 
[~zhangduo] [~gjacoby].

> Region Server doesn't exit cleanly incase it crashes.
> -
>
> Key: HBASE-26468
> URL: https://issues.apache.org/jira/browse/HBASE-26468
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.6.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2, 1.7.2, 2.4.9
>
>
> Observed this in our production cluster running 1.6 version.
> RS crashed due to some reason but the process was still running. On debugging 
> more, found out there was 1 non-daemon thread running and that was not 
> allowing RS to exit cleanly. Our clusters are managed by Ambari and have auto 
> restart capability within them. But since the process was running and pid 
> file was present, Ambari also couldn't do much. There will be some bug where 
> we will miss to stop some non daemon thread. Shutdown hook will not be called 
> unless one of the following 2 conditions are met:
> # The Java virtual machine shuts down in response to two kinds of events:
> The program exits normally, when the last non-daemon thread exits or when the 
> exit (equivalently, System.exit) method is invoked, or
> # The virtual machine is terminated in response to a user interrupt, such as 
> typing ^C, or a system-wide event, such as user logoff or system shutdown.
> Considering the first condition, when the last non-daemon thread exits or 
> when the exit method is invoked.
> Below is the code snippet from 
> [HRegionServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServerCommandLine.java#L51]
> {code:java}
>   private int start() throws Exception {
> try {
>   if (LocalHBaseCluster.isLocal(conf)) {
>  // Ignore this.
>   } else {
> HRegionServer hrs = 
> HRegionServer.constructRegionServer(regionServerClass, conf);
> hrs.start();
> hrs.join();
> if (hrs.isAborted()) {
>   throw new RuntimeException("HRegionServer Aborted");
> }
>   }
> } catch (Throwable t) {
>   LOG.error("Region server exiting", t);
>   return 1;
> }
> return 0;
>   }
> {code}
> Within HRegionServer, there is a subtle difference between when a server is 
> aborted v/s when it is stopped. If it is stopped, then isAborted will return 
> false and it will exit with return code 0.
> Below is the code from 
> [ServerCommandLine.java|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ServerCommandLine.java#L147]
> {code:java}
>   public void doMain(String args[]) {
> try {
>   int ret = ToolRunner.run(HBaseConfiguration.create(), this, args);
>   if (ret != 0) {
> System.exit(ret);
>   }
> } catch (Exception e) {
>   LOG.error("Failed to run", e);
>   System.exit(-1);
> }
>   }
> {code}
> If return code is 0, then it won't call System.exit. This means JVM will wait 
> to call ShutdownHook until all non daemon threads are stopped which means 
> infinite wait if we don't close all non-daemon threads cleanly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26466) Immutable timeseries usecase - Create new region rather than split existing one

2021-11-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26466:


 Summary: Immutable timeseries usecase - Create new region rather 
than split existing one
 Key: HBASE-26466
 URL: https://issues.apache.org/jira/browse/HBASE-26466
 Project: HBase
  Issue Type: Brainstorming
Reporter: Viraj Jasani


For insertion of immutable data usecase (specifically time-series data), region 
split mechanism doesn't seem to provide better availability when ingestion rate 
is very high. When we ingest lot of data, the region split policy tries to 
split the given hot region based on the size (either size of all stores 
combined or size of any single store exceeding max file size configured) if we 
consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to 
receive all latest inserts. When the region is split, the first half of the 
region (say daughterA) stays on the same server whereas the second half 
(daughterB) region – likely to become another hot region because all new latest 
updates come to second half region in the sequential write fashion – is moved 
out to other servers in the cluster. Hence, once new daughter region is 
created, client traffic will be redirected to another server. Client requests 
will be piled up when region split is triggered till new daughters come alive 
and once done, client will have to request meta for updated daughter region and 
redirect traffic to new server.

If we could have configurable region creation strategy that 1) keeps the split 
disabled for the given table, and 2) create new region dynamically with 
lexicographically higher start key on the same server and update it's own 
region boundary, the client will have to look up meta once and continue 
ingestion without any degraded SLA caused by region split transitions.

Note: region split might also encounter some complications, requiring the 
procedure to be rolled back from some step, or continue with internal retries, 
eventually further delaying the ingestion from clients.

 

There are some complications around updating live region's start and end keys 
as this key range is immutable. We could brainstorm ideas around making them 
optionally mutable and any issues around them. For instance, client might 
continue writing data to the region with updated end key but writes will fail 
and hence, they will lookup in meta for updated key-space range of the table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26459) HMaster should move non-meta region only if meta is ONLINE

2021-11-17 Thread Viraj Jasani (Jira)
Viraj Jasani created HBASE-26459:


 Summary: HMaster should move non-meta region only if meta is ONLINE
 Key: HBASE-26459
 URL: https://issues.apache.org/jira/browse/HBASE-26459
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.7.1
Reporter: Viraj Jasani
 Fix For: 1.7.2


Any non-meta region movement has dependency over meta's availability, hence it 
is important for any non-meta region to wait for meta to be assigned and be 
available for scan before attempting to move non-meta region.

This concept has already been well managed by SCP (ServerCrashProcedure) on 
HBase 1.x and 2.x versions. However, for 1.x versions, HMaster#move API doesn't 
check for any prerequisite of meta being available before attempting to move 
any non-meta region.

On the other hand, 2.x versions already have TransitRegionStateProcedure (TRSP) 
in place that uses lock _LockState.LOCK_EVENT_WAIT_ in case if meta is not yet 
assigned and loaded in AssignmentManager's memory:
{code:java}
@Override
protected boolean waitInitialized(MasterProcedureEnv env) {
  if (TableName.isMetaTableName(getTableName())) {
return false;
  }
  // First we need meta to be loaded, and second, if meta is not online then we 
will likely to
  // fail when updating meta so we wait until it is assigned.
  AssignmentManager am = env.getAssignmentManager();
  return am.waitMetaLoaded(this) || am.waitMetaAssigned(this, getRegion());
}
 {code}
For 1.x versions, it is recommended to introduce configurable wait time in 
master's region move API for non-meta region movement until meta region is 
available. If meta remains in transition after the wait time elapses, we should 
fail fast and avoid non-meta region move.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26433) Rollback from ZK-less to ZK-based assignment could produce inconsistent state - doubly assigned regions

2021-11-09 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-26433.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

Thanks for the reviews [~apurtell] [~gjacoby] [~dmanning].

> Rollback from ZK-less to ZK-based assignment could produce inconsistent state 
> - doubly assigned regions
> ---
>
> Key: HBASE-26433
> URL: https://issues.apache.org/jira/browse/HBASE-26433
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.1
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 1.7.2
>
>
> By enabling config {_}hbase.assignment.usezk.migrating{_}, we initiate the 
> transition of HBase 1.x cluster from default ZK-based region assignment to 
> ZK-less region assignments. Once the migration is enabled, any subsequent 
> region transition is going to add two additional CQs in meta: info:sn and 
> info:state. The workflow that adds new CQs in meta should be the only 
> workflow reading it (unless it requires coordination among multiple 
> workflows), however that is not the case here. Reading info:sn and info:state 
> to rebuild user region states in RegionStateStore data structure is a hidden 
> bug because it doesn't restrict the usage for only ZK-less region assignment.
> What are the effects?
> After enabling ZK-less migration, if we revert it back, info:state and 
> info:sn are not reverted. Moreover, new active master rebuilds the region 
> states in memory and use this info. So if all regions have consistent info:sn 
> values (i.e. consistent with info:server and info:serverstartcode), nothing 
> goes wrong and this is likely going to happen when we revert the config with 
> rolling restart of masters. However, after this config revert, if any region 
> moves, only info:server and info:serverstartcode get updated but info:sn and 
> info:state values stay the same. Because of the missing condition, subsequent 
> active master restart would try to rebuild regions and assign regions as per 
> info:sn, but those regions are already OPEN on info:server, hence we get 
> doubly assigned regions.
> We need two part fix for this:
>  # Guard reading of info:sn and info:state with proper conditions.
>  # Once active master init is complete, if ZK-based region assignment is 
> enabled and redundant CQs are available in meta (info:sn and info:state), 
> delete them all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   >