[jira] [Created] (HBASE-27493) Allow namespace admins to clone snapshots created by them
Szabolcs Bukros created HBASE-27493: --- Summary: Allow namespace admins to clone snapshots created by them Key: HBASE-27493 URL: https://issues.apache.org/jira/browse/HBASE-27493 Project: HBase Issue Type: Improvement Components: snapshots Affects Versions: 2.5.1, 3.0.0-alpha-3 Reporter: Szabolcs Bukros Assignee: Szabolcs Bukros Creating a snapshot requires table admin permissions. But cloning it requires global admin permissions unless the user owns the snapshot and wants to recreate the original table the snapshot was based on using the same table name. This puts unnecessary load on the few people having global admin permissions. I would like to relax this rule a bit and allow the owner of the snapshot to clone it into any namespace where they have admin permissions regardless of the table name used. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27494) Client meta cache clear by exception metrics are missing some cases
Bryan Beaudreault created HBASE-27494: - Summary: Client meta cache clear by exception metrics are missing some cases Key: HBASE-27494 URL: https://issues.apache.org/jira/browse/HBASE-27494 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Briana Augenreich MetricsConnection has metrics for meta cache clears by server and region, and also metrics by exception type which triggered the clear. The metric by exception type is currently missing at least one instance (in AsyncRequestFutureImpl) where the cache is cleared due to an exception. We should do an audit of cache clear calls and ensure all the appropriate ones are tracked by exception too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27495) Improve HFileLinkCleaner to validate back reference links ahead the next traverse
Tak-Lon (Stephen) Wu created HBASE-27495: Summary: Improve HFileLinkCleaner to validate back reference links ahead the next traverse Key: HBASE-27495 URL: https://issues.apache.org/jira/browse/HBASE-27495 Project: HBase Issue Type: Improvement Affects Versions: 2.6.0, 3.0.0-alpha-4, 2.5.2 Reporter: Tak-Lon (Stephen) Wu Assignee: Tak-Lon (Stephen) Wu We found a a race in the CleanerChore related to back reference links. When the HFileLinkCleaner runs for a file it can make 2 decisions depending on the file types. - Hfiles, The cleaner for HFile deletion only checks if the .links-<> directory is present with files. - Back reference links, the cleaner checks if the forward link is still available in the data directory. The logic and order how the cleaner checks these 2 files matters. When the back reference is checked first it can remove both the reference and the HFile from the archive, however, when it first runs for the HFile then only the back-reference is removed. In this case, the HFile is only deleted in the next iteration of the CleanerChore, and it could be very slow if the list of files are huge in case of using object store. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27496) Limit size of plans produced by SimpleRegionNormalizer
Charles Connell created HBASE-27496: --- Summary: Limit size of plans produced by SimpleRegionNormalizer Key: HBASE-27496 URL: https://issues.apache.org/jira/browse/HBASE-27496 Project: HBase Issue Type: Improvement Components: Normalizer Reporter: Charles Connell My company (Hubspot) is starting to use {{{}SimpleRegionNormalizer{}}}. We turn the normalizer switch on for 30 minutes each day, when our database traffic is at a low point. We're using the {{hbase.normalizer.throughput.max_bytes_per_sec}} setting to create a rate limit. I've found that while the {{SimpleRegionNormalizer}} only produces new plans for 30 minutes each day, the plans often take many hours to execute. This leds to region splits, merges, and moves occurring in our HBase clusters during hours we'd prefer them not to.{color:#067d17} {color} I propose two new settings: * {{hbase.normalizer.merge.plans_size_limit.mb}} * {{hbase.normalizer.split.plans_size_limit.mb}} This will allow HBase administrators to limit the number of plans produced by a run of {{{}SimpleRegionNormalizer{}}}. This will give you a way to limit approximately how long it takes to execute the plans. Because the current limit to execute plans is primarily determined by a per-byte rate limit, I propose that the new settings also work on a similar basis. This will make it feasible to reason about how your rate limit and your size limits interact. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27399) Add config for setting a max actions per normalizer run
[ https://issues.apache.org/jira/browse/HBASE-27399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27399. --- Resolution: Duplicate > Add config for setting a max actions per normalizer run > --- > > Key: HBASE-27399 > URL: https://issues.apache.org/jira/browse/HBASE-27399 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > When you enable the normalizer for the first time on an existing table, the > next time it runs it can do a massive amount of merges/splits. This can be > painful for callers. > Additionally, if someone were to write a large amount of Puts or Deletes it > may cause a merge/split storm as well. > We should add a config which allows operators to limit the amount of actions > per run. This way the normalizer can work towards its plan gracefully over > the course of a few hours rather than all at once. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27497) Add a note for RegionMerge tool.
Karthik Palanisamy created HBASE-27497: -- Summary: Add a note for RegionMerge tool. Key: HBASE-27497 URL: https://issues.apache.org/jira/browse/HBASE-27497 Project: HBase Issue Type: Bug Components: hbck2 Reporter: Karthik Palanisamy NOTE: Do not perform region merge operations on phoenix slated tables. It will affect the region boundaries and produce incorrect query results. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27498) Observed lot of threads blocked in ConnectionImplementation.getKeepAliveMasterService
Vaibhav Joshi created HBASE-27498: - Summary: Observed lot of threads blocked in ConnectionImplementation.getKeepAliveMasterService Key: HBASE-27498 URL: https://issues.apache.org/jira/browse/HBASE-27498 Project: HBase Issue Type: Bug Components: Client Affects Versions: 2.5.0 Reporter: Vaibhav Joshi Attachments: Screenshot 2022-11-16 at 10.06.59 AM.png Recently We obseved that lot of threads are blocked in method "ConnectionImplementation.getKeepAliveMasterService" during some initialization stages of rolling restart workflow. During rolling restart, we make RPC calls to Master using RpcRetryingCallerImpl, so as part of initialization we call "ConnectionImplementation.getKeepAliveMasterService" for each thread. Internally this method do RPC call within a synchronized block to check if master is running (mss.isMasterRunning). Lots of threads are in blocked state due following synchronized block synchronized (masterLock) { if (!isKeepAliveMasterConnectedAndRunning(this.masterServiceState)) { MasterServiceStubMaker stubMaker = new MasterServiceStubMaker(); this.masterServiceState.stub = stubMaker.makeStub(); } resetMasterServiceState(this.masterServiceState); } In Thread Dump Analyzer (2.4), we get warning that "A lot of threads are waiting for this monitor to become available again. This might indicate a congestion. You also should analyze other locks blocked by threads waiting for this monitor as there might be much more threads waiting for it.". Please check attached screenshot !Screenshot 2022-11-16 at 10.06.59 AM.png|width=1639,height=971! "pool-11-thread-158" #313 prio=5 os_prio=0 tid=0x55b88bcb8800 nid=0x404e waiting for monitor entry [0x7fa48aa86000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService(ConnectionImplementation.java:1336) - waiting to lock <0x0005d30ecb68> (a java.lang.Object) at org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster(ConnectionImplementation.java:1327) at org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:57) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:103) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3019) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3011) at org.apache.hadoop.hbase.client.HBaseAdmin.move(HBaseAdmin.java:1458) at org.apache.hadoop.hbase.util.MoveWithoutAck.call(MoveWithoutAck.java:58) at org.apache.hadoop.hbase.util.MoveWithoutAck.call(MoveWithoutAck.java:33) --- *Proposal:* We can optimize this flow as follows 1. Use double checked lock for "isKeepAliveMasterConnectedAndRunning(this.masterServiceState)" so that theads don't race for monitor, when master is running. 2. "isKeepAliveMasterConnectedAndRunning()" method should reuse the Globally cached state of isMasterRunning instead of doing expensive Call in for each thread. Note: The "master" branch uses "AsyncConnectionImpl" so apparently we don't have issues there. -- This message was sent by Atlassian Jira (v8.20.10#820010)