[jira] [Created] (HBASE-28399) region size can be wrong from RegionSizeCalculator
ruanhui created HBASE-28399: --- Summary: region size can be wrong from RegionSizeCalculator Key: HBASE-28399 URL: https://issues.apache.org/jira/browse/HBASE-28399 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 3.0.0-beta-1 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-beta-2 The RegionSizeCalculator calculates region byte size using the following method {code:java} private static final long MEGABYTE = 1024L * 1024L; long regionSizeBytes = ((long) regionLoad.getStoreFileSize().get(Size.Unit.MEGABYTE)) * MEGABYTE; {code} However, this method will lose accuracy. For example, the result of {code:java} ((long) new Size(1, Size.Unit.BYTE).get(Size.Unit.MEGABYTE)) * MEGABYTE {code} is 0. This will result in a TableInputSplit with a length of 0, but in fact this TableInputSplit has a small amount of data. This TableInputSplit will be ignored if we enable spark.hadoopRDD.ignoreEmptySplits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28195) set start row as prefix if a scan with PrefixFilter
ruanhui created HBASE-28195: --- Summary: set start row as prefix if a scan with PrefixFilter Key: HBASE-28195 URL: https://issues.apache.org/jira/browse/HBASE-28195 Project: HBase Issue Type: Improvement Components: Filters Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-beta-1 If a scan with PrefixFilter, we can set start row as the prefix. This will help reduce filtered data. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28194) New Splittable Meta
ruanhui created HBASE-28194: --- Summary: New Splittable Meta Key: HBASE-28194 URL: https://issues.apache.org/jira/browse/HBASE-28194 Project: HBase Issue Type: New Feature Components: meta, Region Assignment Reporter: ruanhui This issue is used to try to land to solution on splittable meta. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28116) Move snapshot storage from filesystem to a separated HBase table
ruanhui created HBASE-28116: --- Summary: Move snapshot storage from filesystem to a separated HBase table Key: HBASE-28116 URL: https://issues.apache.org/jira/browse/HBASE-28116 Project: HBase Issue Type: New Feature Components: snapshots Reporter: ruanhui As we know, rename and list are very expensive operations on object storage. Currently, the snapshot in hbase relies on these two operations. For example, when taking snapshot, we first write snapshot description and data manifest file to a temporary directory ,then commit it by a rename operation. When list all snapshots, we will scan the snapshot directory to find all completed snapshots. So maybe we can try to introduce a new snapshot storage, using hbase table to store it. Here are a few points from which maybe we can gain benefits: 1. make hbase easier to deploy on object storage, like s3 2. will make snapshots faster and more lightweight. In the current filesystem-based snapshot implementation, when consolidating snapshot manifest, we will first list all region manifests with a thread pool, read content and then delete them. When the number of regions is large, this process may take a lot of time. In comparison, the read and write operations of hbase tables are more lightweight than the read and write operations of hdfs files. 3. more likely to reduce hdfs small files -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28080) correct span name in AbstractRpcBasedConnectionRegistry#getActiveMaster
ruanhui created HBASE-28080: --- Summary: correct span name in AbstractRpcBasedConnectionRegistry#getActiveMaster Key: HBASE-28080 URL: https://issues.apache.org/jira/browse/HBASE-28080 Project: HBase Issue Type: Bug Components: Client Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Assignee: ruanhui Fix For: 4.0.0-alpha-1 It looks like that the span name does not correspond to what is actually done. public CompletableFuture getActiveMaster() { return tracedFuture( () -> this . call( (c, s, d) -> s.getActiveMaster(c, GetActiveMasterRequest.getDefaultInstance(), d), GetActiveMasterResponse::hasServerName, "getActiveMaster()") .thenApply(resp -> ProtobufUtil.toServerName(resp.getServerName())), getClass().getSimpleName() + ".getClusterId"); } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28015) rpc handler can get stuck on LruBlockCache
ruanhui created HBASE-28015: --- Summary: rpc handler can get stuck on LruBlockCache Key: HBASE-28015 URL: https://issues.apache.org/jira/browse/HBASE-28015 Project: HBase Issue Type: Bug Components: BlockCache Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Fix For: 4.0.0-alpha-1 We found lots of read handlers got stuck on LruBlockCache#getBlock, this may be caused by a bug in jdk8 ConcurrentHashMap. To make common fast, I think we'd better get and check it before call ConcurrentHashMap#computeIfAbsent. "RpcServer.priority.RWQ.Fifo.scan.handler=190,queue=57,port=60020" #1807 daemon prio=5 os_prio=0 cpu=9703.28ms elapsed=88160.93s tid=0x7f38d338a800 nid=0x8f4 waiting for monitor entry [0x7f0af4baa000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1760) - waiting to lock <0x7f2fc6495fe0> (a java.util.concurrent.ConcurrentHashMap$Node) at org.apache.hadoop.hbase.io.hfile.LruBlockCache.getBlock(LruBlockCache.java:538) at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.getBlock(CombinedBlockCache.java:88) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1124) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1300) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:331) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:679) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:631) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:561) at org.apache.hadoop.hbase.regionserver.ReversedKeyValueHeap.backwardSeek(ReversedKeyValueHeap.java:117) at org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.backwardSeek(ReversedStoreScanner.java:134) at org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekAsDirection(ReversedStoreScanner.java:94) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekOrSkipToNextColumn(StoreScanner.java:821) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:727) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7515) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7683) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7447) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3403) - locked <0x7f2ff1fc8f40> (a org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3662) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45253) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:447) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:136) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) Locked ownable synchronizers: - None -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27988) NPE in AddPeerProcedure recovery
ruanhui created HBASE-27988: --- Summary: NPE in AddPeerProcedure recovery Key: HBASE-27988 URL: https://issues.apache.org/jira/browse/HBASE-27988 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 4.0.0-alpha-1 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 AddPeerProcedure will restore syncReplicationPeerLock when replayed in master recovery, however the replicationPeerManager has not been initialized when replay procedure, which will cause a nullPointerException and master to abort. {code:java} @Override protected void afterReplay(MasterProcedureEnv env) { // .. if (peerConfig.isSyncReplication()) { if (!env.getReplicationPeerManager().tryAcquireSyncReplicationPeerLock()) { throw new IllegalStateException( "Can not acquire sync replication peer lock for peer " + peerId); } } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27984) NPE in MigrateReplicationQueueFromZkToTableProcedure recovery
ruanhui created HBASE-27984: --- Summary: NPE in MigrateReplicationQueueFromZkToTableProcedure recovery Key: HBASE-27984 URL: https://issues.apache.org/jira/browse/HBASE-27984 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Fix For: 4.0.0-alpha-1 MigrateReplicationQueueFromZkToTableProcedure will restore the disabled state of replication log cleaner barrier when replayed in master recovery, {code:java} @Override protected void afterReplay(MasterProcedureEnv env) { if (getCurrentState() == getInitialState()) { // do not need to disable log cleaner or acquire lock if we are in the initial state, later // when executing the procedure we will try to disable and acquire. return; } if (!env.getReplicationPeerManager().getReplicationLogCleanerBarrier().disable()) { throw new IllegalStateException("can not disable log cleaner, this should not happen"); } } {code} however the replicationPeerManager has not been initialized when replay procedure, which will cause a nullPointerException and master to abort. Maybe better to add a check after the initialization of replicationPeerManager to determine whether replication log cleaner barrier needs to be disabled ? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27968) add JvmPauseMonitor in hbase-client
ruanhui created HBASE-27968: --- Summary: add JvmPauseMonitor in hbase-client Key: HBASE-27968 URL: https://issues.apache.org/jira/browse/HBASE-27968 Project: HBase Issue Type: New Feature Components: Client Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-beta-1 Many of our users integrate hbase-client in some frameworks such as SpringBoot, and JvmPauseMonitor will help to find GC problems. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27967) introduce a ConnectionLimitHandler to limit the number of concurrent connections to the Server
ruanhui created HBASE-27967: --- Summary: introduce a ConnectionLimitHandler to limit the number of concurrent connections to the Server Key: HBASE-27967 URL: https://issues.apache.org/jira/browse/HBASE-27967 Project: HBase Issue Type: New Feature Components: IPC/RPC Affects Versions: 3.0.0-alpha-4 Reporter: ruanhui Fix For: 3.0.0-beta-1 The unreasonable retries of the client cause the hbase server to fail to accept and create new connections, and thus hang up. We can consider introducing a ConnectionLimitHandler similar to Cassandra in our NettyRpcServer to protect the hbase servers. ERROR [master:store-WAL-Roller] master.HMaster: * ABORTING master hmaster,6,1679921578648: IOE in log roller * java.net.SocketException: Call From hmaster/hmaster to namenode:9000 failed on socket exception: java.net.SocketException: Too many open files; For more details see: [http://wiki.apache.org/hadoop/SocketException] java.io.IOException: Too many open files at java.base/sun.nio.ch.Net.accept(Native Method) at java.base/sun.nio.ch.ServerSocketChannelImpl.implAccept(ServerSocketChannelImpl.java:425) at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:391) at org.apt java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:376) at jdk.proxy2/jdk.proxy2.$Proxy24.getFileInfo(Unknown Source) at jdk.internal.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:376) at jdk.proxy2/jdk.proxy2.$Proxy24.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1753) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1617) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1614) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1629) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1713) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getNewPath(AbstractFSWAL.java:582) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:843) at org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:268) at org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:187) Caused by: java.net.SocketException: Too many open files at java.base/sun.nio.ch.Net.socket0(Native Method) at java.base/sun.nio.ch.Net.socket(Net.java:524) at java.base/sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:146) at java.base/sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:129) at java.base/sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:77) at java.base/java.nio.channels.SocketChannel.open(SocketChannel.java:192) at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:656) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:812) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) at org.apache.hadoop.ipc.Client.call(Client.java:1452) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27905) Directly schedule procedures that do not need to acquire locks
ruanhui created HBASE-27905: --- Summary: Directly schedule procedures that do not need to acquire locks Key: HBASE-27905 URL: https://issues.apache.org/jira/browse/HBASE-27905 Project: HBase Issue Type: Improvement Components: proc-v2 Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 Currently, in the procedure scheduler, we will not schedule any other procedures for a given queue if a procedure has held the exclusive lock, even if a procedure does not require any locks. For such procedures that do not require locks, we prefer that they can be executed directly without waiting until the procedure that held the exclusive lock is executed before starting to schedule execution. Otherwise, if the procedure holding the exclusive lock is stuck, the procedure that does not need the lock will also wait forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27885) expose metaCacheHits in MetricsConnection
ruanhui created HBASE-27885: --- Summary: expose metaCacheHits in MetricsConnection Key: HBASE-27885 URL: https://issues.apache.org/jira/browse/HBASE-27885 Project: HBase Issue Type: New Feature Components: Client Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27855) Support dynamic adjustment of flusher count
ruanhui created HBASE-27855: --- Summary: Support dynamic adjustment of flusher count Key: HBASE-27855 URL: https://issues.apache.org/jira/browse/HBASE-27855 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27844) changed type names to avoid conflicts with built-in types
ruanhui created HBASE-27844: --- Summary: changed type names to avoid conflicts with built-in types Key: HBASE-27844 URL: https://issues.apache.org/jira/browse/HBASE-27844 Project: HBase Issue Type: Improvement Components: build Affects Versions: 2.5.4 Reporter: ruanhui Assignee: ruanhui Some compilers will resolve Builder to java.lang.Thread.Builder instead of Builder in pb and cause compilation failure. We should try to avoid conflicts with built-in class names. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27463) Reset sizeOfLogQueue when refresh replication source
ruanhui created HBASE-27463: --- Summary: Reset sizeOfLogQueue when refresh replication source Key: HBASE-27463 URL: https://issues.apache.org/jira/browse/HBASE-27463 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 When refresh replication sources, we don't clear the metric. That may cause the value of sizeOfLogQueue metric wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27458) Use ReadWriteLock for region scanner readpoint map
ruanhui created HBASE-27458: --- Summary: Use ReadWriteLock for region scanner readpoint map Key: HBASE-27458 URL: https://issues.apache.org/jira/browse/HBASE-27458 Project: HBase Issue Type: Improvement Components: Scanners Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 Attachments: jstack-2.png Currently we manage the concurrency between the RegionScanner and getSmallestReadPoint by synchronizing on the scannerReadPoints object. In our production, we find that many read threads are blocked by this when we have a heavy read load. we need to get smallest read point when a. flush a memstore b. compact memstore/storefile c. do delta operation like increment/append Usually the frequency of these operations is much less than read requests. It's a little expensive to use an exclusive lock here because for region scanners, what it need to do is just calcaulating readpoint and putting the readpoint in the scanner readpoint map, which is thread-safe. Multiple read threads can do this in parallel without synchronization. Based on the above consideration, maybe we can replace the synchronized lock with readwrite lock. It will help improve the read performance if the bottleneck is on the synchronization here. !jstack.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27445) result of DirectMemoryUtils#getDirectMemorySize may be wrong
ruanhui created HBASE-27445: --- Summary: result of DirectMemoryUtils#getDirectMemorySize may be wrong Key: HBASE-27445 URL: https://issues.apache.org/jira/browse/HBASE-27445 Project: HBase Issue Type: Bug Components: UI Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 If the parameter is set repeatedly, the latter will take effect. For example, if we set -Xms30g -Xmx30g -XX:MaxDirectMemorySize=40g -XX:MaxDirectMemorySize=50g the MaxDirectMemorySize will be set as 50g. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27355) Separate meta read requests from master and client
ruanhui created HBASE-27355: --- Summary: Separate meta read requests from master and client Key: HBASE-27355 URL: https://issues.apache.org/jira/browse/HBASE-27355 Project: HBase Issue Type: Improvement Components: IPC/RPC Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 If we have a large number of store files in a single region or the response from hdfs is slow, the region transition can be slow, the client may put a lot of pressure on the meta server when retrying. This may block the master system read requests. Maybe we can set a special priority for the master request to isolate read requests from master and client. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27325) the bulkload max call queue size can be update to a wrong value
ruanhui created HBASE-27325: --- Summary: the bulkload max call queue size can be update to a wrong value Key: HBASE-27325 URL: https://issues.apache.org/jira/browse/HBASE-27325 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 The configKey can be wrong, because name.toLowerCase(Locale.ROOT).contains("bulkLoad") is always false. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27320) hide some sensitive configuration information in the UI
ruanhui created HBASE-27320: --- Summary: hide some sensitive configuration information in the UI Key: HBASE-27320 URL: https://issues.apache.org/jira/browse/HBASE-27320 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 In the discussion about how to store keystore/truststore password securely, [~bbeaudreault] mentioned and I quote here "I agree that it seems insecure to put it directly into the hbase-site.xml. Another reason is due to the RS UI which (helpfully) can print the entire site configuration. We’d need to make sure the password is excluded from that, but better to remove it from site xml altogether". I also felt that some sensitive information was exposed in the UI, for example, if we set superuser in the hbase-site.xml, the non-admin users can obtain superuser information and simulate superuser to perform some non-permitted operations on the cluster. So I think maybe we should hide these sensitive information in the UI. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27305) add an option to skip file splitting when bulkload hfiles
ruanhui created HBASE-27305: --- Summary: add an option to skip file splitting when bulkload hfiles Key: HBASE-27305 URL: https://issues.apache.org/jira/browse/HBASE-27305 Project: HBase Issue Type: Improvement Components: tooling Affects Versions: 3.0.0-alpha-3 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-4 When bulkload hfiles, if the key range of the hfile does not match the key range of the region, the BulkLoadHFilesTool will split hfile to fit make the key range of the new file match the key range of the region. If there are many files to be split, the load on the BulkLoadHFilesTool will be very high. Sometimes we want to avoid this situation, just directly fail and regenerate new hfiles. Here we try to introduce a new option, When the above problem is encountered, an exception will be thrown and let the upper client handle it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27158) Add namespace column family to UNDELETABLE_META_COLUMNFAMILIES
ruanhui created HBASE-27158: --- Summary: Add namespace column family to UNDELETABLE_META_COLUMNFAMILIES Key: HBASE-27158 URL: https://issues.apache.org/jira/browse/HBASE-27158 Project: HBase Issue Type: Improvement Components: proc-v2 Affects Versions: 2.4.12 Reporter: ruanhui Fix For: 3.0.0-alpha-4 If we delete the namespace family from hbase:meta, clusters can also be problematic. So I think we should also add the namespace family to the family list which can not be deleted. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-27157) Potential race condition in WorkerAssigner
ruanhui created HBASE-27157: --- Summary: Potential race condition in WorkerAssigner Key: HBASE-27157 URL: https://issues.apache.org/jira/browse/HBASE-27157 Project: HBase Issue Type: Bug Components: proc-v2 Affects Versions: 2.4.12 Reporter: ruanhui Fix For: 3.0.0-alpha-2 Multiple SplitWALProcedures share the same WorkerAssigner instance, so there is potential race condition because the suspend and the wake method are not synchronized. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26974) Introduce a LogRollProcedure
ruanhui created HBASE-26974: --- Summary: Introduce a LogRollProcedure Key: HBASE-26974 URL: https://issues.apache.org/jira/browse/HBASE-26974 Project: HBase Issue Type: Improvement Components: backup&restore, proc-v2 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-3 The current log-rolling for all regionservers is based in ZK. Here is an attempt to reimplement it with procedure v2. Here are some requirements about the implementation. The procedure can be introduced as a new feature. It should remain fully compatible with previous implementations. Also, this feature can be disabled by the configuration. Currently we only use the logroll procedure when taking a backup job, so I think all code logic should be implemented in the hbase-backup module as much as possible(I'm not sure if this is the right way to do it. If you have any suggestions, please let me know). Here are some details about the implementation. LogRollProcedure The LogRollProcedure is used to roll WAL for all the regionservers in the cluster. It acquires the shared lock of the backup system table. RSLogRollProcedure The RSLogRollProcedure is used to schedule a RSLogRollRemoteProcedure for each regionserver. When the subprocedure returns, the RSLogRollProcedure will check the logrolling result in the backup system table. If failed, The RSLogRollProcedure will schedule a new RSLogRollRemoteProcedure to retry. RSLogRollRemoteProcedure The RSLogRollRemoteProcedure is used to send the log roll request to the remote server. This is only the first version implementation, any suggestions and feedbacks are appreciated. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26961) cache region locations when getAllRegionLocations() for branch-2.2+
ruanhui created HBASE-26961: --- Summary: cache region locations when getAllRegionLocations() for branch-2.2+ Key: HBASE-26961 URL: https://issues.apache.org/jira/browse/HBASE-26961 Project: HBase Issue Type: Improvement Components: Client Affects Versions: 2.4.11, 2.3.7, 2.2.7 Reporter: ruanhui Assignee: ruanhui Fix For: 2.4.12 backport HBASE-26942 for branch-2.2, branch-2.3 and branch-2.4 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26942) cache region locations when getAllRegionLocations()
ruanhui created HBASE-26942: --- Summary: cache region locations when getAllRegionLocations() Key: HBASE-26942 URL: https://issues.apache.org/jira/browse/HBASE-26942 Project: HBase Issue Type: Improvement Components: Client Affects Versions: 2.4.11, 3.0.0-alpha-2 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-3 When get all table region locations from meta, we can cache the result. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26867) Introduce a FlushProcedure
ruanhui created HBASE-26867: --- Summary: Introduce a FlushProcedure Key: HBASE-26867 URL: https://issues.apache.org/jira/browse/HBASE-26867 Project: HBase Issue Type: New Feature Components: proc-v2 Reporter: ruanhui Assignee: ruanhui Fix For: 2.6.0, 3.0.0-alpha-3 Reimplement proc-v1 based flush procedure in proc-v2. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26859) Split TestSnapshotProcedure to several smaller tests
ruanhui created HBASE-26859: --- Summary: Split TestSnapshotProcedure to several smaller tests Key: HBASE-26859 URL: https://issues.apache.org/jira/browse/HBASE-26859 Project: HBase Issue Type: Improvement Components: proc-v2, snapshots Affects Versions: 2.4.11 Reporter: ruanhui Fix For: 3.0.0-alpha-3, 2.4.12 TestSnapshotProcedure is too big. It's easy to timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26842) TestSnapshotProcedure fails in branch-2
ruanhui created HBASE-26842: --- Summary: TestSnapshotProcedure fails in branch-2 Key: HBASE-26842 URL: https://issues.apache.org/jira/browse/HBASE-26842 Project: HBase Issue Type: Bug Components: proc-v2, snapshots Reporter: ruanhui Fix For: 2.4.11 We still use the origin implementation for Admin in branch-2, which is different from the AdminOverAsyncAdmin in master branch. This patch will try to introduce the snapshot procedure to the origin Admin implementation client in branch-2. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26769) add archive directory, old WAL direcotry and disabled tables space usage information to metrics
ruanhui created HBASE-26769: --- Summary: add archive directory, old WAL direcotry and disabled tables space usage information to metrics Key: HBASE-26769 URL: https://issues.apache.org/jira/browse/HBASE-26769 Project: HBase Issue Type: Improvement Components: metrics Reporter: ruanhui Currently we don't have space usage information for the archive directory, the old wal directory and disabled tables. This patch is to add this part of the information to the metric system. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26716) NPE caused by converting uppercase hostname to lowercase in RegionMover
ruanhui created HBASE-26716: --- Summary: NPE caused by converting uppercase hostname to lowercase in RegionMover Key: HBASE-26716 URL: https://issues.apache.org/jira/browse/HBASE-26716 Project: HBase Issue Type: Improvement Components: util Affects Versions: 2.4.9 Reporter: ruanhui In HBASE-19456, we introduced case-insensitivity feature in RegionMover and converted uppercase hostnames to lowercase hostnames. But this maybe causes that we can't get the rsgroup info of unloading server, because the addresses in hbase case insensitive. This will make org.apache.hadoop.hbase.util.TestRegionMoverWithRSGroupEnable fail. 2022-01-27T20:53:31,948 INFO [Time-limited test] util.TestRegionMoverWithRSGroupEnable(127): Unloading {*}VM{*}-154-75-centos 2022-01-27T20:53:31,959 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=49232] master.MasterRpcServices(3011): rsGroupInfo of {*}vm{*}-154-75-centos:39126 is null 2022-01-27T20:53:31,961 INFO [pool-332-thread-1] util.RegionMover(419): rsgroup of {*}vm{*}-154-75-centos:39126 is null 2022-01-27T20:53:31,961 ERROR [pool-332-thread-1] util.RegionMover(471): Error while unloading regions java.lang.NullPointerException: null at org.apache.hadoop.hbase.util.RegionMover.lambda$unloadRegions$3(RegionMover.java:421) ~[classes/:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26610) RSRollLogTask didn't call coprocessor when request roll log in backup
ruanhui created HBASE-26610: --- Summary: RSRollLogTask didn't call coprocessor when request roll log in backup Key: HBASE-26610 URL: https://issues.apache.org/jira/browse/HBASE-26610 Project: HBase Issue Type: Improvement Components: backup&restore Affects Versions: 2.4.9 Reporter: ruanhui Assignee: ruanhui -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26554) Introduce a new parameter in jmx servlet to exclude the specific mbean
ruanhui created HBASE-26554: --- Summary: Introduce a new parameter in jmx servlet to exclude the specific mbean Key: HBASE-26554 URL: https://issues.apache.org/jira/browse/HBASE-26554 Project: HBase Issue Type: Improvement Components: metrics Affects Versions: 2.4.8 Reporter: ruanhui Assignee: ruanhui Fix For: 3.0.0-alpha-2 There are many regionservers serving over a thousand regions, and the metric load is pretty big. I tried to exclude some huge mbean like 'Hadoop:service=HBase,name=RegionServer,sub=Regions' with regex, but didn't succeed. So I want to propose a new parameter 'excl' in jmx servlet to exclude the splecific bean or beans. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HBASE-26485) Introduce a method to clean restore directory after Snapshot Scan
[ https://issues.apache.org/jira/browse/HBASE-26485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruanhui resolved HBASE-26485. - Resolution: Fixed > Introduce a method to clean restore directory after Snapshot Scan > - > > Key: HBASE-26485 > URL: https://issues.apache.org/jira/browse/HBASE-26485 > Project: HBase > Issue Type: Improvement > Components: snapshots >Reporter: ruanhui >Assignee: ruanhui >Priority: Minor > > SnapshotScan is widely used in our company. However, after the snapshot scan > job, the restore directory is not cleaned, and this maybe puts a lot of > pressure on HDFS after a long time. So maybe we can introduce a method for > users to clean the snapshot restore directory after job. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26485) Introduce a method to clean restore directory after Snapshot Scan
ruanhui created HBASE-26485: --- Summary: Introduce a method to clean restore directory after Snapshot Scan Key: HBASE-26485 URL: https://issues.apache.org/jira/browse/HBASE-26485 Project: HBase Issue Type: Improvement Components: snapshots Reporter: ruanhui Assignee: ruanhui SnapshotScan is widely used in our company. However, after the snapshot scan job, the restore directory is not cleaned, and this maybe puts a lot of pressure on HDFS after a long time. So maybe we can introduce a method for users to clean the snapshot restore directory after job. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26323) introduce a SnapshotProcedure
ruanhui created HBASE-26323: --- Summary: introduce a SnapshotProcedure Key: HBASE-26323 URL: https://issues.apache.org/jira/browse/HBASE-26323 Project: HBase Issue Type: New Feature Components: proc-v2, snapshots Reporter: ruanhui Currently,snapshot in hbase uses zk as coordinator. It has some limitations, a. Snapshot maybe fails when there are region server crashes. b. Snapshot maybe failed when master restarts. c. Only one snapshot per table can be taken in a time. d. Snapshot verify will be handled by master, which may take long time when our table has a large number of regions, for example 1. Since we have procedure v2 framework now, it is possible to solve the above problems. So here is a procedure2-based snapshot implementation. It has some goals, a. Snapshot can continue when there are region server crashes. b. Snapshot can continue when master restarts. c. More than one snapshot per table can be taken in a time. d. We can use region servers to verify snapshot to accelerate procedure. Here are some details about implementation. *SnapshotProcedure* SnapshotProcedure is used to take snapshot on a table. It acquires shared table lock on the snapshot table and hold the shared lock during suspend and yield. *SnapshotRegionProcedure* SnapshotRegionProcedure is used to take snapshot on a specific region of the snapshot table. It acquires exclusive region lock and releases lock during suspend and yield. Before dispatch remote snapshot operations to region server, it will check target region in RIT or not. If target region is in RIT, it will sleep some time and retry. *SnapshotVerifyProcedure* SnapshotVerifyProcedure is used to send snapshot verify request to region server. If snapshot is corrupted, it will notify parent snapshot to retry. When remote region server is crashed, it will choose another online server and retry. I would be very grateful for any advice and guidance. Is anyone interested in taking a look? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26166) table list in master ui has a monor bug
ruanhui created HBASE-26166: --- Summary: table list in master ui has a monor bug Key: HBASE-26166 URL: https://issues.apache.org/jira/browse/HBASE-26166 Project: HBase Issue Type: Bug Components: UI Affects Versions: 2.4.5 Reporter: ruanhui Attachments: image-2021-08-03-13-09-24-030.png !image-2021-08-03-13-09-24-030.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25880) remove files from filesCompacting when clear compaction queues
ruanhui created HBASE-25880: --- Summary: remove files from filesCompacting when clear compaction queues Key: HBASE-25880 URL: https://issues.apache.org/jira/browse/HBASE-25880 Project: HBase Issue Type: Bug Components: Compaction Reporter: ruanhui Assignee: ruanhui When clear compaction queues, we just clear the workQueue of ThreadPoolExecutor, but files in compaction request are still in filesCompacting list. maybe we should clear it also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25102) fix replication.stats.thread.period.seconds default setting bug
ruanhui created HBASE-25102: --- Summary: fix replication.stats.thread.period.seconds default setting bug Key: HBASE-25102 URL: https://issues.apache.org/jira/browse/HBASE-25102 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.3.2, 2.2.6 Reporter: ruanhui replication.stats.thread.period.seconds is in seconds while default TimeUnit is TimeUnit. MILLISECONDS -- This message was sent by Atlassian Jira (v8.3.4#803005)