[
https://issues.apache.org/jira/browse/HBASE-28545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Klay updated HBASE-28545:
-------------------------
Component/s: master
> Worker thread gets stuck in an infinite retry when TRUNCATE command runs into
> RowTooBigException
> ------------------------------------------------------------------------------------------------
>
> Key: HBASE-28545
> URL: https://issues.apache.org/jira/browse/HBASE-28545
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.4.17, 2.5.8
> Reporter: Klay
> Priority: Major
> Attachments: hbase-master.log
>
>
> Hi, I am testing the reliability of HBase to see if it could work normally or
> properly handle exceptions under different configurations.
> When setting +hbase.table.max.rowsize+ to a relatively small size, create an
> empty table and then truncate it could get worker thread to get stuck in an
> infinite retry and generate massive logs.
> h1. Reproduce
> This bug can be reproduced determinstically with the following steps:
> 1. Set hbase.table.max.rowsize to a relatively small value: 745
> hbase-site.xml
> {code:java}
> <property>
> <name>hbase.table.max.rowsize</name>
> <value>745</value>
> </property>
> {code}
> 2. Start up HBase cluster 2.5.8 (1HM, 2RS) and execute the following commands
> from hbase shell
> {code:java}
> create 'uuid0f06811e5d11431fa9d4a543705b8747', {NAME =>
> 'uuidad7c09870a6346598bbca56cb97c4c67', VERSIONS => 4, COMPRESSION => 'GZ',
> BLOOMFILTER => 'NONE', IN_MEMORY => 'false'}, {NAME =>
> 'uuidd2c5ad4df7b844afb4c2ac92b224dedd', VERSIONS => 4, COMPRESSION => 'NONE',
> BLOOMFILTER => 'NONE', IN_MEMORY => 'true'}, {NAME =>
> 'uuid6e21f5f492eb4b9ba0e431b9a0666dae', VERSIONS => 4, COMPRESSION => 'NONE',
> BLOOMFILTER => 'NONE', IN_MEMORY => 'true'}, {NAME =>
> 'uuidfb110629fba44915956e41609b68ebe9', VERSIONS => 2, COMPRESSION => 'NONE',
> BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME =>
> 'uuid4943640bf1f6425582ac9b3572d95964', VERSIONS => 4, COMPRESSION => 'GZ',
> BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'}
> truncate 'uuid0f06811e5d11431fa9d4a543705b8747' {code}
> *truncate* command timeout after 10m. However, the PEWorker thread still gets
> stuck in an *infinite retry* and keep generating a massive WARN in HMaster
> logs
> {code:java}
> ➜ log git:(main) ✗ ls
> hbase--master-09e0b9e25e7c.log hbase--master-09e0b9e25e7c.log.12
> hbase--master-09e0b9e25e7c.log.16 hbase--master-09e0b9e25e7c.log.2
> hbase--master-09e0b9e25e7c.log.5 hbase--master-09e0b9e25e7c.log.9
> SecurityAuth.audit
> hbase--master-09e0b9e25e7c.log.1 hbase--master-09e0b9e25e7c.log.13
> hbase--master-09e0b9e25e7c.log.17 hbase--master-09e0b9e25e7c.log.20
> hbase--master-09e0b9e25e7c.log.6 hbase--master-09e0b9e25e7c.out
> hbase--master-09e0b9e25e7c.log.10 hbase--master-09e0b9e25e7c.log.14
> hbase--master-09e0b9e25e7c.log.18 hbase--master-09e0b9e25e7c.log.3
> hbase--master-09e0b9e25e7c.log.7 hbase--zookeeper-09e0b9e25e7c.log
> hbase--master-09e0b9e25e7c.log.11 hbase--master-09e0b9e25e7c.log.15
> hbase--master-09e0b9e25e7c.log.19 hbase--master-09e0b9e25e7c.log.4
> hbase--master-09e0b9e25e7c.log.8 hbase--zookeeper-09e0b9e25e7c.out {code}
> The logs contain the following repeated exceptions
> {code:java}
> 2024-04-23T02:52:16,144 WARN [PEWorker-14] procedure.TruncateTableProcedure:
> Retriable error trying to truncate table=uuid0f06811e5d11431fa9d4a543705b8747
> state=TRUNCATE_TABLE_CLEAR_FS_LAYOUT
> org.apache.hadoop.hbase.regionserver.RowTooBigException:
> org.apache.hadoop.hbase.regionserver.RowTooBigException: Max row size
> allowed: 745, but the row is bigger than that, the row info:
> uuid0f06811e5d11431fa9d4a543705b8747,,1713840709844.c03450373b32cba95a2dd3139f0a2201./info:state/1713840716052/Put/vlen=6/seqid=16,
> already have process row cells = 6, it belong to region =
> hbase:meta,,1.1588230740
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:256)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:243)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2654)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2577)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45002)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
> at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
> ~[?:?]
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ~[?:1.8.0_362]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> ~[?:1.8.0_362]
> at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:110)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:100)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:280)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:265)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:126)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:103)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:390)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.client.HTable.lambda$get$0(HTable.java:363)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> ~[hbase-common-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:363)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.MetaTableAccessor.get(MetaTableAccessor.java:248)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.MetaTableAccessor.getRegionResult(MetaTableAccessor.java:344)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.MetaTableAccessor.getMergeRegions(MetaTableAccessor.java:374)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.master.procedure.DeleteTableProcedure.deleteFromFs(DeleteTableProcedure.java:283)
> ~[hbase-server-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:103)
> ~[hbase-server-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:44)
> ~[hbase-server-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:942)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1443)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:77)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:2091)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> ~[hbase-common-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2118)
> ~[hbase-procedure-2.5.8.jar:2.5.8]
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException:
> org.apache.hadoop.hbase.regionserver.RowTooBigException: Max row size
> allowed: 745, but the row is bigger than that, the row info:
> uuid0f06811e5d11431fa9d4a543705b8747,,1713840709844.c03450373b32cba95a2dd3139f0a2201./info:state/1713840716052/Put/vlen=6/seqid=16,
> already have process row cells = 6, it belong to region =
> hbase:meta,,1.1588230740
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:256)
> at
> org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:243)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2654)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2577)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45002)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:388)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:199)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:220)
> ~[hbase-client-2.5.8.jar:2.5.8]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> ~[hbase-shaded-netty-4.1.5.jar:?]
> at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]
> {code}
> I have attached the one example hmaster log.
> h1. Root Cause
> If decrease the length of the table name and cf name in the create table
> command, the truncate will work normally. This indicates that the symptom
> could be triggered if there's a table whose metadata size is larger than the
> max row size.
> E.g., using the default max row size value (1GB), if create a table with a
> large metadata size (lots of columns with long names), it might still cause
> the similar infinite loop problem.
> This usually won't happen with the default configurations. However, users are
> more likely to run into this exception if they set the max row size to a
> smaller value and create a large table.
> h1. Fix
> A better error handler might work
> * This infinite loop could be avoided by adding a retry limit if the
> truncate command keeps running into RowTooBigException.
> I am still examining the exact root cause. I could try to provide a patch.
> Any suggestions are welcome. Thank you.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)