Klay created HBASE-28545:
----------------------------

             Summary: Worker thread gets stuck in an infinite retry when 
TRUNCATE command runs into RowTooBigException
                 Key: HBASE-28545
                 URL: https://issues.apache.org/jira/browse/HBASE-28545
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.5.8, 2.4.17
            Reporter: Klay
         Attachments: hbase-master.log

Hi, I am testing the reliability of HBase to see if it could work normally or 
properly handle exceptions under certain different configurations.

When setting +hbase.table.max.rowsize+ to a relatively small size, create an 
empty table and then truncate it could get worker thread to get stuck in an 
infinite retry and generate massive logs.
h1. Reproduce

This bug can be reproduced determinstically with the following steps:

1. Set hbase.table.max.rowsize to a relatively small value: 745

hbase-site.xml

 
{code:java}
    <property>
        <name>hbase.table.max.rowsize</name>
        <value>745</value>
    </property>
{code}
2. Start up HBase cluster 2.5.8 (1HM, 2RS) and execute the following commands 
from hbase shell

 
{code:java}
create 'uuid0f06811e5d11431fa9d4a543705b8747', {NAME => 
'uuidad7c09870a6346598bbca56cb97c4c67', VERSIONS => 4, COMPRESSION => 'GZ', 
BLOOMFILTER => 'NONE', IN_MEMORY => 'false'}, {NAME => 
'uuidd2c5ad4df7b844afb4c2ac92b224dedd', VERSIONS => 4, COMPRESSION => 'NONE', 
BLOOMFILTER => 'NONE', IN_MEMORY => 'true'}, {NAME => 
'uuid6e21f5f492eb4b9ba0e431b9a0666dae', VERSIONS => 4, COMPRESSION => 'NONE', 
BLOOMFILTER => 'NONE', IN_MEMORY => 'true'}, {NAME => 
'uuidfb110629fba44915956e41609b68ebe9', VERSIONS => 2, COMPRESSION => 'NONE', 
BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME => 
'uuid4943640bf1f6425582ac9b3572d95964', VERSIONS => 4, COMPRESSION => 'GZ', 
BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'}

truncate 'uuid0f06811e5d11431fa9d4a543705b8747' {code}
*truncate* command timeout after 10m. However, the PEWorker thread still gets 
stuck in an *infinite retry* and keep generating a massive WARN in HMaster logs

 
{code:java}
➜  log git:(main) ✗ ls
hbase--master-09e0b9e25e7c.log     hbase--master-09e0b9e25e7c.log.12  
hbase--master-09e0b9e25e7c.log.16  hbase--master-09e0b9e25e7c.log.2   
hbase--master-09e0b9e25e7c.log.5  hbase--master-09e0b9e25e7c.log.9   
SecurityAuth.audit
hbase--master-09e0b9e25e7c.log.1   hbase--master-09e0b9e25e7c.log.13  
hbase--master-09e0b9e25e7c.log.17  hbase--master-09e0b9e25e7c.log.20  
hbase--master-09e0b9e25e7c.log.6  hbase--master-09e0b9e25e7c.out
hbase--master-09e0b9e25e7c.log.10  hbase--master-09e0b9e25e7c.log.14  
hbase--master-09e0b9e25e7c.log.18  hbase--master-09e0b9e25e7c.log.3   
hbase--master-09e0b9e25e7c.log.7  hbase--zookeeper-09e0b9e25e7c.log
hbase--master-09e0b9e25e7c.log.11  hbase--master-09e0b9e25e7c.log.15  
hbase--master-09e0b9e25e7c.log.19  hbase--master-09e0b9e25e7c.log.4   
hbase--master-09e0b9e25e7c.log.8  hbase--zookeeper-09e0b9e25e7c.out {code}
 

The logs contain the following exceptions

 
{code:java}
2024-04-23T02:52:16,144 WARN  [PEWorker-14] procedure.TruncateTableProcedure: 
Retriable error trying to truncate table=uuid0f06811e5d11431fa9d4a543705b8747 
state=TRUNCATE_TABLE_CLEAR_FS_LAYOUT
org.apache.hadoop.hbase.regionserver.RowTooBigException: 
org.apache.hadoop.hbase.regionserver.RowTooBigException: Max row size allowed: 
745, but the row is bigger than that, the row info: 
uuid0f06811e5d11431fa9d4a543705b8747,,1713840709844.c03450373b32cba95a2dd3139f0a2201./info:state/1713840716052/Put/vlen=6/seqid=16,
 already have process row cells = 6, it belong to region = 
hbase:meta,,1.1588230740
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:256)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:243)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2654)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2577)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45002)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)       
 at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source) 
~[?:?]
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 ~[?:1.8.0_362]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
~[?:1.8.0_362]
        at 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:110)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:100)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:280)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:265)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:126)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:103)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:390) 
~[hbase-client-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.client.HTable.lambda$get$0(HTable.java:363) 
~[hbase-client-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) 
~[hbase-common-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:363) 
~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.MetaTableAccessor.get(MetaTableAccessor.java:248) 
~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.MetaTableAccessor.getRegionResult(MetaTableAccessor.java:344)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.MetaTableAccessor.getMergeRegions(MetaTableAccessor.java:374)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.master.procedure.DeleteTableProcedure.deleteFromFs(DeleteTableProcedure.java:283)
 ~[hbase-server-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:103)
 ~[hbase-server-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:44)
 ~[hbase-server-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:942) 
~[hbase-procedure-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1765)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1443)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:77)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:2091)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) 
~[hbase-common-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2118)
 ~[hbase-procedure-2.5.8.jar:2.5.8]
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException: 
org.apache.hadoop.hbase.regionserver.RowTooBigException: Max row size allowed: 
745, but the row is bigger than that, the row info: 
uuid0f06811e5d11431fa9d4a543705b8747,,1713840709844.c03450373b32cba95a2dd3139f0a2201./info:state/1713840716052/Put/vlen=6/seqid=16,
 already have process row cells = 6, it belong to region = 
hbase:meta,,1.1588230740
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:145)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.populateResult(RegionScannerImpl.java:342)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextInternal(RegionScannerImpl.java:513)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.nextRaw(RegionScannerImpl.java:278)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:256)
        at 
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.next(RegionScannerImpl.java:243)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2654)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2577)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45002)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
        at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)       
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:388)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) 
~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) 
~[hbase-client-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) 
~[hbase-client-2.5.8.jar:2.5.8]
        at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) 
~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:199)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:220)
 ~[hbase-client-2.5.8.jar:2.5.8]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[hbase-shaded-netty-4.1.5.jar:?]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]
{code}
I have attached the one example hmaster log.
h1. Root Cause

If decrease the length of the table name and cf name in the create table 
command, the truncate will work normally. This indicates that the symptom could 
be triggered as long as there's a table whose metadata size is larger than the 
max row size. 

E.g., using the default max row size value (1GB), if create a table with a 
large metadata size (lots of columns with long names), it might still cause the 
similar infinite loop problem. 

This usually won't happen with the default configurations. However, users are 
more likely to run into this exception if they set the max row size to a 
smaller value and create a large table.
h1. Fix

A better error handler might work
 * This infinite loop could be avoided by adding a retry limit if the truncate 
command keeps running into RowTooBigException.

I am still examining the exact root cause. I could try to provide a patch. Any 
suggestions are welcome. Thank you.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to