[ https://issues.apache.org/jira/browse/HBASE-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
huginn updated HBASE-29381: --------------------------- Priority: Major (was: Minor) > Repeated submission of subProcedures leads to concurrency issues > ---------------------------------------------------------------- > > Key: HBASE-29381 > URL: https://issues.apache.org/jira/browse/HBASE-29381 > Project: HBase > Issue Type: Bug > Components: snapshots > Affects Versions: 2.4.11 > Reporter: huginn > Priority: Major > Labels: pull-request-available > Attachments: HBASE-29381.patch > > > When a subProcedure has completed but procedure has not yet completed, if a > new different procedure is submitted during this window, it may lead to the > repeated submission of the current subProcedure. As a result, duplicated > subProcedure remain active when procedure has completed, which will cause > concurrent conflicts. > Here are some logs related to an issue that occurred during taking snapshot. > *Master* > {code:java} > 2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure: > Starting procedure 'snapshotName__1745863344' > acquire > 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure: > Procedure 'snapshotName__1745863344' starting 'acquire' > 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] > procedure.ZKProcedureCoordinator: Creating acquire > znode:/hbase/online-snapshot/acquired/snapshotName__1745863344 > 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] > procedure.ZKProcedureCoordinator: Watching for acquire > node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1 > 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] > procedure.ZKProcedureCoordinator: Watching for acquire > node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2 > 2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0] > procedure.ZKProcedureCoordinator: Node created: > /hbase/online-snapshot/acquired/snapshotName__1745863344/rs2 > 2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0] > procedure.ZKProcedureCoordinator: Node created: > /hbase/online-snapshot/acquired/snapshotName__1745863344/rs1 > reach > 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure: > Procedure 'snapshotName__1745863344' starting 'in-barrier' execution. > 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] > procedure.ZKProcedureCoordinator: Creating reached barrier zk > node:/hbase/online-snapshot/reached/snapshotName__1745863344 > 2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0] > procedure.ZKProcedureCoordinator: Node created: > /hbase/online-snapshot/reached/snapshotName__1745863344/rs1 > 2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0] > procedure.ZKProcedureCoordinator: Node created: > /hbase/online-snapshot/reached/snapshotName__1745863344/rs2 > complete > 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure: > Procedure 'snapshotName__1745863344' execution completed > 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] > procedure.ZKProcedureUtil: Clearing all znodes for procedure > snapshotName__1745863344including nodes /hbase/online-snapshot/acquired > /hbase/online-snapshot/reached /hbase/online-snapshot/abort > 2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master] > snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for > snapshotName__1745863344 > 2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master] > snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table > Test:table1 completed > > abort > 2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0] > procedure.ZKProcedureCoordinator: Node created: > /hbase/online-snapshot/abort/snapshotName__1745863344 > 2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0] > procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777 > {code} > *rs2* > {code:java} > acquire > 2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0] > procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already > running. Bailing out > 2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0] > procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already > running. Bailing out > reach > 2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received created > event:/hbase/online-snapshot/reached/snapshotName__1745863344 > 2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0] > procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 > is still present, removing > 2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0] > procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already > running. Bailing out > 2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0] > procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 > is still present, removing > clean > 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received created > event:/hbase/online-snapshot/abort/snapshotName__1745863344 > 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] > procedure.ZKProcedureMemberRpcs: Received procedure abort children changed > event: /hbase/online-snapshot/abort > abort > 2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0] > procedure.ProcedureMember: Propagating foreign exception to subprocedure > snapshotName__1745863344 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777 > {code} > > *Client* > {code:java} > 2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7] > Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: > Snapshot \{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } > had an error. Procedure snapshotName__1745863344 \{ waiting=[] done=[] } > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402) > at > org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105) > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85) > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395) > at > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82) > at > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330) > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391) > ... 6 more > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777 > {code} > We can see that during subProcedure has completed but procedure has not yet > completed, this completed old subProcedure is removed and a duplicated > subProcedure is submitted. After procedure has completed, the duplicated > subProcedures is active and generating region manifest in the working > directory. When taking snapshot has completed, master cleans out the working > directory which causes duplicated subProcedure to fail in generating the > region manifest and throw an exception. In the end, the snapshot is > successfully created, but the subProcedure reportes an exception. The root > cause of all this is that after the members have reached the barrier, the > coordinator does not clean up the corresponding acquired barrier, which > results in the repeated submission of subProcedure. > In addition, I find that during the entire process, there is no watch set on > node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is > unnecessary to create this node to trigger its watch event before cleaning it > up. -- This message was sent by Atlassian Jira (v8.20.10#820010)