[ 
https://issues.apache.org/jira/browse/HBASE-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huginn updated HBASE-29381:
---------------------------
    Priority: Major  (was: Minor)

> Repeated submission of subProcedures leads to concurrency issues
> ----------------------------------------------------------------
>
>                 Key: HBASE-29381
>                 URL: https://issues.apache.org/jira/browse/HBASE-29381
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 2.4.11
>            Reporter: huginn
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HBASE-29381.patch
>
>
> When a subProcedure has completed but procedure has not yet completed, if a 
> new different procedure is submitted during this window, it may lead to the 
> repeated submission of the current subProcedure. As a result, duplicated 
> subProcedure remain active when procedure has completed, which will cause 
> concurrent conflicts.
> Here are some logs related to an issue that occurred during taking snapshot.
> *Master*
> {code:java}
> 2025-04-29 02:03:24,716 INFO [proc-coordinator-pool-25] procedure.Procedure: 
> Starting procedure 'snapshotName__1745863344'
> acquire
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' starting 'acquire'
> 2025-04-29 02:03:24,716 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Creating acquire 
> znode:/hbase/online-snapshot/acquired/snapshotName__1745863344
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Watching for acquire 
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> 2025-04-29 02:03:24,718 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Watching for acquire 
> node:/hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,670 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs2
> 2025-04-29 02:03:27,908 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/acquired/snapshotName__1745863344/rs1
> reach
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' starting 'in-barrier' execution.
> 2025-04-29 02:03:28,133 DEBUG [proc-coordinator-pool-25] 
> procedure.ZKProcedureCoordinator: Creating reached barrier zk 
> node:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:30,477 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs1
> 2025-04-29 02:03:30,652 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/reached/snapshotName__1745863344/rs2
> complete
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] procedure.Procedure: 
> Procedure 'snapshotName__1745863344' execution completed
> 2025-04-29 02:03:30,768 INFO [proc-coordinator-pool-25] 
> procedure.ZKProcedureUtil: Clearing all znodes for procedure 
> snapshotName__1745863344including nodes /hbase/online-snapshot/acquired 
> /hbase/online-snapshot/reached /hbase/online-snapshot/abort
> 2025-04-29 02:03:30,906 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
> snapshot.EnabledTableSnapshotHandler: Done waiting - online snapshot for 
> snapshotName__1745863344
> 2025-04-29 02:03:31,750 INFO [MASTER_SNAPSHOT_OPERATIONS-master] 
> snapshot.TakeSnapshotHandler: Snapshot snapshotName__1745863344 of table 
> Test:table1 completed
>  
> abort
> 2025-04-29 02:03:31,186 DEBUG [zk-event-processor-pool-0] 
> procedure.ZKProcedureCoordinator: Node created: 
> /hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:31,218 DEBUG [zk-event-processor-pool-0] 
> procedure.ProcedureCoordinator: abort procedure snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> *rs2*
> {code:java}
> acquire
> 2025-04-29 02:03:26,069 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,083 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> 2025-04-29 02:03:26,522 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:26,540 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> reach
> 2025-04-29 02:03:28,134 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received created 
> event:/hbase/online-snapshot/reached/snapshotName__1745863344
> 2025-04-29 02:03:29,910 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:29,927 WARN [zk-event-processor-pool-0] 
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 
> is still present, removing
> 2025-04-29 02:03:30,208 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,215 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Subproc 'snapshotName__1745863344' is already 
> running. Bailing out
> 2025-04-29 02:03:30,610 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
> event: /hbase/online-snapshot/acquired
> 2025-04-29 02:03:30,616 WARN [zk-event-processor-pool-0] 
> procedure.ProcedureMember: A completed old subproc snapshotName__1745863344 
> is still present, removing
> clean
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received created 
> event:/hbase/online-snapshot/abort/snapshotName__1745863344
> 2025-04-29 02:03:30,770 INFO [zk-event-processor-pool-0] 
> procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
> event: /hbase/online-snapshot/abort
> abort
> 2025-04-29 02:03:31,037 ERROR [zk-event-processor-pool-0] 
> procedure.ProcedureMember: Propagating foreign exception to subprocedure 
> snapshotName__1745863344
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
>  
> *Client*
> {code:java}
> 2025-04-29 02:03:36,289 ERROR c.s.d.h.c.CommonUtils [RoutineTaskExecutor-7] 
> Error message org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: 
> Snapshot \{ ss=snapshotName__1745863344 table=table1 type=SKIPFLUSH ttl=0 } 
> had an error.  Procedure snapshotName__1745863344 \{ waiting=[] done=[] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:402)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1266)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105)
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> rs2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3250)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:718)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:704)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3294)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:1130)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:667)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:637)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:605)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1249)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1147)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3395)
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:330)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:391)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/.hbase-snapshot/.tmp/snapshotName__1745863344/region-manifest.abcdef122344556667777
> {code}
> We can see that during subProcedure has completed but procedure has not yet 
> completed, this completed old subProcedure is removed and a duplicated 
> subProcedure is submitted. After procedure has completed, the duplicated 
> subProcedures is active and generating region manifest in the working 
> directory.  When taking snapshot has completed, master cleans out the working 
> directory which causes duplicated subProcedure to fail in generating the 
> region manifest and throw an exception. In the end,  the snapshot is 
> successfully created, but the subProcedure reportes an exception. The root 
> cause of all this is that after the members have reached the barrier, the 
> coordinator does not clean up the corresponding acquired barrier, which 
> results in the repeated submission of subProcedure.
> In addition, I find that during the entire process, there is no watch set on 
> node /hbase/online-snapshot/acquired/snapshotName__1745863344 , so it is 
> unnecessary to create this node to trigger its watch event before cleaning it 
> up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to