[ https://issues.apache.org/jira/browse/HDDS-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794305#comment-17794305 ]
Sammi Chen edited comment on HDDS-9342 at 12/7/23 3:50 PM: ----------------------------------------------------------- Together with [~wanghongbing] and [~z-bb] , we find the root cause of this. There are three major issues shown in this case. The first is a long full GC which caused OM to terminate itself, {code:java} 2023-10-24 11:40:08,542 [JvmPauseMonitor0] WARN org.apache.ratis.util.JvmPauseMonitor: JvmPauseMonitor-om2: Detected pause in JVM or host machine approximately 131.539s with 131.735s GC time. GC pool 'ParNew' had collection(s): count=1 time=1422ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=130313ms ... 2023-10-24 11:40:08,552 [JvmPauseMonitor0] ERROR org.apache.ratis.server.RaftServer: om2: JVM pause detected 131.539s longer than the close-threshold 120s, shutting down ... {code} The second is the gap between snapshot index taken during OM shutdown and commitIndex in Raft is too huge, which leads to many raft logs need replay in next OM start, and which also caused the third issue. {code:java} // Some comments here 2023-10-24 11:40:08,557 [om2-impl-thread2] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: set stopIndex = 405823932 ... 2023-10-24 11:40:13,773 [om2@group-197E298202B9-StateMachineUpdater] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: Took a snapshot at index 206114616 2023-10-24 11:40:13,776 [om2@group-197E298202B9-StateMachineUpdater] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: snapshotIndex: updateIncreasingly 205866698 -> 206114616 ... 2023-10-24 12:02:58,532 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-197E298202B9-SegmentedRaftLogWorker: flushIndex: setUnconditionally 0 -> 405823932 2023-10-24 12:02:58,532 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-197E298202B9-SegmentedRaftLogWorker: safeCacheEvictIndex: setUnconditionally 0 -> 206114616 2023-10-24 12:02:58,619 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.RaftLog: om2@group-197E298202B9-SegmentedRaftLog: commitIndex: updateToMax old=206114616, new=405823931, updated? true 2023-10-24 12:02:58,619 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.RaftLog: om2@group-197E298202B9-SegmentedRaftLog: purgeIndex: updateIncreasingly -1 -> 206094557 {code} The third is the OM start failure due to raft log replay failure, {code:java} 2023-10-24 12:02:59,086 [OM StateMachine ApplyTransaction Thread - 0] ERROR org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with exit status 1: Request cmdType: RenameKey traceID: "" clientId: "client-9300289B92FF" userInfo { userName: "weisuonan" remoteAddress: "192.168.0.1" hostName: "bigdata-apache-tool1.xxx.xxx.com" } version: 3 renameKeyRequest { keyArgs { volumeName: "vol-bigdata-offline" bucketName: "weisuonan" keyName: "checkative/english-book_Ozone" dataSize: 0 modificationTime: 1697627113034 } toKeyName: ".Trash/weisuonan/Current/vol-bigdata-offline/weisuonan/checkative/english-book_Ozone1697627113026" } failed with exception java.lang.IllegalArgumentException: Trying to set updateID to 206114651 which is not greater than the current value of 405823931 for OMKeyInfo{volume='vol-bigdata-offline', bucket='weisuonan', key='checkative/english-book_Ozone', dataSize='23124', creationTime='1698118516546', objectID='-9223371932963850495', parentID='-9223372036604311807', replication='RATIS/THREE', fileChecksum='null} at org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107) at org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.renameKey(OMKeyRenameRequestWithFSO.java:286) at org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.validateAndUpdateCache(OMKeyRenameRequestWithFSO.java:219) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} After an analysis, all the issues are one root cause. HDDS-2680 introduced a logic in OzoneManagerStateMachine to calculate the lastAppliedTermIndex based on two maps, applyTransactionMap and ratisTransactionMap. Any write request from RATIS through applyTransaction will add its trxLogIndex into applyTransactionMap. And any write request which is flushed by OzoneManagerDoubleBuffer#flushBatch will have its trxLogIndex removed from applyTransactionMap during flushBatch call ozoneManagerRatisSnapShot.updateLastAppliedIndex(flushedEpochs). If any write request from RATIS not going through OzoneManagerDoubleBuffer#flushBatch, then its trxLogIndex will be left in the applyTransactionMap forever. Since lastApplicedIndex can only be updated incrementally, any trxLogIndex not confirmed by OzoneManagerDoubleBuffer flush will make the lastApplicedIndex grow stops before it, and although write requests after that unconfirmed one could be flushed, but their trxLogIndex will be added to the ratisTransactionMap, which causes the ratisTransactionMap grow bigger and bigger. These explains the full GC, the huge gap between snapshot index and commit index. How a write request will not be confirmed by OzoneManagerDoubleBuffer flush? Here is one case reproduced locally. T1: create bucket1 T2: client1 sends delete bucket "bucket1" request to OM. OM verify bucket1 exists, then send request to RATIS to handle the request. T3: client2 sends create key "bucket1/key1" request to OM. OM verify bucket2 exists, then send request to RATIS T4: OzoneManagerStateMachine executes delete bucket "bucket1" success, return response to client1 T5: OzoneManagerStateMachine executes create key "bucket1/key1" request, "bucket1" cannot be found, execution fails, return failure to client2 In T5, the failure stack is {code:java} 2023-10-18 19:04:10,131 [OM StateMachine ApplyTransaction Thread - 0] WARN org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Failed to write, Exception occurred BUCKET_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Bucket not found: s3v/prod-voyager at org.apache.hadoop.ozone.om.OzoneManagerUtils.reportNotFound(OzoneManagerUtils.java:87) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketInfo(OzoneManagerUtils.java:72) at org.apache.hadoop.ozone.om.OzoneManagerUtils.resolveBucketInfoLink(OzoneManagerUtils.java:148) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getResolvedBucketInfo(OzoneManagerUtils.java:124) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketLayout(OzoneManagerUtils.java:106) at org.apache.hadoop.ozone.om.request.BucketLayoutAwareOMKeyRequestFactory.createRequest(BucketLayoutAwareOMKeyRequestFactory.java:230) at org.apache.hadoop.ozone.om.ratis.utils.OzoneManagerRatisUtils.createClientRequest(OzoneManagerRatisUtils.java:336) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:380) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} In OzoneManagerStateMachine.runCommand, when IOException is throw out from OzoneManagerRequestHandler.handleWriteRequest, it constructs and returns OMResponse to client, it doesn't add the response into OzoneManagerDoubleBuffer, so OzoneManagerDoubleBuffer doesn't aware of this request and its trxLogIndex. The consequence is this trxLogIndex will be stay in applyTransactionMap forever. The failed OM logs confirmed that. Created HDDS-9876 to fix the issue. cc [~szetszwo], [~ritesh] was (Author: sammi): Together with [~wanghongbing] and [~z-bb] , we find the root cause of this. There are three major issues shown in this case. The first is a long full GC which caused OM to terminate itself, {code:java} 2023-10-24 11:40:08,542 [JvmPauseMonitor0] WARN org.apache.ratis.util.JvmPauseMonitor: JvmPauseMonitor-om2: Detected pause in JVM or host machine approximately 131.539s with 131.735s GC time. GC pool 'ParNew' had collection(s): count=1 time=1422ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=130313ms ... 2023-10-24 11:40:08,552 [JvmPauseMonitor0] ERROR org.apache.ratis.server.RaftServer: om2: JVM pause detected 131.539s longer than the close-threshold 120s, shutting down ... {code} The second is the gap between snapshot index taken during OM shutdown and commitIndex in Raft is too huge, which leads to many raft logs need replay in next OM start, and which also caused the third issue. {code:java} // Some comments here 2023-10-24 11:40:08,557 [om2-impl-thread2] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: set stopIndex = 405823932 ... 2023-10-24 11:40:13,773 [om2@group-197E298202B9-StateMachineUpdater] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: Took a snapshot at index 206114616 2023-10-24 11:40:13,776 [om2@group-197E298202B9-StateMachineUpdater] INFO org.apache.ratis.server.impl.StateMachineUpdater: om2@group-197E298202B9-StateMachineUpdater: snapshotIndex: updateIncreasingly 205866698 -> 206114616 ... 2023-10-24 12:02:58,532 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-197E298202B9-SegmentedRaftLogWorker: flushIndex: setUnconditionally 0 -> 405823932 2023-10-24 12:02:58,532 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-197E298202B9-SegmentedRaftLogWorker: safeCacheEvictIndex: setUnconditionally 0 -> 206114616 2023-10-24 12:02:58,619 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.RaftLog: om2@group-197E298202B9-SegmentedRaftLog: commitIndex: updateToMax old=206114616, new=405823931, updated? true 2023-10-24 12:02:58,619 [om2-impl-thread1] INFO org.apache.ratis.server.raftlog.RaftLog: om2@group-197E298202B9-SegmentedRaftLog: purgeIndex: updateIncreasingly -1 -> 206094557 {code} The third is the OM start failure due to raft log replay failure, {code:java} 2023-10-24 12:02:59,086 [OM StateMachine ApplyTransaction Thread - 0] ERROR org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with exit status 1: Request cmdType: RenameKey traceID: "" clientId: "client-9300289B92FF" userInfo { userName: "weisuonan" remoteAddress: "10.83.99.62" hostName: "bigdata-apache-tool1.nmg01.diditaxi.com" } version: 3 renameKeyRequest { keyArgs { volumeName: "vol-bigdata-offline" bucketName: "weisuonan" keyName: "checkative/english-book_Ozone" dataSize: 0 modificationTime: 1697627113034 } toKeyName: ".Trash/weisuonan/Current/vol-bigdata-offline/weisuonan/checkative/english-book_Ozone1697627113026" } failed with exception java.lang.IllegalArgumentException: Trying to set updateID to 206114651 which is not greater than the current value of 405823931 for OMKeyInfo{volume='vol-bigdata-offline', bucket='weisuonan', key='checkative/english-book_Ozone', dataSize='23124', creationTime='1698118516546', objectID='-9223371932963850495', parentID='-9223372036604311807', replication='RATIS/THREE', fileChecksum='null} at org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107) at org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.renameKey(OMKeyRenameRequestWithFSO.java:286) at org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequestWithFSO.validateAndUpdateCache(OMKeyRenameRequestWithFSO.java:219) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:382) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} After an analysis, all the issues are one root cause. HDDS-2680 introduced a logic in OzoneManagerStateMachine to calculate the lastAppliedTermIndex based on two maps, applyTransactionMap and ratisTransactionMap. Any write request from RATIS through applyTransaction will add its trxLogIndex into applyTransactionMap. And any write request which is flushed by OzoneManagerDoubleBuffer#flushBatch will have its trxLogIndex removed from applyTransactionMap during flushBatch call ozoneManagerRatisSnapShot.updateLastAppliedIndex(flushedEpochs). If any write request from RATIS not going through OzoneManagerDoubleBuffer#flushBatch, then its trxLogIndex will be left in the applyTransactionMap forever. Since lastApplicedIndex can only be updated incrementally, any trxLogIndex not confirmed by OzoneManagerDoubleBuffer flush will make the lastApplicedIndex grow stops before it, and although write requests after that unconfirmed one could be flushed, but their trxLogIndex will be added to the ratisTransactionMap, which causes the ratisTransactionMap grow bigger and bigger. These explains the full GC, the huge gap between snapshot index and commit index. How a write request will not be confirmed by OzoneManagerDoubleBuffer flush? Here is one case reproduced locally. T1: create bucket1 T2: client1 sends delete bucket "bucket1" request to OM. OM verify bucket1 exists, then send request to RATIS to handle the request. T3: client2 sends create key "bucket1/key1" request to OM. OM verify bucket2 exists, then send request to RATIS T4: OzoneManagerStateMachine executes delete bucket "bucket1" success, return response to client1 T5: OzoneManagerStateMachine executes create key "bucket1/key1" request, "bucket1" cannot be found, execution fails, return failure to client2 In T5, the failure stack is {code:java} 2023-10-18 19:04:10,131 [OM StateMachine ApplyTransaction Thread - 0] WARN org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Failed to write, Exception occurred BUCKET_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Bucket not found: s3v/prod-voyager at org.apache.hadoop.ozone.om.OzoneManagerUtils.reportNotFound(OzoneManagerUtils.java:87) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketInfo(OzoneManagerUtils.java:72) at org.apache.hadoop.ozone.om.OzoneManagerUtils.resolveBucketInfoLink(OzoneManagerUtils.java:148) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getResolvedBucketInfo(OzoneManagerUtils.java:124) at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketLayout(OzoneManagerUtils.java:106) at org.apache.hadoop.ozone.om.request.BucketLayoutAwareOMKeyRequestFactory.createRequest(BucketLayoutAwareOMKeyRequestFactory.java:230) at org.apache.hadoop.ozone.om.ratis.utils.OzoneManagerRatisUtils.createClientRequest(OzoneManagerRatisUtils.java:336) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:380) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} In OzoneManagerStateMachine.runCommand, when IOException is throw out from OzoneManagerRequestHandler.handleWriteRequest, it constructs and returns OMResponse to client, it doesn't add the response into OzoneManagerDoubleBuffer, so OzoneManagerDoubleBuffer doesn't aware of this request and its trxLogIndex. The consequence is this trxLogIndex will be stay in applyTransactionMap forever. The failed OM logs confirmed that. Created HDDS-9876 to fix the issue. cc [~szetszwo], [~ritesh] > OM restart failed due to transactionLogIndex smaller than current updateID > -------------------------------------------------------------------------- > > Key: HDDS-9342 > URL: https://issues.apache.org/jira/browse/HDDS-9342 > Project: Apache Ozone > Issue Type: Bug > Components: OM, OM HA > Affects Versions: 1.3.0 > Reporter: Hongbing Wang > Assignee: Sammi Chen > Priority: Critical > Attachments: HDDS-9342_testUpdateId.patch, > HDDS-9342_testUpdateId_reproduce.patch, clipboard_image_1700795744614.png, > om.shutdown-20230922.log > > > OM restart failed, log as follow: > create failed: > {noformat} > java.lang.IllegalArgumentException: Trying to set updateID to 2901863625 > which is not greater than the current value of 2901863627 for > OMKeyInfo{volume='vol-xxx', bucket='xxx', key='user/xxx/platform/xxx', > dataSize='268435456', creationTime='1695088210914', > objectID='-9223371293977687808', parentID='0', replication='RATIS/THREE', > fileChecksum='null} > at > org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105) > at > org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareFileInfo(OMKeyRequest.java:665) > at > org.apache.hadoop.ozone.om.request.key.OMKeyRequest.prepareKeyInfo(OMKeyRequest.java:623) > at > org.apache.hadoop.ozone.om.request.file.OMFileCreateRequest.validateAndUpdateCache(OMFileCreateRequest.java:255) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > rename failed: > {noformat} > java.lang.IllegalArgumentException: Trying to set updateID to 2901863669 > which is not greater than the current value of 3076345041 for > OMKeyInfo{volume='vol-xxx', bucket='xxx', key='checkative/xxx', > dataSize='23124', creationTime='1695380440059', > objectID='-9223371249310446848', parentID='0', replication='RATIS/THREE', > fileChecksum='null} > at > org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:105) > at > org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:190) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:311) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerRouterRequestHandler.handleWriteRequest(OzoneManagerRouterRequestHandler.java:806) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:535) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:326) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org