[ https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035970#comment-17035970 ]
Li Cheng commented on HDDS-3004: -------------------------------- When I bounce the leader OM one more time, it turns out the new leader cannot be ready for over 5 hours. 2020-02-13 {color:#FF0000}11:32:00{color},707 [IPC Server handler 165 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 165 on 9862, call Call#49 Retry#16 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:58760 org.apache.hadoop.ozone.om.exceptions.OMLeaderNotReadyException: om3@group-02A030565101 is {color:#FF0000}in LEADER state but not ready yet{color}. at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.processReply(OzoneManagerRatisServer.java:177) at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:136) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:162) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:118) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 2020-02-13 {color:#FF0000}14:53:{color}10,393 [IPC Server handler 174 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 174 on 9862, call Call#99 Retry#16 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:59780 org.apache.hadoop.ozone.om.exceptions.OMLeaderNotReadyException: om3@group-02A030565101 is {color:#FF0000}in LEADER state but not ready yet{color}. at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.processReply(OzoneManagerRatisServer.java:177) at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:136) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:162) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:118) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) No s3 writes can be successful. [root@VM_50_210_centos ~]# ~/.local/bin/aws s3api --endpoint http://localhost:9878 put-object --bucket ozone-test-reproduce-123 --key test.in --body test.txt An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error [root@VM_50_210_centos ~]# ~/.local/bin/aws s3api --endpoint http://localhost:9878 put-object --bucket ozone-test-reproduce-123 --key test.in --body test.txt An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error > S3 gateway write failure when leader OM is down and OM HA is enabled > -------------------------------------------------------------------- > > Key: HDDS-3004 > URL: https://issues.apache.org/jira/browse/HDDS-3004 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: om > Affects Versions: 0.4.0 > Reporter: Li Cheng > Priority: Major > > Use S3 gateway to keep writing data into a specific s3 gateway endpoint. > After the writer starts to work, I kill the OM process on the OM leader host. > After that, the s3 gateway can never allow writing data nad keeps reporting > InternalError for all new coming keys. > Process Process-488: > S3UploadFailedError: Failed to upload ./20191204/file1056.dat to > ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > Process Process-489: > S3UploadFailedError: Failed to upload ./20191204/file9631.dat to > ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > Process Process-490: > S3UploadFailedError: Failed to upload ./20191204/file7520.dat to > ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > Process Process-491: > S3UploadFailedError: Failed to upload ./20191204/file4220.dat to > ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > Process Process-492: > S3UploadFailedError: Failed to upload ./20191204/file5523.dat to > ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > Process Process-493: > S3UploadFailedError: Failed to upload ./20191204/file7520.dat to > ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) > when calling the PutObject operation (reached max retries: 4): Internal > Server Error > That's a partial list and note that all keys are different. I also tried > re-enable the OM process on previous leader OM, but it doesn't help since the > leader has changed. Also attach partial OM logs: > 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO > org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 > Retry#0 > org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from > 9.134.50.210:36561 > org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the > leader. Suggested leader is OM:om2. > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97) > at > org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO > org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 > Retry#0 > org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from > 9.134.50.210:36561 > org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the > leader. Suggested leader is OM:om2. > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97) > at > org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO > org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 > Retry#0 > org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from > 9.134.50.210:36561 > org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the > leader. Suggested leader is OM:om2. > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97) > at > org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > > Also attach the ozone-site.xml config to enable OM HA: > <property> > <name>ozone.om.service.ids</name> > <value>OMHA</value> > </property> > <property> > <name>ozone.om.nodes.OMHA</name> > <value>om1,om2,om3</value> > </property> > <property> > <name>ozone.om.node.id</name> > <value>om1</value> > </property> > <property> > <name>ozone.om.address.OMHA.om1</name> > <value>9.134.50.210:9862</value> > </property> > <property> > <name>ozone.om.address.OMHA.om2</name> > <value>9.134.51.215:9862</value> > </property> > <property> > <name>ozone.om.address.OMHA.om3</name> > <value>9.134.51.25:9862</value> > </property> > <property> > <name>ozone.om.ratis.enable</name> > <value>true</value> > </property> > <property> > <name>ozone.enabled</name> > <value>true</value> > <tag>OZONE, REQUIRED</tag> > <description> > Status of the Ozone Object Storage service is enabled. > Set to true to enable Ozone. > Set to false to disable Ozone. > Unless this value is set to true, Ozone services will not be started in > the cluster. > Please note: By default ozone is disabled on a hadoop cluster. > </description> > </property> -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org