[ https://issues.apache.org/jira/browse/HDDS-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057833#comment-17057833 ]
Shashikant Banerjee edited comment on HDDS-3086 at 3/12/20, 12:15 PM: ---------------------------------------------------------------------- 2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService (GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153] 2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService (GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153] Looks like, on getting a NotLeaderException, the raft client is not retrying on a different server. This is because handleIOException function is not synchronised and can get called in different threads using the same raft client instance as the example quoted here and thereby changing the leaderID field in RaftClientImpl instance. Even for the same call Id, it seems to be retrying on the same leader 2020-03-11 13:49:41,230 [Thread-950] DEBUG impl.UnorderedAsync (UnorderedAsync.java:lambda$sendRequestWithRetry$4(98)) - c*lient-4364850591EA: attempt #2 failed RaftClientRequest:client-4364850591EA->7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759, cid=142**, seq=0, Watch-ALL_COMMITTED(135), null with {} org.apache.ratis.protocol.NotLeaderException: Server 7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759 is not the leader 2e82b5c1-6fda-4b17-971e-32def0d5960a:172.17.0.2:46771 2020-03-11 13:49:41,230 [Thread-950] DEBUG client.RaftClient (RaftClientImpl.java:handleIOException(365)) - client-4364850591EA: suggested new leader: 2e82b5c1-6fda-4b17-971e-32def0d5960a. Failed RaftClientRequest:*client-4364850591EA->7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759, cid=142*, seq=0, Watch-ALL_COMMITTED(135), null with {} org.apache.ratis.protocol.NotLeaderException: Server 7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759 is not the leader 2e82b5c1-6fda-4b17-971e-32def0d5960a:172.17.0.2:46771 [~swagle], can you have a look at this? cc ~[~msingh] was (Author: shashikant): 2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService (GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153] 2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService (GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1, cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153] Looks like, on getting a NotLeaderException, the raft client is not retrying on a different server. This is because handleIOException function is not synchronised and can get called in different threads using the same raft client instance as the example quoted here and thereby changing the leaderID field in RaftClientImpl instance. [~swagle], can you have a look at this? cc ~[~msingh] > Failure running integration test it-freon > ------------------------------------------ > > Key: HDDS-3086 > URL: https://issues.apache.org/jira/browse/HDDS-3086 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: freon > Reporter: Supratim Deka > Assignee: Siddharth Wagle > Priority: Major > Attachments: debug_output.zip, > org.apache.hadoop.fs.ozone.contract.ITestOzoneContractDistCp-output.txt, > org.apache.hadoop.ozone.freon.TestDataValidateWithDummyContainers-output.txt, > org.apache.hadoop.ozone.freon.TestRandomKeyGenerator-output.txt, > org.apache.hadoop.ozone.freon.TestRandomKeyGenerator.txt > > > Observed a time-out during pr-check/it-freon for HDDS-2940. Failure appears > unrelated to the changes in the patch. > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.193 > s - in org.apache.hadoop.ozone.freon.TestDataValidateWithUnsafeByteOperations > 2862 > [INFO] Running org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart > 2863 > [WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: > 30.559 s - in org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart > 2864 > [INFO] > 2865 > [INFO] Results: > 2866 > [INFO] > 2867 > [WARNING] Tests run: 16, Failures: 0, Errors: 0, Skipped: 3 > 2868 > [INFO] > 2869 > [INFO] > ------------------------------------------------------------------------ > 2870 > [INFO] BUILD FAILURE > 2871 > [INFO] > ------------------------------------------------------------------------ > 2872 > [INFO] Total time: 28:58 min > 2873 > [INFO] Finished at: 2020-02-26T17:55:42Z > 2874 > [INFO] > ------------------------------------------------------------------------ > 2875 > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) > on project hadoop-ozone-integration-test: There was a timeout or other error > in the fork -> [Help 1] > 2876 > [ERROR] > 2877 > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > 2878 > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > 2879 > [ERROR] > 2880 > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > 2881 > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org