[ 
https://issues.apache.org/jira/browse/HDDS-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057833#comment-17057833
 ] 

Shashikant Banerjee edited comment on HDDS-3086 at 3/12/20, 12:15 PM:
----------------------------------------------------------------------

2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

Looks like, on getting a NotLeaderException, the raft client is not retrying on 
a different server. This is because handleIOException function is not 
synchronised and can get called in different threads using the same raft client 
instance as the example quoted here and thereby changing the leaderID field in 
RaftClientImpl instance.

Even for the same call Id, it seems to be retrying on the same leader
 2020-03-11 13:49:41,230 [Thread-950] DEBUG impl.UnorderedAsync 
(UnorderedAsync.java:lambda$sendRequestWithRetry$4(98)) - c*lient-4364850591EA: 
attempt #2 failed 
RaftClientRequest:client-4364850591EA->7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759,
 cid=142**, seq=0, Watch-ALL_COMMITTED(135), null with {}
 org.apache.ratis.protocol.NotLeaderException: Server 
7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759 is not the leader 
2e82b5c1-6fda-4b17-971e-32def0d5960a:172.17.0.2:46771

2020-03-11 13:49:41,230 [Thread-950] DEBUG client.RaftClient 
(RaftClientImpl.java:handleIOException(365)) - client-4364850591EA: suggested 
new leader: 2e82b5c1-6fda-4b17-971e-32def0d5960a. Failed 
RaftClientRequest:*client-4364850591EA->7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759,
 cid=142*, seq=0, Watch-ALL_COMMITTED(135), null with {}
 org.apache.ratis.protocol.NotLeaderException: Server 
7924fb7f-519b-472c-bc71-e98527b4bbed@group-E21385DEE759 is not the leader 
2e82b5c1-6fda-4b17-971e-32def0d5960a:172.17.0.2:46771

[~swagle], can you have a look at this?
 cc ~[~msingh]


was (Author: shashikant):
2020-02-27 14:50:15,865 [Thread-1361] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162*, seq=0, Watch-ALL_COMMITTED(152), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=162, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

 2020-02-27 14:50:15,876 [Thread-1371] INFO client.GrpcClientProtocolService 
(GrpcClientProtocolService.java:lambda$processClientRequest$0(283)) - Failed 
RaftClientRequest:*client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151*, seq=0, Watch-ALL_COMMITTED(135), Message:<EMPTY>, 
reply=RaftClientReply:client-E254C6160E81->11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1,
 cid=151, FAILED org.apache.ratis.protocol.NotLeaderException: Server 
11efd80a-6381-4dbb-8880-31de3a16794c@group-271AC8B241F1 is not the leader 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:172.17.0.2:46155, logIndex=0, 
commits[11efd80a-6381-4dbb-8880-31de3a16794c:c127, 
ac77a15b-49b5-4ecf-b448-2cbf40bbc057:c153, 
906a077a-ded3-4fb3-9302-78f8cc56c8ac:c153]

Looks like, on getting a NotLeaderException, the raft client is not retrying on 
a different server.  This is because handleIOException function is not 
synchronised and  can get called in different threads using the same raft 
client instance as the example quoted here and thereby changing the leaderID 
field in RaftClientImpl instance.

[~swagle], can you have a look at this?
cc ~[~msingh]


> Failure running integration test it-freon 
> ------------------------------------------
>
>                 Key: HDDS-3086
>                 URL: https://issues.apache.org/jira/browse/HDDS-3086
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: freon
>            Reporter: Supratim Deka
>            Assignee: Siddharth Wagle
>            Priority: Major
>         Attachments: debug_output.zip, 
> org.apache.hadoop.fs.ozone.contract.ITestOzoneContractDistCp-output.txt, 
> org.apache.hadoop.ozone.freon.TestDataValidateWithDummyContainers-output.txt, 
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator-output.txt, 
> org.apache.hadoop.ozone.freon.TestRandomKeyGenerator.txt
>
>
> Observed a time-out during pr-check/it-freon for HDDS-2940. Failure appears 
> unrelated to the changes in the patch. 
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.193 
> s - in org.apache.hadoop.ozone.freon.TestDataValidateWithUnsafeByteOperations
> 2862
> [INFO] Running org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart
> 2863
> [WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
> 30.559 s - in org.apache.hadoop.ozone.freon.TestFreonWithDatanodeRestart
> 2864
> [INFO] 
> 2865
> [INFO] Results:
> 2866
> [INFO] 
> 2867
> [WARNING] Tests run: 16, Failures: 0, Errors: 0, Skipped: 3
> 2868
> [INFO] 
> 2869
> [INFO] 
> ------------------------------------------------------------------------
> 2870
> [INFO] BUILD FAILURE
> 2871
> [INFO] 
> ------------------------------------------------------------------------
> 2872
> [INFO] Total time:  28:58 min
> 2873
> [INFO] Finished at: 2020-02-26T17:55:42Z
> 2874
> [INFO] 
> ------------------------------------------------------------------------
> 2875
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) 
> on project hadoop-ozone-integration-test: There was a timeout or other error 
> in the fork -> [Help 1]
> 2876
> [ERROR] 
> 2877
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> 2878
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 2879
> [ERROR] 
> 2880
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> 2881
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to