[ 
https://issues.apache.org/jira/browse/KUDU-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-2748.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.10.0

Will fixed this in commit 28c706722.

> Leader master erroneously tries to tablet copy to a follower master due to 
> race at startup
> ------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2748
>                 URL: https://issues.apache.org/jira/browse/KUDU-2748
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Will Berkeley
>            Assignee: Will Berkeley
>            Priority: Major
>             Fix For: 1.10.0
>
>
> I was investigating KUDU-2734 and ran into a weird situation. The test runs 
> with 3 masters and changes the value of a flag on the masters. To effect the 
> change, it restarts the masters. Suppose the masters are labelled A, B, and 
> C. Somewhat rarely (e.g. 8% of the time when run in TSAN with 8 stress 
> threads), the following happens:
> 1. A and B are restarted successfully. They form a quorum and elect a leader 
> (say A).
> 2. C is in the process of restarting. The ConsensusService is registered and 
> C is accepting RPCs.
> 3. A sends C an UpdateConsensus RPC. However, C is still in the process of 
> starting and has not yet initialized the systable. When C receives the 
> UpdateConsensus call, as a result it responds with TABLET_NOT_FOUND, even 
> though the proper response should be SERVICE_UNAVAILABLE.
> 4. A interprets TABLET_NOT_FOUND to mean that C needs to be copied to, and it 
> tries forever to tablet copy to C. The copies never start because tablet copy 
> is not implemented for masters.
> 5. C finishes its startup but does not receive UpdateConsensus from A because 
> A is sending StartTableCopy requests. C calls pre-elections endlessly.
> This effectively means the cluster is running with two masters until there is 
> a leadership change. This caused the flakiness of 
> KsckRemoteTest.TestClusterWithLocation because C never recognizes the 
> leadership of A, so Ksck master consensus checks fail.
> A regular tablet on a tablet server is not vulnerable to this. It's specific 
> to how the master starts up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to