[ https://issues.apache.org/jira/browse/KUDU-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adar Dembo resolved KUDU-2748. ------------------------------ Resolution: Fixed Fix Version/s: 1.10.0 Will fixed this in commit 28c706722. > Leader master erroneously tries to tablet copy to a follower master due to > race at startup > ------------------------------------------------------------------------------------------ > > Key: KUDU-2748 > URL: https://issues.apache.org/jira/browse/KUDU-2748 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.9.0 > Reporter: Will Berkeley > Assignee: Will Berkeley > Priority: Major > Fix For: 1.10.0 > > > I was investigating KUDU-2734 and ran into a weird situation. The test runs > with 3 masters and changes the value of a flag on the masters. To effect the > change, it restarts the masters. Suppose the masters are labelled A, B, and > C. Somewhat rarely (e.g. 8% of the time when run in TSAN with 8 stress > threads), the following happens: > 1. A and B are restarted successfully. They form a quorum and elect a leader > (say A). > 2. C is in the process of restarting. The ConsensusService is registered and > C is accepting RPCs. > 3. A sends C an UpdateConsensus RPC. However, C is still in the process of > starting and has not yet initialized the systable. When C receives the > UpdateConsensus call, as a result it responds with TABLET_NOT_FOUND, even > though the proper response should be SERVICE_UNAVAILABLE. > 4. A interprets TABLET_NOT_FOUND to mean that C needs to be copied to, and it > tries forever to tablet copy to C. The copies never start because tablet copy > is not implemented for masters. > 5. C finishes its startup but does not receive UpdateConsensus from A because > A is sending StartTableCopy requests. C calls pre-elections endlessly. > This effectively means the cluster is running with two masters until there is > a leadership change. This caused the flakiness of > KsckRemoteTest.TestClusterWithLocation because C never recognizes the > leadership of A, so Ksck master consensus checks fail. > A regular tablet on a tablet server is not vulnerable to this. It's specific > to how the master starts up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)