[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438143#comment-16438143 ] Mike Percy commented on KUDU-2323: -- Based on a conversation with Todd in Slack the underlying issue still looks like a race between adding and removing a replica by the master in cluster. Having a delayed or retried DeleteTablet() RPC seems more likely under heavy load, and adding back the same node after evicting it seems more likely the smaller the cluster is. However we still have not been able to reproduce this particular bug. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending >
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416530#comment-16416530 ] Mike Percy commented on KUDU-2323: -- The analysis in my previous comment was wrong. We actually Close() the PeerManager and then reinitialize it when we change the config in RaftConsensus::RefreshConsensusQueueAndPeersUnlocked(). So I'm not really sure what could cause the behavior seen in the description of this ticket. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > {code} -- This message was
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414801#comment-16414801 ] Mike Percy commented on KUDU-2323: -- While the speed of this cycle was likely fixed by the patch to fix KUDU-2320, it appears there is no code path to remove a TrackedPeer when it gets evicted. While this could cause a minor resource leak until a leader was evicted in a 3-2-3 world, in a 3-4-3 world it affects last_communcation_time and can therefore make a downed NON_VOTER to be considered FAILED as soon as it is added to the config. Maybe this is also interacting with KUDU-2354, in which there are certain cases that can cause a catalog manager task to endlessly retry adding a new replica. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sen
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407272#comment-16407272 ] Mike Percy commented on KUDU-2323: -- It seems this should have been partially fixed by the fix for KUDU-2320. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373030#comment-16373030 ] Jean-Daniel Cryans commented on KUDU-2323: -- Or [~aserbin]. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372471#comment-16372471 ] Todd Lipcon commented on KUDU-2323: --- Not 100% following what's going on, but it seems like what happened is the following: - the NON_VOTER peer had fallen behind the WAL retention - it got evicted by the leader - the master decided to add back the same node that just got evicted. -- it sends a DELETE request to the TS to delete the old (un-catch-up-able) replica -- simultaneously it sends an ADD_PEER request to the leader to add back the same peer as a non-voter The two things race,, and in this case for some reason the TS Admin Service was fairly blocked up and the Delete request took 10+ seconds to eventually get through. Thus, the leader received the ADD_PEER before the replica had been deleted, causing it to add back the same stale replica as before. Of course as soon as it connected it saw that it was stale and decided to evict it. What I'm not 100% sure of is why it kept choosing the same peer over and over again. The cluster has 5 servers so seems like there should have been another choice. Perhaps due to the load balancing algorithm this one was chosen overwhelmingly often? [~mpercy] do you mind taking a look since this might be a regression since 1.6? > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e