[
https://issues.apache.org/jira/browse/KUDU-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713618#comment-15713618
]
Todd Lipcon commented on KUDU-1779:
-----------------------------------
Here's some notes on the state of the tablet:
- on all three replicas, the transaction tracker is at the limit
-- the tracker has 5-10 transactions which have been assigned OpIds and many
other LEADER transactions that have no OpId assigned yet (i.e they are still in
the prepare queue)
--- these were submitted to the prepare queue at some past time when the
replica was LEADER, even though in many cases the replica is no longer LEADER
and we know that the operations will eventually fail once they make it out of
the prepare queue
-- the ops themselves have some row key conflicts: one op has been prepared and
is waiting to be replicated, and another op is now blocked waiting to obtain
the same row key lock. This has "stopped" the prepare thread from making any
progress
- the leader's queue state looks like this:
{code}
Peer: 2d52a82cb62c4064842cf628d01083fe
Is new: false
Last received: 151.174779
Next index: 174780
Last known committed idx: 174762
Last exchange result: ERROR
Needs tablet copy: false
Peer: 1d2e1b44012e419392279a8e674638cb
Is new: false
Last received: 145.174776
Next index: 174777
Last known committed idx: 174762
Last exchange result: ERROR
Needs tablet copy: false
(LEADER) Peer: c101883d3e82496989a5f9f667c30e38
Is new: false
Last received: 153.174786
Next index: 174787
Last known committed idx: 174762
Last exchange result: SUCCESS
Needs tablet copy: false
Consensus queue metrics:
Only Majority Done Ops: 174762
In Progress Ops: 24
Cache: LogCacheStats(num_ops=1115, bytes=54426033)
{code}
... its two followers are in 'ERROR' state because every attempt to replicate
to them is getting CANNOT_PREPARE errors.
I captured an RPC exchange on a replica using /tracing.html:
{code}
request:
preceding: 151.174779
committed_index: 174762
ops: 151.174780 - 153.174786
response:
code: CANNOT_PREPARE:
last_received: 151.174779
last_received_current_leader: 0.0
last_committed: 174762
{code}
the follower is logging:
{code}
I1201 17:06:01.941049 111261 raft_consensus.cc:866] T
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161
FOLLOWER]: Deduplicated request from leader. Original:
144.174762->[144.174763-161.174790] Dedup: 151.174779->[151.174780-161.174790]
W1201 17:06:01.941187 111261 raft_consensus.cc:1218] T
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161
FOLLOWER]: Could not prepare transaction for op: 151.174780. Suppressed 10
other warnings. Status for this op: Service unavailable: Transaction failed,
tablet 9f7e54d8b37645c084a46fc6b27ddd34 transaction memory consumption
(67108557) has exceeded its limit (67108864) or the limit of an ancestral
tracker
I1201 17:06:01.941339 111261 raft_consensus.cc:1232] T
9f7e54d8b37645c084a46fc6b27ddd34 P 2d52a82cb62c4064842cf628d01083fe [term 161
FOLLOWER]: Rejecting Update request from peer c101883d3e82496989a5f9f667c30e38
for term 161. Could not prepare a single transaction due to: Service
unavailable: Transaction failed, tablet 9f7e54d8b37645c084a46fc6b27ddd34
transaction memory consumption (67108557) has exceeded its limit (67108864) or
the limit of an ancestral tracker
{code}
the leader is logging:
{code}
W1201 17:06:34.255569 182530 wire_protocol.cc:150] Unknown error code in
status:
W1201 17:06:34.255590 182530 consensus_peers.cc:328] T
9f7e54d8b37645c084a46fc6b27ddd34 P c101883d3e82496989a5f9f667c30e38 -> Peer
2d52a82cb62c4064842cf628d01083fe (ve1120.halxg.cloudera.com:7050): Couldn't
send request to peer 2d52a82cb62c4064842cf628d01083fe for tablet
9f7e54d8b37645c084a46fc6b27ddd34. Status: Runtime error: (unknown error code).
Retrying in the next heartbeat period. Already trie
{code}
So it seems like there are a few things going wrong here:
- when the follower responds with 'CANNOT_PREPARE', the leader should still
take into account that follower's 'last_received' index in order to update the
majority-replicated watermark. Note here that the majority 'last_received' is
id 174779, but the queue watermark is calculated at 174762.
- the leader is not understanding the error response being sent back and
logging that odd 'Unknown error code in Status' message. This may contribute to
the above
- we seem to be lacking appropriate test coverage of cases where the
transaction tracker gets full
> Consensus "stuck" with all transaction trackers are at limit
> ------------------------------------------------------------
>
> Key: KUDU-1779
> URL: https://issues.apache.org/jira/browse/KUDU-1779
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 1.1.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> In a stress cluster, I saw one tablet get "stuck" in the following state:
> - the transaction_tracker on all three replicas is "full" (no more can be
> submitted)
> - leader elections proceed just fine, but no leader is able to advance the
> commit index
> The issue seems to be that a replica will respond with 'CANNOT_PREPARE' when
> its transaction tracker is full. The leader then ignores this response, and
> doesn't advance the majority-replicated watermark. The transaction tracker
> stays full forever because the in-flight transactions can't get committed.
> Notes to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)