On Fri, Mar 1, 2019 at 11:14 AM Han Zhou <zhou...@gmail.com> wrote:
> On Fri, Mar 1, 2019 at 11:13 AM Han Zhou <zhou...@gmail.com> wrote:
> >
> > On Tue, Feb 26, 2019 at 1:57 PM Han Zhou <zhou...@gmail.com> wrote:
> > >
> > > I have been working on OVSDB Raft scale testing problems debugging
> > > recently and here is a major finding. I thought there were different
> > > problems but it turned out that the only major problem is the
> > > prerequisite checking mechanism. My conclusion is that we'd better
> > > abandon the prerequisite checking and change to the alternative
> > > approach mentioned by Ben at an earlier OVN meeting: the follower
> > > sends the operations requested by clients directly to leader, without
> > > any prerequisite assumption. Here is the detailed analysis.
> > >
> > > The test environment is for ovn-scale-test, with 3-node clusters for
> > > NB and SB. There are 1k HVs connecting to the SB DB, connections
> > > randomly distributed to the 3 cluster nodes. The test is creating 10k
> > > lports and bind them on the 1k HVs. The bottleneck is in SB DB only,
> > > because of the big number of clients and the continuous write requests
> > > from the clients (port-binding updates).
> > >
> > > The first problem I encountered was that the leader re-election
> > > happened a lot during the testing because the leader wasn't able to
> > > respond to heartbeat messages in time, due to the CPU load. The ideal
> > > solution should be using dedicated thread for Raft messages handling,
> > > but I tried to circumvent the problem and see what happens by
> > > increasing the election base time from 1024ms to 10000ms. The
> > > heartbeat messages are sent at 1/3 of the expire time, so now it is
> > > sent out around every 3+ sec.
> > >
> > > With the change the leader re-election problem solved. However, it
> > > reveals the real problem why CPU load was high. The CPU load stayed at
> > > 100% on the leader, and the INFO logs on prerequisite check failure
> > > was extremely high. For example:
> > > 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages
> > > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > > rate
> > > 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid
> > > b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite
> > > 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request
> > > 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages
> > > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > > rate
> > > 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid
> > > c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite
> > > a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request
> > > 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages
> > > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > > rate
> > > 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid
> > > 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite
> > > def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request
> > >
> > > This is ~30K failures per second. This is caused by the prereq check
> > > failures and continuously retries from the followers.
> > >
> > > My first reaction was to avoid the continuous retry from followers by
> > > checking if the db's prereq has changed, since it makes no sense to
> > > retry with same prereq as the previous failed attempt. I made the
> > > change and tried again. It did solve the CPU problem, but revealed two
> > > new problems.
> > >
> > > Problem 1):
> > > There were still hundreds of prerequisite check failures per second
> > > seen from the leader's logs and whole scale test execution was
> > > extremely slow, just as the previous high CPU load situation (6 hours
> > > to complete the whole testing). By checking Raft messages in debug
> > > logs, I found out that this is because there are so many clients
> > > requesting updates to the DB, and each follower can pile up a batch of
> > > transactions from different clients and then send them one by one to
> > > the leader, all with *the same prereq*. In this case, even if there is
> > > no racing with other nodes, the requests from the same follower will
> > > end up only the first one being succeeded, and the later ones will
> > > fail the prereq check for sure because the first transaction proposed
> > > by the same follower would cause the leader update its expected
> > > prereq. I tried another patch to solve the problem, by updating the
> > > follower's "prereq" to the "result" - the eid of the last proposed
> > > commit, so that the *batch* of transaction proposed by the same
> > > follower would either all succeed if there is no racing from other
> > > nodes, or all fail if there is racing. This patch worked, and the
> > > whole testing completed in 2 hours, similar to the result when using
> > > standalone DB. However, I realized that this trick violates the basic
> > > transactional correctness. The proposed commit should be based on a
> > > previously committed version. If the previous proposed transaction in
> > > the *batch* is not in committed state, setting its eid as prereq is
> > > wrong. Going further, the attempt to propose a batch of transaction
> > > seems to be wrong - each transaction should be examined on top of
> > > previously committed transactions (unless we are doing concurrency
> > > control such as MVCC, but obviously it is not the case in current
> > > OVSDB design). So, the prereq has to be the last applied eid on the
> > > follower side (as is implemented today), to ensure the correctness.
> > >
> > > Because of this, the only optimization we can do with the prereq
> > > checking mechanism is probably proposing commits to leader only when:
> > > a) the raft applied index is the same as the latest raft eid in the
> > > follower's raft log - because otherwise it indicates there are new
> > > proposal already happened from other nodes and it will be just waste
> > > of CPU to propose a commit which would fail the prereq check; and b)
> > > there is no pending proposal from current follower node that has been
> > > sent to the leader with the same prereq - because otherwise it
> > > indicates the following proposal would fail the prereq check, too.
> >
> > I submitted a patch for a), which is a low hanging fruit. b) is not
> > difficult to do but it seems not providing big value since a) already
> > solves the CPU problem (but not solving the latency - "problem 2"
> > described below).
> Sorry, I forgot to paste link to the patch:
> https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/356891.html
> >
> > >
> > > However, this optimization can solve the CPU load problem, but it
> > > would not solve the performance problem. The scale test would still be
> > > slow, due to the problem 2) described below.
> > >
> > > Problem 2):
> > > There is a latency introduced by the prereq check mechanism. Take the
> > > simple scenario as example, assume there is no racing with other
> > > nodes, and the follower A is currently the only node proposing
> > > commits. Assume there are 3 transactions T1 - T3 accumulated in
> > > current main loop iteration to be sent to the leader. The Raft message
> > > sequence would be like following:
> > >
> > > 1. Follower A: send CommandExec(T1) using prereq P1
> > > 2. Leader: handle T1, send AppendReq(T1) to all followers
> > > 3. Follower A: receive AppendReq(T1), update local log_end (but not
> > > updating commit_index or last_applied), reply to leader
> > > 4. Leader: received major replies, reply success to Follower A for the
> > > CommandExec(T1), with the new commit_index
> > > 5. Follower A: handle reply for CommandExec(T1). However, it does
> > > nothing but some cleanup. The commit_index in the reply seems not
> > > used. (This is correct since Raft doesn't suggest any action at this
> > > point)
> > > 6. ... 1/3 heartbeat timer reached ...
> > > 7. Leader: send AppendReq(heartbeat) with the new commit_index to 
> > > follower A
> > > 8. Follower A: handle AppendReq(heartbeat) and update local
> > > commit_index, and now follower A can apply T1 to its local committed
> > > data, and T2 can be proposed to leader with the new prereq P2.
> > > 9. Follower A: send CommandExec(T2) using prereq P2.
> > > ...
> > >
> > > As we can see, just to complete T1 there is a latency caused by the
> > > hearbeat timer at step 6, and T2 cannot start until step 8 (it can
> > > start earlier like how it is currently implemented but only could
> > > result in prereq check failure and useless retries wasting more CPU
> > > cycles). So the batch of transactions in the follower A have to be
> > > executed one by one, each wait at least 1/3 heartheat timer. This is
> > > causing big latencies and very low performance overall. An
> > > optimization may be done at step 5 when follower receives CommandExec
> > > reply, to move forward the local commit_index without waiting for the
> > > next heartbeat. However, I am not sure if it introduces any problem
> > > since it is not something mentioned in Raft paper.
> > >

I submitted the patch that solves problem 2) here:
It immediately send out the next heartbeat to all followers to sync
the new commit index, without waiting for next heartbeat interval. It
is better than carrying commit_index in CommandExecute reply because
the CommandExecute reply is only for the follower that requested the
update. The heartbeat is sent to all followers so no one will be left
behind. It is also cleaner to keep CommandExecute and AppendRequest
independent. This patch still doesn't execute the batch of requests in
a pipelined way (which can only be solved by avoid the prereq check
mechanism), but the major latency problem is solved (with very minimum
change). I ran the OVN scale test again with this patch (and adjusted
ELECTION_BASE to 10000 to avoid leader re-election during CPU spikes
such as when doing snapshot), the overall CPU is not high and the
whole test finished in 2 hours, just like the standalone situation.

> > > Because of the above problems, I believe the alternative approach
> > > would be better than the prereq check mechanism and solve the problems
> > > easily. Instead of sending the final result to leader, followers can
> > > simply forward the "transaction" PRC from client to leader, and let
> > > leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit()
> > > on top of the latest data on leader. This way, both problems are
> > > solved:
> > >
> > > For problem 1), since leader will do the real transaction parsing and
> > > execution, the prerequisite is not needed - in fact there is no
> > > defined order between the transactions forwarded by followers, since
> > > they originally came from different clients, so leader has the freedom
> > > to apply them in any order it considers as consistent. For problem 2),
> > > a batch of request can be sent to leader without delaying.

> > > Only the
> > > final commit_index update for last transaction in the batch will
> > > experience the delay caused by the heartbeat timer (similar as in step
> > > 6 above) (which seems to be inherent problem of the raft design?)
Raft paper didn't mention this explicitly (at least I didn't find),
but now I think it is not an inherent problem, because as demonstrated
by the patch it can be solved easily.

> > >
> > > There are side-effects of this alternative approach. The leader will
> > > do a little more work than before, but compared with the wasted CPU
> > > for handling the useless failure retries this is nothing. Even if the
> > > previously mentioned optimizations are implemented for the prereq
> > > mechanism to avoid the CPU waste, the alternative approach would still
> > > beat the prereq mechanism because of the overall performance advantage
> > > (because there is no delaying for sending request to leader).
> > >
> > > Ben, please correct me if I misunderstood anything and suggest the next 
> > > steps.
> > >

Now that the two major problems related to prereq check are both
alleviated significantly by relatively simple changes, I think the
necessity of changing the prereq check mechanism to
do-everything-on-leader is less important. Whether it is still worth
to do will depend on the trade-off between the gains (even less
conflict and retries, pipelined execution, etc.) and side-effects
(leader doing more work).

discuss mailing list

Reply via email to