[
https://issues.apache.org/jira/browse/ZOOKEEPER-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227482#comment-16227482
]
ASF GitHub Bot commented on ZOOKEEPER-2684:
-------------------------------------------
Github user kfirlevari commented on a diff in the pull request:
https://github.com/apache/zookeeper/pull/411#discussion_r148119340
--- Diff:
src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
@@ -246,33 +246,51 @@ public void run() {
}
/*
- * Check if request is pending, if so, update it with
the
- * committed info
+ * Check if request is pending, if so, update it with
the committed info
*/
LinkedList<Request> sessionQueue = pendingRequests
.get(request.sessionId);
if (sessionQueue != null) {
// If session queue != null, then it is also not
empty.
Request topPending = sessionQueue.poll();
if (request.cxid != topPending.cxid) {
- LOG.error(
- "Got cxid 0x"
- +
Long.toHexString(request.cxid)
- + " expected 0x" +
Long.toHexString(
- topPending.cxid)
- + " for client session id "
- + Long.toHexString(request.sessionId));
- throw new IOException("Error: unexpected cxid
for"
- + "client session");
+ // TL;DR - we should not encounter this
scenario often under normal load.
+ // We pass the commit to the next processor
and put the pending back with a warning.
+
+ // Generally, we can get commit requests that
are not at the queue head after
+ // a session moved (see ZOOKEEPER-2684). Let's
denote the previous server of the session
+ // with A, and the server that the session
moved to with B (keep in mind that it is
+ // possible that the session already moved
from B to a new server C, and maybe C=A).
+ // 1. If request.cxid < topPending.cxid : this
means that the session requested this update
+ // from A, then moved to B (i.e., which is
us), and now B receives the commit
+ // for the update after the session already
performed several operations in B
+ // (and therefore its cxid is higher than that
old request).
+ // 2. If request.cxid > topPending.cxid : this
means that the session requested an updated
+ // from B with cxid that is bigger than the
one we know therefore in this case we
+ // are A, and we lost the connection to the
session. Given that we are waiting for a commit
+ // for that update, it means that we already
sent the request to the leader and it will
+ // be committed at some point (in this case
the order of cxid won't follow zxid, since zxid
+ // is an increasing order). It is not safe for
us to delete the session's queue at this
+ // point, since it is possible that the
session has newer requests in it after it moved
+ // back to us. We just leave the queue as it
is, and once the commit arrives (for the old
+ // request), the finalRequestProcessor will
see a closed cnxn handle, and just won't send a
+ // response.
+ // Also note that we don't have a local
session, therefore we treat the request
+ // like any other commit for a remote request,
i.e., we perform the update without sending
+ // a response.
+
+ LOG.warn("Got request " + request +
+ " but we are expecting request " +
topPending);
+ sessionQueue.addFirst(topPending);
+ } else {
+ // We want to send to the next processor our
version of the request,
+ // since it contains the session information
that is needed
+ // for post update processing (e.g., using
request.cnxn we send a response to the client).
+ topPending.setHdr(request.getHdr());
--- End diff --
Updated accordingly
> Fix a crashing bug in the mixed workloads commit processor
> ----------------------------------------------------------
>
> Key: ZOOKEEPER-2684
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2684
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.6.0
> Environment: with pretty heavy load on a real cluster
> Reporter: Ryan Zhang
> Assignee: Ryan Zhang
> Priority: Blocker
> Attachments: ZOOKEEPER-2684.patch
>
>
> We deployed our build with ZOOKEEPER-2024 and it quickly started to crash
> with the following error
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:24:42,305 - ERROR
> [CommitProcessor:2]
> -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
> – Got cxid 0x119fa expected 0x11fc5 for client session id 1009079ba470055
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:32:04,746 - ERROR
> [CommitProcessor:2]
> -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
> – Got cxid 0x698 expected 0x928 for client session id 4002eeb3fd0009d
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:34:46,648 - ERROR
> [CommitProcessor:2]
> -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
> – Got cxid 0x8904 expected 0x8f34 for client session id 51b8905c90251
> atla-buh-05-sr1.prod.twttr.net: 2017-01-18 22:43:46,834 - ERROR
> [CommitProcessor:2]
> -org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:268)
> – Got cxid 0x3a8d expected 0x3ebc for client session id 2051af11af900cc
> clearly something is not right in the new commit processor per session queue
> implementation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)