[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

Jeremy Stribling (Commented) (JIRA) Wed, 25 Jan 2012 17:32:06 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193507#comment-13193507
 ]


Jeremy Stribling commented on ZOOKEEPER-1367:
---------------------------------------------

QA was able to reproduce it.  The good news is that NTP is working properly, so 
the logs should be a bit saner; also, I was able to salvage all the logs for 
the node that has the incorrectly-unexpired session.  The bad news I don't have 
any usable logs for one of the nodes, and for the other one I only have the 
most recent logs.  I hope this helps at least a little.

This time around, all three nodes are live when the problem happens.  There are 
three nodes: 90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3.  node2 
is the one in the bad state:

{noformat}
[zk: 90.0.0.2:2888(CONNECTED) 0] ls /election/zkrsm
[nominee0000000034, nominee0000000035, nominee0000000032, nominee0000000036]
[zk: 90.0.0.2:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000032 
90.0.0.1:7777 
cZxid = 0xe00000017
ctime = Wed Jan 25 20:53:15 UTC 2012
mZxid = 0xe00000017
mtime = Wed Jan 25 20:53:15 UTC 2012
pZxid = 0xe00000017
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1e3516a4bb770000
dataLength = 14
numChildren = 0
{noformat}

/election/zkrsm/nominee0000000032 is owned by a session (0x1e3516a4bb770000) 
that should be expired.  Connecting to the other two nodes show only the latest 
three znodes.

I have uploaded a tar file containing the datadirs of all three nodes, the logs 
for all runs of node2, and the logs of the latest run of node3, to:
http://pdos.lcs.mit.edu/~strib/ZOOKEEPER-1367-2.tgz (80 MB zipped, 2.8 GB 
unzipped)

Here's how the logs break down:

* run0.tgz: You can ignore this one, it's from a preliminary run that was not 
part of the cluster
* run1.tgz: This is where the cluster is built up using restarts.  In node2's 
case, it is first started with two nodes (node1 and node2).  Then that 
QuorumPeer is shutdown, and restarted with a configuration of all three nodes.
* run[2-8].tgz: Each represents a cluster restart.  No cluster 
re-configurations are done during this time.  Our whole application process is 
brought down and then back up using the same ZK configuration.
* run7.tgz: This is the run where the bad session was created and live:
{noformat}
2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE 
org.apache.zookeeper.server.PrepRequestProcessor  - 
:Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 
zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
{noformat}
* run8.tgz: The session created in the previous run of the system lives on, and 
is never expired as far as I can tell:
{noformat}
2012-01-25 20:53:27,950 2929 [pool-1-thread-2] TRACE 
org.apache.zookeeper.server.persistence.FileTxnSnapLog  - playLog --- create 
session in log: 1e3516a4bb770000 with timeout: 6000
{noformat}

I have the matching run8.log for node3, but it does not mention the bad session 
at all.  I have no usable logs for node1.

I also managed to do a dump against each node while the system was live.  Here 
are the results:

{noformat}
# telnet 90.0.0.1 2888
Trying 90.0.0.1...
Connected to 90.0.0.1.
Escape character is '^]'.
dump
SessionTracker dump:
Session Sets (3):
0 expire at Thu Jan 26 00:09:30 UTC 2012:
1 expire at Thu Jan 26 00:09:33 UTC 2012:
        0x773516a5076a0000
2 expire at Thu Jan 26 00:09:36 UTC 2012:
        0xd03516a4f7960000
        0x1e3516a4f8520001
ephemeral nodes dump:
Sessions with Ephemerals (3):
0x773516a5076a0000:
        /election/zk_live_members/nominee0000000035
        /election/role-cluster-mgmt_cluster/nominee0000000035
        /election/cassandra/nominee0000000036
        /election/zkrsm/nominee0000000034
0xd03516a4f7960000:
        /election/role-cluster-api_provider/nominee0000000033
        /election/connection-cluster/nominee0000000033
        /election/zkrsm/nominee0000000036
        /election/zk_live_members/nominee0000000036
        /election/role-cluster-dht_node/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000036
        /election/role-cluster-switch_manager/nominee0000000033
        /election/DHT/nominee0000000033
        /election/dht-reconciliation/nominee0000000034
        /election/cassandra/nominee0000000034
        /election/role-cluster-persistence_server/nominee0000000034
        /election/role-cluster-logical_manager/nominee0000000033
0x1e3516a4f8520001:
        /election/zkrsm/nominee0000000035
        /election/zk_live_members/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000034
        /election/cassandra/nominee0000000035


# telnet 90.0.0.2 2888
Trying 90.0.0.2...
Connected to 90.0.0.2.
Escape character is '^]'.
dump
SessionTracker dump:
org.apache.zookeeper.server.quorum.LearnerSessionTracker@1479a6b
ephemeral nodes dump:
Sessions with Ephemerals (4):
0x1e3516a4bb770000:
        /election/cassandra/nominee0000000032
        /election/role-cluster-mgmt_cluster/nominee0000000032
        /election/role-cluster-persistence_server/nominee0000000032
        /election/role-cluster-dht_node/nominee0000000032
        /election/dht-reconciliation/nominee0000000032
        /election/zkrsm/nominee0000000032
        /election/zk_live_members/nominee0000000032
0x773516a5076a0000:
        /election/zk_live_members/nominee0000000035
        /election/role-cluster-mgmt_cluster/nominee0000000035
        /election/cassandra/nominee0000000036
        /election/zkrsm/nominee0000000034
0xd03516a4f7960000:
        /election/role-cluster-api_provider/nominee0000000033
        /election/connection-cluster/nominee0000000033
        /election/zkrsm/nominee0000000036
        /election/zk_live_members/nominee0000000036
        /election/role-cluster-dht_node/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000036
        /election/role-cluster-switch_manager/nominee0000000033
        /election/DHT/nominee0000000033
        /election/dht-reconciliation/nominee0000000034
        /election/cassandra/nominee0000000034
        /election/role-cluster-persistence_server/nominee0000000034
        /election/role-cluster-logical_manager/nominee0000000033
0x1e3516a4f8520001:
        /election/zkrsm/nominee0000000035
        /election/zk_live_members/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000034
        /election/cassandra/nominee0000000035


# telnet 90.0.0.3 2888
Trying 90.0.0.3...
Connected to 90.0.0.3.
Escape character is '^]'.
dump
SessionTracker dump:
org.apache.zookeeper.server.quorum.LearnerSessionTracker@5960a2f1
ephemeral nodes dump:
Sessions with Ephemerals (3):
0x773516a5076a0000:
        /election/zk_live_members/nominee0000000035
        /election/role-cluster-mgmt_cluster/nominee0000000035
        /election/cassandra/nominee0000000036
        /election/zkrsm/nominee0000000034
0xd03516a4f7960000:
        /election/role-cluster-api_provider/nominee0000000033
        /election/connection-cluster/nominee0000000033
        /election/zkrsm/nominee0000000036
        /election/zk_live_members/nominee0000000036
        /election/role-cluster-dht_node/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000036
        /election/role-cluster-switch_manager/nominee0000000033
        /election/DHT/nominee0000000033
        /election/dht-reconciliation/nominee0000000034
        /election/cassandra/nominee0000000034
        /election/role-cluster-persistence_server/nominee0000000034
        /election/role-cluster-logical_manager/nominee0000000033
0x1e3516a4f8520001:
        /election/zkrsm/nominee0000000035
        /election/zk_live_members/nominee0000000034
        /election/role-cluster-mgmt_cluster/nominee0000000034
        /election/cassandra/nominee0000000035
{noformat}

The system is still in that state right now; if anyone needs more info from the 
live system, please ask me for it in the next hour or so.
                
> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1367
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.2
>         Environment: Debian Squeeze, 64-bit
>            Reporter: Jeremy Stribling
>            Priority: Blocker
>             Fix For: 3.4.3
>
>         Attachments: ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee0000000011]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011
> 90.0.0.222:7777 
> cZxid = 0x400000027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x400000027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x400000027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee0000000006   nominee0000000010   nominee0000000011
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011
> 90.0.0.222:7777 
> cZxid = 0x400000027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x400000027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x400000027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010
> 90.0.0.221:7777 
> cZxid = 0x30000014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x30000014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x30000014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220000
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006
> 90.0.0.223:7777 
> cZxid = 0x200000cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x200000cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x200000cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

Reply via email to