[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191568#comment-13191568
 ] 

Jeremy Stribling commented on ZOOKEEPER-1367:
---------------------------------------------

Thanks very much for your thoughts.  Inline:

{quote}
One thing to note - znodes 6 10 and 11 are each created in a different epoch 
(2,3,4 respectively). So in each case leadership must have changed hands. Also 
note each is a different ephemeralOwner session id.
{quote}

Yeah, that's not surprising.  Due to the nature of our ZooKeeper embedding, we 
need to restart all the ZK servers every time a new node comes up, probably 
leading to new internal ZK elections.

{quote}
Also notice the ctime of each of these znodes - znode 10 was created against a 
leader that's 10-20 minutes behind the other server(s). That said, it's not 
likely this is causing the issue but it is a variable. Might be good for you to 
rule it out by running ntp or something to update all the hosts.
{quote}

That's a good catch, and probably something I should have mentioned earlier.  I 
thought ZK didn't care much about wall-clock time, but obviously it would make 
the logs harder to read.  Here are the relative times of the 3 servers during 
the test:

{noformat}
90.0.0.221: Thu Jan 19 18:42:00 UTC 2012 (base)
90.0.0.222: Thu Jan 19 18:26:55 UTC 2012 (-15:05)
90.0.0.223: Thu Jan 19 18:17:50 UTC 2012 (-24:10)
{noformat}

Could that affect ZK correctness?

{quote}
Anything special about these server hosts? For example are they virtualized? 
Anything else you can think of out of the ordinary - ordinary being the servers 
are run on dedicated non-virtualized hosts.
{quote}

As they run as part of our automated QA testing, they are virtualized 3- or 5- 
to a box on ESX.  Again, this hasn't changed since we were using 3.3.3, but I 
understand it's not the desired setup.  We runon bare metal in production, this 
is just for testing.

{quote}
Are you using multi for this or the pre-3.4 api only? Ie did you upgrade to 
3.4, start using multi, then see this issue? Or just update to 3.4 from 3.3 
with essentially no changes and see it start happening?
{quote}

Pre-3.4 api only, no multi commands.  No changes to our application code after 
the upgrade.

{quote}
If you use 4letterwords to query the server (use 'dump' in this case) when 
you're in this bad state, what does it say about the ephemeralOwner session id 
of the ephemeral znode that shouldn't be there but is? Run this against each of 
the live servers.
{quote}

I have not tried this.  If I can wrangle the QA team into re-running the test, 
I will give this a try and report back.
                
> Data inconsistencies and unexpired ephemeral nodes after cluster restart
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1367
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.2
>         Environment: Debian Squeeze, 64-bit
>            Reporter: Jeremy Stribling
>            Priority: Blocker
>             Fix For: 3.4.3
>
>         Attachments: ZOOKEEPER-1367.tgz
>
>
> In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
> all three, and then restart just two of them.  Sometimes we notice that on 
> one of the restarted servers, ephemeral nodes from previous sessions do not 
> get deleted, while on the other server they do.  We are effectively running 
> 3.4.2, though technically we are running 3.4.1 with the patch manually 
> applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
> ZOOKEEPER-1163.
> I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
> zkid 84), I saw only one znode in a particular path:
> {quote}
> [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
> [nominee0000000011]
> [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011
> 90.0.0.222:7777 
> cZxid = 0x400000027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x400000027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x400000027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> {quote}
> However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
> I saw three znodes under that same path:
> {quote}
> [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
> nominee0000000006   nominee0000000010   nominee0000000011
> [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011
> 90.0.0.222:7777 
> cZxid = 0x400000027
> ctime = Thu Jan 19 08:18:24 UTC 2012
> mZxid = 0x400000027
> mtime = Thu Jan 19 08:18:24 UTC 2012
> pZxid = 0x400000027
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220001
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010
> 90.0.0.221:7777 
> cZxid = 0x30000014c
> ctime = Thu Jan 19 07:53:42 UTC 2012
> mZxid = 0x30000014c
> mtime = Thu Jan 19 07:53:42 UTC 2012
> pZxid = 0x30000014c
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0xa234f4f3bc220000
> dataLength = 16
> numChildren = 0
> [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006
> 90.0.0.223:7777 
> cZxid = 0x200000cab
> ctime = Thu Jan 19 08:00:30 UTC 2012
> mZxid = 0x200000cab
> mtime = Thu Jan 19 08:00:30 UTC 2012
> pZxid = 0x200000cab
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x5434f5074e040002
> dataLength = 16
> numChildren = 0
> {quote}
> These never went away for the lifetime of the server, for any clients 
> connected directly to that server.  Note that this cluster is configured to 
> have all three servers still, the third one being down (90.0.0.223, zkid 162).
> I captured the data/snapshot directories for the the two live servers.  When 
> I start single-node servers using each directory, I can briefly see that the 
> inconsistent data is present in those logs, though the ephemeral nodes seem 
> to get (correctly) cleaned up pretty soon after I start the server.
> I will upload a tar containing the debug logs and data directories from the 
> failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to