On Thu, Aug 19, 2010 at 7:51 AM, Vishal K<vishalm...@gmail.com> wrote:
Hi,
I remember Ben had opened a jira for clock jumps earlier:
https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon
to
have clocks jump forward in virtualized environments.
It is desirable to modify ZooKeeper to handle this situation (as much as
possible) internally. It would need to be done for both client - server
connections and server - server connections. One obvious solution is to
retry a few times (send ping) after getting a timeout. Another way is to
count the number of pings that have been sent after receiving the
timeout.
If number of pings do not match the expected number (say 5 ping attempt
should be finished for a 5 sec timeout), then wait till all the pings are
finished. In effect do not completely rely on the clock. Any comments?
-Vishal
On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan<qing...@gmail.com> wrote:
Oh.. our servers are also running in a virtualized environment.
On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite<waite....@gmail.com>
wrote:
Hi,
I have tripped over similar problems testing Red Hat Cluster in
virtualised
environments. I don't know whether recent linux kernels have
improved
their
interaction with VMWare, but in our environments clock drift caused
by
lost
ticks can be substantial, requiring NTP to sometimes jump the clock
rather
than control acceleration. In one of our internal production rigs,
the
local NTP servers themselves were virtualised - causing absolute
mayhem
when
heavy loads hit the other guests on the same physical hosts.
The effect on RHCS (v2.0) is quite dramatic. A forward jump in time
by
10
seconds always causes a member to prematurely time-out on a network
read,
causing the member to drop out and trigger a cluster reconfiguration.
Apparently NTP is integrated with RHCS version 3, but I don't know
what
is
meant by that.
I guess this post is not entirely relevent to ZK, but I am just
making
the
point that virtualisation (of NTP servers and or clients) can cause
repeated
premature timeouts. On Linux, I believe that there is a class of
timers
provided that is immune to this, but I doubt that there is a platform
independent way of coping with this.
My 2p.
regards,
Martin
On 18 August 2010 16:53, Patrick Hunt<ph...@apache.org> wrote:
Do you expect the time to be "wrong" frequently? If ntp is running
it
should never get out of sync more than a small amount. As long as
this
is
less than ~your timeout you should be fine.
Patrick
On 08/18/2010 01:04 AM, Qing Yan wrote:
Hi,
The testcase is fairly simple. We have a client which connects
to
ZK,
registers an ephemeral node and watches on it. Now change the
client
machine's time - session killed..
Here is the log:
*2010-08-18 04:24:57,782 INFO
com.taobao.timetunnel2.cluster.service.AgentService: Host name
kgbtest1.corp.alimama.com
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:zookeeper.version=3.2.2-888565, built on 12/08/2009
21:51
GMT
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:host.name=kgbtest1.corp.alimama.com
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.version=1.6.0_13
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.vendor=Sun Microsystems Inc.
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.home=/usr/java/jdk1.6.0_13/jre
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.io.tmpdir=/tmp
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:java.compiler=<NA>
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:os.name=Linux
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:os.arch=amd64
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:os.version=2.6.18-164.el5
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:user.name=admin
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:user.home=/home/admin
2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper:
Client
environment:user.dir=/home/admin/TimeTunnel2/cluster/log
2010-08-18 04:24:57,790 INFO org.apache.zookeeper.ZooKeeper:
Initiating
client connection, connectString=
xentest10-vm5.corp.alimama.com:2181,
xentest10-vm6.corp.alimama.com:2181,
xentest10-vm9.corp.alimama.com:2181
sessionTimeout=600000
watcher=com.taobao.timetunnel2.cluster.service.agentserv...@48d6c16c
2010-08-18 04:24:57,791 INFO org.apache.zookeeper.ClientCnxn:
zookeeper.disableAutoWatchReset is false
2010-08-18 04:24:57,817 INFO org.apache.zookeeper.ClientCnxn:
Attempting
connection to server
xentest10-vm5.corp.alimama.com/10.254.5.160:2181
2010-08-18 04:24:57,818 INFO org.apache.zookeeper.ClientCnxn:
Priming
connection to java.nio.channels.SocketChannel[connected local=/
10.254.2.160:48536 remote=
xentest10-vm5.corp.alimama.com/10.254.5.160:2181]
2010-08-18 04:24:57,821 INFO org.apache.zookeeper.ClientCnxn:
Server
connection successful
2010-08-18 04:24:57,832 INFO
com.taobao.timetunnel2.cluster.service.AgentService: Register...
2010-08-18 04:24:57,834 DEBUG org.apache.zookeeper.ClientCnxn:
Reading
reply
sessionid:0x12a8413f71a0003, packet:: clientPath:null
serverPath:null
finished:false header:: -8,101 replyHeader:: -8,-1,0 request::
0,v{},v{},v{} response:: null
2010-08-18 04:24:57,846 DEBUG org.apache.zookeeper.ClientCnxn:
Reading
reply
sessionid:0x12a8413f71a0003, packet:: clientPath:null
serverPath:null
finished:false header:: 1,1 replyHeader:: 1,8589934649,-110
request::
'/agent,,v{s{31,s{'world,'anyone}}},0 response::
2010-08-18 04:24:57,853 DEBUG
com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator:
Path
/agent
already exist
2010-08-18 04:24:57,882 DEBUG org.apache.zookeeper.ClientCnxn:
Reading
reply
sessionid:0x12a8413f71a0003, packet:: clientPath:null
serverPath:null
finished:false header:: 2,1 replyHeader:: 2,8589934650,-110
request::
'/agent/status,,v{s{31,s{'world,'anyone}}},0 response::
2010-08-18 04:24:57,882 DEBUG
com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator:
Path
/agent/status already exist
2010-08-18 04:24:57,886 DEBUG org.apache.zookeeper.ClientCnxn:
Reading
reply
sessionid:0x12a8413f71a0003, packet:: clientPath:null
serverPath:null
finished:false header:: 3,1 replyHeader:: 3,8589934651,0
request::
'/agent/status/kgbtest1.corp.alimama.com
,,v{s{31,s{'world,'anyone}}},1
response:: '/agent/status/kgbtest1.corp.alimama.com
2010-08-18 04:24:57,886 DEBUG
com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator:
Path
/agent/status/kgbtest1.corp.alimama.com creation OK
2010-08-18 04:24:57,886 INFO
com.taobao.timetunnel2.cluster.service.AgentService: Watch for ->
/agent/status/kgbtest1.corp.alimama.com
2010-08-18 04:24:57,892 DEBUG org.apache.zookeeper.ClientCnxn:
Reading
reply
sessionid:0x12a8413f71a0003, packet:: clientPath:null
serverPath:null
finished:false header:: 4,4 replyHeader:: 4,-1,0 request::
'/agent/status/
kgbtest1.corp.alimama.com,T response::
,s{8589934651,8589934651,1282116165777,1282116165777,0,0,0,84024764344696835,0,0,8589934651}
2010-08-18 04:24:57,892 INFO
com.taobao.timetunnel2.cluster.service.AgentService: Data changed
@/agent/status/kgbtest1.corp.alimama.com
2010-08-18 04:25:11,226 DEBUG org.apache.zookeeper.ClientCnxn: Got
ping
response for sessionid:0x12a8413f71a0003 after 1ms
2010-08-18 04:25:24,560 DEBUG org.apache.zookeeper.ClientCnxn: Got
ping
response for sessionid:0x12a8413f71a0003 after 2ms
2010-08-18 04:25:37,893 DEBUG org.apache.zookeeper.ClientCnxn: Got
ping
response for sessionid:0x12a8413f71a0003 after 1ms
2010-08-18 04:19:44,853 DEBUG org.apache.zookeeper.ClientCnxn: Got
notification sessionid:0x12a8413f71a0003
2010-08-18 04:19:44,854 DEBUG org.apache.zookeeper.ClientCnxn: Got
WatchedEvent: Znode change. Path:
/agent/status/kgbtest1.corp.alimama.comType: NodeDeleted for
sessionid
0x12a8413f71a0003
2010-08-18 04:19:44,854 DEBUG
com.taobao.timetunnel2.cluster.zookeeper.operation.PathDataWatcher:
getChildren event =>WatchedEvent: Znode change. Path:
/agent/status/
kgbtest1.corp.alimama.com Type: NodeDeleted
2010-08-18 04:19:44,855 WARN org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x12a8413f71a0003 to
sun.nio.ch.selectionkeyi...@7ca3d4cf
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(zookeeper:ClientCnxn.java):701)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):945)
2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn:
Attempting
connection to server
xentest10-vm6.corp.alimama.com/10.254.5.161:2181
2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn:
Priming
connection to java.nio.channels.SocketChannel[connected local=/
10.254.2.160:45776 remote=
xentest10-vm6.corp.alimama.com/10.254.5.161:2181]
2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn:
Server
connection successful
2010-08-18 04:19:45,593 WARN org.apache.zookeeper.ClientCnxn:
Exception
closing session 0x12a8413f71a0003 to
sun.nio.ch.selectionkeyi...@2fdb7df8
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(zookeeper:ClientCnxn.java):589)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(zookeeper:ClientCnxn.java):709)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):945)
* I guess ZK might rely on timestamp to keep sessions alive,
but
we
have
NTP daemon running so machine time can get changed
automatically, is there a conflict?