Hello Mahadev, 2010/3/18 Mahadev Konar <maha...@yahoo-inc.com>: > Hi Lukasz, > I looked at the logs and havent been able to determine how the session > move can happen. > > > You should go ahead and open a jira for this! Please attach the logs to > that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of > now.
I created issue for this https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as 3.2.2 as this is version we are using. > Also, please go ahead and attach non filtered logs to the jira for the same > timeline as you had done earlier (by non filtered I mean not grepping for > the session id). Also, do attach the earlier logs you had emailed. I attached logs in jira. > Also, may I know what function are you using to print that statement > > > Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: > ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session > moved to another server, so operation is ignored' info added to jira > > (If you can cut and paste that code on the jira that you create, it would be > helpful!)... > > Thanks > mahadev > > > On 3/17/10 2:56 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote: > >> Ok. >> >> I analysed some logs from time when problem last occurred. >> >> brokensession.txt - logs concerning session which had a problem >> leader.txt - logs concerning leader election (i grepped for Leader in >> zookeeper.log) >> >> Some observations below >> - Network glitch which resulted in problem occurred at about 22:08. >> - From what I see since 17:48 node2 was the leader and it did not >> change later yesterday. >> - Client was connected to node2 since 17:50 >> - At around 22:09 client tried to connect to every node (1,2,3). >> Connections to node1 and node3 were closed >> with exception "Exception causing close of session 0x22767e1c9630000 >> due to java.io.IOException: Read error". >> Connection to node2 stood alive. >> - All subsequent operations were refused with ZSESSIONMOVED error. >> Error visible both on client and on server side. >> >> Some more clarifications concerning questions from different posts: >> >> 1. Saying we have huge packet drop I mean we have huge packet drop for >> a local network. We are still investigating source of it. From time to >> time we observe that machines loose contact with others for few >> seconds. >> >> 2. We do not use any fancy network tricks. All machines are in single >> vlan and use permanently assigned IP addresses. >> >> Let me know if anything more can help. >> >> PS. Should I already create JIRA issue for this or is not concrete enough? >> >> Regards, Łukasz >> >> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <ph...@apache.org> wrote: >>> Yea, that's great. (no complaints on my end, just wasn't sure what you >>> meant, wanted to make sure I was clear.). >>> >>> Can you identify some sort of pattern? We're discussing on our end, but this >>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're >>> looking for a direction to hunt... anything you could add would help. >>> >>> Is it possible for you to upgrade your client/server to 3.2.2 and report the >>> next failure, including any server/client logs if they are available? You >>> should create a JIRA for this in order to track the details as they emerge. >>> (incl attaching the logs, etc...) >>> >>> Patrick >>> >>> Łukasz Osipiuk wrote: >>>> >>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <ph...@apache.org> wrote: >>>>> >>>>> We'll probably need the ZK server/client logs to hunt this down. Can you >>>>> tell if the MOVED happens in some particular scenario, say you are >>>>> connected >>>>> to a follower and move to a leader, or perhaps you are connected to >>>>> server >>>>> A, get disconnected and reconnected to server A? .... is there some >>>>> pattern >>>>> that could help us understand what's causing this? >>>>> >>>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>>> although most of reconnection leave session if working state. >>>>> >>>>> I'm not sure I follow this ^^^, could you restate it? (I get the first >>>>> about >>>>> about the session is broken after reconnect, what's the second part >>>>> saying?) >>>> >>>> Sorry for my English. >>>> >>>> I meant that it only happens from time to time. We get reconnects >>>> quite often due to large packet loss in our network, but most of the >>>> time they do not break the session. >>>> >>>> Is it clear now? >>>> >>>> Regards, Łukasz Osipiuk >>>> >>>>> Regards, >>>>> >>>>> Patrick >>>>> >>>>> Łukasz Osipiuk wrote: >>>>>> >>>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <ph...@apache.org> wrote: >>>>>>> >>>>>>> Łukasz Osipiuk wrote: >>>>>>>> >>>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <ph...@apache.org> wrote: >>>>>>>>> >>>>>>>>> Can you verify that you are using 3.2.2 on all servers? You can do >>>>>>>>> this >>>>>>>>> by >>>>>>>>> running the "stat" command against each of your servers and look at >>>>>>>>> the >>>>>>>>> very >>>>>>>>> top of the output (we include the version of zk server there). >>>>>>>>> http://bit.ly/dglVld >>>>>>>> >>>>>>>> Thanks for hint. >>>>>>>> >>>>>>>> We had 3.2.2 jars installed on all server - I checked it previously >>>>>>>> but apparently our script which restarts instance >>>>>>>> after installing .deb does not work very well and node2 and node3 were >>>>>>>> running version 3.2.1. >>>>>>>> Could using 3.2.1 be source of the problem? >>>>>>>> >>>>>>> I heard this story where customer service for dell asks customers to >>>>>>> replug >>>>>>> in the mouse because it might be "a dusty connection". Invariably when >>>>>>> ppl >>>>>>> look at the back of the box they find that the mouse is not actually >>>>>>> plugged >>>>>>> in. Brilliant ideas from CS perspective. ;-) >>>>>>> >>>>>>> I think that issue was fixed in 3.2.1, so I don't think that's an issue >>>>>>> (version). >>>>>> >>>>>> Damn :( >>>>>> >>>>>>>>> Are you using synchronous or async operations in your client? >>>>>>>> >>>>>>>> We are only using async operations. >>>>>>> >>>>>>> Ok. good info. >>>>>>> >>>>>>>>> I see snippets from the server logs, but I don't see anything from >>>>>>>>> you >>>>>>>>> client logs. It would really help if you could zip up all the logs >>>>>>>>> (server >>>>>>>>> and client) and provide them. >>>>>>>> >>>>>>>> We do not really have much logging on application side :(. Failed >>>>>>>> operations are logged with error code (ZSESSIONEXPIRED in our case). >>>>>>>> We also log session state changes but we noticed only changes between >>>>>>>> 1 >>>>>>>> and 3. >>>>>>>> Unfortunately we do not have logs coming directly from zookeeper >>>>>>>> library as it logs to stderr/stdout and we do not store any of them at >>>>>>>> the moment. >>>>>>>> >>>>>>> I see, in this case I think that the log messages theyselves are the >>>>>>> problem, ie we are warning about this, but it's not a real problem. We >>>>>>> warn >>>>>>> on things that we don't like but can handle, we error if we >>>>>>> don'tlike/can'thandle. >>>>>>> >>>>>>> MOVED basically means that a client initiated a request on one server, >>>>>>> then >>>>>>> moved to another server before the request could be processed. >>>>>>> >>>>>>> In your case you are using async requests, that can actually make this >>>>>>> problem show up more (more "in flight" requests vs sync operation). >>>>>>> >>>>>>> So if you are not seeing MOVED on the client then this is fine. If you >>>>>>> see >>>>>>> them in the server log it means what I said above. >>>>>>> >>>>>>> That make sense, or am I missing something? >>>>>> >>>>>> It seems I was what I wrote was not clear. We are noticing errors on >>>>>> client side. Any operation after >>>>>> session gets into broken state fails with ZSESSIONMOVE error code (I >>>>>> wrote ZSESSIONEXPIRED by mistake in previous post). >>>>>> We get ZSESSIONMOVE from callback on rc parameter. >>>>>> >>>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>>> although most of reconnection leave session if working state. >>>>>> >>>>>> Regards, Łukasz >>>>>> >>>>>> >>>>>>> Patrick >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Regards, Łukasz >>>>>>>> >>>>>>>>> Patrick >>>>>>>>> >>>>>>>>> Łukasz Osipiuk wrote: >>>>>>>>>> >>>>>>>>>> not really - it happens occasionally - every few days :( >>>>>>>>>> I believe it is somewhat connected with our network environment >>>>>>>>>> which >>>>>>>>>> suffers from some packet loss which leads to >>>>>>>>>> connection timeouts. >>>>>>>>>> >>>>>>>>>> I can switch on some more logging if you can lead me which >>>>>>>>>> categories >>>>>>>>>> are worth to enable DEBUG for them. >>>>>>>>>> >>>>>>>>>> Regards, Łukasz Osipiuk >>>>>>>>>> >>>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <br...@yahoo-inc.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> weird, this does sound like a bug. do you have a reliable way of >>>>>>>>>>> reproducing >>>>>>>>>>> the problem? >>>>>>>>>>> >>>>>>>>>>> thanx >>>>>>>>>>> ben >>>>>>>>>>> >>>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: >>>>>>>>>>>> >>>>>>>>>>>> nope. >>>>>>>>>>>> >>>>>>>>>>>> I always pass 0 as clientid. >>>>>>>>>>>> >>>>>>>>>>>> Łukasz >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<br...@yahoo-inc.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> do you ever use zookeeper_init() with the clientid field set to >>>>>>>>>>>>> something >>>>>>>>>>>>> other than null? >>>>>>>>>>>>> >>>>>>>>>>>>> ben >>>>>>>>>>>>> >>>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi everyone! >>>>>>>>>>>>> >>>>>>>>>>>>> I am writing to this group because recently we are getting some >>>>>>>>>>>>> strange errors with our production zookeeper setup. >>>>>>>>>>>>> >>>>>>>>>>>>> From time to time we are observing that our client application >>>>>>>>>>>>> (C++ >>>>>>>>>>>>> based) disconnects from zookeeper (session state is changed to >>>>>>>>>>>>> 1) >>>>>>>>>>>>> and >>>>>>>>>>>>> reconnects (state changed to 3). >>>>>>>>>>>>> This itself is not a problem - usually application continues to >>>>>>>>>>>>> run >>>>>>>>>>>>> without problems after reconnect. >>>>>>>>>>>>> But from time to time after above happens all subsequent >>>>>>>>>>>>> operations >>>>>>>>>>>>> start to return ZSESSIONMOVED error. To make it work again we >>>>>>>>>>>>> have >>>>>>>>>>>>> to >>>>>>>>>>>>> restart application (which creates new zookeeper session). >>>>>>>>>>>>> >>>>>>>>>>>>> I noticed that in 3.2.0 introduced a bug >>>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are >>>>>>>>>>>>> using >>>>>>>>>>>>> zookeeper v. 3.2.2. >>>>>>>>>>>>> I just noticed that app at compile time used 3.2.0 library but >>>>>>>>>>>>> patches >>>>>>>>>>>>> fixing bug 449 did not touch C client lib so I believe that our >>>>>>>>>>>>> problems are not >>>>>>>>>>>>> related with that. >>>>>>>>>>>>> >>>>>>>>>>>>> In zookeeper logs at moment which initiated the problem with >>>>>>>>>>>>> client >>>>>>>>>>>>> application I have >>>>>>>>>>>>> >>>>>>>>>>>>> node1: >>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to >>>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502 >>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session >>>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO >>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished >>>>>>>>>>>>> init >>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception >>>>>>>>>>>>> causing >>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>>> Read >>>>>>>>>>>>> error >>>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@857] - closing >>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>>> local=/10.1.112.62:2181 >>>>>>>>>>>>> remote=/10.1.112.61:37197] >>>>>>>>>>>>> >>>>>>>>>>>>> node2: >>>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception >>>>>>>>>>>>> causing >>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>>> Read >>>>>>>>>>>>> error >>>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing >>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>>> local=/10.1.112.63:2181 >>>>>>>>>>>>> remote=/10.1.112.61:60693] >>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to >>>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502 >>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session >>>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO >>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished >>>>>>>>>>>>> init >>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>>> >>>>>>>>>>>>> node3: >>>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception >>>>>>>>>>>>> causing >>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>>> Read >>>>>>>>>>>>> error >>>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO >>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing >>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>>> local=/10.1.112.64:2181 >>>>>>>>>>>>> remote=/10.1.112.61:57309] >>>>>>>>>>>>> >>>>>>>>>>>>> and then lots of entries like this >>>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN >>>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when >>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>>> cxid:0x4b9e9e49 >>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>>> /locks/9871253/lock-8589943989- >>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>>> KeeperErrorCode = Session moved >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session >>>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque >>>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro >>>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN >>>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when >>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>>> cxid:0x4b9f6603 >>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>>> /locks/1665960/lock-8589961006- >>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>>> KeeperErrorCode = Session moved >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session >>>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque >>>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro >>>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> To workaround disconnections I am going to increase session >>>>>>>>>>>>> timeout >>>>>>>>>>>>> from 5 to 15 seconds but event if it helps at all it is just a >>>>>>>>>>>>> workaround. >>>>>>>>>>>>> >>>>>>>>>>>>> Do you have an idea where is the source of my problem. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, Łukasz Osipiuk >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >>> >> >> > > -- -- Łukasz Osipiuk mailto:luk...@osipiuk.net