Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen.
You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now. Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session moved to another server, so operation is ignored' (If you can cut and paste that code on the jira that you create, it would be helpful!)... Thanks mahadev On 3/17/10 2:56 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote: > Ok. > > I analysed some logs from time when problem last occurred. > > brokensession.txt - logs concerning session which had a problem > leader.txt - logs concerning leader election (i grepped for Leader in > zookeeper.log) > > Some observations below > - Network glitch which resulted in problem occurred at about 22:08. > - From what I see since 17:48 node2 was the leader and it did not > change later yesterday. > - Client was connected to node2 since 17:50 > - At around 22:09 client tried to connect to every node (1,2,3). > Connections to node1 and node3 were closed > with exception "Exception causing close of session 0x22767e1c9630000 > due to java.io.IOException: Read error". > Connection to node2 stood alive. > - All subsequent operations were refused with ZSESSIONMOVED error. > Error visible both on client and on server side. > > Some more clarifications concerning questions from different posts: > > 1. Saying we have huge packet drop I mean we have huge packet drop for > a local network. We are still investigating source of it. From time to > time we observe that machines loose contact with others for few > seconds. > > 2. We do not use any fancy network tricks. All machines are in single > vlan and use permanently assigned IP addresses. > > Let me know if anything more can help. > > PS. Should I already create JIRA issue for this or is not concrete enough? > > Regards, Łukasz > > On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <ph...@apache.org> wrote: >> Yea, that's great. (no complaints on my end, just wasn't sure what you >> meant, wanted to make sure I was clear.). >> >> Can you identify some sort of pattern? We're discussing on our end, but this >> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're >> looking for a direction to hunt... anything you could add would help. >> >> Is it possible for you to upgrade your client/server to 3.2.2 and report the >> next failure, including any server/client logs if they are available? You >> should create a JIRA for this in order to track the details as they emerge. >> (incl attaching the logs, etc...) >> >> Patrick >> >> Łukasz Osipiuk wrote: >>> >>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <ph...@apache.org> wrote: >>>> >>>> We'll probably need the ZK server/client logs to hunt this down. Can you >>>> tell if the MOVED happens in some particular scenario, say you are >>>> connected >>>> to a follower and move to a leader, or perhaps you are connected to >>>> server >>>> A, get disconnected and reconnected to server A? .... is there some >>>> pattern >>>> that could help us understand what's causing this? >>>> >>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>> although most of reconnection leave session if working state. >>>> >>>> I'm not sure I follow this ^^^, could you restate it? (I get the first >>>> about >>>> about the session is broken after reconnect, what's the second part >>>> saying?) >>> >>> Sorry for my English. >>> >>> I meant that it only happens from time to time. We get reconnects >>> quite often due to large packet loss in our network, but most of the >>> time they do not break the session. >>> >>> Is it clear now? >>> >>> Regards, Łukasz Osipiuk >>> >>>> Regards, >>>> >>>> Patrick >>>> >>>> Łukasz Osipiuk wrote: >>>>> >>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <ph...@apache.org> wrote: >>>>>> >>>>>> Łukasz Osipiuk wrote: >>>>>>> >>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <ph...@apache.org> wrote: >>>>>>>> >>>>>>>> Can you verify that you are using 3.2.2 on all servers? You can do >>>>>>>> this >>>>>>>> by >>>>>>>> running the "stat" command against each of your servers and look at >>>>>>>> the >>>>>>>> very >>>>>>>> top of the output (we include the version of zk server there). >>>>>>>> http://bit.ly/dglVld >>>>>>> >>>>>>> Thanks for hint. >>>>>>> >>>>>>> We had 3.2.2 jars installed on all server - I checked it previously >>>>>>> but apparently our script which restarts instance >>>>>>> after installing .deb does not work very well and node2 and node3 were >>>>>>> running version 3.2.1. >>>>>>> Could using 3.2.1 be source of the problem? >>>>>>> >>>>>> I heard this story where customer service for dell asks customers to >>>>>> replug >>>>>> in the mouse because it might be "a dusty connection". Invariably when >>>>>> ppl >>>>>> look at the back of the box they find that the mouse is not actually >>>>>> plugged >>>>>> in. Brilliant ideas from CS perspective. ;-) >>>>>> >>>>>> I think that issue was fixed in 3.2.1, so I don't think that's an issue >>>>>> (version). >>>>> >>>>> Damn :( >>>>> >>>>>>>> Are you using synchronous or async operations in your client? >>>>>>> >>>>>>> We are only using async operations. >>>>>> >>>>>> Ok. good info. >>>>>> >>>>>>>> I see snippets from the server logs, but I don't see anything from >>>>>>>> you >>>>>>>> client logs. It would really help if you could zip up all the logs >>>>>>>> (server >>>>>>>> and client) and provide them. >>>>>>> >>>>>>> We do not really have much logging on application side :(. Failed >>>>>>> operations are logged with error code (ZSESSIONEXPIRED in our case). >>>>>>> We also log session state changes but we noticed only changes between >>>>>>> 1 >>>>>>> and 3. >>>>>>> Unfortunately we do not have logs coming directly from zookeeper >>>>>>> library as it logs to stderr/stdout and we do not store any of them at >>>>>>> the moment. >>>>>>> >>>>>> I see, in this case I think that the log messages theyselves are the >>>>>> problem, ie we are warning about this, but it's not a real problem. We >>>>>> warn >>>>>> on things that we don't like but can handle, we error if we >>>>>> don'tlike/can'thandle. >>>>>> >>>>>> MOVED basically means that a client initiated a request on one server, >>>>>> then >>>>>> moved to another server before the request could be processed. >>>>>> >>>>>> In your case you are using async requests, that can actually make this >>>>>> problem show up more (more "in flight" requests vs sync operation). >>>>>> >>>>>> So if you are not seeing MOVED on the client then this is fine. If you >>>>>> see >>>>>> them in the server log it means what I said above. >>>>>> >>>>>> That make sense, or am I missing something? >>>>> >>>>> It seems I was what I wrote was not clear. We are noticing errors on >>>>> client side. Any operation after >>>>> session gets into broken state fails with ZSESSIONMOVE error code (I >>>>> wrote ZSESSIONEXPIRED by mistake in previous post). >>>>> We get ZSESSIONMOVE from callback on rc parameter. >>>>> >>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>> although most of reconnection leave session if working state. >>>>> >>>>> Regards, Łukasz >>>>> >>>>> >>>>>> Patrick >>>>>> >>>>>> >>>>>> >>>>>>> Regards, Łukasz >>>>>>> >>>>>>>> Patrick >>>>>>>> >>>>>>>> Łukasz Osipiuk wrote: >>>>>>>>> >>>>>>>>> not really - it happens occasionally - every few days :( >>>>>>>>> I believe it is somewhat connected with our network environment >>>>>>>>> which >>>>>>>>> suffers from some packet loss which leads to >>>>>>>>> connection timeouts. >>>>>>>>> >>>>>>>>> I can switch on some more logging if you can lead me which >>>>>>>>> categories >>>>>>>>> are worth to enable DEBUG for them. >>>>>>>>> >>>>>>>>> Regards, Łukasz Osipiuk >>>>>>>>> >>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <br...@yahoo-inc.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> weird, this does sound like a bug. do you have a reliable way of >>>>>>>>>> reproducing >>>>>>>>>> the problem? >>>>>>>>>> >>>>>>>>>> thanx >>>>>>>>>> ben >>>>>>>>>> >>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: >>>>>>>>>>> >>>>>>>>>>> nope. >>>>>>>>>>> >>>>>>>>>>> I always pass 0 as clientid. >>>>>>>>>>> >>>>>>>>>>> Łukasz >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<br...@yahoo-inc.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> do you ever use zookeeper_init() with the clientid field set to >>>>>>>>>>>> something >>>>>>>>>>>> other than null? >>>>>>>>>>>> >>>>>>>>>>>> ben >>>>>>>>>>>> >>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi everyone! >>>>>>>>>>>> >>>>>>>>>>>> I am writing to this group because recently we are getting some >>>>>>>>>>>> strange errors with our production zookeeper setup. >>>>>>>>>>>> >>>>>>>>>>>> From time to time we are observing that our client application >>>>>>>>>>>> (C++ >>>>>>>>>>>> based) disconnects from zookeeper (session state is changed to >>>>>>>>>>>> 1) >>>>>>>>>>>> and >>>>>>>>>>>> reconnects (state changed to 3). >>>>>>>>>>>> This itself is not a problem - usually application continues to >>>>>>>>>>>> run >>>>>>>>>>>> without problems after reconnect. >>>>>>>>>>>> But from time to time after above happens all subsequent >>>>>>>>>>>> operations >>>>>>>>>>>> start to return ZSESSIONMOVED error. To make it work again we >>>>>>>>>>>> have >>>>>>>>>>>> to >>>>>>>>>>>> restart application (which creates new zookeeper session). >>>>>>>>>>>> >>>>>>>>>>>> I noticed that in 3.2.0 introduced a bug >>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are >>>>>>>>>>>> using >>>>>>>>>>>> zookeeper v. 3.2.2. >>>>>>>>>>>> I just noticed that app at compile time used 3.2.0 library but >>>>>>>>>>>> patches >>>>>>>>>>>> fixing bug 449 did not touch C client lib so I believe that our >>>>>>>>>>>> problems are not >>>>>>>>>>>> related with that. >>>>>>>>>>>> >>>>>>>>>>>> In zookeeper logs at moment which initiated the problem with >>>>>>>>>>>> client >>>>>>>>>>>> application I have >>>>>>>>>>>> >>>>>>>>>>>> node1: >>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to >>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502 >>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session >>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO >>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished >>>>>>>>>>>> init >>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@857] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=/10.1.112.62:2181 >>>>>>>>>>>> remote=/10.1.112.61:37197] >>>>>>>>>>>> >>>>>>>>>>>> node2: >>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=/10.1.112.63:2181 >>>>>>>>>>>> remote=/10.1.112.61:60693] >>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to >>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502 >>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session >>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO >>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished >>>>>>>>>>>> init >>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>> >>>>>>>>>>>> node3: >>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=/10.1.112.64:2181 >>>>>>>>>>>> remote=/10.1.112.61:57309] >>>>>>>>>>>> >>>>>>>>>>>> and then lots of entries like this >>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN >>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when >>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>> cxid:0x4b9e9e49 >>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>> /locks/9871253/lock-8589943989- >>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>> KeeperErrorCode = Session moved >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session >>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque >>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro >>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN >>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when >>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>> cxid:0x4b9f6603 >>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>> /locks/1665960/lock-8589961006- >>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>> KeeperErrorCode = Session moved >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session >>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque >>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro >>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> To workaround disconnections I am going to increase session >>>>>>>>>>>> timeout >>>>>>>>>>>> from 5 to 15 seconds but event if it helps at all it is just a >>>>>>>>>>>> workaround. >>>>>>>>>>>> >>>>>>>>>>>> Do you have an idea where is the source of my problem. >>>>>>>>>>>> >>>>>>>>>>>> Regards, Łukasz Osipiuk >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>> >> > >