[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847085#action_12847085
 ] 

Benjamin Reed commented on ZOOKEEPER-710:
-----------------------------------------

lukasz thank you for the logs! they have really helped pinpoint the problem. it 
appears that message delays are the cause of the problem. it appears that what 
is happening is that the client is connecting to a follower that is running 
very slow. the follower is slow processing the reconnect request and forwarding 
it to the leader. for example, the following shows a case where it takes 7 
seconds for the follower to process a reestablish connection request.

{quote}
zookeeper-node3.log.2010-03-16:2010-03-16 14:21:41,267 - INFO  
[NIOServerCxn.Factory:2181:nioserverc...@964] - Finished init of 
0x224ea5675b4007b valid:true
zookeeper-node3.log.2010-03-16:2010-03-16 14:21:48,937 - INFO  
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 
0x224ea5675b4007b
{quote}

the problem is that within that 7 seconds a client can drop the connection to 
the follower and establish a connection to the leader. if the renew session 
request is delayed until after the client reestablishes the connection with the 
leader, the session will get marked as managed by the previous follower and all 
requests set to the leader will get MOVED error, which is what you are seeing.

the problem is that we consider the MOVED error to be a request error rather 
than a connection error. the leader needs to drop the connection when it sees 
the MOVED error so that the client can reestablish the session properly with 
another machine.

> permanent ZSESSIONMOVED error after client app reconnects to zookeeper cluster
> ------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-710
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-710
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: app1.log.2010-03-16.gz, app2.log.2010-03-16.gz, 
> zookeeper-node1.log.2010-03-16.gz, zookeeper-node2.log.2010-03-16.gz, 
> zookeeper-node3.log.2010-03-16.gz
>
>
> Originally problem was described on Users mailing list starting with this 
> [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396...@mail.gmail.com>].
> Below I restate it in more organized form.
> We occasionally (few times a day) observe that our client application 
> disconnects from Zookeeper cluster.
> Application is written in C++ and we are using libzookeeper_mt library. In 
> version 3.2.2.
> The disconnects we are observing are probably related to some problems with 
> our network infrastructure - we are observing periods with great packet loss 
> between machines in our DC. 
> Sometimes after client application (i.e. zookeeper library) reconnects to 
> zookeeper cluster we are observing that all subsequent requests return 
> ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as 
> clientid to zookeeper_init function so old session is not reused.
> On 16-03-2010 we observed few occurences of problem. Example ones:
> - 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000
> - 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085
> I attach logs of cluster and application nodes (only stuff concerining 
> zookeeper):
> - [^zookeeper-node1.log.2010-03-16.gz] - logs of zookeepr cluster node 1 
> 10.1.112.62
> - [^zookeeper-node2.log.2010-03-16.gz] - logs of zookeepr cluster node 2 
> 10.1.112.63
> - [^zookeeper-node3.log.2010-03-16.gz] - logs of zookeepr cluster node 3 
> 10.1.112.64
> - [^app1.log.2010-03-16.gz] - application logs of app1 10.1.112.60
> - [^app2.log.2010-03-16.gz] - application logs of app2 10.1.112.61
> I also made some analysis of case at 22:08:
> - Network glitch which resulted in problem occurred at about 22:08.
> - From what I see since 17:48 node2 was the leader and it did not
> change later yesterday.
> - Client was connected to node2 since 17:50
> - At around 22:09 client tried to connect to every node (1,2,3).
> Connections to node1 and node3 were closed
>  with exception "Exception causing close of session 0x22767e1c9630000
> due to java.io.IOException: Read error".
>  Connection to node2 stood alive.
> - All subsequent operations were refused with ZSESSIONMOVED error.
> Error visible both on client and on server side.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to