[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-1057:
-------------------------------------

    Fix Version/s: 3.4.0

> zookeeper c-client, connection to offline server fails to successfully 
> fallback to second zk host
> -------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1057
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.3.1, 3.3.2, 3.3.3
>         Environment: snowdutyrise-lm ~/-> uname -a
> Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 
> PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386
> also observed on:
> 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
>            Reporter: Woody Anderson
>             Fix For: 3.3.4, 3.4.0
>
>
> Hello, I'm a contributor for the node.js zookeeper module: 
> https://github.com/yfinkelstein/node-zookeeper
> i'm using zk 3.3.3 for the purposes of this issue, but i have validated it 
> fails on 3.3.1 and 3.3.2
> i'm having an issue when trying to connect when one of my zookeeper servers 
> is offline.
> if the first server attempted is online, all is good.
> if the offline server is attempted first, then the client is never able to 
> connect to _any_ server.
> inside zookeeper.c a connection loss (-4) is received, the socket is closed 
> and buffers are cleaned up, it then attempts the next server in the list, 
> creates a new socket (which gets the same fd as the previously closed socket) 
> and connecting fails, and it continues to fail seemingly forever.
> The nature of this "fail" is not that it gets -4 connection loss errors, but 
> that zookeeper_interest doesn't find anything going on on the socket before 
> the user provided timeout kicks things out. I don't want to have to wait 5 
> minutes, even if i could make myself.
> this is the message that follows the connection loss:
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket 
> [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection 
> timed out (exceeded timeout by 3ms)
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest 
> returned error: -7 - operation timeout
> While investigating, i decided to comment out close(zh->fd) in handle_error 
> (zookeeper.c#1153)
> now everything works (obviously i'm leaking an fd). Connection the the second 
> host works immediately.
> this is the behavior i'm looking for, though i clearly don't want to leak the 
> fd, so i'm wondering why the fd re-use is causing this issue.
> close() is not returning an error (i checked even though current code assumes 
> success).
> i'm on osx 10.6.7
> i tried adding a setsockopt so_linger (though i didn't want that to be a 
> solution), it didn't work.
> full debug traces are included in issue here: 
> https://github.com/yfinkelstein/node-zookeeper/issues/6

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to