[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar updated ZOOKEEPER-1057: ------------------------------------- Fix Version/s: 3.4.0 > zookeeper c-client, connection to offline server fails to successfully > fallback to second zk host > ------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-1057 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 > Project: ZooKeeper > Issue Type: Bug > Components: c client > Affects Versions: 3.3.1, 3.3.2, 3.3.3 > Environment: snowdutyrise-lm ~/-> uname -a > Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 > PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 > also observed on: > 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 > Reporter: Woody Anderson > Fix For: 3.3.4, 3.4.0 > > > Hello, I'm a contributor for the node.js zookeeper module: > https://github.com/yfinkelstein/node-zookeeper > i'm using zk 3.3.3 for the purposes of this issue, but i have validated it > fails on 3.3.1 and 3.3.2 > i'm having an issue when trying to connect when one of my zookeeper servers > is offline. > if the first server attempted is online, all is good. > if the offline server is attempted first, then the client is never able to > connect to _any_ server. > inside zookeeper.c a connection loss (-4) is received, the socket is closed > and buffers are cleaned up, it then attempts the next server in the list, > creates a new socket (which gets the same fd as the previously closed socket) > and connecting fails, and it continues to fail seemingly forever. > The nature of this "fail" is not that it gets -4 connection loss errors, but > that zookeeper_interest doesn't find anything going on on the socket before > the user provided timeout kicks things out. I don't want to have to wait 5 > minutes, even if i could make myself. > this is the message that follows the connection loss: > 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket > [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection > timed out (exceeded timeout by 3ms) > 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest > returned error: -7 - operation timeout > While investigating, i decided to comment out close(zh->fd) in handle_error > (zookeeper.c#1153) > now everything works (obviously i'm leaking an fd). Connection the the second > host works immediately. > this is the behavior i'm looking for, though i clearly don't want to leak the > fd, so i'm wondering why the fd re-use is causing this issue. > close() is not returning an error (i checked even though current code assumes > success). > i'm on osx 10.6.7 > i tried adding a setsockopt so_linger (though i didn't want that to be a > solution), it didn't work. > full debug traces are included in issue here: > https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira