[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Germán Blanco updated ZOOKEEPER-1057: - Attachment: ZOOKEEPER-1057.patch The test is simpler and looks better if integrated into TestClient.cc. The attached patch can be applied both to trunk and branch 3.4. With this version, the test case passes for the single threaded version, but for the multithreaded version it hangs forever (or at least more than a few minutes). zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Germán Blanco updated ZOOKEEPER-1057: - Attachment: ZOOKEEPER-1057.patch The attached patch has a proposed test case that passes both in trunk and 3.4. It was my mistake, one zookeeper_close too many in the last patch. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Germán Blanco updated ZOOKEEPER-1057: - Attachment: ZOOKEEPER-1057.patch ... and now with the deterministic connection order, to make sure it is not just luck that it was working. I am very sorry for the spam, I think I need the holidays. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michi Mutsuzaki updated ZOOKEEPER-1057: --- Priority: Blocker (was: Critical) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Germán Blanco updated ZOOKEEPER-1057: - Attachment: ZOOKEEPER-1057-b3.4.patch ZOOKEEPER-1057-b3.4 is the port of Michi's test case to 3.4 branch. It fails for me. It uses a standalone server, instead of TestQuorumServer, since I figured that we just need a server listening on one port to test this. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Germán Blanco updated ZOOKEEPER-1057: - Attachment: ZOOKEEPER-1057.patch The trunk version of the b3.4 patch, for whatever it is worth. I guess it will work just as Michi's patch. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057-b3.4.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michi Mutsuzaki updated ZOOKEEPER-1057: --- Attachment: ZOOKEEPER-1057.patch This patch adds a test to validate that the c client gets connected to the second server in the list if the first server is down when zookeeper_init is called. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michi Mutsuzaki updated ZOOKEEPER-1057: --- Attachment: ZOOKEEPER-1057.patch Trying again. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1057: - Fix Version/s: (was: 3.3.4) (was: 3.4.0) 3.5.0 Not a blocker. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Fix For: 3.5.0 Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1057: - Fix Version/s: 3.4.0 zookeeper c-client, connection to offline server fails to successfully fallback to second zk host - Key: ZOOKEEPER-1057 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Environment: snowdutyrise-lm ~/- uname -a Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 also observed on: 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011 Reporter: Woody Anderson Fix For: 3.3.4, 3.4.0 Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2 i'm having an issue when trying to connect when one of my zookeeper servers is offline. if the first server attempted is online, all is good. if the offline server is attempted first, then the client is never able to connect to _any_ server. inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever. The nature of this fail is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself. this is the message that follows the connection loss: 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms) 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout While investigating, i decided to comment out close(zh-fd) in handle_error (zookeeper.c#1153) now everything works (obviously i'm leaking an fd). Connection the the second host works immediately. this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue. close() is not returning an error (i checked even though current code assumes success). i'm on osx 10.6.7 i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work. full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira