[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884560#action_12884560
 ] 

Patrick Hunt commented on ZOOKEEPER-801:


Travis, I see the address 10.209.44.182 *2140 times* in the netstat file you 
provided. 

Can you provide some insight on this client - what binding is this, c or java? 
python, ruby... what? Also, if this client fails (say it segvs or exits for 
whatever reason) is it automatically restarted by a supervisor? Basically what 
I'm trying to figure out is why this client should be attempting so many 
connections to the server - the client libs we provide typically have a hold 
off of a second or so btw connection attempts in order that we don't overload 
the server in this way. However if the client were to be restarted, or if there 
was a bug in the code where connections were established in a tight loop (I've 
seen a couple client side bugs like this now, where the client code thought it 
failed to get a connection and therefore created a new ZK client, which cause 
DOS on the server (thousands of clients were created by each client process), 
this is one of the reasons why we added the maxclientcnxns check.), or a bug in 
our client library this could explain why we see so many TIME_WAIT state 
sockets in the netstat file.





 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884579#action_12884579
 ] 

Patrick Hunt commented on ZOOKEEPER-801:


Travis, one more thing:

{noformat}
egrep 10.209.44.182 zookeeper-out-of-files.netstat |egrep EST |wc -l
66
{noformat}

10.209.44.182 has 66 connections established to this server. Why so many? Looks 
like a potential problem with the client side code, should it be opening more 
than 1 session concurrently?


 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to 

[jira] Updated: (ZOOKEEPER-800) zoo_add_auth returns ZOK if zookeeper handle is in ZOO_CLOSED_STATE

2010-07-02 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-800:
---

Fix Version/s: 3.3.2
   3.4.0

penciling in for 3.3.2

 zoo_add_auth returns ZOK if zookeeper handle is in ZOO_CLOSED_STATE
 ---

 Key: ZOOKEEPER-800
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-800
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.1
Reporter: Michi Mutsuzaki
Priority: Minor
 Fix For: 3.3.2, 3.4.0


 This happened when I called zoo_add_auth() immediately after 
 zookeeper_init(). It took me a while to figure out that authentication 
 actually failed since zoo_add_auth() returned ZOK. It should return 
 ZINVALIDSTATE instead. 
 --Michi

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hudson build is back to normal : ZooKeeper-trunk #864

2010-07-02 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/864/




[jira] Commented: (ZOOKEEPER-778) ability to add a watch on a setData or create call

2010-07-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884625#action_12884625
 ] 

Hudson commented on ZOOKEEPER-778:
--

Integrated in ZooKeeper-trunk #864 (See 
[http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/864/])


 ability to add a watch on a setData or create call
 --

 Key: ZOOKEEPER-778
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-778
 Project: Zookeeper
  Issue Type: Improvement
  Components: c client, java client, server
Reporter: Woody Anderson
Priority: Minor

 It is often desirable to set a watch when creating a node or setting data on 
 a node. Currently, you have to add a watch after the create/set with another 
 api call, which incurs extra cost, and a window of unobserved state change.
 This would seem to be an easy addition to the server/client libs, but i'm 
 not sure if there are reasons this was never proposed or developed.
 I currently am most concerned with a data watch in these two scenarios, but i 
 would imagine other users might be interested in registering a children watch 
 immediately upon creation.
 This change would require adding new method signatures in the clients for 
 create and setData which took watchers. And some changes to the protocol, as 
 the SetDataRequest and CreateRequest objects would need watch flags.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-802) Improved LogGraph filters + documentation

2010-07-02 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated ZOOKEEPER-802:
-

Status: Patch Available  (was: Open)

 Improved LogGraph filters + documentation
 -

 Key: ZOOKEEPER-802
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-802
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Minor
 Attachments: ZOOKEEPER-802.patch


 The log filtering mechanism has been improved and extended to work with 
 message logs. Also, the documentation has been moved into the forrest 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-802) Improved LogGraph filters + documentation

2010-07-02 Thread Ivan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Kelly updated ZOOKEEPER-802:
-

Attachment: ZOOKEEPER-802.patch

 Improved LogGraph filters + documentation
 -

 Key: ZOOKEEPER-802
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-802
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Minor
 Attachments: ZOOKEEPER-802.patch


 The log filtering mechanism has been improved and extended to work with 
 message logs. Also, the documentation has been moved into the forrest 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-802) Improved LogGraph filters + documentation

2010-07-02 Thread Ivan Kelly (JIRA)
Improved LogGraph filters + documentation
-

 Key: ZOOKEEPER-802
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-802
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Minor


The log filtering mechanism has been improved and extended to work with message 
logs. Also, the documentation has been moved into the forrest documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-802) Improved LogGraph filters + documentation

2010-07-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884727#action_12884727
 ] 

Hadoop QA commented on ZOOKEEPER-802:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448577/ZOOKEEPER-802.patch
  against trunk revision 958096.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/129/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/129/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/129/console

This message is automatically generated.

 Improved LogGraph filters + documentation
 -

 Key: ZOOKEEPER-802
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-802
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
Priority: Minor
 Attachments: ZOOKEEPER-802.patch


 The log filtering mechanism has been improved and extended to work with 
 message logs. Also, the documentation has been moved into the forrest 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-784) server-side functionality for read-only mode

2010-07-02 Thread Sergey Doroshenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Doroshenko updated ZOOKEEPER-784:


Status: Open  (was: Patch Available)

 server-side functionality for read-only mode
 

 Key: ZOOKEEPER-784
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-784
 Project: Zookeeper
  Issue Type: Sub-task
Reporter: Sergey Doroshenko
Assignee: Sergey Doroshenko
 Attachments: ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, 
 ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, ZOOKEEPER-784.patch


 As per http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode , create 
 ReadOnlyZooKeeperServer which comes into play when peer is partitioned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-784) server-side functionality for read-only mode

2010-07-02 Thread Sergey Doroshenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Doroshenko updated ZOOKEEPER-784:


Attachment: ZOOKEEPER-784.patch

In this patch:
* client seeks for r/w server if it's connected to r/o server, backoffing 
exponentially.
* fake session (the one assigned by r/o server) is transitioned to valid 
session upon connection to r/w server, transparently for the user
* client receives notifications about r/o mode via default watcher
* more tests


 server-side functionality for read-only mode
 

 Key: ZOOKEEPER-784
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-784
 Project: Zookeeper
  Issue Type: Sub-task
Reporter: Sergey Doroshenko
Assignee: Sergey Doroshenko
 Attachments: ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, 
 ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, ZOOKEEPER-784.patch


 As per http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode , create 
 ReadOnlyZooKeeperServer which comes into play when peer is partitioned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-784) server-side functionality for read-only mode

2010-07-02 Thread Sergey Doroshenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Doroshenko updated ZOOKEEPER-784:


Status: Patch Available  (was: Open)

 server-side functionality for read-only mode
 

 Key: ZOOKEEPER-784
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-784
 Project: Zookeeper
  Issue Type: Sub-task
Reporter: Sergey Doroshenko
Assignee: Sergey Doroshenko
 Attachments: ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, 
 ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, ZOOKEEPER-784.patch


 As per http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode , create 
 ReadOnlyZooKeeperServer which comes into play when peer is partitioned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-784) server-side functionality for read-only mode

2010-07-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884753#action_12884753
 ] 

Hadoop QA commented on ZOOKEEPER-784:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448584/ZOOKEEPER-784.patch
  against trunk revision 958096.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 13 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/130/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/130/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/130/console

This message is automatically generated.

 server-side functionality for read-only mode
 

 Key: ZOOKEEPER-784
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-784
 Project: Zookeeper
  Issue Type: Sub-task
Reporter: Sergey Doroshenko
Assignee: Sergey Doroshenko
 Attachments: ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, 
 ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, ZOOKEEPER-784.patch


 As per http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode , create 
 ReadOnlyZooKeeperServer which comes into play when peer is partitioned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884769#action_12884769
 ] 

Travis Crawford commented on ZOOKEEPER-801:
---

@Patrick - So this morning while investigating I discovered that yes, the 
client you pointed out was indeed misbehaving. It was a new application that 
had a bug.

I talked with a couple teams and basically everyone is using a different Scala 
client. We've decided http://github.com/jcorwin/zookeeper-client is our best 
scala client and are forking from the main twitter account so everyone knows to 
use it.

I looked around the docs and didn't see a place to list the scala client; 
please let me know if we should list this somewhere.

Closing this issue as its basically a client bug on our side. Thanks for the 
comments and help tracking this down.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout 

[jira] Resolved: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford resolved ZOOKEEPER-801.
---

  Assignee: Travis Crawford
Resolution: Not A Problem

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 
 weren't able to identify the root cause for that one.
 One thing we might also consider is 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884779#action_12884779
 ] 

Patrick Hunt commented on ZOOKEEPER-801:


Ah, that's great to hear. ;-)

I would really appreciate it though if you (Travis) could enter a new JIRA. 
Basically I consider it a serious problem that the server failed in this 
situation. Obv fixing ZOOKEEPER-517 would help in this situation, but really we 
need to ensure that mis-behaving clients will not bring down the service. 
That's a serious bug. Please enter a new jira, if you could describe what the 
problem was on the client side that will help us to craft a test that will 
verify.

Btw, I did add a new wiki page to capture available bindings:
http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings

However I encourage you (anyone) to submit back the code as a contrib.

Regards.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884775#action_12884775
 ] 

Mahadev konar commented on ZOOKEEPER-801:
-

travis, can you elaborate on the misbehaved client? What was its exact 
behavior? We just want to make sure if its indeed the misbehaved client or do 
we need to fix something on our side as well? 


 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if 

[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-07-02 Thread Abmar Barros (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abmar Barros updated ZOOKEEPER-702:
---

Attachment: bertier-pseudo.txt
chen-pseudo.txt
phiaccrual-pseudo.txt

Added more comments to pseudo codes.

 GSoC 2010: Failure Detector Model
 -

 Key: ZOOKEEPER-702
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
 Project: Zookeeper
  Issue Type: Wish
Reporter: Henry Robinson
Assignee: Abmar Barros
 Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
 chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch


 Failure Detector Module
 Possible Mentor
 Henry Robinson (henry at apache dot org)
 Requirements
 Java, some distributed systems knowledge, comfort implementing distributed 
 systems protocols
 Description
 ZooKeeper servers detects the failure of other servers and clients by 
 counting the number of 'ticks' for which it doesn't get a heartbeat from 
 other machines. This is the 'timeout' method of failure detection and works 
 very well; however it is possible that it is too aggressive and not easily 
 tuned for some more unusual ZooKeeper installations (such as in a wide-area 
 network, or even in a mobile ad-hoc network).
 This project would abstract the notion of failure detection to a dedicated 
 Java module, and implement several failure detectors to compare and contrast 
 their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
 phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
 is much more tunable and has some very interesting properties. This is a 
 great project if you are interested in distributed algorithms, or want to 
 help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)
Improve defenses against misbehaving clients


 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford


This issue is in response to ZOOKEEPER-801. Short version is a small number of 
buggy clients opened thousands of connections and caused Zookeeper to fail.

The misbehaving client did not correctly handle expired sessions, creating a 
new connection each time. The huge number of connections exacerbated the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-803:
--

Attachment: connection-bugfix-diff.png

This diff shows a bug where the client developer confused disconnections and 
expired sessions. In the zookeeper programing model, clients reconnect 
themselves automatically when disconnected. However, should the session expire 
the application is responsible for reconnecting.

In this case the developer attempted to throttle reconnects, however, due to a 
bug the application created a new connection each time.

A small number of clients running the buggy code took down a 3 node Zookeeper 
cluster by exhausting 65k file descriptor limit. It only recovered after 
shutting down clients, restarting the Zookeepers, and then restarting the 
well-behaved clients.

 Improve defenses against misbehaving clients
 

 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford
 Attachments: connection-bugfix-diff.png


 This issue is in response to ZOOKEEPER-801. Short version is a small number 
 of buggy clients opened thousands of connections and caused Zookeeper to fail.
 The misbehaving client did not correctly handle expired sessions, creating a 
 new connection each time. The huge number of connections exacerbated the 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-07-02 Thread Abmar Barros (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abmar Barros updated ZOOKEEPER-702:
---

Attachment: ZOOKEEPER-702.patch

Patch changes are:
* Created appMessageReceived() and appMessageSent() methods. These methods 
allow the Failure Detector to use application messages as heartbeats, which 
reflects the ZooKeeper case.
* Added command line options to configure failure detector method and its 
parameters.
* Unit tests expanded to comply with new methods.
* Enhanced javadocs for each Failure Detector implementation
* Added commons-math dependency to ivy.xml

Next things to work on are:
* Adapt server side code to use the proposed FD interface
* Start experimenting

 GSoC 2010: Failure Detector Model
 -

 Key: ZOOKEEPER-702
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
 Project: Zookeeper
  Issue Type: Wish
Reporter: Henry Robinson
Assignee: Abmar Barros
 Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
 chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch


 Failure Detector Module
 Possible Mentor
 Henry Robinson (henry at apache dot org)
 Requirements
 Java, some distributed systems knowledge, comfort implementing distributed 
 systems protocols
 Description
 ZooKeeper servers detects the failure of other servers and clients by 
 counting the number of 'ticks' for which it doesn't get a heartbeat from 
 other machines. This is the 'timeout' method of failure detection and works 
 very well; however it is possible that it is too aggressive and not easily 
 tuned for some more unusual ZooKeeper installations (such as in a wide-area 
 network, or even in a mobile ad-hoc network).
 This project would abstract the notion of failure detection to a dedicated 
 Java module, and implement several failure detectors to compare and contrast 
 their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
 phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
 is much more tunable and has some very interesting properties. This is a 
 great project if you are interested in distributed algorithms, or want to 
 help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Wiki page to catalog Client Bindings for ZooKeeper

2010-07-02 Thread Patrick Hunt

I created this wiki page to start cataloging zk client bindings:
http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings

feel free to update if you would like to publicize your binding.

Regards,

Patrick


[jira] Commented: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884829#action_12884829
 ] 

Travis Crawford commented on ZOOKEEPER-803:
---

Maybe 8-10 clients were running the buggy code. Not too many.

 Improve defenses against misbehaving clients
 

 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford
 Attachments: connection-bugfix-diff.png


 This issue is in response to ZOOKEEPER-801. Short version is a small number 
 of buggy clients opened thousands of connections and caused Zookeeper to fail.
 The misbehaving client did not correctly handle expired sessions, creating a 
 new connection each time. The huge number of connections exacerbated the 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884828#action_12884828
 ] 

Patrick Hunt commented on ZOOKEEPER-803:


thanks for this - approx how many clients are we talking about?

 Improve defenses against misbehaving clients
 

 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford
 Attachments: connection-bugfix-diff.png


 This issue is in response to ZOOKEEPER-801. Short version is a small number 
 of buggy clients opened thousands of connections and caused Zookeeper to fail.
 The misbehaving client did not correctly handle expired sessions, creating a 
 new connection each time. The huge number of connections exacerbated the 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.