[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921244#action_12921244 ] Alexandre Hardy commented on ZOOKEEPER-885: --- {quote} Hi Alexandre, When you load the machines running the zookeeper servers by running the dd command, how much time elapses between running dd and observing the connections expiring? I'm not being able to reproduce it, and I wonder how long the problem takes to manifest. {quote} Hi Flavio, Problems usually start occurring after about 30 seconds. I have also tested on some other machines and response varies somewhat. I suspect that the 30 seconds is dictated by some queue that needs to fill up before significant disk traffic is initiated. I think that the speed at which this queue is processed determines how likely it is that zookeeper will fail to respond to a ping. Please also see responses to Patrick's questions. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921251#action_12921251 ] Alexandre Hardy commented on ZOOKEEPER-885: --- Hi Patrick, {quote} 1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server) {quote} Correct. If {{dd}} is run on only one machine then the likelihood of disconnects is reduced. Unfortunately our typical scenario would involve load on all three machines. {quote} 2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340) {quote} Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume is 10G in size and (obviously) shares the same spindle as root and log (/var/log) partitions. {quote} 3) you mentioned that this: echo 5 /proc/sys/vm/dirty_ratio echo 5 /proc/sys/vm/dirty_background_ratio resulting in stability in this test, can you tell us what this was set to initially? Checkout this article: http://lwn.net/Articles/216853/ {quote} The initial value for {{/proc/sys/vm/dirty_ratio}} is 20, an the initial value for {{/proc/sys/vm/dirty_background_ratio}} is 10. These machines have 1G of RAM, and thus are less susceptible to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I have run a more complete benchmark with random IO instead of {{dd}} sequential IO testing session timeouts, and the effect of {{dirty_ratio}} settings. I will attach that separately. {{dirty_ratio}} seems to help with the {{dd}} test but has much less influence in the random IO test. {quote} I notice you are running a bigmem kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm) {quote} We have 1G on each machine in this test system and 100M heap size for each zookeeper server. {quote} 4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the dirty_ratio down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out). {quote} I'm not entirely sure of all the JVM internals, but all swap space on the linux system was disabled. So no swapping based on the linux kernel would happen. I'm not sure if the JVM does any swapping of its own? I concur with your analysis. What puzzles me is why the system would even get into a state where the zookeeper server would have to wait so long for a disk flush? In the case of {{dd if=/dev/urandom}} the IO rate is quite low, and there should (I think) be more than enough IOPS available for zookeeper to flush data to disk in time. Even if the IO scheduling results in this scenario, it is still not clear to me why zookeeper would fail to respond to a ping. My only conclusion at this stage is that responding to a ping requires information to be flushed to disk. Is this correct? Referring to your private e-mail: {quote} The weird thing here is that there should be no delay for these pings. {quote} This would indicate to me that the ping response should not be dependent on any disk IO. Thanks for all the effort in looking into this! Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced.
[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Hardy updated ZOOKEEPER-885: -- Attachment: benchmark.csv Here are some results with different settings of {{dirty_ratio}}, {{dirty_bytes}} (finer control), session timeouts, and io priorities (set with {{ionice}}). The status field indicates success or failure, 0=success and anything else is failure. Where failure means a zookeeper session disconnected (but did not necessarily expire). Each test ran for a maximum of 5 minutes. A test could also fail if it failed to connect to zookeeper servers within the first 60 seconds, unfortunately the return code did not differentiate properly between these cases. However pauses of about 4 seconds were allowed between tests, during which all IO operations (by the test program) were stopped. This should have allowed the system to stabilize somewhat. In this test the session timeout had the most significant influence on stability (not too surprising) and the other system settings had far less influence. We have settled on a 60s session timeout for the moment (to achieve the stability we need under the IO loads we are experiencing). But it would be great if we could reduce this a bit. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-893) ZooKeeper high cpu usage when invalid requests
[ https://issues.apache.org/jira/browse/ZOOKEEPER-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921286#action_12921286 ] Thijs Terlouw commented on ZOOKEEPER-893: - I'll try to write a test case, but not exactly sure yet how to integrate. Need to look at some example unit tests first :) ZooKeeper high cpu usage when invalid requests -- Key: ZOOKEEPER-893 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-893 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.3.1 Environment: Linux 2.6.16 4x Intel(R) Xeon(R) CPU X3320 @ 2.50GHz java version 1.6.0_17 Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode) Reporter: Thijs Terlouw Assignee: Thijs Terlouw Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-893.patch Original Estimate: 1h Remaining Estimate: 1h When ZooKeeper receives certain illegally formed messages on the internal communication port (:4181 by default), it's possible for ZooKeeper to enter an infinite loop which causes 100% cpu usage. It's related to ZOOKEEPER-427, but that patch does not resolve all issues. from: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java the two affected parts: === int length = msgLength.getInt(); if(length = 0) { throw new IOException(Invalid packet length: + length); } === === while (message.hasRemaining()) { temp_numbytes = channel.read(message); if(temp_numbytes 0) { throw new IOException(Channel eof before end); } numbytes += temp_numbytes; } === how to replicate this bug: perform an nmap portscan against your zookeeper server: nmap -sV -n your.ip.here -p4181 wait for a while untill you see some messages in the logfile and then you will see 100% cpu usage. It does not recover from this situation. With my patch, it does not occur anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-881) ZooKeeperServer.loadData loads database twice
[ https://issues.apache.org/jira/browse/ZOOKEEPER-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921306#action_12921306 ] Hudson commented on ZOOKEEPER-881: -- Integrated in ZooKeeper-trunk #967 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/967/]) ZOOKEEPER-881. ZooKeeperServer.loadData loads database twice (jared cantwell via breed) ZooKeeperServer.loadData loads database twice - Key: ZOOKEEPER-881 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-881 Project: Zookeeper Issue Type: Bug Components: server Reporter: Jared Cantwell Assignee: Jared Cantwell Priority: Trivial Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-881.patch zkDb.loadDataBase() is called twice at the beginning of loadData(). It shouldn't have any negative affects, but is unnecessary. A patch should be trivial. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Branch hudson job created for ZOOKEEPER-823
Patrick Hunt: I created a branch in SVN and a new Hudson job to track the ZOOKEEPER-823 patch. Getting this one done has been tough, Thomas has been working on it but we are seeing intermittent failures. lmk if you have any questions. Hi Patrick, thank you for the branch. Is there a way, how I could commit to this branch? Or if sbd else commits on this branch, how should I send patches? If I can commit, is there a way, how I could use GIT to work on this branch? I've not yet used git-svn to do commits. Have a nice weekend, Thomas Koch, http://www.koch.ro
[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921340#action_12921340 ] Vishal K commented on ZOOKEEPER-851: Hi Henry , Any comments? Thanks. -Vishal ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal K Priority: Critical Fix For: 3.4.0 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change the logic. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. i think we can mitigate 1) by only doing it when necessary. as a corollary we can mitigate 2) and 3) by only doing refactoring/cleanups when motivated by some new change: fix a bug, increased performance, new feature, etc. I agree that refactoring should be carefully planned and done in small steps. Therefor I collected each refactoring item for the java client in a small separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that can individually be discussed, reviewed and tested. Have a nice weekend after Hadoop World! Thomas Koch, http://www.koch.ro
[jira] Commented: (ZOOKEEPER-445) Potential bug in leader code
[ https://issues.apache.org/jira/browse/ZOOKEEPER-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921360#action_12921360 ] Vishal K commented on ZOOKEEPER-445: Is this a bug? If not, we should just close this one. Potential bug in leader code Key: ZOOKEEPER-445 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-445 Project: Zookeeper Issue Type: Bug Components: server Environment: Linux fortiz-desktop 2.6.27-7-generic #1 SMP Fri Oct 24 06:42:44 UTC 2008 i686 GNU/Linux java version 1.6.0_10 Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing) Reporter: Manos Kapritsos Priority: Minor Fix For: 3.4.0 Original Estimate: 0.33h Remaining Estimate: 0.33h There is a suspicious line in server/quorum/Leader.java:226. It reads if (stop) { LOG.info(exception while shutting down acceptor: + e); stop = true; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets
FLE implementation should be improved to use non-blocking sockets - Key: ZOOKEEPER-900 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 Project: Zookeeper Issue Type: Bug Reporter: Vishal K Assignee: Flavio Junqueira Priority: Critical From earlier email exchanges: 1. Blocking connects and accepts: a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective pe! er. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
Hi, I would like to add my few cents here. I would suggest to stay away from code cleanup unless it is absolutely necessary. I would also like to extend this discussion to understand the amount of testing/QA to be performed before a release. How do we currently qualify a release? Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release. I will be honest in saying that, unfortunately, these bugs have resulted in questions being raised by several people in our organization about our choice of using ZooKeeper. Nevertheless, our product group really thinks that ZK is a cool technology, but we need to focus on making it robust before adding major new features to it. I would suggest to: 1. Look at current bugs and see why existing test did not uncover these bugs and improve those tests. 2. Look at places that need more tests and broadcast it to the community. Follow-up with test development. 3. Have a crisp release QA strategy for each release. 4. Improve API documentation as well as code documentation so that the API usage is clear and debugging is made easier. Comments? Thanks. -Vishal On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote: Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change the logic. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. i think we can mitigate 1) by only doing it when necessary. as a corollary we can mitigate 2) and 3) by only doing refactoring/cleanups when motivated by some new change: fix a bug, increased performance, new feature, etc. I agree that refactoring should be carefully planned and done in small steps. Therefor I collected each refactoring item for the java client in a small separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that can individually be discussed, reviewed and tested. Have a nice weekend after Hadoop World! Thomas Koch, http://www.koch.ro
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921412#action_12921412 ] Benjamin Reed commented on ZOOKEEPER-885: - we are having problems reproducing this. can you give a bit more details on the machines you are using? what are the cpu and memory size? also, what is the throughput of dd if=/dev/zero of=/dev/mapper/nimbula-test? is there just one disk, where nimbula-test is a partition on that disk and you have another partition for the snapshots and logs? even if you don't have swap space, code pages can be discarded and loaded on demand, so that could be a potential problem. what does /proc/meminfo look like? Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921459#action_12921459 ] Dave Wright commented on ZOOKEEPER-885: --- I don't think the cause of this is much of a mystery, as we experienced similar problems when we had the zookeeper files on the same filesystem as an IO-heavy application that was doing buffered IO. Quite simply, when zookeeper does a sync on its own files, it causes the entire filesystem to sync, flushing any buffered data from the IO-heavy application and freezing the ZK server process for long enough for heartbeats to timeout. When you say moderate IO load I'm curious what the bottleneck is - the dd command will copy data as fast as possible, if you're only getting 4MB/sec, the underlying device must be pretty slow, which would further indicate why a sync() request would take a while to complete. The only fix we've seen is to put the ZK files on their own device, although you may be able to fix it with a different partition on the same device. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921467#action_12921467 ] Flavio Junqueira commented on ZOOKEEPER-885: I'm not sure it is that simple, Dave. The problem is that pings do not require writes to disk, and in the scenario that Alexandre describes, there are only pings being processed. Why is the background I/O load affecting the processing of ZooKeeper? And in particular, why are session expiring as a consequence of this background I/O load? Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921471#action_12921471 ] Patrick Hunt commented on ZOOKEEPER-885: re what Dave just mentioned (fsync issues on ext3 volume), this is a good read: http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ ZK is in a similar situation. The perplexing thing here though is that the client is just doing pings, so really there should be no writes/syncs to disk for ZK. That's what I'm stuck on. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
Well said Vishal. I really like the points you put forth!!! Agree on all the points, but again, all the point you mention require commitment from folks like you. Its a pretty hard task to test all the corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a release. We should definitely work towards a plan. You should go ahead and create a jira for the QA plan. We should all pitch in with what all should be tested. Thanks mahadev On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote: Hi, I would like to add my few cents here. I would suggest to stay away from code cleanup unless it is absolutely necessary. I would also like to extend this discussion to understand the amount of testing/QA to be performed before a release. How do we currently qualify a release? Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release. I will be honest in saying that, unfortunately, these bugs have resulted in questions being raised by several people in our organization about our choice of using ZooKeeper. Nevertheless, our product group really thinks that ZK is a cool technology, but we need to focus on making it robust before adding major new features to it. I would suggest to: 1. Look at current bugs and see why existing test did not uncover these bugs and improve those tests. 2. Look at places that need more tests and broadcast it to the community. Follow-up with test development. 3. Have a crisp release QA strategy for each release. 4. Improve API documentation as well as code documentation so that the API usage is clear and debugging is made easier. Comments? Thanks. -Vishal On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote: Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change the logic. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. i think we can mitigate 1) by only doing it when necessary. as a corollary we can mitigate 2) and 3) by only doing refactoring/cleanups when motivated by some new change: fix a bug, increased performance, new feature, etc. I agree that refactoring should be carefully planned and done in small steps. Therefor I collected each refactoring item for the java client in a small separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that can individually be discussed, reviewed and tested. Have a nice weekend after Hadoop World! Thomas Koch, http://www.koch.ro
Re: What's the QA strategy of ZooKeeper?
Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release Vishal, can you be more specific, point out specific JIRAs that you entered would be very valuable. Don't worry about hurting our feelings or anything, without this type of feedback we can't address the specific issues and their underlying problems. Regards, Patrick On Fri, Oct 15, 2010 at 11:14 AM, Mahadev Konar maha...@yahoo-inc.comwrote: Well said Vishal. I really like the points you put forth!!! Agree on all the points, but again, all the point you mention require commitment from folks like you. Its a pretty hard task to test all the corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a release. We should definitely work towards a plan. You should go ahead and create a jira for the QA plan. We should all pitch in with what all should be tested. Thanks mahadev On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote: Hi, I would like to add my few cents here. I would suggest to stay away from code cleanup unless it is absolutely necessary. I would also like to extend this discussion to understand the amount of testing/QA to be performed before a release. How do we currently qualify a release? Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release. I will be honest in saying that, unfortunately, these bugs have resulted in questions being raised by several people in our organization about our choice of using ZooKeeper. Nevertheless, our product group really thinks that ZK is a cool technology, but we need to focus on making it robust before adding major new features to it. I would suggest to: 1. Look at current bugs and see why existing test did not uncover these bugs and improve those tests. 2. Look at places that need more tests and broadcast it to the community. Follow-up with test development. 3. Have a crisp release QA strategy for each release. 4. Improve API documentation as well as code documentation so that the API usage is clear and debugging is made easier. Comments? Thanks. -Vishal On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote: Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change the logic. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. i think we can mitigate 1) by only doing it when necessary. as a corollary we can mitigate 2) and 3) by only doing refactoring/cleanups when motivated by some new change: fix a bug, increased performance, new feature, etc. I agree that refactoring should be carefully planned and done in small steps. Therefor I collected each refactoring
Re: What's the QA strategy of ZooKeeper?
i think we have a very different perspective on the quality issue: I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. great. fixing the code and refactoring before a patch goes in is the perfect time to do it! please give feedback and help make the patch better. there is a reason to exclude checks (which is why there is such excludes), but if we can avoid them we should. before a patch is applied is exactly the time to do cleanup If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. cleaning up code to add a new feature is a great time to clean up the code. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. it is the but necessary that i disagree with. there is plenty of code that could be cleaned up and made to look a lot nicer, but we shouldn't touch it, unless we are fixing something else or adding a new feature. it's pretty lame to explain to someone that the bug that was introduced by a code change was motivated by a desire to make the code cleaner. any code change runs the risk of breakage, thus changing code simply for cleanliness is not worth the risk. ben
[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jared Cantwell updated ZOOKEEPER-804: - Attachment: ZOOKEEPER-804-1.patch Not sure what the protocol is here, but I'm gonna go ahead and attach a revised patch. c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
Hi Mahadev, Yes this will require effort from the community. In a related note how many of the contributors do you think are dedicated (either partially or fully to ZK development/testing)? This will help to in understanding how much effort will be required for tasks and how we can prioritize them based on resources. I will create a JIRA for this with some initial thoughts. However, the list would have to come from folks who are currently familiar with parts of ZK code. Thanks. -Vishal On Fri, Oct 15, 2010 at 2:14 PM, Mahadev Konar maha...@yahoo-inc.comwrote: Well said Vishal. I really like the points you put forth!!! Agree on all the points, but again, all the point you mention require commitment from folks like you. Its a pretty hard task to test all the corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a release. We should definitely work towards a plan. You should go ahead and create a jira for the QA plan. We should all pitch in with what all should be tested. Thanks mahadev On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote: Hi, I would like to add my few cents here. I would suggest to stay away from code cleanup unless it is absolutely necessary. I would also like to extend this discussion to understand the amount of testing/QA to be performed before a release. How do we currently qualify a release? Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release. I will be honest in saying that, unfortunately, these bugs have resulted in questions being raised by several people in our organization about our choice of using ZooKeeper. Nevertheless, our product group really thinks that ZK is a cool technology, but we need to focus on making it robust before adding major new features to it. I would suggest to: 1. Look at current bugs and see why existing test did not uncover these bugs and improve those tests. 2. Look at places that need more tests and broadcast it to the community. Follow-up with test development. 3. Have a crisp release QA strategy for each release. 4. Improve API documentation as well as code documentation so that the API usage is clear and debugging is made easier. Comments? Thanks. -Vishal On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote: Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change the logic. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. i think we can mitigate 1) by only doing it when necessary. as a corollary we can mitigate 2) and 3) by only doing refactoring/cleanups when motivated by some new change: fix a bug, increased performance, new feature, etc. I agree that
Re: What's the QA strategy of ZooKeeper?
Hi Patrick, On Fri, Oct 15, 2010 at 2:22 PM, Patrick Hunt ph...@apache.org wrote: Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release Vishal, can you be more specific, point out specific JIRAs that you entered would be very valuable. Don't worry about hurting our feelings or anything, without this type of feedback we can't address the specific issues and their underlying problems. Heres a list of few issues: Leader election taking a long time to complete - https://issues.apache.org/jira/browse/ZOOKEEPER-822 Last processed zxid set prematurely while establishing leadership - https://issues.apache.org/jira/browse/ZOOKEEPER-790 FLE implementation should be improved to use non-blocking sockets ZOOKEEPER-900 ZK lets any node to become an observer - https://issues.apache.org/jira/browse/ZOOKEEPER-851 Regards, Patrick On Fri, Oct 15, 2010 at 11:14 AM, Mahadev Konar maha...@yahoo-inc.com wrote: Well said Vishal. I really like the points you put forth!!! Agree on all the points, but again, all the point you mention require commitment from folks like you. Its a pretty hard task to test all the corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a release. We should definitely work towards a plan. You should go ahead and create a jira for the QA plan. We should all pitch in with what all should be tested. Thanks mahadev On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote: Hi, I would like to add my few cents here. I would suggest to stay away from code cleanup unless it is absolutely necessary. I would also like to extend this discussion to understand the amount of testing/QA to be performed before a release. How do we currently qualify a release? Recently, we have ran into issues in ZK that I believe should have caught by some basic testing before the release. I will be honest in saying that, unfortunately, these bugs have resulted in questions being raised by several people in our organization about our choice of using ZooKeeper. Nevertheless, our product group really thinks that ZK is a cool technology, but we need to focus on making it robust before adding major new features to it. I would suggest to: 1. Look at current bugs and see why existing test did not uncover these bugs and improve those tests. 2. Look at places that need more tests and broadcast it to the community. Follow-up with test development. 3. Have a crisp release QA strategy for each release. 4. Improve API documentation as well as code documentation so that the API usage is clear and debugging is made easier. Comments? Thanks. -Vishal On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote: Hi Benjamin, thank you for your response. Please find some comments inline. Benjamin Reed: code quality is important, and there are things we should keep in mind, but in general i really don't like the idea of risking code breakage because of a gratuitous code cleanup. we should be watching out for these things when patches get submitted or when new things go in. I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. i think this is inline with what pat was saying. just to expand a bit. in my opinion clean up refactorings have the following problems: 1) you risk breaking things in production for a potential future maintenance advantage. If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. 2) there is always subjectivity: quality code for one code quality zealot is often seen as a bad design by another code quality zealot. unless there is an objective reason to do it, don't. I don't agree. IMHO, the area of subjectivism in code quality is actually very small compared to hard established standards of quality metrics and best practices. I believe that my list given in the first mail of this thread gives examples of rather objective guidelines. 3) you may cleanup the wrong way. you may restructure to make the current code clean and then end up rewriting and refactoring again to change
Re: What's the QA strategy of ZooKeeper?
I broadly agree with Ben - all meaningful code changes carry a risk of destabilization (otherwise software development would be very easy) so we should guard against improving cleanliness only for its own sake. At the point where bad code gets in the way of fixing bugs or adding features, I think it's very worthwhile to 'lazily' clean code. I did this with the observers patch - reworked some of the class hierarchies to improve encapsulation and make it easier to add new implementations. The netty patch is a good test case for this approach. If we feel that reworking the structure of the existing server cnxn code will make it significantly easier to add a second implementation that adheres to the same interface, then I say that such a refactoring is worthwhile, but even then only if it's straightforward to make the changes while convincing ourselves that the behaviour of the new implementation is consistent with the old. Thomas, do comment on the patch itself! That's the very best way to make sure your concerns get heard and addressed. cheers, Henry On 15 October 2010 11:37, Benjamin Reed br...@yahoo-inc.com wrote: i think we have a very different perspective on the quality issue: I didn't want to say it that clear, but especially the new Netty code, both on client and server side is IMHO an example of new code in very bad shape. The client code patch even changes the FindBugs configuration to exclude the new code from the FindBugs checks. great. fixing the code and refactoring before a patch goes in is the perfect time to do it! please give feedback and help make the patch better. there is a reason to exclude checks (which is why there is such excludes), but if we can avoid them we should. before a patch is applied is exactly the time to do cleanup If your code is already in such a bad shape, that every change includes considerable risk to break something, then you already are in trouble. With every new feature (or bugfix!) you also risk to break something. If you don't have the attitude of permanent refactoring to improve the code quality, you will inevitably lower the maintainability of your code with every new feature. New features will build on the dirty concepts already in the code and therfor make it more expensive to ever clean things up. cleaning up code to add a new feature is a great time to clean up the code. Yes. Refactoring isn't easy, but necessary. Only over time you better understand your domain and find better structures. Over time you introduce features that let code grow so that it should better be split up in smaller units that the human brain can still handle. it is the but necessary that i disagree with. there is plenty of code that could be cleaned up and made to look a lot nicer, but we shouldn't touch it, unless we are fixing something else or adding a new feature. it's pretty lame to explain to someone that the bug that was introduced by a code change was motivated by a desire to make the code cleaner. any code change runs the risk of breakage, thus changing code simply for cleanliness is not worth the risk. ben -- Henry Robinson Software Engineer Cloudera 415-994-6679
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921528#action_12921528 ] Dave Wright commented on ZOOKEEPER-885: --- Has it been verified that ZK is doing no disk activity at all during that time? What about log file writes? What about sessions being established/torn down (which would cause syncs)? Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921557#action_12921557 ] Flavio Junqueira commented on ZOOKEEPER-885: I've been running it and there is no traffic to the disk while the clients are watching. We generate a snapshot every snapCount, but given that there are no transactions generated, no transaction is appended to the log and no new snapshot is written. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921558#action_12921558 ] Michi Mutsuzaki commented on ZOOKEEPER-804: --- Jared, thank you for volunteering to fix this. I haven't had a chance to take care of it. --Michi c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinson he...@cloudera.com wrote: The netty patch is a good test case for this approach. If we feel that reworking the structure of the existing server cnxn code will make it significantly easier to add a second implementation that adheres to the same interface, then I say that such a refactoring is worthwhile, but even then only if it's straightforward to make the changes while convincing ourselves that the behaviour of the new implementation is consistent with the old. Thomas, do comment on the patch itself! That's the very best way to make sure your concerns get heard and addressed. Well really the _best_ way IMO is to both comment and submit a patch. ;-) And this is just what Thomas is doing, so kudos to him for the effort! Vishal is doing this as well for many of the issues he's found, so thanks to him also. We do appreciate you guys jumping in to help. Lack of contributors is one of the things we've been missing and addressing that opens the door to some of these improvements being suggested. Wrt the netty patch, the approach Ben and I took was to refactor sufficiently to add support for NIO/Netty/... while minimizing breakage. This is already a big patch, esp given that the code is not really as clean to begin with (complex too). Perfect situation, no. But the intent was to further clean things up once the original patch was reviewed/committed. Trying to do a huge refactoring in one shot (one patch) is not a good idea imo. Already these patches are too large. Perhaps lesson learned here is that we should have just created a special branch from the get go, applied a number of smaller patches to that branch, then eventually merged back into the trunk once it was fully baked. Patrick
Re: What's the QA strategy of ZooKeeper?
On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinson he...@cloudera.com wrote: The netty patch is a good test case for this approach. If we feel that reworking the structure of the existing server cnxn code will make it significantly easier to add a second implementation that adheres to the same interface, then I say that such a refactoring is worthwhile, but even then only if it's straightforward to make the changes while convincing ourselves that the behaviour of the new implementation is consistent with the old. Thomas, do comment on the patch itself! That's the very best way to make sure your concerns get heard and addressed. Well really the _best_ way IMO is to both comment and submit a patch. ;-) And this is just what Thomas is doing, so kudos to him for the effort! Vishal is doing this as well for many of the issues he's found, so thanks to him also. We do appreciate you guys jumping in to help. Lack of contributors is one of the things we've been missing and addressing that opens the door to some of these improvements being suggested. Wrt the netty patch, the approach Ben and I took was to refactor sufficiently to add support for NIO/Netty/... while minimizing breakage. This is already a big patch, esp given that the code is not really as clean to begin with (complex too). Perfect situation, no. But the intent was to further clean things up once the original patch was reviewed/committed. Trying to do a huge refactoring in one shot (one patch) is not a good idea imo. Already these patches are too large. Perhaps lesson learned here is that we should have just created a special branch from the get go, applied a number of smaller patches to that branch, then eventually merged back into the trunk once it was fully baked. Patrick
[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921573#action_12921573 ] Patrick Hunt commented on ZOOKEEPER-804: Thanks Jared, I missed your question on IRC (you had quit before I got back). What you've submitted is fine, in general though you should just submit a patch against the current svn (so a diff against code which already had the original patch committed to it). Michi, can you (also ben/mahadev) review this and let me know if it's ok to commit, I'll commit it if you guys sign off. Thanks all! c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-804: --- Status: Patch Available (was: Reopened) c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed
[ https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921583#action_12921583 ] Michi Mutsuzaki commented on ZOOKEEPER-804: --- The patch looks good, but I can't seem to apply it. Has anybody seen this error? $ cd branch-3.3 $ patch -p0 ZOOKEEPER-804.patch patching file src/c/src/zookeeper.c Hunk #1 FAILED at 1947. 1 out of 1 hunk FAILED -- saving rejects to file src/c/src/zookeeper.c.rej --Michi c unit tests failing due to assertion cptr failed --- Key: ZOOKEEPER-804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.4.0 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel) Reporter: Patrick Hunt Assignee: Michi Mutsuzaki Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch I'm seeing this frequently: [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : elapsed 25687 : OK [exec] zktest-mt: /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: zookeeper_process: Assertion `cptr' failed. [exec] make: *** [run-check] Aborted [exec] Zookeeper_simpleSystem::testHangingClient Mahadev can you take a look? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: What's the QA strategy of ZooKeeper?
actually, the other way of doing the netty patch (since i'm scared of merges) would be to do a refactor cleanup patch with an eye toward netty, and then another patch to actually add netty. that would have been nice because the first patch would allow us to more easily make sure that NIO wasn't broken. and the second we could focus more on the netty addition. ben On 10/15/2010 03:07 PM, Patrick Hunt wrote: On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinsonhe...@cloudera.com wrote: The netty patch is a good test case for this approach. If we feel that reworking the structure of the existing server cnxn code will make it significantly easier to add a second implementation that adheres to the same interface, then I say that such a refactoring is worthwhile, but even then only if it's straightforward to make the changes while convincing ourselves that the behaviour of the new implementation is consistent with the old. Thomas, do comment on the patch itself! That's the very best way to make sure your concerns get heard and addressed. Well really the _best_ way IMO is to both comment and submit a patch. ;-) And this is just what Thomas is doing, so kudos to him for the effort! Vishal is doing this as well for many of the issues he's found, so thanks to him also. We do appreciate you guys jumping in to help. Lack of contributors is one of the things we've been missing and addressing that opens the door to some of these improvements being suggested. Wrt the netty patch, the approach Ben and I took was to refactor sufficiently to add support for NIO/Netty/... while minimizing breakage. This is already a big patch, esp given that the code is not really as clean to begin with (complex too). Perfect situation, no. But the intent was to further clean things up once the original patch was reviewed/committed. Trying to do a huge refactoring in one shot (one patch) is not a good idea imo. Already these patches are too large. Perhaps lesson learned here is that we should have just created a special branch from the get go, applied a number of smaller patches to that branch, then eventually merged back into the trunk once it was fully baked. Patrick