[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921244#action_12921244
 ] 

Alexandre Hardy commented on ZOOKEEPER-885:
---

{quote}
Hi Alexandre, When you load the machines running the zookeeper servers by 
running the dd command, how much time elapses between running dd and observing 
the connections expiring? I'm not being able to reproduce it, and I wonder how 
long the problem takes to manifest.
{quote}

Hi Flavio,

Problems usually start occurring after about 30 seconds. I have also tested on 
some other machines and response varies somewhat. I suspect that the 30 seconds 
is dictated by some queue that needs to fill up before significant disk traffic 
is initiated. I think that the speed at which this queue is processed 
determines how likely it is that zookeeper will fail to respond to a ping.

Please also see responses to Patrick's questions.

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921251#action_12921251
 ] 

Alexandre Hardy commented on ZOOKEEPER-885:
---

Hi Patrick,

{quote}
1) you are applying the load using dd to all three servers at the same time, is 
that correct? (not just to 1 server)
{quote}
Correct. If {{dd}} is run on only one machine then the likelihood of 
disconnects is reduced. Unfortunately our typical scenario would involve load 
on all three machines.

{quote}
2) /dev/mapper indicates some sort of lvm setup, can you give more detail on 
that? (fyi http://ubuntuforums.org/showthread.php?t=646340)
{quote}
Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume 
is 10G in size and (obviously) shares the same spindle as root and log 
(/var/log) partitions. 

{quote}
3) you mentioned that this:

echo 5  /proc/sys/vm/dirty_ratio
echo 5  /proc/sys/vm/dirty_background_ratio

resulting in stability in this test, can you tell us what this was set to 
initially?

Checkout this article: http://lwn.net/Articles/216853/
{quote}

The initial value for {{/proc/sys/vm/dirty_ratio}} is 20, an the initial value 
for {{/proc/sys/vm/dirty_background_ratio}} is 10. These machines have 1G of 
RAM, and thus are less susceptible
to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I 
have run a more complete benchmark with random IO instead of {{dd}} sequential 
IO testing session timeouts, and the effect of {{dirty_ratio}} settings. I will 
attach that separately. {{dirty_ratio}} seems to help with the {{dd}} test but 
has much less influence in the random IO test.

{quote}
I notice you are running a bigmem kernel. What's the total memory size? How 
large of a heap have to assigned to the ZK server? (jvm)
{quote}
We have 1G on each machine in this test system and 100M heap size for each 
zookeeper server.

{quote}
4) Can you verify whether or not the JVM is swapping? Any chance that the 
server JVM is swapping, which is causing the server to pause, which then causes 
the clients to time out? This seems to me like it would fit the scenario - esp 
given that when you turn the dirty_ratio down you see stability increase (the 
time it would take to complete the flush would decrease, meaning that the 
server can respond before the client times out).
{quote}
I'm not entirely sure of all the JVM internals, but all swap space on the linux 
system was disabled. So no swapping based on the linux kernel would happen. I'm 
not sure if the JVM does any swapping of its own?
I concur with your analysis. What puzzles me is why the system would even get 
into a state where the zookeeper server would have to wait so long for a disk 
flush? In the case of {{dd if=/dev/urandom}} the IO rate is quite low, and 
there should (I think) be more than enough IOPS available for zookeeper to 
flush data to disk in time. Even if the IO scheduling results in this scenario, 
it is still not clear to me why zookeeper would fail to respond to a ping. My 
only conclusion at this stage is that responding to a ping requires information 
to be flushed to disk. Is this correct?

Referring to your private e-mail:
{quote}
 The weird thing here is that there should be no delay for these pings.
{quote}
This would indicate to me that the ping response should not be dependent on any 
disk IO.

Thanks for all the effort in looking into this!

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. 

[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Alexandre Hardy (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Hardy updated ZOOKEEPER-885:
--

Attachment: benchmark.csv

Here are some results with different settings of {{dirty_ratio}}, 
{{dirty_bytes}} (finer control), session timeouts, and io priorities (set with 
{{ionice}}). 

The status field indicates success or failure, 0=success and anything else is 
failure.
Where failure means a zookeeper session disconnected (but did not necessarily 
expire).

Each test ran for a maximum of 5 minutes. A test could also fail if it failed 
to connect to zookeeper servers within the first 60 seconds, unfortunately the 
return code did not differentiate properly between these cases. However pauses 
of about 4 seconds were allowed between tests, during which all IO operations 
(by the test program) were stopped. This should have allowed the system to 
stabilize somewhat.

In this test the session timeout had the most significant influence on 
stability (not too surprising) and the other system settings had far less 
influence.

We have settled on a 60s session timeout for the moment (to achieve the 
stability we need under the IO loads we are experiencing). But it would be 
great if we could reduce this a bit.



 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-893) ZooKeeper high cpu usage when invalid requests

2010-10-15 Thread Thijs Terlouw (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921286#action_12921286
 ] 

Thijs Terlouw commented on ZOOKEEPER-893:
-

I'll try to write a test case, but not exactly sure yet how to integrate. Need 
to look at some example unit tests first :)

 ZooKeeper high cpu usage when invalid requests
 --

 Key: ZOOKEEPER-893
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-893
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.1
 Environment: Linux 2.6.16
 4x Intel(R) Xeon(R) CPU X3320  @ 2.50GHz
 java version 1.6.0_17
 Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
 Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
Reporter: Thijs Terlouw
Assignee: Thijs Terlouw
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-893.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 When ZooKeeper receives certain illegally formed messages on the internal 
 communication port (:4181 by default), it's possible for ZooKeeper to enter 
 an infinite loop which causes 100% cpu usage. It's related to ZOOKEEPER-427, 
 but that patch does not resolve all issues.
 from: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java 
 the two affected parts:
 ===
 int length = msgLength.getInt();  
   
 if(length = 0) { 
   
 throw new IOException(Invalid packet length: + length); 
   
 } 
 ===
 ===
 while (message.hasRemaining()) {  
   
 temp_numbytes = channel.read(message);
   
 if(temp_numbytes  0) {   
   
 throw new IOException(Channel eof before end);  
   
 } 
   
 numbytes += temp_numbytes;
   
 } 
 ===
 how to replicate this bug:
 perform an nmap portscan against your zookeeper server: nmap -sV -n 
 your.ip.here -p4181
 wait for a while untill you see some messages in the logfile and then you 
 will see 100% cpu usage. It does not recover from this situation. With my 
 patch, it does not occur anymore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-881) ZooKeeperServer.loadData loads database twice

2010-10-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921306#action_12921306
 ] 

Hudson commented on ZOOKEEPER-881:
--

Integrated in ZooKeeper-trunk #967 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/967/])
ZOOKEEPER-881. ZooKeeperServer.loadData loads database twice (jared 
cantwell via breed)


 ZooKeeperServer.loadData loads database twice
 -

 Key: ZOOKEEPER-881
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-881
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-881.patch


 zkDb.loadDataBase() is called twice at the beginning of loadData().  It 
 shouldn't have any negative affects, but is unnecessary.   A patch should be 
 trivial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Branch hudson job created for ZOOKEEPER-823

2010-10-15 Thread Thomas Koch
Patrick Hunt:
 I created a branch in SVN and a new Hudson job to track the ZOOKEEPER-823
 patch. Getting this one done has been tough, Thomas has been working on it
 but we are seeing intermittent failures.

 lmk if you have any questions.
Hi Patrick,

thank you for the branch. Is there a way, how I could commit to this branch? 
Or if sbd else commits on this branch, how should I send patches?
If I can commit, is there a way, how I could use GIT to work on this branch? 
I've not yet used git-svn to do commits.

Have a nice weekend,

Thomas Koch, http://www.koch.ro


[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer

2010-10-15 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921340#action_12921340
 ] 

Vishal K commented on ZOOKEEPER-851:


Hi Henry ,

Any comments?

Thanks.
-Vishal

 ZK lets any node to become an observer
 --

 Key: ZOOKEEPER-851
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Critical
 Fix For: 3.4.0


 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as 
 show below:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I 
 created another entry for that node and started zk server. The zoo.cfg on the 
 first 3 nodes was left unchanged. The fourth node was able to join the 
 cluster even though the 3 nodes had no idea about the fourth node.
 zoo.cfg on fourth node:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 server.3=10.17.117.71:2888:3888
 It looks like 10.17.117.71 is becoming an observer in this case. I was 
 expecting that the leader will reject 10.17.117.71.
 # telnet 10.17.117.71 2181
 Trying 10.17.117.71...
 Connected to 10.17.117.71.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT
 Clients:
  /10.17.117.71:37297[1](queued=0,recved=1,sent=0)
 Latency min/avg/max: 0/0/0
 Received: 3
 Sent: 2
 Outstanding: 0
 Zxid: 0x20065
 Mode: follower
 Node count: 288

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Thomas Koch
Hi Benjamin,

thank you for your response. Please find some comments inline.

Benjamin Reed:
   code quality is important, and there are things we should keep in
 mind, but in general i really don't like the idea of risking code
 breakage because of a gratuitous code cleanup. we should be watching out
 for these things when patches get submitted or when new things go in.
I didn't want to say it that clear, but especially the new Netty code, both on 
client and server side is IMHO an example of new code in very bad shape. The 
client code patch even changes the FindBugs configuration to exclude the new 
code from the FindBugs checks.

 i think this is inline with what pat was saying. just to expand a bit.
 in my opinion clean up refactorings have the following problems:
 
 1) you risk breaking things in production for a potential future
 maintenance advantage.
If your code is already in such a bad shape, that every change includes 
considerable risk to break something, then you already are in trouble. With 
every new feature (or bugfix!) you also risk to break something.
If you don't have the attitude of permanent refactoring to improve the code 
quality, you will inevitably lower the maintainability of your code with every 
new feature. New features will build on the dirty concepts already in the code 
and therfor make it more expensive to ever clean things up.

 2) there is always subjectivity: quality code for one code quality
 zealot is often seen as a bad design by another code quality zealot.
 unless there is an objective reason to do it, don't.
I don't agree. IMHO, the area of subjectivism in code quality is actually very 
small compared to hard established standards of quality metrics and best 
practices. I believe that my list given in the first mail of this thread gives 
examples of rather objective guidelines.

 3) you may cleanup the wrong way. you may restructure to make the
 current code clean and then end up rewriting and refactoring again to
 change the logic.
Yes. Refactoring isn't easy, but necessary. Only over time you better 
understand your domain and find better structures. Over time you introduce 
features that let code grow so that it should better be split up in smaller 
units that the human brain can still handle.

 i think we can mitigate 1) by only doing it when necessary. as a
 corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
 when motivated by some new change: fix a bug, increased performance, new
 feature, etc.
I agree that refactoring should be carefully planned and done in small steps. 
Therefor I collected each refactoring item for the java client in a small 
separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that can 
individually be discussed, reviewed and tested.

Have a nice weekend after Hadoop World!

Thomas Koch, http://www.koch.ro


[jira] Commented: (ZOOKEEPER-445) Potential bug in leader code

2010-10-15 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921360#action_12921360
 ] 

Vishal K commented on ZOOKEEPER-445:


Is this a bug? If not, we should just close this one.

 Potential bug in leader code
 

 Key: ZOOKEEPER-445
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-445
 Project: Zookeeper
  Issue Type: Bug
  Components: server
 Environment: Linux fortiz-desktop 2.6.27-7-generic #1 SMP Fri Oct 24 
 06:42:44 UTC 2008 i686 GNU/Linux
 java version 1.6.0_10
 Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
 Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing)
Reporter: Manos Kapritsos
Priority: Minor
 Fix For: 3.4.0

   Original Estimate: 0.33h
  Remaining Estimate: 0.33h

 There is a suspicious line in server/quorum/Leader.java:226. It reads
 if (stop) {
 LOG.info(exception while shutting down acceptor:  + e);
 stop = true;
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

2010-10-15 Thread Vishal K (JIRA)
FLE implementation should be improved to use non-blocking sockets
-

 Key: ZOOKEEPER-900
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
 Project: Zookeeper
  Issue Type: Bug
Reporter: Vishal K
Assignee: Flavio Junqueira
Priority: Critical


From earlier email exchanges:
1. Blocking connects and accepts:

a) The first problem is in manager.toSend(). This invokes connectOne(), which 
does a blocking connect. While testing, I changed the code so that connectOne() 
starts a new thread called AsyncConnct(). AsyncConnect.run() does a 
socketChannel.connect(). After starting AsyncConnect, connectOne starts a 
timer. connectOne continues with normal operations if the connection is 
established before the timer expires, otherwise, when the timer expires it 
interrupts AsyncConnect() thread and returns. In this way, I can have an upper 
bound on the amount of time we need to wait for connect to succeed. Of course, 
this was a quick fix for my testing. Ideally, we should use Selector to do 
non-blocking connects/accepts. I am planning to do that later once we at least 
have a quick fix for the problem and consensus from others for the real fix 
(this problem is big blocker for us). Note that it is OK to do blocking IO in 
SenderWorker and RecvWorker threads since they block IO to the respective pe!
 er.

b) The blocking IO problem is not just restricted to connectOne(), but also in 
receiveConnection(). The Listener thread calls receiveConnection() for each 
incoming connection request. receiveConnection does blocking IO to get peer's 
info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that 
had sent the connection request. All of this is happening from the Listener. In 
short, if a peer fails after initiating a connection, the Listener thread won't 
be able to accept connections from other peers, because it would be stuck in 
read() or connetOne(). Also the code has an inherent cycle. 
initiateConnection() and receiveConnection() will have to be very carefully 
synchronized otherwise, we could run into deadlocks. This code is going to be 
difficult to maintain/modify.

Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Vishal K
Hi,

I would like to add my few cents here.

I would suggest to stay away from code cleanup unless it is absolutely
necessary.

I would also like to extend this discussion to understand the amount of
testing/QA to be performed before a release. How do we currently qualify a
release?

Recently, we have ran into issues in ZK that I believe should have caught by
some basic testing before the release. I will be honest in saying that,
unfortunately, these bugs have resulted in questions being raised by several
people in our organization about our choice of using ZooKeeper.
Nevertheless, our product group really thinks that ZK is a cool technology,
but we need to focus on making it robust before adding major new features to
it.

I would suggest to:
1. Look at current bugs and see why existing test did not uncover these bugs
and improve those tests.
2. Look at places that need more tests and broadcast it to the community.
Follow-up with test development.
3. Have a crisp release QA strategy for each release.
4. Improve API documentation as well as code documentation so that the API
usage is clear and debugging is made easier.

Comments?

Thanks.
-Vishal

On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:

 Hi Benjamin,

 thank you for your response. Please find some comments inline.

 Benjamin Reed:
code quality is important, and there are things we should keep in
  mind, but in general i really don't like the idea of risking code
  breakage because of a gratuitous code cleanup. we should be watching out
  for these things when patches get submitted or when new things go in.
 I didn't want to say it that clear, but especially the new Netty code, both
 on
 client and server side is IMHO an example of new code in very bad shape.
 The
 client code patch even changes the FindBugs configuration to exclude the
 new
 code from the FindBugs checks.

  i think this is inline with what pat was saying. just to expand a bit.
  in my opinion clean up refactorings have the following problems:
 
  1) you risk breaking things in production for a potential future
  maintenance advantage.
 If your code is already in such a bad shape, that every change includes
 considerable risk to break something, then you already are in trouble. With
 every new feature (or bugfix!) you also risk to break something.
 If you don't have the attitude of permanent refactoring to improve the code
 quality, you will inevitably lower the maintainability of your code with
 every
 new feature. New features will build on the dirty concepts already in the
 code
 and therfor make it more expensive to ever clean things up.

  2) there is always subjectivity: quality code for one code quality
  zealot is often seen as a bad design by another code quality zealot.
  unless there is an objective reason to do it, don't.
 I don't agree. IMHO, the area of subjectivism in code quality is actually
 very
 small compared to hard established standards of quality metrics and best
 practices. I believe that my list given in the first mail of this thread
 gives
 examples of rather objective guidelines.

  3) you may cleanup the wrong way. you may restructure to make the
  current code clean and then end up rewriting and refactoring again to
  change the logic.
 Yes. Refactoring isn't easy, but necessary. Only over time you better
 understand your domain and find better structures. Over time you introduce
 features that let code grow so that it should better be split up in smaller
 units that the human brain can still handle.

  i think we can mitigate 1) by only doing it when necessary. as a
  corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
  when motivated by some new change: fix a bug, increased performance, new
  feature, etc.
 I agree that refactoring should be carefully planned and done in small
 steps.
 Therefor I collected each refactoring item for the java client in a small
 separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that
 can
 individually be discussed, reviewed and tested.

 Have a nice weekend after Hadoop World!

 Thomas Koch, http://www.koch.ro



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921412#action_12921412
 ] 

Benjamin Reed commented on ZOOKEEPER-885:
-

we are having problems reproducing this. can you give a bit more details on the 
machines you are using? what are the cpu and memory size? also, what is the 
throughput of dd if=/dev/zero of=/dev/mapper/nimbula-test? is there just one 
disk, where nimbula-test is a partition on that disk and you have another 
partition for the snapshots and logs?

even if you don't have swap space, code pages can be discarded and loaded on 
demand, so that could be a potential problem. what does /proc/meminfo look like?

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Dave Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921459#action_12921459
 ] 

Dave Wright commented on ZOOKEEPER-885:
---

I don't think the cause of this is much of a mystery, as we experienced similar 
problems when we had the zookeeper files on the same filesystem as an IO-heavy 
application that was doing buffered IO. Quite simply, when zookeeper does a 
sync on its own files, it causes the entire filesystem to sync, flushing any 
buffered data from the IO-heavy application and freezing the ZK server process 
for long enough for heartbeats to timeout. 

When you say moderate IO load I'm curious what the bottleneck is - the dd 
command will copy data as fast as possible, if you're only getting 4MB/sec, the 
underlying device must be pretty slow, which would further indicate why a 
sync() request would take a while to complete. 

The only fix we've seen is to put the ZK files on their own device, although 
you may be able to fix it with a different partition on the same device. 

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921467#action_12921467
 ] 

Flavio Junqueira commented on ZOOKEEPER-885:


I'm not sure it is that simple, Dave. The problem is that pings do not require 
writes to disk, and in the scenario that Alexandre describes, there are only 
pings being processed. Why is the background I/O load affecting the processing 
of ZooKeeper? And in particular, why are session expiring as a consequence of 
this background I/O load?

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921471#action_12921471
 ] 

Patrick Hunt commented on ZOOKEEPER-885:


re what Dave just mentioned (fsync issues on ext3 volume), this is a good read:
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/
ZK is in a similar situation.

The perplexing thing here though is that the client is just doing pings, so 
really there should be no writes/syncs to disk for ZK. That's what I'm stuck on.

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Mahadev Konar
Well said Vishal.

I really like the points you put forth!!!

Agree on all the points, but again, all the point you mention require
commitment from folks like you. Its a pretty hard task to test all the
corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a
release. We should definitely work towards a plan. You should go ahead and
create a jira for the QA plan. We should all pitch in with what all should
be tested.

Thanks
mahadev

On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:

 Hi,
 
 I would like to add my few cents here.
 
 I would suggest to stay away from code cleanup unless it is absolutely
 necessary.
 
 I would also like to extend this discussion to understand the amount of
 testing/QA to be performed before a release. How do we currently qualify a
 release?
 
 Recently, we have ran into issues in ZK that I believe should have caught by
 some basic testing before the release. I will be honest in saying that,
 unfortunately, these bugs have resulted in questions being raised by several
 people in our organization about our choice of using ZooKeeper.
 Nevertheless, our product group really thinks that ZK is a cool technology,
 but we need to focus on making it robust before adding major new features to
 it.
 
 I would suggest to:
 1. Look at current bugs and see why existing test did not uncover these bugs
 and improve those tests.
 2. Look at places that need more tests and broadcast it to the community.
 Follow-up with test development.
 3. Have a crisp release QA strategy for each release.
 4. Improve API documentation as well as code documentation so that the API
 usage is clear and debugging is made easier.
 
 Comments?
 
 Thanks.
 -Vishal
 
 On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
 
 Hi Benjamin,
 
 thank you for your response. Please find some comments inline.
 
 Benjamin Reed:
   code quality is important, and there are things we should keep in
 mind, but in general i really don't like the idea of risking code
 breakage because of a gratuitous code cleanup. we should be watching out
 for these things when patches get submitted or when new things go in.
 I didn't want to say it that clear, but especially the new Netty code, both
 on
 client and server side is IMHO an example of new code in very bad shape.
 The
 client code patch even changes the FindBugs configuration to exclude the
 new
 code from the FindBugs checks.
 
 i think this is inline with what pat was saying. just to expand a bit.
 in my opinion clean up refactorings have the following problems:
 
 1) you risk breaking things in production for a potential future
 maintenance advantage.
 If your code is already in such a bad shape, that every change includes
 considerable risk to break something, then you already are in trouble. With
 every new feature (or bugfix!) you also risk to break something.
 If you don't have the attitude of permanent refactoring to improve the code
 quality, you will inevitably lower the maintainability of your code with
 every
 new feature. New features will build on the dirty concepts already in the
 code
 and therfor make it more expensive to ever clean things up.
 
 2) there is always subjectivity: quality code for one code quality
 zealot is often seen as a bad design by another code quality zealot.
 unless there is an objective reason to do it, don't.
 I don't agree. IMHO, the area of subjectivism in code quality is actually
 very
 small compared to hard established standards of quality metrics and best
 practices. I believe that my list given in the first mail of this thread
 gives
 examples of rather objective guidelines.
 
 3) you may cleanup the wrong way. you may restructure to make the
 current code clean and then end up rewriting and refactoring again to
 change the logic.
 Yes. Refactoring isn't easy, but necessary. Only over time you better
 understand your domain and find better structures. Over time you introduce
 features that let code grow so that it should better be split up in smaller
 units that the human brain can still handle.
 
 i think we can mitigate 1) by only doing it when necessary. as a
 corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
 when motivated by some new change: fix a bug, increased performance, new
 feature, etc.
 I agree that refactoring should be carefully planned and done in small
 steps.
 Therefor I collected each refactoring item for the java client in a small
 separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that
 can
 individually be discussed, reviewed and tested.
 
 Have a nice weekend after Hadoop World!
 
 Thomas Koch, http://www.koch.ro
 
 



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Patrick Hunt
 Recently, we have ran into issues in ZK that I believe should have caught
by some basic testing before the release

Vishal, can you be more specific, point out specific JIRAs that you entered
would be very valuable. Don't worry about hurting our feelings or anything,
without this type of feedback we can't address the specific issues and their
underlying problems.

Regards,

Patrick

On Fri, Oct 15, 2010 at 11:14 AM, Mahadev Konar maha...@yahoo-inc.comwrote:

 Well said Vishal.

 I really like the points you put forth!!!

 Agree on all the points, but again, all the point you mention require
 commitment from folks like you. Its a pretty hard task to test all the
 corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a
 release. We should definitely work towards a plan. You should go ahead and
 create a jira for the QA plan. We should all pitch in with what all should
 be tested.

 Thanks
 mahadev

 On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:

  Hi,
 
  I would like to add my few cents here.
 
  I would suggest to stay away from code cleanup unless it is absolutely
  necessary.
 
  I would also like to extend this discussion to understand the amount of
  testing/QA to be performed before a release. How do we currently qualify
 a
  release?
 
  Recently, we have ran into issues in ZK that I believe should have caught
 by
  some basic testing before the release. I will be honest in saying that,
  unfortunately, these bugs have resulted in questions being raised by
 several
  people in our organization about our choice of using ZooKeeper.
  Nevertheless, our product group really thinks that ZK is a cool
 technology,
  but we need to focus on making it robust before adding major new features
 to
  it.
 
  I would suggest to:
  1. Look at current bugs and see why existing test did not uncover these
 bugs
  and improve those tests.
  2. Look at places that need more tests and broadcast it to the community.
  Follow-up with test development.
  3. Have a crisp release QA strategy for each release.
  4. Improve API documentation as well as code documentation so that the
 API
  usage is clear and debugging is made easier.
 
  Comments?
 
  Thanks.
  -Vishal
 
  On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
 
  Hi Benjamin,
 
  thank you for your response. Please find some comments inline.
 
  Benjamin Reed:
code quality is important, and there are things we should keep in
  mind, but in general i really don't like the idea of risking code
  breakage because of a gratuitous code cleanup. we should be watching
 out
  for these things when patches get submitted or when new things go in.
  I didn't want to say it that clear, but especially the new Netty code,
 both
  on
  client and server side is IMHO an example of new code in very bad shape.
  The
  client code patch even changes the FindBugs configuration to exclude the
  new
  code from the FindBugs checks.
 
  i think this is inline with what pat was saying. just to expand a bit.
  in my opinion clean up refactorings have the following problems:
 
  1) you risk breaking things in production for a potential future
  maintenance advantage.
  If your code is already in such a bad shape, that every change includes
  considerable risk to break something, then you already are in trouble.
 With
  every new feature (or bugfix!) you also risk to break something.
  If you don't have the attitude of permanent refactoring to improve the
 code
  quality, you will inevitably lower the maintainability of your code with
  every
  new feature. New features will build on the dirty concepts already in
 the
  code
  and therfor make it more expensive to ever clean things up.
 
  2) there is always subjectivity: quality code for one code quality
  zealot is often seen as a bad design by another code quality zealot.
  unless there is an objective reason to do it, don't.
  I don't agree. IMHO, the area of subjectivism in code quality is
 actually
  very
  small compared to hard established standards of quality metrics and best
  practices. I believe that my list given in the first mail of this thread
  gives
  examples of rather objective guidelines.
 
  3) you may cleanup the wrong way. you may restructure to make the
  current code clean and then end up rewriting and refactoring again to
  change the logic.
  Yes. Refactoring isn't easy, but necessary. Only over time you better
  understand your domain and find better structures. Over time you
 introduce
  features that let code grow so that it should better be split up in
 smaller
  units that the human brain can still handle.
 
  i think we can mitigate 1) by only doing it when necessary. as a
  corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
  when motivated by some new change: fix a bug, increased performance,
 new
  feature, etc.
  I agree that refactoring should be carefully planned and done in small
  steps.
  Therefor I collected each refactoring 

Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Benjamin Reed

 i think we have a very different perspective on the quality issue:



I didn't want to say it that clear, but especially the new Netty code, both on 
client and server side is IMHO an example of new code in very bad shape. The
client code patch even changes the FindBugs configuration to exclude the new
code from the FindBugs checks.

great. fixing the code and refactoring before a patch goes in is the 
perfect time to do it! please give feedback and help make the patch 
better. there is a reason to exclude checks (which is why there is such 
excludes), but if we can avoid them we should. before a patch is applied 
is exactly the time to do cleanup

If your code is already in such a bad shape, that every change includes
considerable risk to break something, then you already are in trouble. With
every new feature (or bugfix!) you also risk to break something.
If you don't have the attitude of permanent refactoring to improve the code
quality, you will inevitably lower the maintainability of your code with every
new feature. New features will build on the dirty concepts already in the code
and therfor make it more expensive to ever clean things up.

cleaning up code to add a new feature is a great time to clean up the code.

Yes. Refactoring isn't easy, but necessary. Only over time you better
understand your domain and find better structures. Over time you introduce
features that let code grow so that it should better be split up in smaller
units that the human brain can still handle.

it is the but necessary that i disagree with. there is plenty of code 
that could be cleaned up and made to look a lot nicer, but we shouldn't 
touch it, unless we are fixing something else or adding a new feature. 
it's pretty lame to explain to someone that the bug that was introduced 
by a code change was motivated by a desire to make the code cleaner. any 
code change runs the risk of breakage, thus changing code simply for 
cleanliness is not worth the risk.


ben


[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-15 Thread Jared Cantwell (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jared Cantwell updated ZOOKEEPER-804:
-

Attachment: ZOOKEEPER-804-1.patch

Not sure what the protocol is here, but I'm gonna go ahead and attach a revised 
patch.

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Vishal K
Hi Mahadev,

Yes this will require effort from the community. In a related note how many
of the contributors do you think are dedicated (either partially or fully to
ZK development/testing)? This will help to in understanding how much effort
will be required for tasks and how we can prioritize them based on
resources.

I will create a JIRA for this with some initial thoughts. However, the list
would have to come from folks who are currently familiar with parts of ZK
code.

Thanks.
-Vishal

On Fri, Oct 15, 2010 at 2:14 PM, Mahadev Konar maha...@yahoo-inc.comwrote:

 Well said Vishal.

 I really like the points you put forth!!!

 Agree on all the points, but again, all the point you mention require
 commitment from folks like you. Its a pretty hard task to test all the
 corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a
 release. We should definitely work towards a plan. You should go ahead and
 create a jira for the QA plan. We should all pitch in with what all should
 be tested.

 Thanks
 mahadev

 On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:

  Hi,
 
  I would like to add my few cents here.
 
  I would suggest to stay away from code cleanup unless it is absolutely
  necessary.
 
  I would also like to extend this discussion to understand the amount of
  testing/QA to be performed before a release. How do we currently qualify
 a
  release?
 
  Recently, we have ran into issues in ZK that I believe should have caught
 by
  some basic testing before the release. I will be honest in saying that,
  unfortunately, these bugs have resulted in questions being raised by
 several
  people in our organization about our choice of using ZooKeeper.
  Nevertheless, our product group really thinks that ZK is a cool
 technology,
  but we need to focus on making it robust before adding major new features
 to
  it.
 
  I would suggest to:
  1. Look at current bugs and see why existing test did not uncover these
 bugs
  and improve those tests.
  2. Look at places that need more tests and broadcast it to the community.
  Follow-up with test development.
  3. Have a crisp release QA strategy for each release.
  4. Improve API documentation as well as code documentation so that the
 API
  usage is clear and debugging is made easier.
 
  Comments?
 
  Thanks.
  -Vishal
 
  On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
 
  Hi Benjamin,
 
  thank you for your response. Please find some comments inline.
 
  Benjamin Reed:
code quality is important, and there are things we should keep in
  mind, but in general i really don't like the idea of risking code
  breakage because of a gratuitous code cleanup. we should be watching
 out
  for these things when patches get submitted or when new things go in.
  I didn't want to say it that clear, but especially the new Netty code,
 both
  on
  client and server side is IMHO an example of new code in very bad shape.
  The
  client code patch even changes the FindBugs configuration to exclude the
  new
  code from the FindBugs checks.
 
  i think this is inline with what pat was saying. just to expand a bit.
  in my opinion clean up refactorings have the following problems:
 
  1) you risk breaking things in production for a potential future
  maintenance advantage.
  If your code is already in such a bad shape, that every change includes
  considerable risk to break something, then you already are in trouble.
 With
  every new feature (or bugfix!) you also risk to break something.
  If you don't have the attitude of permanent refactoring to improve the
 code
  quality, you will inevitably lower the maintainability of your code with
  every
  new feature. New features will build on the dirty concepts already in
 the
  code
  and therfor make it more expensive to ever clean things up.
 
  2) there is always subjectivity: quality code for one code quality
  zealot is often seen as a bad design by another code quality zealot.
  unless there is an objective reason to do it, don't.
  I don't agree. IMHO, the area of subjectivism in code quality is
 actually
  very
  small compared to hard established standards of quality metrics and best
  practices. I believe that my list given in the first mail of this thread
  gives
  examples of rather objective guidelines.
 
  3) you may cleanup the wrong way. you may restructure to make the
  current code clean and then end up rewriting and refactoring again to
  change the logic.
  Yes. Refactoring isn't easy, but necessary. Only over time you better
  understand your domain and find better structures. Over time you
 introduce
  features that let code grow so that it should better be split up in
 smaller
  units that the human brain can still handle.
 
  i think we can mitigate 1) by only doing it when necessary. as a
  corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
  when motivated by some new change: fix a bug, increased performance,
 new
  feature, etc.
  I agree that 

Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Vishal K
Hi Patrick,

On Fri, Oct 15, 2010 at 2:22 PM, Patrick Hunt ph...@apache.org wrote:

  Recently, we have ran into issues in ZK that I believe should have caught
 by some basic testing before the release

 Vishal, can you be more specific, point out specific JIRAs that you entered
 would be very valuable. Don't worry about hurting our feelings or anything,
 without this type of feedback we can't address the specific issues and
 their
 underlying problems.


Heres a list of few issues:
Leader election taking a long time  to complete -
https://issues.apache.org/jira/browse/ZOOKEEPER-822
Last processed zxid set prematurely while establishing leadership -
https://issues.apache.org/jira/browse/ZOOKEEPER-790
FLE implementation should be improved to use non-blocking sockets
ZOOKEEPER-900
ZK lets any node to become an observer -
https://issues.apache.org/jira/browse/ZOOKEEPER-851


 Regards,

 Patrick

 On Fri, Oct 15, 2010 at 11:14 AM, Mahadev Konar maha...@yahoo-inc.com
 wrote:

  Well said Vishal.
 
  I really like the points you put forth!!!
 
  Agree on all the points, but again, all the point you mention require
  commitment from folks like you. Its a pretty hard task to test all the
  corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a
  release. We should definitely work towards a plan. You should go ahead
 and
  create a jira for the QA plan. We should all pitch in with what all
 should
  be tested.
 
  Thanks
  mahadev
 
  On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:
 
   Hi,
  
   I would like to add my few cents here.
  
   I would suggest to stay away from code cleanup unless it is absolutely
   necessary.
  
   I would also like to extend this discussion to understand the amount of
   testing/QA to be performed before a release. How do we currently
 qualify
  a
   release?
  
   Recently, we have ran into issues in ZK that I believe should have
 caught
  by
   some basic testing before the release. I will be honest in saying that,
   unfortunately, these bugs have resulted in questions being raised by
  several
   people in our organization about our choice of using ZooKeeper.
   Nevertheless, our product group really thinks that ZK is a cool
  technology,
   but we need to focus on making it robust before adding major new
 features
  to
   it.
  
   I would suggest to:
   1. Look at current bugs and see why existing test did not uncover these
  bugs
   and improve those tests.
   2. Look at places that need more tests and broadcast it to the
 community.
   Follow-up with test development.
   3. Have a crisp release QA strategy for each release.
   4. Improve API documentation as well as code documentation so that the
  API
   usage is clear and debugging is made easier.
  
   Comments?
  
   Thanks.
   -Vishal
  
   On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
  
   Hi Benjamin,
  
   thank you for your response. Please find some comments inline.
  
   Benjamin Reed:
 code quality is important, and there are things we should keep in
   mind, but in general i really don't like the idea of risking code
   breakage because of a gratuitous code cleanup. we should be watching
  out
   for these things when patches get submitted or when new things go in.
   I didn't want to say it that clear, but especially the new Netty code,
  both
   on
   client and server side is IMHO an example of new code in very bad
 shape.
   The
   client code patch even changes the FindBugs configuration to exclude
 the
   new
   code from the FindBugs checks.
  
   i think this is inline with what pat was saying. just to expand a
 bit.
   in my opinion clean up refactorings have the following problems:
  
   1) you risk breaking things in production for a potential future
   maintenance advantage.
   If your code is already in such a bad shape, that every change
 includes
   considerable risk to break something, then you already are in trouble.
  With
   every new feature (or bugfix!) you also risk to break something.
   If you don't have the attitude of permanent refactoring to improve the
  code
   quality, you will inevitably lower the maintainability of your code
 with
   every
   new feature. New features will build on the dirty concepts already in
  the
   code
   and therfor make it more expensive to ever clean things up.
  
   2) there is always subjectivity: quality code for one code quality
   zealot is often seen as a bad design by another code quality zealot.
   unless there is an objective reason to do it, don't.
   I don't agree. IMHO, the area of subjectivism in code quality is
  actually
   very
   small compared to hard established standards of quality metrics and
 best
   practices. I believe that my list given in the first mail of this
 thread
   gives
   examples of rather objective guidelines.
  
   3) you may cleanup the wrong way. you may restructure to make the
   current code clean and then end up rewriting and refactoring again to
   change 

Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Henry Robinson
I broadly agree with Ben - all meaningful code changes carry a risk of
destabilization (otherwise software development would be very easy) so we
should guard against improving cleanliness only for its own sake. At the
point where bad code gets in the way of fixing bugs or adding features, I
think it's very worthwhile to 'lazily' clean code.

I did this with the observers patch - reworked some of the class hierarchies
to improve encapsulation and make it easier to add new implementations.

The netty patch is a good test case for this approach. If we feel that
reworking the structure of the existing server cnxn code will make it
significantly easier to add a second implementation that adheres to the same
interface, then I say that such a refactoring is worthwhile, but even then
only if it's straightforward to make the changes while convincing ourselves
that the behaviour of the new implementation is consistent with the old.

Thomas, do comment on the patch itself! That's the very best way to make
sure your concerns get heard and addressed.

cheers,
Henry

On 15 October 2010 11:37, Benjamin Reed br...@yahoo-inc.com wrote:

  i think we have a very different perspective on the quality issue:



  I didn't want to say it that clear, but especially the new Netty code,
 both on client and server side is IMHO an example of new code in very bad
 shape. The
 client code patch even changes the FindBugs configuration to exclude the
 new
 code from the FindBugs checks.

  great. fixing the code and refactoring before a patch goes in is the
 perfect time to do it! please give feedback and help make the patch better.
 there is a reason to exclude checks (which is why there is such excludes),
 but if we can avoid them we should. before a patch is applied is exactly the
 time to do cleanup

  If your code is already in such a bad shape, that every change includes
 considerable risk to break something, then you already are in trouble.
 With
 every new feature (or bugfix!) you also risk to break something.
 If you don't have the attitude of permanent refactoring to improve the
 code
 quality, you will inevitably lower the maintainability of your code with
 every
 new feature. New features will build on the dirty concepts already in the
 code
 and therfor make it more expensive to ever clean things up.

 cleaning up code to add a new feature is a great time to clean up the code.

  Yes. Refactoring isn't easy, but necessary. Only over time you better
 understand your domain and find better structures. Over time you introduce
 features that let code grow so that it should better be split up in
 smaller
 units that the human brain can still handle.

  it is the but necessary that i disagree with. there is plenty of code
 that could be cleaned up and made to look a lot nicer, but we shouldn't
 touch it, unless we are fixing something else or adding a new feature. it's
 pretty lame to explain to someone that the bug that was introduced by a code
 change was motivated by a desire to make the code cleaner. any code change
 runs the risk of breakage, thus changing code simply for cleanliness is not
 worth the risk.

 ben




-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Dave Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921528#action_12921528
 ] 

Dave Wright commented on ZOOKEEPER-885:
---

Has it been verified that ZK is doing no disk activity at all during that time? 
What about log file writes? What about sessions being established/torn down 
(which would cause syncs)?

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-15 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921557#action_12921557
 ] 

Flavio Junqueira commented on ZOOKEEPER-885:


I've been running it and there is no traffic to the disk while the clients are 
watching. We generate a snapshot every snapCount, but given that there are no 
transactions generated, no transaction is appended to the log and no new 
snapshot is written.  

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-15 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921558#action_12921558
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-804:
---

Jared, thank you for volunteering to fix this. I haven't had a chance to take 
care of it.

--Michi

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Patrick Hunt
On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinson he...@cloudera.com wrote:

 The netty patch is a good test case for this approach. If we feel that
 reworking the structure of the existing server cnxn code will make it
 significantly easier to add a second implementation that adheres to the
 same
 interface, then I say that such a refactoring is worthwhile, but even then
 only if it's straightforward to make the changes while convincing ourselves
 that the behaviour of the new implementation is consistent with the old.

 Thomas, do comment on the patch itself! That's the very best way to make
 sure your concerns get heard and addressed.


Well really the _best_ way IMO is to both comment and submit a patch. ;-)

And this is just what Thomas is doing, so kudos to him for the effort!
Vishal is doing this as well for many of the issues he's found, so thanks to
him also. We do appreciate you guys jumping in to help. Lack of contributors
is one of the things we've been missing and addressing that opens the door
to some of these improvements being suggested.

Wrt the netty patch, the approach Ben and I took was to refactor
sufficiently to add support for NIO/Netty/... while minimizing breakage.
This is already a big patch, esp given that the code is not really as clean
to begin with (complex too). Perfect situation, no. But the intent was to
further clean things up once the original patch was reviewed/committed.
Trying to do a huge refactoring in one shot (one patch) is not a good idea
imo. Already these patches are too large. Perhaps lesson learned here is
that we should have just created a special branch from the get go, applied a
number of smaller patches to that branch, then eventually merged back into
the trunk once it was fully baked.

Patrick


Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Patrick Hunt
On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinson he...@cloudera.com wrote:

 The netty patch is a good test case for this approach. If we feel that
 reworking the structure of the existing server cnxn code will make it
 significantly easier to add a second implementation that adheres to the
 same
 interface, then I say that such a refactoring is worthwhile, but even then
 only if it's straightforward to make the changes while convincing ourselves
 that the behaviour of the new implementation is consistent with the old.

 Thomas, do comment on the patch itself! That's the very best way to make
 sure your concerns get heard and addressed.


Well really the _best_ way IMO is to both comment and submit a patch. ;-)

And this is just what Thomas is doing, so kudos to him for the effort!
Vishal is doing this as well for many of the issues he's found, so thanks to
him also. We do appreciate you guys jumping in to help. Lack of contributors
is one of the things we've been missing and addressing that opens the door
to some of these improvements being suggested.

Wrt the netty patch, the approach Ben and I took was to refactor
sufficiently to add support for NIO/Netty/... while minimizing breakage.
This is already a big patch, esp given that the code is not really as clean
to begin with (complex too). Perfect situation, no. But the intent was to
further clean things up once the original patch was reviewed/committed.
Trying to do a huge refactoring in one shot (one patch) is not a good idea
imo. Already these patches are too large. Perhaps lesson learned here is
that we should have just created a special branch from the get go, applied a
number of smaller patches to that branch, then eventually merged back into
the trunk once it was fully baked.


Patrick


[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-15 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921573#action_12921573
 ] 

Patrick Hunt commented on ZOOKEEPER-804:


Thanks Jared, I missed your question on IRC (you had quit before I got back). 
What you've submitted is fine, in general though you should  just submit a 
patch against the current svn (so a diff against code which already had the 
original patch committed to it).

Michi, can you (also ben/mahadev) review this and let me know if it's ok to 
commit, I'll commit it if you guys sign off. Thanks all!


 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-15 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-804:
---

Status: Patch Available  (was: Reopened)

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-15 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921583#action_12921583
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-804:
---

The patch looks good, but I can't seem to apply it. Has anybody seen this error?

$ cd branch-3.3
$ patch -p0   ZOOKEEPER-804.patch 
patching file src/c/src/zookeeper.c
Hunk #1 FAILED at 1947.
1 out of 1 hunk FAILED -- saving rejects to file src/c/src/zookeeper.c.rej

--Michi

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Benjamin Reed
 actually, the other way of doing the netty patch (since i'm scared of 
merges) would be to do a refactor cleanup patch with an eye toward 
netty, and then another patch to actually add netty. that would have 
been nice because the first patch would allow us to more easily make 
sure that NIO wasn't broken. and the second we could focus more on the 
netty addition.


ben

On 10/15/2010 03:07 PM, Patrick Hunt wrote:

On Fri, Oct 15, 2010 at 12:11 PM, Henry Robinsonhe...@cloudera.com  wrote:


The netty patch is a good test case for this approach. If we feel that
reworking the structure of the existing server cnxn code will make it
significantly easier to add a second implementation that adheres to the
same
interface, then I say that such a refactoring is worthwhile, but even then
only if it's straightforward to make the changes while convincing ourselves
that the behaviour of the new implementation is consistent with the old.

Thomas, do comment on the patch itself! That's the very best way to make
sure your concerns get heard and addressed.


Well really the _best_ way IMO is to both comment and submit a patch. ;-)

And this is just what Thomas is doing, so kudos to him for the effort!
Vishal is doing this as well for many of the issues he's found, so thanks to
him also. We do appreciate you guys jumping in to help. Lack of contributors
is one of the things we've been missing and addressing that opens the door
to some of these improvements being suggested.

Wrt the netty patch, the approach Ben and I took was to refactor
sufficiently to add support for NIO/Netty/... while minimizing breakage.
This is already a big patch, esp given that the code is not really as clean
to begin with (complex too). Perfect situation, no. But the intent was to
further clean things up once the original patch was reviewed/committed.
Trying to do a huge refactoring in one shot (one patch) is not a good idea
imo. Already these patches are too large. Perhaps lesson learned here is
that we should have just created a special branch from the get go, applied a
number of smaller patches to that branch, then eventually merged back into
the trunk once it was fully baked.


Patrick