[ https://issues.apache.org/jira/browse/CURATOR-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709562#comment-14709562 ]
J D commented on CURATOR-233: ----------------------------- Hi Mike, Thanks for looking at this issue. 1. Test case above actually does show the bug 2. Pointer to error in source 3. Due to race conditions, the test case may not always work 1. The output from your test is actually showing the problem (the output messages from the test are maybe a bit unclear as there are no clear fail/success messages) 09:25:05: 1 entered 09:25:05: 1 sleeping 09:25:05: 0 entered 09:25:05: 0 trying to leave 09:25:15: 1 woke up 09:25:15: 1 trying to leave 09:25:15: 0 left 09:25:15: 1 left Client 0 is supposed to leave the 2nd barrier within 1 second (at 9:25:06), no matter if there are enough members or not. But in reality client 0 only leaves after 10 seconds. Together with client 1 after client 1 has woken up and joined the 2nd barrier. Method used: //From method description: Leave the barrier and block until all members have left or the timeout has elapsed public synchronized boolean leave(long maxWait, TimeUnit unit) { } 2. Pointer to error in source code The bug is in the following method private boolean internalLeave(long startMs, boolean hasMaxWait, long maxWaitMs) throws Exception { ... if (hasMaxWait) { long elapsed = System.currentTimeMillis() - startMs; long thisWaitMs = maxWaitMs - elapsed; if (thisWaitMs <= 0) result = false; //Setting the result to false is not sufficient! Without a break statement we are stuck in the loop forever (or until enough members have joined the second barrier) else wait(thisWaitMs); } ... } (Note: Unfortunately, after a quick look at the code when I discovered the issue, at that time I had the impression it is not sufficient/correct to simply add a break statement to fix the bug. I do not remember the problem with that solution in detail but either there were some edge cases not covered or some cleanup that needed to be done.) 3. The unit test may not work every time due to race conditions. After your post above I made a few more tests to confirm the issue. Surprisingly, there was one case where the test failed (i.e. the issue did not occur and the program behaved correctly). I suspect the actual outcome depends on the ordering of client 0/1's actions. This discovery may make it more difficult to create a proper unit test for this issue. 17:37:01: 1 trying to enter 17:37:01: 0 trying to enter 17:37:02: 1 entered 17:37:02: 0 entered 17:37:02: 0 trying to leave 17:37:02: 1 sleeping 17:37:12: 1 woke up 17:37:12: 1 trying to leave 17:37:12: 0 left 17:37:12: 1 left 0 was supposed to leave at 17:37:03 (1 second after entry) but only left 9 seconds later 17:38:29: 1 trying to enter 17:38:29: 0 trying to enter 17:38:30: 0 entered 17:38:30: 0 trying to leave 17:38:30: 1 entered 17:38:30: 1 sleeping 17:38:31: 0 left 17:38:40: 1 woke up 17:38:40: 1 trying to leave 17:38:40: 1 left 0 left at 17:38:31 (1 second after entry), that is correct!!! 17:41:26: 0 trying to enter 17:41:26: 1 trying to enter 17:41:27: 1 entered 17:41:27: 0 entered 17:41:27: 0 trying to leave 17:41:27: 1 sleeping 17:41:37: 1 woke up 17:41:37: 1 trying to leave 17:41:38: 0 left 17:41:38: 1 left 0 was supposed to leave at 17:41:28 (1 second after entry) but only left 9 seconds later 17:42:07: 0 trying to enter 17:42:07: 1 trying to enter 17:42:07: 1 entered 17:42:07: 1 sleeping 17:42:07: 0 entered 17:42:07: 0 trying to leave 17:42:17: 1 woke up 17:42:17: 1 trying to leave 17:42:17: 0 left 17:42:17: 1 left 0 was supposed to leave at 17:42:08 (1 second after entry) but only left 9 seconds later Again, thanks for looking at this issue and please let me know if you need any additional information or support. Best Regards, J D > Bug in double barrier > --------------------- > > Key: CURATOR-233 > URL: https://issues.apache.org/jira/browse/CURATOR-233 > Project: Apache Curator > Issue Type: Bug > Components: Recipes > Affects Versions: 2.8.0 > Reporter: J D > Fix For: awaiting-response > > Attachments: DoubleBarrierClient.java, DoubleBarrierTester.java > > > Hi, > I think I discovered a bug in the internalLeave method of the double barrier > implementation. > When a client is told to leave the barrier after maxWait it does not do so. A > flag is set but the client does not leave the barrier, instead it keeps > iterating through the control loop and drives CPU usage to 100%. > I have attached an example. > Best regards > Lianro -- This message was sent by Atlassian JIRA (v6.3.4#6332)