[ 
https://issues.apache.org/jira/browse/CURATOR-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709562#comment-14709562
 ] 

J D commented on CURATOR-233:
-----------------------------

Hi Mike,

Thanks for looking at this issue.

1. Test case above actually does show the bug
2. Pointer to error in source
3. Due to race conditions, the test case may not always work


1. The output from your test is actually showing the problem (the output 
messages from the test are maybe a bit unclear as there are no clear 
fail/success messages)

09:25:05: 1 entered
09:25:05: 1 sleeping
09:25:05: 0 entered
09:25:05: 0 trying to leave
09:25:15: 1 woke up
09:25:15: 1 trying to leave
09:25:15: 0 left
09:25:15: 1 left

Client 0 is supposed to leave the 2nd barrier within 1 second (at 9:25:06), no 
matter if there are enough members or not.
But in reality client 0 only leaves after 10 seconds. Together with client 1 
after client 1 has woken up and joined the 2nd barrier.

Method used:

//From method description: Leave the barrier and block until all members have 
left or the timeout has elapsed
public synchronized boolean leave(long maxWait, TimeUnit unit) 
{
}



2. Pointer to error in source code
The bug is in the following method

private boolean internalLeave(long startMs, boolean hasMaxWait, long maxWaitMs) 
throws Exception
{
...
                if (hasMaxWait)
                {
                    long elapsed = System.currentTimeMillis() - startMs;
                    long thisWaitMs = maxWaitMs - elapsed;
                    if (thisWaitMs <= 0)
                        result = false; //Setting the result to false is not 
sufficient! Without a break statement we are stuck in the loop forever (or 
until enough members have joined the second barrier)
                    else
                        wait(thisWaitMs);
                }
...
}

(Note: Unfortunately, after a quick look at the code when I discovered the 
issue, at that time I had the impression it is not sufficient/correct to simply 
add a break statement to fix the bug. I do not remember the problem with that 
solution in detail but either there were some edge cases not covered or some 
cleanup that needed to be done.)


3. The unit test may not work every time due to race conditions. After your 
post above I made a few more tests to confirm the issue. Surprisingly, there 
was one case where the test failed (i.e. the issue did not occur and the 
program behaved correctly). I suspect the actual outcome depends on the 
ordering of client 0/1's actions. This discovery may make it more difficult to 
create a proper unit test for this issue.


17:37:01: 1 trying to enter
17:37:01: 0 trying to enter
17:37:02: 1 entered
17:37:02: 0 entered
17:37:02: 0 trying to leave
17:37:02: 1 sleeping
17:37:12: 1 woke up
17:37:12: 1 trying to leave
17:37:12: 0 left
17:37:12: 1 left
0 was supposed to leave at 17:37:03 (1 second after entry) but only left 9 
seconds later

17:38:29: 1 trying to enter
17:38:29: 0 trying to enter
17:38:30: 0 entered
17:38:30: 0 trying to leave
17:38:30: 1 entered
17:38:30: 1 sleeping
17:38:31: 0 left
17:38:40: 1 woke up
17:38:40: 1 trying to leave
17:38:40: 1 left
0 left at 17:38:31 (1 second after entry), that is correct!!!

17:41:26: 0 trying to enter
17:41:26: 1 trying to enter
17:41:27: 1 entered
17:41:27: 0 entered
17:41:27: 0 trying to leave
17:41:27: 1 sleeping
17:41:37: 1 woke up
17:41:37: 1 trying to leave
17:41:38: 0 left
17:41:38: 1 left
0 was supposed to leave at 17:41:28 (1 second after entry) but only left 9 
seconds later


17:42:07: 0 trying to enter
17:42:07: 1 trying to enter
17:42:07: 1 entered
17:42:07: 1 sleeping
17:42:07: 0 entered
17:42:07: 0 trying to leave
17:42:17: 1 woke up
17:42:17: 1 trying to leave
17:42:17: 0 left
17:42:17: 1 left
0 was supposed to leave at 17:42:08 (1 second after entry) but only left 9 
seconds later


Again, thanks for looking at this issue and please let me know if you need any 
additional information or support.

Best Regards,

J D

> Bug in double barrier
> ---------------------
>
>                 Key: CURATOR-233
>                 URL: https://issues.apache.org/jira/browse/CURATOR-233
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.8.0
>            Reporter: J D
>             Fix For: awaiting-response
>
>         Attachments: DoubleBarrierClient.java, DoubleBarrierTester.java
>
>
> Hi,
> I think I discovered a bug in the internalLeave method of the double barrier 
> implementation.
> When a client is told to leave the barrier after maxWait it does not do so. A 
> flag is set but the client does not leave the barrier, instead it keeps 
> iterating through the control loop and drives CPU usage to 100%.
> I have attached an example.
> Best regards
> Lianro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to