[jira] Updated: (ZOOKEEPER-362) Issues with FLENewEpochTest

2009-04-03 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-362:
-

Attachment: ZOOKEEPER-362.patch

This patch fixes the problem in the description. More concretely, it does the 
following:

1- It synchronizes QuorumCnxManager::connectOne so that there are no competing 
connections to the same server;
2- It doesn't remove an existing connection in 
QuorumCnxManager::receiveConnection when winning the challenge;
3- it eliminates the second definition of ss in QuorumCnxManager::Listener. 
This was a pretty silly bug (my fault of course);
4- It adds a deadline to semapahores in FLENewEpochTest so that it doesn't wait 
indefinitely;
5- If thread 0 finishes before thread 1, then thread 1 initiates a new round 
after waiting for 1s. This is what happens in a real deployment as a follower 
gives up on its elected leader if the elected leader takes too long to 
acknowledge its leadership. As we don't run the follower/leader part of the 
code in this test, moving to the next round doesn't happen automatically.

 Issues with FLENewEpochTest
 ---

 Key: ZOOKEEPER-362
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Flavio Paiva Junqueira
 Fix For: 3.2.0

 Attachments: ZOOKEEPER-362.patch


 I have been able to identify two reasons that cause FLENewEpochTest to fail:
 1- There is a race condition that is triggered when two peers try to 
 establish a connection to each other for leader election. Basically, if they 
 start roughly at the same time, the server with highest id will try to open 
 two connections. The two competing connections will lead to one notification 
 message to be lost. This message happens to be critical for this two process 
 scenario; 
 2- The code to shut down a peer is not working well with the unit tests. For 
 this particular unit test, we need to be able to shut down a peer completely 
 to check the situation the test tries to reproduce. However, it seems that in 
 some runs timing causes the other peers to believe it is still alive, and end 
 up electing it. This peer, however, eventually shuts down and leader election 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-362) Issues with FLENewEpochTest

2009-04-03 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-362:
-

Attachment: ZOOKEEPER-362.patch

Thanks, Ben. I've fixed the log calls in this new patch.

 Issues with FLENewEpochTest
 ---

 Key: ZOOKEEPER-362
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Flavio Paiva Junqueira
 Fix For: 3.2.0

 Attachments: ZOOKEEPER-362.patch, ZOOKEEPER-362.patch


 I have been able to identify two reasons that cause FLENewEpochTest to fail:
 1- There is a race condition that is triggered when two peers try to 
 establish a connection to each other for leader election. Basically, if they 
 start roughly at the same time, the server with highest id will try to open 
 two connections. The two competing connections will lead to one notification 
 message to be lost. This message happens to be critical for this two process 
 scenario; 
 2- The code to shut down a peer is not working well with the unit tests. For 
 this particular unit test, we need to be able to shut down a peer completely 
 to check the situation the test tries to reproduce. However, it seems that in 
 some runs timing causes the other peers to believe it is still alive, and end 
 up electing it. This peer, however, eventually shuts down and leader election 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-362) Issues with FLENewEpochTest

2009-04-03 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-362:
-

Status: Patch Available  (was: Open)

Re-submitting...

 Issues with FLENewEpochTest
 ---

 Key: ZOOKEEPER-362
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Flavio Paiva Junqueira
 Fix For: 3.2.0

 Attachments: ZOOKEEPER-362.patch, ZOOKEEPER-362.patch


 I have been able to identify two reasons that cause FLENewEpochTest to fail:
 1- There is a race condition that is triggered when two peers try to 
 establish a connection to each other for leader election. Basically, if they 
 start roughly at the same time, the server with highest id will try to open 
 two connections. The two competing connections will lead to one notification 
 message to be lost. This message happens to be critical for this two process 
 scenario; 
 2- The code to shut down a peer is not working well with the unit tests. For 
 this particular unit test, we need to be able to shut down a peer completely 
 to check the situation the test tries to reproduce. However, it seems that in 
 some runs timing causes the other peers to believe it is still alive, and end 
 up electing it. This peer, however, eventually shuts down and leader election 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-362) Issues with FLENewEpochTest

2009-04-03 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-362:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

+1 to the patch. i just committed this. thanks flavio.


 Issues with FLENewEpochTest
 ---

 Key: ZOOKEEPER-362
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.2.0

 Attachments: ZOOKEEPER-362.patch, ZOOKEEPER-362.patch


 I have been able to identify two reasons that cause FLENewEpochTest to fail:
 1- There is a race condition that is triggered when two peers try to 
 establish a connection to each other for leader election. Basically, if they 
 start roughly at the same time, the server with highest id will try to open 
 two connections. The two competing connections will lead to one notification 
 message to be lost. This message happens to be critical for this two process 
 scenario; 
 2- The code to shut down a peer is not working well with the unit tests. For 
 this particular unit test, we need to be able to shut down a peer completely 
 to check the situation the test tries to reproduce. However, it seems that in 
 some runs timing causes the other peers to believe it is still alive, and end 
 up electing it. This peer, however, eventually shuts down and leader election 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.