[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-27 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-512:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this. thanks flavio.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-22 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Status: Patch Available  (was: Open)

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-22 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

Fixing log message and uploading new patch.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-21 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-512:


Status: Open  (was: Patch Available)

cancelling patch for flavio to update the logging.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-08 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Status: Patch Available  (was: Open)

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Fix Version/s: (was: 3.2.1)

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

I've found a corner that I was not expecting but was an easy fix. Basically 
receiveConnection was passing an invalid server identifier to connectOne which 
was throwing a NPE and was propagating to the Listener. The listener would 
consequently die, and would stop participating in leader election. The fix is 
just not to proceed with the connection when that happens.

This patch is working fine for my test. It is important to note, though, that 
if we are overwhelming servers so much (clients are hammering the system and 
connections are failing), then there will be periods in which there will be no 
leader. The important invariant to satisfy is that the system converges to a 
live state once it stabilizes. In my tests, I observe periods with no leader 
when clients are hammering the servers with requests, but they converge to a 
leader soon after the clients stop. Of course, if we have no injected faults, 
the clients requests are executed just fine (there is always a leader). This is 
the behavior I expect to see.

At the same time, although I think it was a good idea to test such an extreme 
case, I'm still not convinced that this test is realistic. It would be great if 
we could model the cases this fault injection is trying to emulate to make sure 
they are really expected cases. 

Also, I don't see a good way of introducing a unit test for such extreme cases. 
In fact, I'm not even sure it would make sense to test only leader election 
under such extreme conditions. 

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
> ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-25 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

New patch fixes two issues:

# Adds finally block to method run of RecvWorker;
# Replaces take() with poll() in SendWorker(). This fixes a race condition that 
can easily manifest when the system goes haywire (see Pat's aspects for 
examples).

This patch should apply to both 3.2 and trunk.



> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-25 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

I realized that the logs that Pat posted do not contain the same messages 
indicating that the processes exited. My understanding is that the finally 
block had some effect and now the test is exposing  a potential new issue. 

Just in case, I'm uploading a simple patch for testing. It compiles fine for me.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Assignee: Flavio Paiva Junqueira
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-24 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

Based on feedback I got on how to handle socket exceptions, I have added a 
finally block that closes the channel in RecvWorker. I haven't been able to 
reproduce the problem reliably yet, but I can't see any issue after applying 
this patch. It would be good if Pat could run it to see if the problem still 
manifests.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Attachment: t5_aj.tar.gz

sample test environment. start.sh is setup to cause failures at .5% probability 
(each net operation)

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Attachment: log3_debug.tar.gz

debug logs for similar situation as previous.

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-20 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Attachment: logs2.tar.gz

take a look at logs2, this is similar fault injection model, however I'm now:

sock.close()
throw IOException

rather than just throwing the ioexception. otw basically the same test as 
before.

Notice that 1 drops off alot earlier than the rest (seems due to it's server id 
being the lowest?)


> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, logs.tar.gz, logs2.tar.gz
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-19 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Attachment: logs.tar.gz
jst.txt

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
> Attachments: jst.txt, logs.tar.gz
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-19 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Component/s: server
 quorum

> FLE election fails to elect leader
> --
>
> Key: ZOOKEEPER-512
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.2.0
>Reporter: Patrick Hunt
>Priority: Blocker
> Fix For: 3.2.1, 3.3.0
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.