[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections

2010-09-02 Thread Thomas Koch (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-823:
--

Attachment: ZOOKEEPER-823.patch

changes:

- call ClientCnxn.cleanup() from ClientCnxnSocketNIO.cleanup(), was lost during 
the refactoring
- cleaned the formatting changes to make the patch smaller

Now there are only three failures left:
NettyNettySuiteTest - ACLTest.testAcls
KeeperErrorCode = ConnectionLoss for /0
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /0
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:640)
at org.apache.zookeeper.test.ACLTest.testAcls(ACLTest.java:104)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:51)

When I run the whole suite in eclipse as JUnit test, it does not fail.

NettyNettySuiteHammerTest - The log doesn't tell me anything, I assume it's 
just the same as in NettyNettySuiteTest

NioNettySuiteTest - ClientTest.testClientCleanup
open fds after test are not significantly higher than before
junit.framework.AssertionFailedError: open fds after test are not significantly 
higher than before
at 
org.apache.zookeeper.test.ClientTest.testClientCleanup(ClientTest.java:731)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:51)

When I run the whole suite in eclipse, the test still fails, however when I run 
only ClientTest.testClientCleanup alone, it does not fail anymore.

I would really appreciate, if you could help me from now on. I double, partly 
triple checked the refactoring.

 update ZooKeeper java client to optionally use Netty for connections
 

 Key: ZOOKEEPER-823
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823
 Project: Zookeeper
  Issue Type: New Feature
  Components: java client
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch


 This jira will port the client side connection code to use netty rather than 
 direct nio.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections

2010-09-02 Thread Thomas Koch (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-823:
--

Status: Patch Available  (was: Open)

 update ZooKeeper java client to optionally use Netty for connections
 

 Key: ZOOKEEPER-823
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823
 Project: Zookeeper
  Issue Type: New Feature
  Components: java client
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch


 This jira will port the client side connection code to use netty rather than 
 direct nio.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: ZooKeeper-trunk #922

2010-09-02 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/922/

--
[...truncated 169648 lines...]
[junit] 2010-09-02 10:53:33,413 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:50594
[junit] 2010-09-02 10:53:33,413 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioserverc...@791] - Processing 
stat command from /127.0.0.1:50594
[junit] 2010-09-02 10:53:33,413 [myid:] - INFO  
[Thread-295:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-02 10:53:33,414 [myid:] - INFO  
[Thread-295:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:50594 (no session established for client)
[junit] 2010-09-02 10:53:33,414 [myid:] - INFO  [main:quorumb...@195] - 
127.0.0.1:11237 is accepting client connections
[junit] 2010-09-02 10:53:33,414 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11238
[junit] 2010-09-02 10:53:33,415 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:45703
[junit] 2010-09-02 10:53:33,415 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioserverc...@791] - Processing 
stat command from /127.0.0.1:45703
[junit] 2010-09-02 10:53:33,415 [myid:] - INFO  
[Thread-296:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-02 10:53:33,416 [myid:] - INFO  
[Thread-296:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:45703 (no session established for client)
[junit] 2010-09-02 10:53:33,416 [myid:] - INFO  [main:quorumb...@195] - 
127.0.0.1:11238 is accepting client connections
[junit] 2010-09-02 10:53:33,417 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-02 10:53:33,417 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:57052
[junit] 2010-09-02 10:53:33,417 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:57052
[junit] 2010-09-02 10:53:33,418 [myid:] - INFO  
[Thread-297:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:57052 (no session established for client)
[junit] 2010-09-02 10:53:33,668 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-02 10:53:33,669 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:57053
[junit] 2010-09-02 10:53:33,669 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:57053
[junit] 2010-09-02 10:53:33,669 [myid:] - INFO  
[Thread-298:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:57053 (no session established for client)
[junit] 2010-09-02 10:53:33,919 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-02 10:53:33,920 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:57054
[junit] 2010-09-02 10:53:33,920 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:57054
[junit] 2010-09-02 10:53:33,920 [myid:] - INFO  
[Thread-299:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:57054 (no session established for client)
[junit] 2010-09-02 10:53:34,171 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-02 10:53:34,171 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:57055
[junit] 2010-09-02 10:53:34,171 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:57055
[junit] 2010-09-02 10:53:34,172 [myid:] - INFO  
[Thread-300:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:57055 (no session established for client)
[junit] 2010-09-02 10:53:34,422 [myid:] - INFO  [main:clientb...@225] - 
connecting to 127.0.0.1 11239
[junit] 2010-09-02 10:53:34,422 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - 
Accepted socket connection from /127.0.0.1:57056
[junit] 2010-09-02 10:53:34,423 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing 
stat command from /127.0.0.1:57056
[junit] 2010-09-02 10:53:34,423 [myid:] - INFO  
[Thread-301:nioservercnxn$statcomm...@645] - Stat command output
[junit] 2010-09-02 10:53:34,424 [myid:] - INFO  
[Thread-301:nioserverc...@967] - Closed socket connection for client 
/127.0.0.1:57056 (no session established for client)

[jira] Created: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-02 Thread Alex Baranau (JIRA)
Add alternative search-provider to ZK site
--

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Alex Baranau
Priority: Minor


Use search-hadoop.com service to make available search in ZK sources, MLs, 
wiki, etc.
This was initially proposed on user mailing list. The search service was 
already added in site's skin (common for all Hadoop related projects) before so 
this issue is about enabling it for ZK. The ultimate goal is to use it at all 
Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-02 Thread Alex Baranau (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Baranau updated ZOOKEEPER-860:
---

Attachment: ZOOKEEPER-860.patch

Attached patch which enables search-hadoop search service for site

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Alex Baranau
Priority: Minor
 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list. The search service was 
 already added in site's skin (common for all Hadoop related projects) before 
 so this issue is about enabling it for ZK. The ultimate goal is to use it at 
 all Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



About symbol table of Zookeeper c client

2010-09-02 Thread Qian Ye
Hi all:

I'm writing a application in C which need to link both memcached's lib and
zookeeper's c client lib. I found a symbol table conflict, because both libs
provide implmentation(recordio.h/c) of function htonll. It seems that some
functions of zookeeper c client, which can be accessed externally but uesd
internally, have simple names. I think it will bring much symbol table
confilct from time to time, and we should do something about it, e.g. add a
specific prefix to these funcitons.

thx

-- 
With Regards!

Ye, Qian


[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-02 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905528#action_12905528
 ] 

Vishal K commented on ZOOKEEPER-822:


Hi Flavio,

I was planning to send out a mail explaining the problems in the FLE 
implementation that I have found so far. For now, I will put the info here. We 
can create new JIRAs if needed. I am waiting to hear back from our legal 
department to resolve copyright issues so that I can share my fixes as well.

1. Blocking connects and accepts:
You are right, when the node is down TCP timeouts rule.

a) The first problem is in manager.toSend(). This invokes connectOne(), which 
does a blocking connect. While testing, I changed the code so that connectOne() 
starts a new thread called AsyncConnct(). AsyncConnect.run() does a 
socketChannel.connect(). After starting AsyncConnect, connectOne starts a 
timer. connectOne continues with normal operations if the connection is 
established before the timer expires, otherwise, when the timer expires it 
interrupts AsyncConnect() thread and returns. In this way, I can have an upper 
bound on the amount of time we need to wait for connect to succeed. Of course, 
this was a quick fix for my testing. Ideally, we should use Selector to do 
non-blocking connects/accepts. I am planning to do that later once we at least 
have a quick fix for the problem and consensus from others for the real fix 
(this problem is big blocker for us). Note that it is OK to do blocking IO in 
SenderWorker and RecvWorker threads since they block IO to the respective pe!
 er.

b) The blocking IO problem is not just restricted to connectOne(), but also in 
receiveConnection(). The Listener thread calls receiveConnection() for each 
incoming connection request. receiveConnection does blocking IO to get peer's 
info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that 
had sent the connection request. All of this is happening from the Listener. In 
short, if a peer fails after initiating a connection, the Listener thread won't 
be able to accept connections from other peers, because it would be stuck in 
read() or connetOne(). Also the code has an inherent cycle. 
initiateConnection() and receiveConnection() will have to be very carefully 
synchronized otherwise, we could run into deadlocks. This code is going to be 
difficult to maintain/modify.

2. Buggy senderWorkerMap handling:
The code that manages senderWorkerMap is very buggy. It is causing multiple 
election rounds. While debugging I found that sometimes after FLE a node will 
have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads 
for each peer.

a) The receiveConnection() method calls the finish() method, which removes an 
entry from the map. Additionally, the thread itself calls finish() which could 
remove the newly added entry from the map. In short, receiveConnection is 
causing the exact condition that you mentioned above.

b) Apart from the bug in finish(), receiveConnection is making an entry in 
senderWorkerMap at the wrong place. Here's the buggy code:
SendWorker vsw = senderWorkerMap.get(sid);
senderWorkerMap.put(sid, sw);
if(vsw != null)
vsw.finish();
It makes an entry for the new thread and then calls finish, which causes the 
new thread to be removed from the Map. The old thread will also get terminated 
since finish() will interrupt the thread.

3. Race condition in receiveConnection and initiateConnection:

*In theory*, two peers can keep disconnecting each other's connection.

Example:
T0: Peer 0 initiates a connection (request 1)

   T1: Peer 1 receives connection from peer 0

   T2: Peer 1 calls receiveConnection()
T2: Peer 0 closes connection to Peer 1 because its ID is lower.
T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2)
T3: Peer 1 terminates older connection to peer 0
T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0
T5: Peer 1 kills connection created in T3 because it receives another (request 
2) connect request from 0

The problem here is that while Peer 0 is accepting a connection from Peer 1 it 
can also be initiating a connection to Peer 1. So if they hit the right 
frequencies they could sit in a connect/disconnect loop and cause multiple 
rounds of leader election.

I think the cause here is again blocking connects()/accepts(). A peer starts to 
take action (to kill existing threads and start new threads) as soon as a 
connection is established at the *TCP level*. That is, it does not give us any 
control to synchronized connect and accepts. We could use non-blocking connects 
and accepts. This will allow us to a) tell a 

election recipe

2010-09-02 Thread Eric van Orsouw
Hi there,

I would like to use zookeeper to implement an election scheme.
There is a recipe on the homepage, but it is relatively complex.
I was wondering what was wrong with the following pseudo code;

forever {
zookeeper.create -e /election my_ip_address
if creation succeeded then {
// do the leader thing
} else {
// wait for change in /election using watcher mechanism
}
}

My assumption is that the recipe is more elaborate to the eliminate the flood 
of requests if the leader falls away.
But if there are only a handful of leader-candidates than, that should not be 
pose a problem.

Is this correct, or am I missing out on something.

Thanks,
Eric


[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-02 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-860:
---

   Assignee: Alex Baranau
Component/s: documentation

Please provide a link to the discussion thread referenced. Also link to the 
other Hadoop (sub)project jiras implementing this change. Thanks.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list. The search service was 
 already added in site's skin (common for all Hadoop related projects) before 
 so this issue is about enabling it for ZK. The ultimate goal is to use it at 
 all Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site

2010-09-02 Thread Alex Baranau (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Baranau updated ZOOKEEPER-860:
---

Description: 
Use search-hadoop.com service to make available search in ZK sources, MLs, 
wiki, etc.
This was initially proposed on user mailing list 
(http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already added 
in site's skin (common for all Hadoop related projects) before (as a part of 
[AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this issue is 
about enabling it for ZK. The ultimate goal is to use it at all Hadoop's 
sub-projects' sites.

  was:
Use search-hadoop.com service to make available search in ZK sources, MLs, 
wiki, etc.
This was initially proposed on user mailing list. The search service was 
already added in site's skin (common for all Hadoop related projects) before so 
this issue is about enabling it for ZK. The ultimate goal is to use it at all 
Hadoop's sub-projects' sites.


Updated description.

Currently created JIRA issues for next Hadoop-related projects:

https://issues.apache.org/jira/browse/AVRO-626 (committed)
https://issues.apache.org/jira/browse/HBASE-2886 (committed)

https://issues.apache.org/jira/browse/ZOOKEEPER-860
https://issues.apache.org/jira/browse/HIVE-1611
https://issues.apache.org/jira/browse/HDFS-1367

I'm about to create issues also for (discussions have been already initiated):
* Hadoop TLP
* Common
* Chukwa
* MapReduce
* Pig

Please, let me know if more information is needed.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.

2010-09-02 Thread Erwin Tam (JIRA)
Missing the test SSL certificate used for running junit tests.
--

 Key: ZOOKEEPER-861
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
Priority: Minor
 Fix For: 3.4.0


The Hedwig code checked into Apache is missing a test SSL certificate file used 
for running the server junit tests.  We need this file otherwise the tests that 
use this (e.g. TestHedwigHub) will fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.

2010-09-02 Thread Erwin Tam (JIRA)
Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size.  
Make these a server config parameter instead.
-

 Key: ZOOKEEPER-862
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862
 Project: Zookeeper
  Issue Type: Improvement
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
 Fix For: 3.4.0


Hedwig code right now when using Bookkeeper as the persistence store is 
hardcoding the number of bookie servers in the ensemble and quorum size.  This 
is used the first time a ledger is created.  This should be exposed as a server 
configuration parameter instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.

2010-09-02 Thread Erwin Tam (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erwin Tam updated ZOOKEEPER-861:


Attachment: server.p12

Uploading the binary SSL certificate file used for doing junit tests.

 Missing the test SSL certificate used for running junit tests.
 --

 Key: ZOOKEEPER-861
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
Priority: Minor
 Fix For: 3.4.0

 Attachments: server.p12


 The Hedwig code checked into Apache is missing a test SSL certificate file 
 used for running the server junit tests.  We need this file otherwise the 
 tests that use this (e.g. TestHedwigHub) will fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.

2010-09-02 Thread Erwin Tam (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erwin Tam updated ZOOKEEPER-861:


Status: Patch Available  (was: Open)

 Missing the test SSL certificate used for running junit tests.
 --

 Key: ZOOKEEPER-861
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
Priority: Minor
 Fix For: 3.4.0

 Attachments: server.p12, ZOOKEEPER-861.patch


 The Hedwig code checked into Apache is missing a test SSL certificate file 
 used for running the server junit tests.  We need this file otherwise the 
 tests that use this (e.g. TestHedwigHub) will fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.

2010-09-02 Thread Erwin Tam (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erwin Tam updated ZOOKEEPER-862:


Status: Patch Available  (was: Open)

Fix so the bookkeeper ledgers created will use a server configuration parameter 
to determine the ensemble and quorum size instead of hardcoding it.

 Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size.  
 Make these a server config parameter instead.
 -

 Key: ZOOKEEPER-862
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862
 Project: Zookeeper
  Issue Type: Improvement
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-862.patch


 Hedwig code right now when using Bookkeeper as the persistence store is 
 hardcoding the number of bookie servers in the ensemble and quorum size.  
 This is used the first time a ledger is created.  This should be exposed as a 
 server configuration parameter instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.

2010-09-02 Thread Erwin Tam (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erwin Tam updated ZOOKEEPER-862:


Attachment: ZOOKEEPER-862.patch

 Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size.  
 Make these a server config parameter instead.
 -

 Key: ZOOKEEPER-862
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862
 Project: Zookeeper
  Issue Type: Improvement
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-862.patch


 Hedwig code right now when using Bookkeeper as the persistence store is 
 hardcoding the number of bookie servers in the ensemble and quorum size.  
 This is used the first time a ledger is created.  This should be exposed as a 
 server configuration parameter instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client

2010-09-02 Thread Camille Fournier (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Camille Fournier updated ZOOKEEPER-844:
---

Fix Version/s: 3.3.2

 handle auth failure in java client
 --

 Key: ZOOKEEPER-844
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-844.patch


 ClientCnxn.java currently has the following code:
   if (replyHdr.getXid() == -4) {
 // -2 is the xid for AuthPacket
 // TODO: process AuthPacket here
 if (LOG.isDebugEnabled()) {
 LOG.debug(Got auth sessionid:0x
 + Long.toHexString(sessionId));
 }
 return;
 }
 Auth failures appear to cause the server to disconnect but the client never 
 gets a proper state change or notification that auth has failed, which makes 
 handling this scenario very difficult as it causes the client to go into a 
 loop of sending bad auth, getting disconnected, trying to reconnect, sending 
 bad auth again, over and over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: (ZOOKEEPER-844) handle auth failure in java client

2010-09-02 Thread Fournier, Camille F. [Tech]
Hi all,

I would like to submit this patch into the 3.3 branch as well, since we are 
probably going to go into production with 3.3 and I'd rather not do a 
production release with a patched version of ZK if possible. I added a patch 
for this fix against the 3.3 branch to this ticket. Any idea of the odds of 
getting this in to the 3.3.2 release?

Thanks,
Camille

-Original Message-
From: Giridharan Kesavan (JIRA) [mailto:j...@apache.org] 
Sent: Tuesday, August 31, 2010 7:25 PM
To: Fournier, Camille F. [Tech]
Subject: [jira] Updated: (ZOOKEEPER-844) handle auth failure in java client


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated ZOOKEEPER-844:
-

Status: Patch Available  (was: Open)

 handle auth failure in java client
 --

 Key: ZOOKEEPER-844
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-844.patch


 ClientCnxn.java currently has the following code:
   if (replyHdr.getXid() == -4) {
 // -2 is the xid for AuthPacket
 // TODO: process AuthPacket here
 if (LOG.isDebugEnabled()) {
 LOG.debug(Got auth sessionid:0x
 + Long.toHexString(sessionId));
 }
 return;
 }
 Auth failures appear to cause the server to disconnect but the client never 
 gets a proper state change or notification that auth has failed, which makes 
 handling this scenario very difficult as it causes the client to go into a 
 loop of sending bad auth, getting disconnected, trying to reconnect, sending 
 bad auth again, over and over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client

2010-09-02 Thread Camille Fournier (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Camille Fournier updated ZOOKEEPER-844:
---

Attachment: ZOOKEEPER332-844

Patch for ZooKeeper 3.3.1 branch

 handle auth failure in java client
 --

 Key: ZOOKEEPER-844
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844


 ClientCnxn.java currently has the following code:
   if (replyHdr.getXid() == -4) {
 // -2 is the xid for AuthPacket
 // TODO: process AuthPacket here
 if (LOG.isDebugEnabled()) {
 LOG.debug(Got auth sessionid:0x
 + Long.toHexString(sessionId));
 }
 return;
 }
 Auth failures appear to cause the server to disconnect but the client never 
 gets a proper state change or notification that auth has failed, which makes 
 handling this scenario very difficult as it causes the client to go into a 
 loop of sending bad auth, getting disconnected, trying to reconnect, sending 
 bad auth again, over and over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Problems in FLE implementation

2010-09-02 Thread Vishal K
Hi All,

I had posted this message as a comment for ZOOKEEPER-822. I thought it might
be a good idea to give a wider attention so that it will be easier to
collect feedback.

I found few problems in the FLE implementation while debugging for:
https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email
below might require some background. If necessary, please browse the JIRA. I
have a patch for 1. a) and 2). I will send them out soon.

1. Blocking connects and accepts:

a) The first problem is in manager.toSend(). This invokes connectOne(),
which does a blocking connect. While testing, I changed the code so that
connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
does a socketChannel.connect(). After starting AsyncConnect, connectOne
starts a timer. connectOne continues with normal operations if the
connection is established before the timer expires, otherwise, when the
timer expires it interrupts AsyncConnect() thread and returns. In this way,
I can have an upper bound on the amount of time we need to wait for connect
to succeed. Of course, this was a quick fix for my testing. Ideally, we
should use Selector to do non-blocking connects/accepts. I am planning to do
that later once we at least have a quick fix for the problem and consensus
from others for the real fix (this problem is big blocker for us). Note that
it is OK to do blocking IO in SenderWorker and RecvWorker threads since they
block IO to the respective peer.

b) The blocking IO problem is not just restricted to connectOne(), but also
in receiveConnection(). The Listener thread calls receiveConnection() for
each incoming connection request. receiveConnection does blocking IO to get
peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the
peer that had sent the connection request. All of this is happening from the
Listener. In short, if a peer fails after initiating a connection, the
Listener thread won't be able to accept connections from other peers,
because it would be stuck in read() or connetOne(). Also the code has an
inherent cycle. initiateConnection() and receiveConnection() will have to be
very carefully synchronized otherwise, we could run into deadlocks. This
code is going to be difficult to maintain/modify.

2. Buggy senderWorkerMap handling:
The code that manages senderWorkerMap is very buggy. It is causing multiple
election rounds. While debugging I found that sometimes after FLE a node
will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker
threads for each peer.

a) The receiveConnection() method calls the finish() method, which removes
an entry from the map. Additionally, the thread itself calls finish() which
could remove the newly added entry from the map. In short, receiveConnection
is causing the exact condition that you mentioned above.

b) Apart from the bug in finish(), receiveConnection is making an entry in
senderWorkerMap at the wrong place. Here's the buggy code:
SendWorker vsw = senderWorkerMap.get(sid);
senderWorkerMap.put(sid, sw);
if(vsw != null)
vsw.finish();
It makes an entry for the new thread and then calls finish, which causes the
new thread to be removed from the Map. The old thread will also get
terminated since finish() will interrupt the thread.

3. Race condition in receiveConnection and initiateConnection:

*In theory*, two peers can keep disconnecting each other's connection.

Example:
T0: Peer 0 initiates a connection (request 1)
 T1: Peer 1 receives connection from
peer 0
 T2: Peer 1 calls receiveConnection()
T2: Peer 0 closes connection to Peer 1 because its ID is lower.
T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request
2)
 T3: Peer 1 terminates older connection
to peer 0
 T4: Peer 1 calls connectOne() which
starts new sendWorker threads for peer 0
 T5: Peer 1 kills connection created in
T3 because it receives another (request 2) connect request from 0

The problem here is that while Peer 0 is accepting a connection from Peer 1
it can also be initiating a connection to Peer 1. So if they hit the right
frequencies they could sit in a connect/disconnect loop and cause multiple
rounds of leader election.

I think the cause here is again blocking connects()/accepts(). A peer starts
to take action (to kill existing threads and start new threads) as soon as a
connection is established at the* *TCP level. That is, it does not give us
any control to synchronized connect and accepts. We could use non-blocking
connects and accepts. This will allow us to a) tell a thread to not initiate
a connection because the listener is about to accept a connection from the
remote peer (use isAcceptable() and isConnectable()methods of SelectionKey)
and b) prevent a thread from initiating multiple connect request to the same
peer. It will simplify