[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Attachment: ZOOKEEPER-823.patch changes: - call ClientCnxn.cleanup() from ClientCnxnSocketNIO.cleanup(), was lost during the refactoring - cleaned the formatting changes to make the patch smaller Now there are only three failures left: NettyNettySuiteTest - ACLTest.testAcls KeeperErrorCode = ConnectionLoss for /0 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /0 at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:640) at org.apache.zookeeper.test.ACLTest.testAcls(ACLTest.java:104) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:51) When I run the whole suite in eclipse as JUnit test, it does not fail. NettyNettySuiteHammerTest - The log doesn't tell me anything, I assume it's just the same as in NettyNettySuiteTest NioNettySuiteTest - ClientTest.testClientCleanup open fds after test are not significantly higher than before junit.framework.AssertionFailedError: open fds after test are not significantly higher than before at org.apache.zookeeper.test.ClientTest.testClientCleanup(ClientTest.java:731) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:51) When I run the whole suite in eclipse, the test still fails, however when I run only ClientTest.testClientCleanup alone, it does not fail anymore. I would really appreciate, if you could help me from now on. I double, partly triple checked the refactoring. update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Status: Patch Available (was: Open) update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: ZooKeeper-trunk #922
See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/922/ -- [...truncated 169648 lines...] [junit] 2010-09-02 10:53:33,413 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:50594 [junit] 2010-09-02 10:53:33,413 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11237:nioserverc...@791] - Processing stat command from /127.0.0.1:50594 [junit] 2010-09-02 10:53:33,413 [myid:] - INFO [Thread-295:nioservercnxn$statcomm...@645] - Stat command output [junit] 2010-09-02 10:53:33,414 [myid:] - INFO [Thread-295:nioserverc...@967] - Closed socket connection for client /127.0.0.1:50594 (no session established for client) [junit] 2010-09-02 10:53:33,414 [myid:] - INFO [main:quorumb...@195] - 127.0.0.1:11237 is accepting client connections [junit] 2010-09-02 10:53:33,414 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11238 [junit] 2010-09-02 10:53:33,415 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:45703 [junit] 2010-09-02 10:53:33,415 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11238:nioserverc...@791] - Processing stat command from /127.0.0.1:45703 [junit] 2010-09-02 10:53:33,415 [myid:] - INFO [Thread-296:nioservercnxn$statcomm...@645] - Stat command output [junit] 2010-09-02 10:53:33,416 [myid:] - INFO [Thread-296:nioserverc...@967] - Closed socket connection for client /127.0.0.1:45703 (no session established for client) [junit] 2010-09-02 10:53:33,416 [myid:] - INFO [main:quorumb...@195] - 127.0.0.1:11238 is accepting client connections [junit] 2010-09-02 10:53:33,417 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11239 [junit] 2010-09-02 10:53:33,417 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:57052 [junit] 2010-09-02 10:53:33,417 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing stat command from /127.0.0.1:57052 [junit] 2010-09-02 10:53:33,418 [myid:] - INFO [Thread-297:nioserverc...@967] - Closed socket connection for client /127.0.0.1:57052 (no session established for client) [junit] 2010-09-02 10:53:33,668 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11239 [junit] 2010-09-02 10:53:33,669 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:57053 [junit] 2010-09-02 10:53:33,669 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing stat command from /127.0.0.1:57053 [junit] 2010-09-02 10:53:33,669 [myid:] - INFO [Thread-298:nioserverc...@967] - Closed socket connection for client /127.0.0.1:57053 (no session established for client) [junit] 2010-09-02 10:53:33,919 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11239 [junit] 2010-09-02 10:53:33,920 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:57054 [junit] 2010-09-02 10:53:33,920 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing stat command from /127.0.0.1:57054 [junit] 2010-09-02 10:53:33,920 [myid:] - INFO [Thread-299:nioserverc...@967] - Closed socket connection for client /127.0.0.1:57054 (no session established for client) [junit] 2010-09-02 10:53:34,171 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11239 [junit] 2010-09-02 10:53:34,171 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:57055 [junit] 2010-09-02 10:53:34,171 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing stat command from /127.0.0.1:57055 [junit] 2010-09-02 10:53:34,172 [myid:] - INFO [Thread-300:nioserverc...@967] - Closed socket connection for client /127.0.0.1:57055 (no session established for client) [junit] 2010-09-02 10:53:34,422 [myid:] - INFO [main:clientb...@225] - connecting to 127.0.0.1 11239 [junit] 2010-09-02 10:53:34,422 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioservercnxnfact...@196] - Accepted socket connection from /127.0.0.1:57056 [junit] 2010-09-02 10:53:34,423 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11239:nioserverc...@791] - Processing stat command from /127.0.0.1:57056 [junit] 2010-09-02 10:53:34,423 [myid:] - INFO [Thread-301:nioservercnxn$statcomm...@645] - Stat command output [junit] 2010-09-02 10:53:34,424 [myid:] - INFO [Thread-301:nioserverc...@967] - Closed socket connection for client /127.0.0.1:57056 (no session established for client)
[jira] Created: (ZOOKEEPER-860) Add alternative search-provider to ZK site
Add alternative search-provider to ZK site -- Key: ZOOKEEPER-860 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860 Project: Zookeeper Issue Type: Improvement Reporter: Alex Baranau Priority: Minor Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list. The search service was already added in site's skin (common for all Hadoop related projects) before so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site
[ https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Baranau updated ZOOKEEPER-860: --- Attachment: ZOOKEEPER-860.patch Attached patch which enables search-hadoop search service for site Add alternative search-provider to ZK site -- Key: ZOOKEEPER-860 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860 Project: Zookeeper Issue Type: Improvement Reporter: Alex Baranau Priority: Minor Attachments: ZOOKEEPER-860.patch Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list. The search service was already added in site's skin (common for all Hadoop related projects) before so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
About symbol table of Zookeeper c client
Hi all: I'm writing a application in C which need to link both memcached's lib and zookeeper's c client lib. I found a symbol table conflict, because both libs provide implmentation(recordio.h/c) of function htonll. It seems that some functions of zookeeper c client, which can be accessed externally but uesd internally, have simple names. I think it will bring much symbol table confilct from time to time, and we should do something about it, e.g. add a specific prefix to these funcitons. thx -- With Regards! Ye, Qian
[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete
[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905528#action_12905528 ] Vishal K commented on ZOOKEEPER-822: Hi Flavio, I was planning to send out a mail explaining the problems in the FLE implementation that I have found so far. For now, I will put the info here. We can create new JIRAs if needed. I am waiting to hear back from our legal department to resolve copyright issues so that I can share my fixes as well. 1. Blocking connects and accepts: You are right, when the node is down TCP timeouts rule. a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective pe! er. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. 2. Buggy senderWorkerMap handling: The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer. a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above. b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code: SendWorker vsw = senderWorkerMap.get(sid); senderWorkerMap.put(sid, sw); if(vsw != null) vsw.finish(); It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread. 3. Race condition in receiveConnection and initiateConnection: *In theory*, two peers can keep disconnecting each other's connection. Example: T0: Peer 0 initiates a connection (request 1) T1: Peer 1 receives connection from peer 0 T2: Peer 1 calls receiveConnection() T2: Peer 0 closes connection to Peer 1 because its ID is lower. T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2) T3: Peer 1 terminates older connection to peer 0 T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0 T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0 The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election. I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the *TCP level*. That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a
election recipe
Hi there, I would like to use zookeeper to implement an election scheme. There is a recipe on the homepage, but it is relatively complex. I was wondering what was wrong with the following pseudo code; forever { zookeeper.create -e /election my_ip_address if creation succeeded then { // do the leader thing } else { // wait for change in /election using watcher mechanism } } My assumption is that the recipe is more elaborate to the eliminate the flood of requests if the leader falls away. But if there are only a handful of leader-candidates than, that should not be pose a problem. Is this correct, or am I missing out on something. Thanks, Eric
[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site
[ https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-860: --- Assignee: Alex Baranau Component/s: documentation Please provide a link to the discussion thread referenced. Also link to the other Hadoop (sub)project jiras implementing this change. Thanks. Add alternative search-provider to ZK site -- Key: ZOOKEEPER-860 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860 Project: Zookeeper Issue Type: Improvement Components: documentation Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Attachments: ZOOKEEPER-860.patch Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list. The search service was already added in site's skin (common for all Hadoop related projects) before so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site
[ https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Baranau updated ZOOKEEPER-860: --- Description: Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already added in site's skin (common for all Hadoop related projects) before (as a part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. was: Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list. The search service was already added in site's skin (common for all Hadoop related projects) before so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. Updated description. Currently created JIRA issues for next Hadoop-related projects: https://issues.apache.org/jira/browse/AVRO-626 (committed) https://issues.apache.org/jira/browse/HBASE-2886 (committed) https://issues.apache.org/jira/browse/ZOOKEEPER-860 https://issues.apache.org/jira/browse/HIVE-1611 https://issues.apache.org/jira/browse/HDFS-1367 I'm about to create issues also for (discussions have been already initiated): * Hadoop TLP * Common * Chukwa * MapReduce * Pig Please, let me know if more information is needed. Add alternative search-provider to ZK site -- Key: ZOOKEEPER-860 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860 Project: Zookeeper Issue Type: Improvement Components: documentation Reporter: Alex Baranau Assignee: Alex Baranau Priority: Minor Attachments: ZOOKEEPER-860.patch Use search-hadoop.com service to make available search in ZK sources, MLs, wiki, etc. This was initially proposed on user mailing list (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already added in site's skin (common for all Hadoop related projects) before (as a part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this issue is about enabling it for ZK. The ultimate goal is to use it at all Hadoop's sub-projects' sites. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.
Missing the test SSL certificate used for running junit tests. -- Key: ZOOKEEPER-861 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861 Project: Zookeeper Issue Type: Bug Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Priority: Minor Fix For: 3.4.0 The Hedwig code checked into Apache is missing a test SSL certificate file used for running the server junit tests. We need this file otherwise the tests that use this (e.g. TestHedwigHub) will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.
Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead. - Key: ZOOKEEPER-862 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862 Project: Zookeeper Issue Type: Improvement Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Fix For: 3.4.0 Hedwig code right now when using Bookkeeper as the persistence store is hardcoding the number of bookie servers in the ensemble and quorum size. This is used the first time a ledger is created. This should be exposed as a server configuration parameter instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erwin Tam updated ZOOKEEPER-861: Attachment: server.p12 Uploading the binary SSL certificate file used for doing junit tests. Missing the test SSL certificate used for running junit tests. -- Key: ZOOKEEPER-861 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861 Project: Zookeeper Issue Type: Bug Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Priority: Minor Fix For: 3.4.0 Attachments: server.p12 The Hedwig code checked into Apache is missing a test SSL certificate file used for running the server junit tests. We need this file otherwise the tests that use this (e.g. TestHedwigHub) will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erwin Tam updated ZOOKEEPER-861: Status: Patch Available (was: Open) Missing the test SSL certificate used for running junit tests. -- Key: ZOOKEEPER-861 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861 Project: Zookeeper Issue Type: Bug Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Priority: Minor Fix For: 3.4.0 Attachments: server.p12, ZOOKEEPER-861.patch The Hedwig code checked into Apache is missing a test SSL certificate file used for running the server junit tests. We need this file otherwise the tests that use this (e.g. TestHedwigHub) will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erwin Tam updated ZOOKEEPER-862: Status: Patch Available (was: Open) Fix so the bookkeeper ledgers created will use a server configuration parameter to determine the ensemble and quorum size instead of hardcoding it. Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead. - Key: ZOOKEEPER-862 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862 Project: Zookeeper Issue Type: Improvement Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Fix For: 3.4.0 Attachments: ZOOKEEPER-862.patch Hedwig code right now when using Bookkeeper as the persistence store is hardcoding the number of bookie servers in the ensemble and quorum size. This is used the first time a ledger is created. This should be exposed as a server configuration parameter instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-862) Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erwin Tam updated ZOOKEEPER-862: Attachment: ZOOKEEPER-862.patch Hedwig created ledgers with hardcoded Bookkeeper ensemble and quorum size. Make these a server config parameter instead. - Key: ZOOKEEPER-862 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-862 Project: Zookeeper Issue Type: Improvement Components: contrib-hedwig Reporter: Erwin Tam Assignee: Erwin Tam Fix For: 3.4.0 Attachments: ZOOKEEPER-862.patch Hedwig code right now when using Bookkeeper as the persistence store is hardcoding the number of bookie servers in the ensemble and quorum size. This is used the first time a ledger is created. This should be exposed as a server configuration parameter instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Camille Fournier updated ZOOKEEPER-844: --- Fix Version/s: 3.3.2 handle auth failure in java client -- Key: ZOOKEEPER-844 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 Project: Zookeeper Issue Type: Improvement Components: java client Affects Versions: 3.3.1 Reporter: Camille Fournier Assignee: Camille Fournier Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-844.patch ClientCnxn.java currently has the following code: if (replyHdr.getXid() == -4) { // -2 is the xid for AuthPacket // TODO: process AuthPacket here if (LOG.isDebugEnabled()) { LOG.debug(Got auth sessionid:0x + Long.toHexString(sessionId)); } return; } Auth failures appear to cause the server to disconnect but the client never gets a proper state change or notification that auth has failed, which makes handling this scenario very difficult as it causes the client to go into a loop of sending bad auth, getting disconnected, trying to reconnect, sending bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: (ZOOKEEPER-844) handle auth failure in java client
Hi all, I would like to submit this patch into the 3.3 branch as well, since we are probably going to go into production with 3.3 and I'd rather not do a production release with a patched version of ZK if possible. I added a patch for this fix against the 3.3 branch to this ticket. Any idea of the odds of getting this in to the 3.3.2 release? Thanks, Camille -Original Message- From: Giridharan Kesavan (JIRA) [mailto:j...@apache.org] Sent: Tuesday, August 31, 2010 7:25 PM To: Fournier, Camille F. [Tech] Subject: [jira] Updated: (ZOOKEEPER-844) handle auth failure in java client [ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated ZOOKEEPER-844: - Status: Patch Available (was: Open) handle auth failure in java client -- Key: ZOOKEEPER-844 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 Project: Zookeeper Issue Type: Improvement Components: java client Affects Versions: 3.3.1 Reporter: Camille Fournier Assignee: Camille Fournier Fix For: 3.4.0 Attachments: ZOOKEEPER-844.patch ClientCnxn.java currently has the following code: if (replyHdr.getXid() == -4) { // -2 is the xid for AuthPacket // TODO: process AuthPacket here if (LOG.isDebugEnabled()) { LOG.debug(Got auth sessionid:0x + Long.toHexString(sessionId)); } return; } Auth failures appear to cause the server to disconnect but the client never gets a proper state change or notification that auth has failed, which makes handling this scenario very difficult as it causes the client to go into a loop of sending bad auth, getting disconnected, trying to reconnect, sending bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-844) handle auth failure in java client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Camille Fournier updated ZOOKEEPER-844: --- Attachment: ZOOKEEPER332-844 Patch for ZooKeeper 3.3.1 branch handle auth failure in java client -- Key: ZOOKEEPER-844 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 Project: Zookeeper Issue Type: Improvement Components: java client Affects Versions: 3.3.1 Reporter: Camille Fournier Assignee: Camille Fournier Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844 ClientCnxn.java currently has the following code: if (replyHdr.getXid() == -4) { // -2 is the xid for AuthPacket // TODO: process AuthPacket here if (LOG.isDebugEnabled()) { LOG.debug(Got auth sessionid:0x + Long.toHexString(sessionId)); } return; } Auth failures appear to cause the server to disconnect but the client never gets a proper state change or notification that auth has failed, which makes handling this scenario very difficult as it causes the client to go into a loop of sending bad auth, getting disconnected, trying to reconnect, sending bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Problems in FLE implementation
Hi All, I had posted this message as a comment for ZOOKEEPER-822. I thought it might be a good idea to give a wider attention so that it will be easier to collect feedback. I found few problems in the FLE implementation while debugging for: https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email below might require some background. If necessary, please browse the JIRA. I have a patch for 1. a) and 2). I will send them out soon. 1. Blocking connects and accepts: a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. 2. Buggy senderWorkerMap handling: The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer. a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above. b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code: SendWorker vsw = senderWorkerMap.get(sid); senderWorkerMap.put(sid, sw); if(vsw != null) vsw.finish(); It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread. 3. Race condition in receiveConnection and initiateConnection: *In theory*, two peers can keep disconnecting each other's connection. Example: T0: Peer 0 initiates a connection (request 1) T1: Peer 1 receives connection from peer 0 T2: Peer 1 calls receiveConnection() T2: Peer 0 closes connection to Peer 1 because its ID is lower. T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2) T3: Peer 1 terminates older connection to peer 0 T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0 T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0 The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election. I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the* *TCP level. That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a thread to not initiate a connection because the listener is about to accept a connection from the remote peer (use isAcceptable() and isConnectable()methods of SelectionKey) and b) prevent a thread from initiating multiple connect request to the same peer. It will simplify