[jira] [Updated] (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets
[ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reed Wanderman-Milne updated ZOOKEEPER-900: --- Attachment: ZOOKEEPER-900-part2.patch I've attached a patch that fixes the blocking issue in connectOne(). I've moved much of the conncetion logic into SendWorker, so all the socket operations are done on a seperate thread. Some of the code in the two connectOne() methods were moved to SendWorker.conncetToServer. Additionally, receiveConnection() and initiateConnection() were moved to conncetOne. As a result, conncetOne() shouldn't wait for the connection to be established before returning. One consequence of this is that SendWorker.finish() may block for the connection to be made, if it's called before a connection is established (since both finish() and SendWorker.establishConnection() are synchronized). This is better than blocking on connectOne(), but does anyone have any ideas to fix this? FLE implementation should be improved to use non-blocking sockets - Key: ZOOKEEPER-900 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 Project: ZooKeeper Issue Type: Bug Reporter: Vishal Kher Assignee: Vishal Kher Priority: Critical Fix For: 3.5.1 Attachments: ZOOKEEPER-900-part2.patch, ZOOKEEPER-900.patch, ZOOKEEPER-900.patch1, ZOOKEEPER-900.patch2 From earlier email exchanges: 1. Blocking connects and accepts: a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets
[ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130830#comment-14130830 ] Reed Wanderman-Milne commented on ZOOKEEPER-900: Hi, I'm wondering if there's any progress on this JIRA. I'm running into an issue similar to that of ZOOKEEPER-1678, which can be solved by fixing this. If no one is working on it, I'd be happy to take a stab at it. [~vishalmlst]'s patch added a timeout for connections to other peers, but it still seems appears that only one connection can be processed at a time. Additionally, in connectOne(long), a lock on the QuorumPeer is held, preventing other threads from accessing it. Both this issues seem to contribute to ZOOKEEPER-1678. [~vishalmlst] suggested in an earlier comment to move the socket operations to SenderWorker and RecvWorker, which would prevent socket operations from blocking other connections. Let me know what your thoughts are. Thanks! FLE implementation should be improved to use non-blocking sockets - Key: ZOOKEEPER-900 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 Project: ZooKeeper Issue Type: Bug Reporter: Vishal Kher Assignee: Vishal Kher Priority: Critical Fix For: 3.5.1 Attachments: ZOOKEEPER-900.patch, ZOOKEEPER-900.patch1, ZOOKEEPER-900.patch2 From earlier email exchanges: 1. Blocking connects and accepts: a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reed Wanderman-Milne updated ZOOKEEPER-1660: Attachment: ZOOKEEPER-1660-v3.patch Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1660-v2.patch, ZOOKEEPER-1660-v3.patch, ZOOKEEPER-1660.patch Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109669#comment-14109669 ] Reed Wanderman-Milne commented on ZOOKEEPER-1660: - Hi Alex, That was a good change, the overview is much clearer now. Note that I had to format the paper citation slightly, since it appears Docbooks doesn't support line breaks. Tell me if there are any more changes to the Google Doc. The patch contains a reference to the reconfig page from the Administrator's guide (in the section Configuration Parameters), so a reader should be able to figure out how to upgrade to 3.5.0. Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1660-v2.patch, ZOOKEEPER-1660-v3.patch, ZOOKEEPER-1660.patch Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reed Wanderman-Milne updated ZOOKEEPER-1660: Attachment: ZOOKEEPER-1660-v2.patch Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1660-v2.patch, ZOOKEEPER-1660.patch Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108195#comment-14108195 ] Reed Wanderman-Milne commented on ZOOKEEPER-1660: - Hi Alex, Thanks for the updates. I made the changes, except for adding the comment about local sessions (which I can add later if necessary). Maybe we should move the Upgrading to 3.5.0 to the Administrator's guide page, it doesn't seem directly related to dynamic reconfig. What do you think? Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1660-v2.patch, ZOOKEEPER-1660.patch Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reed Wanderman-Milne updated ZOOKEEPER-1660: Attachment: ZOOKEEPER-1660.patch Here's a draft of the new documentation. I had to make some minor formatting changes from the Google Doc, since the Forrest Docbooks plugin doesn't support some formatting options. Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1660.patch Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102974#comment-14102974 ] Reed Wanderman-Milne commented on ZOOKEEPER-1660: - I'll start working on the Forrest, doc then, I'll have it done in a few days. Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Reed Wanderman-Milne Priority: Blocker Fix For: 3.5.0 Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (ZOOKEEPER-1991) zkServer.sh returns with a zero exit status when a ZooKeeper process is already running
Reed Wanderman-Milne created ZOOKEEPER-1991: --- Summary: zkServer.sh returns with a zero exit status when a ZooKeeper process is already running Key: ZOOKEEPER-1991 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1991 Project: ZooKeeper Issue Type: Bug Components: scripts Affects Versions: 3.4.6 Reporter: Reed Wanderman-Milne Priority: Minor If ZooKeeper is started with zkServer.sh, and an error is shown that a ZooKeeper process is already running, the command returns with an exit status of 0, while it should end with a non-zero exit status. Example: $ bin/zkServer.sh start JMX enabled by default Using config: /home/reed/zookeeper/bin/../conf/zoo.cfg Starting zookeeper ... already running as process 25063. $ echo $? 0 This can make it difficult for automated scripts to check if creating a new ZooKeeper process was successful, as it won't catch if a user accidentally started it before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (ZOOKEEPER-1660) Add documentation for dynamic reconfiguration
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073718#comment-14073718 ] Reed Wanderman-Milne commented on ZOOKEEPER-1660: - I spoke to [~shralex], and agreed to create the forrest docs, once the Google Doc is updated to its near-final version. Add documentation for dynamic reconfiguration - Key: ZOOKEEPER-1660 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1660 Project: ZooKeeper Issue Type: Sub-task Components: documentation Affects Versions: 3.5.0 Reporter: Alexander Shraer Assignee: Alexander Shraer Priority: Blocker Fix For: 3.5.0 Update user manual with reconfiguration info. -- This message was sent by Atlassian JIRA (v6.2#6252)