[ https://issues.apache.org/jira/browse/ZOOKEEPER-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295278#comment-15295278 ]
Alexander Shraer commented on ZOOKEEPER-2366: --------------------------------------------- let me try to clarify: - as you know, in ZK operations fail only (a) at the leader and (b) during pre-processing. Here the situation is different - (a) the reconfig could be trying to alter a follower's port, which may be taken (the leader has no way to know that), and (b) this is detected during commit time when the change is attempted. At that point, the operation is committed for all practical purposes (quorum already accepted), there is no way to abort it. What can we do without redesigning ZK ? my suggestion is to keep logging an error and not propagate it up (because you can't do anything about it anyway) but of course to close the port, which is what the bug was opened for and what the patch does. The alternative is perhaps to bring down that server because its state can't reflect a committed operation, but I think its an overkill :) - Why I said that the new "return false" in Leader.java isn't exercised by the test: because the test scenario is trying to change a follower's port, so processReconfig fails to change it at the follower, the leader will execute processReconfig just fine, since all it does is update the config in memory. - returning false in Leader.java is wrong. My comments can probably be better there, but false has a different meaning - it indicates that there are not enough acks. Operations do not fail at the commit stage. - The test can be improved a bit to also try to reconfig the leader's client port and see what happens when its taken. Right now it only checks for follower. Its possible to do that without adding too much code if you do the flow in a loop - once for leader and once for follower. I think its also ok without this extension though. > Reconfiguration of client port causes a socket leak > --------------------------------------------------- > > Key: ZOOKEEPER-2366 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2366 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.5.0 > Reporter: Timothy Ward > Assignee: Flavio Junqueira > Priority: Blocker > Fix For: 3.5.2 > > Attachments: ZOOKEEPER-2366.patch, ZOOKEEPER-2366.patch, > zookeeper.patch > > > The NIOServerCnxnFactory reconfigure method can leak server sockets, and > hence make ports unusable until the JVM restarts: > The first line of the method takes a reference to the current > ServerSocketChannel and then the next line replaces it. The subsequent > interactions with the server socket can fail (for example if the > reconfiguration tries to bind to an in-use port). If they fail *before* the > call to oldSS.close() then oldSS is *never* closed. This holds that port open > forever, and prevents the user from rolling back to the previous port! > The code from reconfigure is shown below: > ServerSocketChannel oldSS = ss; > try { > this.ss = ServerSocketChannel.open(); > ss.socket().setReuseAddress(true); > LOG.info("binding to port " + addr); > ss.socket().bind(addr); > ss.configureBlocking(false); > acceptThread.setReconfiguring(); > oldSS.close(); > acceptThread.wakeupSelector(); > try { > acceptThread.join(); > } catch (InterruptedException e) { > LOG.error("Error joining old acceptThread when > reconfiguring client port " + e.getMessage()); > } > acceptThread = new AcceptThread(ss, addr, selectorThreads); > acceptThread.start(); > } catch(IOException e) { > LOG.error("Error reconfiguring client port to " + addr + " " + > e.getMessage()); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)