[ https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440552#comment-13440552 ]
Serg Shnerson commented on CASSANDRA-4571: ------------------------------------------ It seems that bug is related to Java NIO internals (May be to Thrift framework). Please, read https://forums.oracle.com/forums/thread.jspa?threadID=1146235 for more details and give your thoughts about. >From topic: "I am submitting this post to highlight a possible NIO "gotcha" in >multithreaded applications and pose a couple of questions. We have observed >file descriptor resource leakage (eventually leading to server failure) in a >server process using NIO within the excellent framework written by Ronny >Standtke (http://nioframework.sourceforge.net). Platform is JDK1.6.0_05 on >RHEL4. I don't think that this is the same issue as that in connection with >TCP CLOSED sockets reported elsewhere - What leaks here are descriptors >connected to Unix domain sockets. In the framework, SelectableChannels registered in a selector are select()-ed in a single thread that handles data transfer to clients of the selector channels, executing in different threads. When a client shuts down its connection (invoking key.cancel() and key.channel.close()) eventually we get to JRE AbstractInterruptibleChannel::close() and SocketChannelImpl::implCloseSelectableChannel() which does the preClose() - via JNI this dup2()s a statically maintained descriptor (attached to a dummy Unix domain socket) onto the underlying file descriptor (as discussed by Alan Bateman (http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html)). The problem occurs when the select() thread runs at the same time and the cancelled key is seen by SelectorImpl::processDeregisterQueue(). Eventually (in our case) EPollSelectorImpl::implDereg() tests the "channel closed" flag set by AbstractInterruptibleChannel::close() (this is not read-protected by a lock) and executes channel.kill() which closes the underlying file descriptor. If this happens before the preClose() in the other thread, the out-of-sequence dup2() leaks the file descriptor, attached to the UNIX domain socket. In the framework mentioned, we don't particularly want to add locking in the select() thread as this would impact other clients of the selector - alternatively a fix is to simply comment out the key.cancel(). channel.close() does the cancel() for us anyway, but after the close()/preClose() has completed, so the select() processing then occurs in the right sequence. (I am notifying Ronny Standtke of this issue independently)." See also following links for more information: http://stackoverflow.com/questions/7038688/java-nio-causes-file-descriptor-leak http://mail-archives.apache.org/mod_mbox/tomcat-users/201201.mbox/%3CCAJkSUv-DDKTCQ-pD7W=qovmph1dxexovcr+3mcgu05cqpt7...@mail.gmail.com%3E http://www.apacheserver.net/HBase-Thrift-for-CDH3U3-leaking-file-descriptors-socket-at1580921.htm > Strange permament socket descriptors increasing leads to "Too many open files" > ------------------------------------------------------------------------------ > > Key: CASSANDRA-4571 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4571 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 1.1.2 > Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 > 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. > java version "1.6.0_33" > Java(TM) SE Runtime Environment (build 1.6.0_33-b03) > Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode) > Reporter: Serg Shnerson > Priority: Critical > > On the two-node cluster there was found strange socket descriptors > increasing. lsof -n | grep java shows many rows like" > java 8380 cassandra 113r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 114r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 115r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 116r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 117r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 118r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 119r unix 0xffff8101a374a080 > 938348482 socket > java 8380 cassandra 120r unix 0xffff8101a374a080 > 938348482 socket > " And number of this rows constantly increasing. After about 24 hours this > situation leads to error. > We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira