[ 
https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440552#comment-13440552
 ] 

Serg Shnerson commented on CASSANDRA-4571:
------------------------------------------

It seems that bug is related to Java NIO internals (May be to Thrift 
framework). Please, read 
https://forums.oracle.com/forums/thread.jspa?threadID=1146235 for more details 
and give your thoughts about.
>From topic: "I am submitting this post to highlight a possible NIO "gotcha" in 
>multithreaded applications and pose a couple of questions. We have observed 
>file descriptor resource leakage (eventually leading to server failure) in a 
>server process using NIO within the excellent framework written by Ronny 
>Standtke (http://nioframework.sourceforge.net). Platform is JDK1.6.0_05 on 
>RHEL4. I don't think that this is the same issue as that in connection with 
>TCP CLOSED sockets reported elsewhere - What leaks here are descriptors 
>connected to Unix domain sockets.

In the framework, SelectableChannels registered in a selector are select()-ed 
in a single thread that handles data transfer to clients of the selector 
channels, executing in different threads. When a client shuts down its 
connection (invoking key.cancel() and key.channel.close()) eventually we get to 
JRE AbstractInterruptibleChannel::close() and 
SocketChannelImpl::implCloseSelectableChannel() which does the preClose() - via 
JNI this dup2()s a statically maintained descriptor (attached to a dummy Unix 
domain socket) onto the underlying file descriptor (as discussed by Alan 
Bateman 
(http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html)).
 The problem occurs when the select() thread runs at the same time and the 
cancelled key is seen by SelectorImpl::processDeregisterQueue(). Eventually (in 
our case) EPollSelectorImpl::implDereg() tests the "channel closed" flag set by 
AbstractInterruptibleChannel::close() (this is not read-protected by a lock) 
and executes channel.kill() which closes the underlying file descriptor. If 
this happens before the preClose() in the other thread, the out-of-sequence 
dup2() leaks the file descriptor, attached to the UNIX domain socket.

In the framework mentioned, we don't particularly want to add locking in the 
select() thread as this would impact other clients of the selector - 
alternatively a fix is to simply comment out the key.cancel(). channel.close() 
does the cancel() for us anyway, but after the close()/preClose() has 
completed, so the select() processing then occurs in the right sequence. (I am 
notifying Ronny Standtke of this issue independently)."

See also following links for more information:
http://stackoverflow.com/questions/7038688/java-nio-causes-file-descriptor-leak
http://mail-archives.apache.org/mod_mbox/tomcat-users/201201.mbox/%3CCAJkSUv-DDKTCQ-pD7W=qovmph1dxexovcr+3mcgu05cqpt7...@mail.gmail.com%3E
http://www.apacheserver.net/HBase-Thrift-for-CDH3U3-leaking-file-descriptors-socket-at1580921.htm

                
> Strange permament socket descriptors increasing leads to "Too many open files"
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4571
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.2
>         Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 
> 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
> java version "1.6.0_33"
> Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
>            Reporter: Serg Shnerson
>            Priority: Critical
>
> On the two-node cluster there was found strange socket descriptors 
> increasing. lsof -n | grep java shows many rows like"
> java       8380 cassandra  113r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  114r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  115r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  116r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  117r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  118r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  119r     unix 0xffff8101a374a080            
> 938348482 socket
> java       8380 cassandra  120r     unix 0xffff8101a374a080            
> 938348482 socket
> " And number of this rows constantly increasing. After about 24 hours this 
> situation leads to error.
> We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to