[ https://issues.apache.org/jira/browse/HADOOP-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032441#comment-15032441 ]
Alan Burlison commented on HADOOP-12487: ---------------------------------------- Hmph, must have missed them, I'll respin - thanks. > DomainSocket.close() assumes incorrect Linux behaviour > ------------------------------------------------------ > > Key: HADOOP-12487 > URL: https://issues.apache.org/jira/browse/HADOOP-12487 > Project: Hadoop Common > Issue Type: Sub-task > Components: net > Affects Versions: 2.7.1 > Environment: Linux Solaris > Reporter: Alan Burlison > Assignee: Alan Burlison > Attachments: HADOOP-12487.001.patch, HADOOP-12487.002.patch, > shutdown.c > > > I'm getting a test failure in TestDomainSocket.java, in the > testSocketAcceptAndClose test. That test creates a socket which one thread > waits on in DomainSocket.accept() whilst a second thread sleeps for a short > time before closing the same socket with DomainSocket.close(). > DomainSocket.close() first calls shutdown0() on the socket before closing > close0() - both those are thin wrappers around the corresponding libc socket > calls. DomainSocket.close() contains the following comment, explaining the > logic involved: > {code} > // Calling shutdown on the socket will interrupt blocking system > // calls like accept, write, and read that are going on in a > // different thread. > {code} > Unfortunately that relies on non-standards-compliant Linux behaviour. I've > written a simple C test case that replicates the scenario above: > # ThreadA opens, binds, listens and accepts on a socket, waiting for > connections. > # Some time later ThreadB calls shutdown on the socket ThreadA is waiting in > accept on. > Here is what happens: > On Linux, the shutdown call in ThreadB succeeds and the accept call in > ThreadA returns with EINVAL. > On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA > continues to wait in accept. > Relevant POSIX manpages: > http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html > http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html > The POSIX shutdown manpage says: > "The shutdown() function shall cause all or part of a full-duplex connection > on the socket associated with the file descriptor socket to be shut down." > ... > "\[ENOTCONN] The socket is not connected." > Page 229 & 303 of "UNIX System V Network Programming" say: > "shutdown can only be called on sockets that have been previously connected" > "The socket \[passed to accept that] fd refers to does not participate in the > connection. It remains available to receive further connect indications" > That is pretty clear, sockets being waited on with accept are not connected > by definition. Nor is it the accept socket connected when a client connects > to it, it is the socket returned by accept that is connected to the client. > Therefore the Solaris behaviour of failing the shutdown call is correct. > In order to get the required behaviour of ThreadB causing ThreadA to exit the > accept call with an error, the correct way is for ThreadB to call close on > the socket that ThreadA is waiting on in accept. > On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA > fails and returns EBADF. > On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in > accept until there is an incoming connection. That accept returns > successfully. However subsequent accept calls on the same socket return EBADF. > The Linux behaviour is fundamentally broken in three places: > # Allowing shutdown to succeed on an unconnected socket is incorrect. > # Returning a successful accept on a closed file descriptor is incorrect, > especially as future accept calls on the same socket fail. > # Once shutdown has been called on the socket, calling close on the socket > fails with EBADF. That is incorrect, shutdown should just prevent further IO > on the socket, it should not close it. > The real issue though is that there's no single way of doing this that works > on both Solaris and Linux, there will need to be platform-specific code in > Hadoop to cater for the Linux brokenness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)