[ 
https://issues.apache.org/jira/browse/THRIFT-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231305#comment-14231305
 ] 

Qiao Mu commented on THRIFT-2789:
---------------------------------

Finally I reproduced the issue and fixed it. Sergey was right about the PIPE, 
although the patch contains some irrelevant code. I've uploaded a cleaner 
version.

The root cause is TNonblockingServer::TConnection::Task::run() throws an 
TException when notifyIOThread returns false. Then ThreadManager just simply 
ignores the exception (the comment in ThreadManager says "XXX need to log this" 
but it never does). So from the user view, we don't see any error except a 
never-return connection and have to use timeout to work around.

When there's high load for IOThread, it's common to see notifyIOThread fails. 
More specifically, the send method inside TNonblockingIOThread::notify returns 
-1 and errno is set to EAGAIN.

The patch close the connection in such case as Sergey did. It's simple and 
enough for me. I also tried with select and a short  timeout, it did not work 
well. 

This bug exists for a very long time and it's still not fixed yet. Could 
anybody please look into this issue?

> TNonblockingServer leaks socket FD's under load
> -----------------------------------------------
>
>                 Key: THRIFT-2789
>                 URL: https://issues.apache.org/jira/browse/THRIFT-2789
>             Project: Thrift
>          Issue Type: Bug
>          Components: C++ - Library
>            Reporter: Sergey
>         Attachments: 
> 0001-Close-connection-when-failed-to-notify-IO-thread.patch, D10015.diff
>
>
> I checked 0.9.2 and 1.0, but code didn't seem to change in 1.2 either.
> Problem is that network threads and worker threads use non-blocking socket 
> (pipe) to communicate. Under heavy load writes to that pipe might fail with 
> EAGAIN. While 'notifyIOThread' method carefully checks for the error and 
> communicates the result via return value, not all callers check result of 
> 'notify'.
> Generally it's hard to tell what appropriate handling of such a failure would 
> be, but it's clear sockets shouldn't leak. Please use attached patch for the 
> reference, but I do not insist what I did there is the best way to fix the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to