[ https://issues.apache.org/jira/browse/THRIFT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743941#comment-13743941 ]
Jake Farrell commented on THRIFT-1795: -------------------------------------- Updated fix version > Race condition in TThreadedServerPool java implementation > --------------------------------------------------------- > > Key: THRIFT-1795 > URL: https://issues.apache.org/jira/browse/THRIFT-1795 > Project: Thrift > Issue Type: Bug > Components: Java - Library > Affects Versions: 0.6.1, 0.7, 0.8, 0.9 > Environment: RHEL 5.x derivatives, SUSE linux are the systems where > this is intermittently reproducible > Reporter: Venkat Ranganathan > Assignee: Jake Farrell > Fix For: 0.9.1 > > Attachments: Thrift-1795.diff > > > While running unit tests on flume[1], it was discovered that there have been > intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux > environments (or the failures where more pronounced in those environments). > While trying to debug this issue and removing Flume specific issues, I think > there is a race condition in Thrift TThreadPool server implementation. We can > reduce the problem to the basic issue > The server is written in as below in its serve() method > stopped_ = false > while (!stopped_) > try > serve request > except > log message > The stop method (executed typically from another thread) is implemented thus > stopped_ = false > interrupt and cleanup server transport > The basic thought is that stopped_ being volatile, the writes to stopped_ > will always be seen by the reading thread and there is no other > synchronization required, essentially relying on the JSR 133 volatile > description. > The problem is that there is no check in the server run method whether it was > asked to stop or not > For example, the Flume unit test, once again reduced to what is needed for > this discussion, does the following for testing the flume source lifecycle > server.start -- which means starting a thread, whose run method will call > TThreadPoolServer.serve described above > server.stop > It so happens that on Redhat Linux 5.x systems or more specifically with > Linux kernels before the CFQ scheduler, there is a higher chance that > server.stop can run before server thread's serve method is invoked. This > means that the setting of stopped_ flag in the stop method is overwritten in > the serve method. This causes the server to enter an infinite loop and > eventually, the test timeout kicks in > There is an easy fix - just set stopped_ to false in the constructor. But > then it probably violates the contract on the state of a constructed server > object. I see that in couple of other TServer implementations, stopped_ is > initialize to true and only during the listen processing, stopped_ is set to > false. > If preserving this semantics is important, we should introduce another > volatile variable (say started_), and carefully set started_ and stopped_ in > places where we are not sure stopped_ might be overwritten, and check for > both flags also in key locations (right after a thread starts with the > server() method. > This same issue can be evidenced in other variations of TServer > implementations also. > I can provide a patch if it is agreed that this is a bug that needs to be > fixed. I can create a patch introducing a new started_ flag (which respects > the current behavior in some of the classes where stopped_ is initialized to > true), or initialize the Servers to initialize stopped_ to false and then not > reset the stopped_ flag during the server execution (which is simplere and > easy to fix, but makes a server just constructed to be not in stopped state). > I don't know how important it is to maintain that behavior or if there is > any reliance on that behavior > Thanks > Venkat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira