[ 
https://issues.apache.org/jira/browse/THRIFT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538582#comment-13538582
 ] 

Carl Yeksigian commented on THRIFT-1795:
----------------------------------------

+1

LGTM
                
> Race condition in TThreadedServerPool java implementation
> ---------------------------------------------------------
>
>                 Key: THRIFT-1795
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1795
>             Project: Thrift
>          Issue Type: Bug
>          Components: Java - Library
>    Affects Versions: 0.6.1, 0.7, 0.8, 0.9
>         Environment: RHEL 5.x derivatives, SUSE linux are the systems where 
> this is intermittently reproducible
>            Reporter: Venkat Ranganathan
>         Attachments: Thrift-1795.diff
>
>
> While running unit tests on flume[1], it was discovered that there have been 
> intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux 
> environments (or the failures where more pronounced in those environments).   
> While trying to debug this issue and removing Flume specific issues,  I think 
> there is a race condition in Thrift TThreadPool server implementation. We can 
> reduce the problem to the basic issue
> The server is written in as below in its serve() method
> stopped_ = false
> while (!stopped_) 
>    try
>         serve request
>    except
>        log message
> The stop method (executed typically from another thread) is implemented thus
> stopped_ = false
> interrupt and cleanup server transport
> The basic thought is that stopped_ being volatile, the writes to stopped_ 
> will always be seen by the reading thread and there is no other 
> synchronization required, essentially relying on the JSR 133 volatile 
> description.
> The problem is that there is no check in the server run method whether it was 
> asked to stop or not
> For example, the Flume unit test, once again reduced to what is needed for 
> this discussion, does the following for testing the flume source lifecycle
> server.start -- which means starting a thread, whose run method will call  
> TThreadPoolServer.serve described above
> server.stop
> It so happens that on Redhat Linux 5.x systems or more specifically with 
> Linux kernels before the CFQ scheduler, there is a higher chance that 
> server.stop can run before server thread's serve method is invoked.   This 
> means that the setting of stopped_ flag in the stop method is overwritten in 
> the serve method.    This causes the server to enter an infinite loop and 
> eventually, the test timeout kicks in
> There is an easy fix - just set stopped_ to false in the constructor.   But 
> then it probably violates the contract on the state of a constructed server 
> object.   I see that in couple of other TServer implementations, stopped_ is 
> initialize to true and only during the listen processing, stopped_ is set to 
> false.
> If preserving this semantics is important, we should introduce another 
> volatile variable (say started_),  and carefully set started_ and stopped_ in 
> places where we are not sure stopped_ might be overwritten, and check for 
> both flags also in key locations (right after a thread starts with the 
> server() method.
> This same issue can be evidenced in other variations of TServer 
> implementations also.
> I can provide a patch if it is agreed that this is a bug that needs to be 
> fixed.  I can create a patch introducing a new started_ flag (which respects 
> the current behavior in some of the classes where stopped_ is initialized to 
> true), or initialize the Servers to initialize stopped_ to false and then not 
> reset the stopped_ flag during the server execution (which is simplere and 
> easy to fix, but makes a server just constructed to be not in stopped state). 
>   I don't know how important it is to maintain that behavior or if there is 
> any reliance on that behavior
> Thanks
> Venkat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to