r31738)

Ralph Castain Tue, 13 May 2014 23:45:49 -0400 (EDT)

As I said, this isn't the only thread that faces this issue, and we have 
resolved it elsewhere - surely we can resolve it here as well in an acceptable 
manner.


Nathan?


On May 13, 2014, at 7:33 PM, Gilles Gouaillardet 
<[email protected]> wrote:

> Ralph,
> 
> scif_poll(...) is called with an infinite timeout.
> 
> a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
> the obvious drawback is the thread has to wake up every xxx seconds and
> that would be for
> nothing 99.9% of the time.
> 
> my analysis (see #4615) is the crash occurs when the btl/scif is
> unloaded from memory (e.g. dlcose()) and
> the scif_thread is still running.
> 
> Gilles
> 
> On 2014/05/14 11:25, Ralph Castain wrote:
>> It could be a bug in the software stack, though I wouldn't count on it. 
>> Unfortunately, pthread_cancel is known to have bad side effects, and so we 
>> avoid its use.
>> 
>> The key here is that the thread must detect that the file descriptor has 
>> closed and exit, or use some other method for detecting that it should 
>> terminate. We do this in multiple other places in the code, without using 
>> pthread_cancel and without hanging. So it is certainly doable.
>> 
>> I don't know the specifics of why Nathan's code is having trouble exiting, 
>> but I suspect that a simple solution - not involving pthread_cancel - can be 
>> readily developed.
>> 
>> 
>> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
>> <[email protected]> wrote:
>> 
>>> Folks,
>>> 
>>> i would like to comment on r31738 :
>>> 
>>>> There is no reason to cancel the listening thread. It should die
>>>> automatically when the file descriptor is closed.
>>> i could not agree more
>>>> It is sufficient to just wait for the thread to exit with pthread join.
>>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>>> is *not* :-(
>>> 
>>> this is what i described in #4615
>>> https://svn.open-mpi.org/trac/ompi/ticket/4615
>>> in which i attached scif_hang.c that evidences that (at least in my
>>> environment)
>>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>>> hence the scif pthread never ends.
>>> 
>>> this is likely a bug in MPSS and it might have been fixed in earlier
>>> release.
>>> 
>>> Nathan, could you try scif_hang in your environment and report the MPSS
>>> version you are running ?
>>> 
>>> 
>>> bottom line, and once again, in my test environment, pthread_join (...)
>>> without pthread_cancel(...)
>>> might cause a hang when the btl/scif module is released.
>>> 
>>> 
>>> assuming the bug is in old MPSS and has been fixed in recent releases,
>>> what is the OpenMPI policy ?
>>> a) test the MPSS version and call pthread_cancel() or do *not* call
>>> pthread_join if buggy MPSS is detected ?
>>> b) display an error/warning if a buggy MPSS is detected ?
>>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>>> MPSS, it is in MPI_Finalize() so impact is limited */
>>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>>> problem after all ?
>>> e) something else ?
>>> 
>>> Gilles
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> 
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14789.php

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

Reply via email to