As I said, this isn't the only thread that faces this issue, and we have resolved it elsewhere - surely we can resolve it here as well in an acceptable manner.
Nathan? On May 13, 2014, at 7:33 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Ralph, > > scif_poll(...) is called with an infinite timeout. > > a quick fix would be to use a finite timeout (1s ? 10s ? more ?) > the obvious drawback is the thread has to wake up every xxx seconds and > that would be for > nothing 99.9% of the time. > > my analysis (see #4615) is the crash occurs when the btl/scif is > unloaded from memory (e.g. dlcose()) and > the scif_thread is still running. > > Gilles > > On 2014/05/14 11:25, Ralph Castain wrote: >> It could be a bug in the software stack, though I wouldn't count on it. >> Unfortunately, pthread_cancel is known to have bad side effects, and so we >> avoid its use. >> >> The key here is that the thread must detect that the file descriptor has >> closed and exit, or use some other method for detecting that it should >> terminate. We do this in multiple other places in the code, without using >> pthread_cancel and without hanging. So it is certainly doable. >> >> I don't know the specifics of why Nathan's code is having trouble exiting, >> but I suspect that a simple solution - not involving pthread_cancel - can be >> readily developed. >> >> >> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Folks, >>> >>> i would like to comment on r31738 : >>> >>>> There is no reason to cancel the listening thread. It should die >>>> automatically when the file descriptor is closed. >>> i could not agree more >>>> It is sufficient to just wait for the thread to exit with pthread join. >>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >>> is *not* :-( >>> >>> this is what i described in #4615 >>> https://svn.open-mpi.org/trac/ompi/ticket/4615 >>> in which i attached scif_hang.c that evidences that (at least in my >>> environment) >>> scif_poll(...) does *not* return after scif_close(...) is closed, and >>> hence the scif pthread never ends. >>> >>> this is likely a bug in MPSS and it might have been fixed in earlier >>> release. >>> >>> Nathan, could you try scif_hang in your environment and report the MPSS >>> version you are running ? >>> >>> >>> bottom line, and once again, in my test environment, pthread_join (...) >>> without pthread_cancel(...) >>> might cause a hang when the btl/scif module is released. >>> >>> >>> assuming the bug is in old MPSS and has been fixed in recent releases, >>> what is the OpenMPI policy ? >>> a) test the MPSS version and call pthread_cancel() or do *not* call >>> pthread_join if buggy MPSS is detected ? >>> b) display an error/warning if a buggy MPSS is detected ? >>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >>> MPSS, it is in MPI_Finalize() so impact is limited */ >>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >>> problem after all ? >>> e) something else ? >>> >>> Gilles >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14789.php