Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-15 Thread Gilles Gouaillardet
Nathan, this had no effect on my environment :-( i am not sure you can reuse mca_btl_scif_module.scif_fd with connect() i had to use a new scif fd for that. then i ran into an other glitch : if the listen thread does not scif_accept() the connection, the scif_connect() will take 30 seconds (defa

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Nathan Hjelm
That is exactly how I decided to fix it. It looks like it is working. Please try r31755 when you get a chance. -Nathan On Thu, May 15, 2014 at 12:03:53AM +0900, Gilles Gouaillardet wrote: >Nathan, > >> Looks like this is a scif bug. From the documentation: > >and from the source co

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread George Bosilca
There seems to be a consensus on the fact that closing an fd should trigger the return from poll. Unfortunately this assumption is wrong, and not condoned by any documentation available online. To be more clear, all documentation I know tend to point in the opposite direction: it is unwise to c

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Nathan Hjelm
On Wed, May 14, 2014 at 07:55:54AM -0700, Ralph Castain wrote: > Couple of suggestions: > > * detect that this is an older scif lib and just don't build or enable the > scif btl > > * have a flag that indicates "you should exit", and then tickle the fd so > scif_poll exits Thinking along these

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread George Bosilca
It sounds more like a suboptimal usage of the pthread cancelation helpers than a real issue with the pthread_cancel itself. I do agree the usage is not necessarily straightforward even for a veteran coder, but the related issues remain belong to the realm of implementation not at the conceptual lev

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Gilles Gouaillardet
Nathan, > Looks like this is a scif bug. From the documentation: and from the source code, scif_poll(...) simply calls poll(...) at least in MPSS 2.1 > Since that is not the case I will look through the documentation and see if there is a way other than pthread_cancel. what about : - use a

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Ralph Castain
Couple of suggestions: * detect that this is an older scif lib and just don't build or enable the scif btl * have a flag that indicates "you should exit", and then tickle the fd so scif_poll exits Ralph On May 14, 2014, at 7:45 AM, Nathan Hjelm wrote: > Looks like this is a scif bug. From t

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Nathan Hjelm
Looks like this is a scif bug. From the documentation: scif_poll() waits for one of a set of endpoints to become ready to perform an I/O operation; it is syntactically and semantically very similar to poll() . The SCIF functions on which scif_poll() waits are scif_accept(), scif_send(), and sc

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
+1 - seen it before, and you'll find warnings across many software sites about this problem. Easy to have the main program segfault by touching the wrong thing after a cancel unless all the stars are properly aligned in the various libraries. On May 13, 2014, at 7:56 PM, Paul Hargrove wrote:

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
As I said, this isn't the only thread that faces this issue, and we have resolved it elsewhere - surely we can resolve it here as well in an acceptable manner. Nathan? On May 13, 2014, at 7:33 PM, Gilles Gouaillardet wrote: > Ralph, > > scif_poll(...) is called with an infinite timeout. >

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Paul Hargrove
George, Just my USD0.02: With pthreads many system calls (mostly those that might block) become "cancellation points" where the implementation checks if the callinf thread has been cancelled. This means that a thread making any of those calls may simply never return (calling pthread_exit() intern

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
Ralph, scif_poll(...) is called with an infinite timeout. a quick fix would be to use a finite timeout (1s ? 10s ? more ?) the obvious drawback is the thread has to wake up every xxx seconds and that would be for nothing 99.9% of the time. my analysis (see #4615) is the crash occurs when the btl

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread George Bosilca
I heard multiple references to pthread_cancel being known to have bad side effects. Can somebody educate my on this topic please? Thanks, George. On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote: > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunat

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
It could be a bug in the software stack, though I wouldn't count on it. Unfortunately, pthread_cancel is known to have bad side effects, and so we avoid its use. The key here is that the thread must detect that the file descriptor has closed and exit, or use some other method for detecting that