Nathan,
this had no effect on my environment :-(
i am not sure you can reuse mca_btl_scif_module.scif_fd with connect()
i had to use a new scif fd for that.
then i ran into an other glitch : if the listen thread does not
scif_accept() the connection,
the scif_connect() will take 30 seconds (defa
That is exactly how I decided to fix it. It looks like it is
working. Please try r31755 when you get a chance.
-Nathan
On Thu, May 15, 2014 at 12:03:53AM +0900, Gilles Gouaillardet wrote:
>Nathan,
>
>> Looks like this is a scif bug. From the documentation:
>
>and from the source co
There seems to be a consensus on the fact that closing an fd should trigger the
return from poll. Unfortunately this assumption is wrong, and not condoned by
any documentation available online.
To be more clear, all documentation I know tend to point in the opposite
direction: it is unwise to c
On Wed, May 14, 2014 at 07:55:54AM -0700, Ralph Castain wrote:
> Couple of suggestions:
>
> * detect that this is an older scif lib and just don't build or enable the
> scif btl
>
> * have a flag that indicates "you should exit", and then tickle the fd so
> scif_poll exits
Thinking along these
It sounds more like a suboptimal usage of the pthread cancelation
helpers than a real issue with the pthread_cancel itself. I do agree
the usage is not necessarily straightforward even for a veteran coder,
but the related issues remain belong to the realm of implementation
not at the conceptual lev
Nathan,
> Looks like this is a scif bug. From the documentation:
and from the source code, scif_poll(...) simply calls poll(...)
at least in MPSS 2.1
> Since that is not the case I will look through the documentation and see
if there is a way other than pthread_cancel.
what about :
- use a
Couple of suggestions:
* detect that this is an older scif lib and just don't build or enable the scif
btl
* have a flag that indicates "you should exit", and then tickle the fd so
scif_poll exits
Ralph
On May 14, 2014, at 7:45 AM, Nathan Hjelm wrote:
> Looks like this is a scif bug. From t
Looks like this is a scif bug. From the documentation:
scif_poll() waits for one of a set of endpoints to become ready to perform an
I/O operation;
it is syntactically and semantically very similar to poll() . The SCIF
functions on which
scif_poll() waits are scif_accept(), scif_send(), and sc
+1 - seen it before, and you'll find warnings across many software sites about
this problem. Easy to have the main program segfault by touching the wrong
thing after a cancel unless all the stars are properly aligned in the various
libraries.
On May 13, 2014, at 7:56 PM, Paul Hargrove wrote:
As I said, this isn't the only thread that faces this issue, and we have
resolved it elsewhere - surely we can resolve it here as well in an acceptable
manner.
Nathan?
On May 13, 2014, at 7:33 PM, Gilles Gouaillardet
wrote:
> Ralph,
>
> scif_poll(...) is called with an infinite timeout.
>
George,
Just my USD0.02:
With pthreads many system calls (mostly those that might block) become
"cancellation points" where the implementation checks if the callinf thread
has been cancelled.
This means that a thread making any of those calls may simply never return
(calling pthread_exit() intern
Ralph,
scif_poll(...) is called with an infinite timeout.
a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
the obvious drawback is the thread has to wake up every xxx seconds and
that would be for
nothing 99.9% of the time.
my analysis (see #4615) is the crash occurs when the btl
I heard multiple references to pthread_cancel being known to have bad
side effects. Can somebody educate my on this topic please?
Thanks,
George.
On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote:
> It could be a bug in the software stack, though I wouldn't count on it.
> Unfortunat
It could be a bug in the software stack, though I wouldn't count on it.
Unfortunately, pthread_cancel is known to have bad side effects, and so we
avoid its use.
The key here is that the thread must detect that the file descriptor has closed
and exit, or use some other method for detecting that
14 matches
Mail list logo