Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Mike Dubman
should be fixed. thanks On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd wrote: > Yes. Will look into it. > > Josh > > > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> Ah; I guess the tags aren't getting pulled over. >> >> Mellanox -- can you check into

Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Jeff Squyres (jsquyres)
Hmm. The last tag I see on github is still 1.7.2. On May 13, 2014, at 2:11 AM, Mike Dubman wrote: > should be fixed. > thanks > > > On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd wrote: > Yes. Will look into it. > > Josh > > > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) > wr

Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Friedley, Andrew
I see a v1.8.1, but no v1.8.0, is that correct? Andrew > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Tuesday, May 13, 2014 3:15 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] OMPI v1.8.x git tags? > > Hmm.

Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Jeff Squyres (jsquyres)
I think Mellanox is still working on it. On May 13, 2014, at 10:57 AM, "Friedley, Andrew" wrote: > I see a v1.8.1, but no v1.8.0, is that correct? > > Andrew > >> -Original Message- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff >> Squyres (jsquyres) >> Sent: Tu

[OMPI devel] Minutes of Open MPI ConCall Meeting - Tuesday, May 13, 2014

2014-05-13 Thread Rolf vandeVaart
Open MPI 1.6: - Release was waiting on https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1. Open MPI 1.8: - Several tickets have been applied. Some discussion about other

[OMPI devel] Non-uniform BTL problems in: openib, tcp, sctp, portals4, vader, scif

2014-05-13 Thread Jeff Squyres (jsquyres)
I notice that BTLs are not checking the return value from ompi_modex_recv() for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that modex key). In the BTL context, NOT_FOUND means that that peer process doesn't have this BTL, so this local peer process should probabl

[OMPI devel] 1.6.6rc1 tarball posted

2014-05-13 Thread Jeff Squyres (jsquyres)
Now that the 1.8 series is out, we're going to do one final release in the 1.6.x series, just so that the few bug fixes that came in after 1.6.5 can get out into the world (for those who are unable to upgrade to the v1.8 series). 1.6.6rc1 has been posted: http://www.open-mpi.org/software/om

[OMPI devel] opal_free_list_t annoyance

2014-05-13 Thread Nathan Hjelm
While tracking down memory leaks in components I ran into an interesting issue. osc/rdma uses an opal_free_list_t (not an ompi_free_list_t) for buffer fragments. The fragment class allocates a buffer as part in the constructor and frees the buffer in the destructor. The problem is that the item con

[OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
Folks, i would like to comment on r31738 : > There is no reason to cancel the listening thread. It should die > automatically when the file descriptor is closed. i could not agree more > It is sufficient to just wait for the thread to exit with pthread join. unfortunatly, at least in my test envi

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
It could be a bug in the software stack, though I wouldn't count on it. Unfortunately, pthread_cancel is known to have bad side effects, and so we avoid its use. The key here is that the thread must detect that the file descriptor has closed and exit, or use some other method for detecting that

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread George Bosilca
I heard multiple references to pthread_cancel being known to have bad side effects. Can somebody educate my on this topic please? Thanks, George. On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote: > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunat

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
Ralph, scif_poll(...) is called with an infinite timeout. a quick fix would be to use a finite timeout (1s ? 10s ? more ?) the obvious drawback is the thread has to wake up every xxx seconds and that would be for nothing 99.9% of the time. my analysis (see #4615) is the crash occurs when the btl

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Paul Hargrove
George, Just my USD0.02: With pthreads many system calls (mostly those that might block) become "cancellation points" where the implementation checks if the callinf thread has been cancelled. This means that a thread making any of those calls may simply never return (calling pthread_exit() intern

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
As I said, this isn't the only thread that faces this issue, and we have resolved it elsewhere - surely we can resolve it here as well in an acceptable manner. Nathan? On May 13, 2014, at 7:33 PM, Gilles Gouaillardet wrote: > Ralph, > > scif_poll(...) is called with an infinite timeout. >

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
+1 - seen it before, and you'll find warnings across many software sites about this problem. Easy to have the main program segfault by touching the wrong thing after a cancel unless all the stars are properly aligned in the various libraries. On May 13, 2014, at 7:56 PM, Paul Hargrove wrote: