Re: [OMPI devel] OMPI v1.8.x git tags?
should be fixed. thanks On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd wrote: > Yes. Will look into it. > > Josh > > > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> Ah; I guess the tags aren't getting pulled over. >> >> Mellanox -- can you check into this? >> >> >> >> On May 12, 2014, at 5:52 PM, "Friedley, Andrew" < >> andrew.fried...@intel.com> wrote: >> >> > Hi, >> > >> > I'm looking at the OMPI svn mirror on github and there don't appear to >> be any release tags for v1.8.x; the most recent appears to be v1.7.2. Are >> there any plans to add tags for any more releases? >> > >> > Thanks, >> > >> > Andrew >> > ___ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14774.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14775.php >> > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14777.php >
Re: [OMPI devel] OMPI v1.8.x git tags?
Hmm. The last tag I see on github is still 1.7.2. On May 13, 2014, at 2:11 AM, Mike Dubman wrote: > should be fixed. > thanks > > > On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd wrote: > Yes. Will look into it. > > Josh > > > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) > wrote: > Ah; I guess the tags aren't getting pulled over. > > Mellanox -- can you check into this? > > > > On May 12, 2014, at 5:52 PM, "Friedley, Andrew" > wrote: > > > Hi, > > > > I'm looking at the OMPI svn mirror on github and there don't appear to be > > any release tags for v1.8.x; the most recent appears to be v1.7.2. Are > > there any plans to add tags for any more releases? > > > > Thanks, > > > > Andrew > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/05/14774.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14775.php > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14777.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14778.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] OMPI v1.8.x git tags?
I see a v1.8.1, but no v1.8.0, is that correct? Andrew > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Tuesday, May 13, 2014 3:15 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] OMPI v1.8.x git tags? > > Hmm. The last tag I see on github is still 1.7.2. > > > On May 13, 2014, at 2:11 AM, Mike Dubman > wrote: > > > should be fixed. > > thanks > > > > > > On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd > wrote: > > Yes. Will look into it. > > > > Josh > > > > > > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) > wrote: > > Ah; I guess the tags aren't getting pulled over. > > > > Mellanox -- can you check into this? > > > > > > > > On May 12, 2014, at 5:52 PM, "Friedley, Andrew" > wrote: > > > > > Hi, > > > > > > I'm looking at the OMPI svn mirror on github and there don't appear to be > any release tags for v1.8.x; the most recent appears to be v1.7.2. Are there > any plans to add tags for any more releases? > > > > > > Thanks, > > > > > > Andrew > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: http://www.open- > mpi.org/community/lists/devel/2014/05/14774.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open- > mpi.org/community/lists/devel/2014/05/14775.php > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open- > mpi.org/community/lists/devel/2014/05/14777.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open- > mpi.org/community/lists/devel/2014/05/14778.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open- > mpi.org/community/lists/devel/2014/05/14779.php
Re: [OMPI devel] OMPI v1.8.x git tags?
I think Mellanox is still working on it. On May 13, 2014, at 10:57 AM, "Friedley, Andrew" wrote: > I see a v1.8.1, but no v1.8.0, is that correct? > > Andrew > >> -Original Message- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff >> Squyres (jsquyres) >> Sent: Tuesday, May 13, 2014 3:15 AM >> To: Open MPI Developers >> Subject: Re: [OMPI devel] OMPI v1.8.x git tags? >> >> Hmm. The last tag I see on github is still 1.7.2. >> >> >> On May 13, 2014, at 2:11 AM, Mike Dubman >> wrote: >> >>> should be fixed. >>> thanks >>> >>> >>> On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd >> wrote: >>> Yes. Will look into it. >>> >>> Josh >>> >>> >>> On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) >> wrote: >>> Ah; I guess the tags aren't getting pulled over. >>> >>> Mellanox -- can you check into this? >>> >>> >>> >>> On May 12, 2014, at 5:52 PM, "Friedley, Andrew" >> wrote: >>> Hi, I'm looking at the OMPI svn mirror on github and there don't appear to be >> any release tags for v1.8.x; the most recent appears to be v1.7.2. Are there >> any plans to add tags for any more releases? Thanks, Andrew ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open- >> mpi.org/community/lists/devel/2014/05/14774.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: http://www.open- >> mpi.org/community/lists/devel/2014/05/14775.php >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: http://www.open- >> mpi.org/community/lists/devel/2014/05/14777.php >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: http://www.open- >> mpi.org/community/lists/devel/2014/05/14778.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: http://www.open- >> mpi.org/community/lists/devel/2014/05/14779.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14780.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] Minutes of Open MPI ConCall Meeting - Tuesday, May 13, 2014
Open MPI 1.6: - Release was waiting on https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1. Open MPI 1.8: - Several tickets have been applied. Some discussion about other tickets but details are too numerous to catch here. - Still having issues with 0 sized messages and MPI_Alltoallw. I think this is being tracked with ticket https://svn.open-mpi.org/trac/ompi/ticket/4506. Jeff will poke a few folks to get things moving for a fix of that issue. - Still leaking in some component. Nathan looking at that issue and hopes to have fix soon. - Encourage everyone to review their CMRs and change owner after review is done. Other: RFC: autogen.sh removal is approved. Bye, bye autogen.sh Round Table --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
[OMPI devel] Non-uniform BTL problems in: openib, tcp, sctp, portals4, vader, scif
I notice that BTLs are not checking the return value from ompi_modex_recv() for OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that modex key). In the BTL context, NOT_FOUND means that that peer process doesn't have this BTL, so this local peer process should probably mark it as unreachable in add_procs(). This is on both trunk and the v1.8 branch. The BTLs listed above are not checking/handling ompi_modex_recv() returning OPAL_ERR_DATA_VALUE_NOT_FOUND properly. Most of these BTLs do something like this: - module_add_procs() { loop over the peers { proc = proc_create(...) if (NULL == proc) error! } } proc_create(...) { if (ompi_modex_recv() != OMPI_SUCCESS) return NULL; ... } - The fix is to make proc_create() return something a bit more expressive so that add_procs() can tell the difference between "error!" and "you can't reach this peer". I fixed this in the usnic BTL back in late March, but forgot to bring this to everyone's attention -- oops. See https://svn.open-mpi.org/trac/ompi/ticket/4442 -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] 1.6.6rc1 tarball posted
Now that the 1.8 series is out, we're going to do one final release in the 1.6.x series, just so that the few bug fixes that came in after 1.6.5 can get out into the world (for those who are unable to upgrade to the v1.8 series). 1.6.6rc1 has been posted: http://www.open-mpi.org/software/ompi/v1.6/ There will be no MS Windows binary version posted for 1.6.6. Please test! Here's the list of changes since 1.6.5: - Prevent integer overflow in datatype creation. Thanks to Gilles Gouaillardet for identifying the problem and providing a preliminary version of the patch. - Ensure help-opal-hwloc-base.txt is included in distribution tarballs. Thanks to Gilles Gouaillardet for supplying the patch. - Correctly handle the invalid status for NULL and inactive requests. Thanks to KAWASHIMA Takahiro for submitting the initial patch. - Fixed MPI_STATUS_SIZE Fortran issue when used with 8-byte Fortran INTEGERs and 4-byte C ints. Since this issue affects ABI, it is only enabled if Open MPI is configured with --enable-abi-breaking-fortran-status-i8-fix. Thanks to Jim Parker for supplying the initial patch. - Fix datatype issue for sending from the middle of non-contiguous data. - Fixed failure error with pty support. Thanks to Michal Pecio for the patch. - Fixed debugger support for direct-launched jobs. - Fix MPI_IS_THREAD_MAIN to return the correct value. Thanks to Lisandro Dalcin for pointing out the issue. - Update VT to 5.14.4.4: - Fix C++-11 issue. - Fix support for building RPMs on Fedora with CUDA libraries. - Add openib part number for ConnectX3-Pro HCA. - Ensure to check that all resolved IP addresses are local. - Fix MPI_COMM_SPAWN via rsh when mpirun is on a different server. - Add Gentoo "sandbox" memory hooks override. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] opal_free_list_t annoyance
While tracking down memory leaks in components I ran into an interesting issue. osc/rdma uses an opal_free_list_t (not an ompi_free_list_t) for buffer fragments. The fragment class allocates a buffer as part in the constructor and frees the buffer in the destructor. The problem is that the item constructor is called but the destructor is never called. I looked into the issue and I see what is happening. When growing the free list we call the constructor for each item we allocate (see opal_free_list.c:113) but the free list destructor does not invoke the destructor. This is different from ompi_free_list_t which does invoke the destructor on each constructed item. The question is. Is this difference intentional? It seems a little odd that the free list does not call the item destructor given that it calls the constructor. If this is intentional is there a reason for this behavior? If not I plan on "fixing" the opal_free_list_t destructor to call the item destructor. -Nathan pgpfmwPgrzPzX.pgp Description: PGP signature
[OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
Folks, i would like to comment on r31738 : > There is no reason to cancel the listening thread. It should die > automatically when the file descriptor is closed. i could not agree more > It is sufficient to just wait for the thread to exit with pthread join. unfortunatly, at least in my test environment (an outdated MPSS 2.1) it is *not* :-( this is what i described in #4615 https://svn.open-mpi.org/trac/ompi/ticket/4615 in which i attached scif_hang.c that evidences that (at least in my environment) scif_poll(...) does *not* return after scif_close(...) is closed, and hence the scif pthread never ends. this is likely a bug in MPSS and it might have been fixed in earlier release. Nathan, could you try scif_hang in your environment and report the MPSS version you are running ? bottom line, and once again, in my test environment, pthread_join (...) without pthread_cancel(...) might cause a hang when the btl/scif module is released. assuming the bug is in old MPSS and has been fixed in recent releases, what is the OpenMPI policy ? a) test the MPSS version and call pthread_cancel() or do *not* call pthread_join if buggy MPSS is detected ? b) display an error/warning if a buggy MPSS is detected ? c) do not call pthread_join at all ? /* SIGSEGV might occur with older MPSS, it is in MPI_Finalize() so impact is limited */ d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI problem after all ? e) something else ? Gilles
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
It could be a bug in the software stack, though I wouldn't count on it. Unfortunately, pthread_cancel is known to have bad side effects, and so we avoid its use. The key here is that the thread must detect that the file descriptor has closed and exit, or use some other method for detecting that it should terminate. We do this in multiple other places in the code, without using pthread_cancel and without hanging. So it is certainly doable. I don't know the specifics of why Nathan's code is having trouble exiting, but I suspect that a simple solution - not involving pthread_cancel - can be readily developed. On May 13, 2014, at 7:18 PM, Gilles Gouaillardet wrote: > Folks, > > i would like to comment on r31738 : > >> There is no reason to cancel the listening thread. It should die >> automatically when the file descriptor is closed. > i could not agree more >> It is sufficient to just wait for the thread to exit with pthread join. > unfortunatly, at least in my test environment (an outdated MPSS 2.1) it > is *not* :-( > > this is what i described in #4615 > https://svn.open-mpi.org/trac/ompi/ticket/4615 > in which i attached scif_hang.c that evidences that (at least in my > environment) > scif_poll(...) does *not* return after scif_close(...) is closed, and > hence the scif pthread never ends. > > this is likely a bug in MPSS and it might have been fixed in earlier > release. > > Nathan, could you try scif_hang in your environment and report the MPSS > version you are running ? > > > bottom line, and once again, in my test environment, pthread_join (...) > without pthread_cancel(...) > might cause a hang when the btl/scif module is released. > > > assuming the bug is in old MPSS and has been fixed in recent releases, > what is the OpenMPI policy ? > a) test the MPSS version and call pthread_cancel() or do *not* call > pthread_join if buggy MPSS is detected ? > b) display an error/warning if a buggy MPSS is detected ? > c) do not call pthread_join at all ? /* SIGSEGV might occur with older > MPSS, it is in MPI_Finalize() so impact is limited */ > d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI > problem after all ? > e) something else ? > > Gilles > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
I heard multiple references to pthread_cancel being known to have bad side effects. Can somebody educate my on this topic please? Thanks, George. On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote: > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunately, pthread_cancel is known to have bad side effects, and so we > avoid its use. > > The key here is that the thread must detect that the file descriptor has > closed and exit, or use some other method for detecting that it should > terminate. We do this in multiple other places in the code, without using > pthread_cancel and without hanging. So it is certainly doable. > > I don't know the specifics of why Nathan's code is having trouble exiting, > but I suspect that a simple solution - not involving pthread_cancel - can be > readily developed. > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > wrote: > >> Folks, >> >> i would like to comment on r31738 : >> >>> There is no reason to cancel the listening thread. It should die >>> automatically when the file descriptor is closed. >> i could not agree more >>> It is sufficient to just wait for the thread to exit with pthread join. >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >> is *not* :-( >> >> this is what i described in #4615 >> https://svn.open-mpi.org/trac/ompi/ticket/4615 >> in which i attached scif_hang.c that evidences that (at least in my >> environment) >> scif_poll(...) does *not* return after scif_close(...) is closed, and >> hence the scif pthread never ends. >> >> this is likely a bug in MPSS and it might have been fixed in earlier >> release. >> >> Nathan, could you try scif_hang in your environment and report the MPSS >> version you are running ? >> >> >> bottom line, and once again, in my test environment, pthread_join (...) >> without pthread_cancel(...) >> might cause a hang when the btl/scif module is released. >> >> >> assuming the bug is in old MPSS and has been fixed in recent releases, >> what is the OpenMPI policy ? >> a) test the MPSS version and call pthread_cancel() or do *not* call >> pthread_join if buggy MPSS is detected ? >> b) display an error/warning if a buggy MPSS is detected ? >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >> MPSS, it is in MPI_Finalize() so impact is limited */ >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >> problem after all ? >> e) something else ? >> >> Gilles >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
Ralph, scif_poll(...) is called with an infinite timeout. a quick fix would be to use a finite timeout (1s ? 10s ? more ?) the obvious drawback is the thread has to wake up every xxx seconds and that would be for nothing 99.9% of the time. my analysis (see #4615) is the crash occurs when the btl/scif is unloaded from memory (e.g. dlcose()) and the scif_thread is still running. Gilles On 2014/05/14 11:25, Ralph Castain wrote: > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunately, pthread_cancel is known to have bad side effects, and so we > avoid its use. > > The key here is that the thread must detect that the file descriptor has > closed and exit, or use some other method for detecting that it should > terminate. We do this in multiple other places in the code, without using > pthread_cancel and without hanging. So it is certainly doable. > > I don't know the specifics of why Nathan's code is having trouble exiting, > but I suspect that a simple solution - not involving pthread_cancel - can be > readily developed. > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > wrote: > >> Folks, >> >> i would like to comment on r31738 : >> >>> There is no reason to cancel the listening thread. It should die >>> automatically when the file descriptor is closed. >> i could not agree more >>> It is sufficient to just wait for the thread to exit with pthread join. >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >> is *not* :-( >> >> this is what i described in #4615 >> https://svn.open-mpi.org/trac/ompi/ticket/4615 >> in which i attached scif_hang.c that evidences that (at least in my >> environment) >> scif_poll(...) does *not* return after scif_close(...) is closed, and >> hence the scif pthread never ends. >> >> this is likely a bug in MPSS and it might have been fixed in earlier >> release. >> >> Nathan, could you try scif_hang in your environment and report the MPSS >> version you are running ? >> >> >> bottom line, and once again, in my test environment, pthread_join (...) >> without pthread_cancel(...) >> might cause a hang when the btl/scif module is released. >> >> >> assuming the bug is in old MPSS and has been fixed in recent releases, >> what is the OpenMPI policy ? >> a) test the MPSS version and call pthread_cancel() or do *not* call >> pthread_join if buggy MPSS is detected ? >> b) display an error/warning if a buggy MPSS is detected ? >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >> MPSS, it is in MPI_Finalize() so impact is limited */ >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >> problem after all ? >> e) something else ? >> >> Gilles >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
George, Just my USD0.02: With pthreads many system calls (mostly those that might block) become "cancellation points" where the implementation checks if the callinf thread has been cancelled. This means that a thread making any of those calls may simply never return (calling pthread_exit() internally), unless extra work has been done to prevent this default behavior. This makes it very hard to write code that properly cleans up its resources, including (but not limited to) file descriptors and malloc()ed memory. Even if Open MPI is written very carefully, one cannot assume that all the libraries it calls (and their dependencies, etc.) are written to properly deal with cancellation. -Paul On Tue, May 13, 2014 at 7:32 PM, George Bosilca wrote: > I heard multiple references to pthread_cancel being known to have bad > side effects. Can somebody educate my on this topic please? > > Thanks, > George. > > > > On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote: > > It could be a bug in the software stack, though I wouldn't count on it. > Unfortunately, pthread_cancel is known to have bad side effects, and so we > avoid its use. > > > > The key here is that the thread must detect that the file descriptor has > closed and exit, or use some other method for detecting that it should > terminate. We do this in multiple other places in the code, without using > pthread_cancel and without hanging. So it is certainly doable. > > > > I don't know the specifics of why Nathan's code is having trouble > exiting, but I suspect that a simple solution - not involving > pthread_cancel - can be readily developed. > > > > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > > >> Folks, > >> > >> i would like to comment on r31738 : > >> > >>> There is no reason to cancel the listening thread. It should die > >>> automatically when the file descriptor is closed. > >> i could not agree more > >>> It is sufficient to just wait for the thread to exit with pthread join. > >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it > >> is *not* :-( > >> > >> this is what i described in #4615 > >> https://svn.open-mpi.org/trac/ompi/ticket/4615 > >> in which i attached scif_hang.c that evidences that (at least in my > >> environment) > >> scif_poll(...) does *not* return after scif_close(...) is closed, and > >> hence the scif pthread never ends. > >> > >> this is likely a bug in MPSS and it might have been fixed in earlier > >> release. > >> > >> Nathan, could you try scif_hang in your environment and report the MPSS > >> version you are running ? > >> > >> > >> bottom line, and once again, in my test environment, pthread_join (...) > >> without pthread_cancel(...) > >> might cause a hang when the btl/scif module is released. > >> > >> > >> assuming the bug is in old MPSS and has been fixed in recent releases, > >> what is the OpenMPI policy ? > >> a) test the MPSS version and call pthread_cancel() or do *not* call > >> pthread_join if buggy MPSS is detected ? > >> b) display an error/warning if a buggy MPSS is detected ? > >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older > >> MPSS, it is in MPI_Finalize() so impact is limited */ > >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI > >> problem after all ? > >> e) something else ? > >> > >> Gilles > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14788.php > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
As I said, this isn't the only thread that faces this issue, and we have resolved it elsewhere - surely we can resolve it here as well in an acceptable manner. Nathan? On May 13, 2014, at 7:33 PM, Gilles Gouaillardet wrote: > Ralph, > > scif_poll(...) is called with an infinite timeout. > > a quick fix would be to use a finite timeout (1s ? 10s ? more ?) > the obvious drawback is the thread has to wake up every xxx seconds and > that would be for > nothing 99.9% of the time. > > my analysis (see #4615) is the crash occurs when the btl/scif is > unloaded from memory (e.g. dlcose()) and > the scif_thread is still running. > > Gilles > > On 2014/05/14 11:25, Ralph Castain wrote: >> It could be a bug in the software stack, though I wouldn't count on it. >> Unfortunately, pthread_cancel is known to have bad side effects, and so we >> avoid its use. >> >> The key here is that the thread must detect that the file descriptor has >> closed and exit, or use some other method for detecting that it should >> terminate. We do this in multiple other places in the code, without using >> pthread_cancel and without hanging. So it is certainly doable. >> >> I don't know the specifics of why Nathan's code is having trouble exiting, >> but I suspect that a simple solution - not involving pthread_cancel - can be >> readily developed. >> >> >> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet >> wrote: >> >>> Folks, >>> >>> i would like to comment on r31738 : >>> There is no reason to cancel the listening thread. It should die automatically when the file descriptor is closed. >>> i could not agree more It is sufficient to just wait for the thread to exit with pthread join. >>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it >>> is *not* :-( >>> >>> this is what i described in #4615 >>> https://svn.open-mpi.org/trac/ompi/ticket/4615 >>> in which i attached scif_hang.c that evidences that (at least in my >>> environment) >>> scif_poll(...) does *not* return after scif_close(...) is closed, and >>> hence the scif pthread never ends. >>> >>> this is likely a bug in MPSS and it might have been fixed in earlier >>> release. >>> >>> Nathan, could you try scif_hang in your environment and report the MPSS >>> version you are running ? >>> >>> >>> bottom line, and once again, in my test environment, pthread_join (...) >>> without pthread_cancel(...) >>> might cause a hang when the btl/scif module is released. >>> >>> >>> assuming the bug is in old MPSS and has been fixed in recent releases, >>> what is the OpenMPI policy ? >>> a) test the MPSS version and call pthread_cancel() or do *not* call >>> pthread_join if buggy MPSS is detected ? >>> b) display an error/warning if a buggy MPSS is detected ? >>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older >>> MPSS, it is in MPI_Finalize() so impact is limited */ >>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI >>> problem after all ? >>> e) something else ? >>> >>> Gilles >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14789.php
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
+1 - seen it before, and you'll find warnings across many software sites about this problem. Easy to have the main program segfault by touching the wrong thing after a cancel unless all the stars are properly aligned in the various libraries. On May 13, 2014, at 7:56 PM, Paul Hargrove wrote: > George, > > Just my USD0.02: > > With pthreads many system calls (mostly those that might block) become > "cancellation points" where the implementation checks if the callinf thread > has been cancelled. > This means that a thread making any of those calls may simply never return > (calling pthread_exit() internally), unless extra work has been done to > prevent this default behavior. > This makes it very hard to write code that properly cleans up its resources, > including (but not limited to) file descriptors and malloc()ed memory. > Even if Open MPI is written very carefully, one cannot assume that all the > libraries it calls (and their dependencies, etc.) are written to properly > deal with cancellation. > > -Paul > > > On Tue, May 13, 2014 at 7:32 PM, George Bosilca wrote: > I heard multiple references to pthread_cancel being known to have bad > side effects. Can somebody educate my on this topic please? > > Thanks, > George. > > > > On Tue, May 13, 2014 at 10:25 PM, Ralph Castain wrote: > > It could be a bug in the software stack, though I wouldn't count on it. > > Unfortunately, pthread_cancel is known to have bad side effects, and so we > > avoid its use. > > > > The key here is that the thread must detect that the file descriptor has > > closed and exit, or use some other method for detecting that it should > > terminate. We do this in multiple other places in the code, without using > > pthread_cancel and without hanging. So it is certainly doable. > > > > I don't know the specifics of why Nathan's code is having trouble exiting, > > but I suspect that a simple solution - not involving pthread_cancel - can > > be readily developed. > > > > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > > wrote: > > > >> Folks, > >> > >> i would like to comment on r31738 : > >> > >>> There is no reason to cancel the listening thread. It should die > >>> automatically when the file descriptor is closed. > >> i could not agree more > >>> It is sufficient to just wait for the thread to exit with pthread join. > >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it > >> is *not* :-( > >> > >> this is what i described in #4615 > >> https://svn.open-mpi.org/trac/ompi/ticket/4615 > >> in which i attached scif_hang.c that evidences that (at least in my > >> environment) > >> scif_poll(...) does *not* return after scif_close(...) is closed, and > >> hence the scif pthread never ends. > >> > >> this is likely a bug in MPSS and it might have been fixed in earlier > >> release. > >> > >> Nathan, could you try scif_hang in your environment and report the MPSS > >> version you are running ? > >> > >> > >> bottom line, and once again, in my test environment, pthread_join (...) > >> without pthread_cancel(...) > >> might cause a hang when the btl/scif module is released. > >> > >> > >> assuming the bug is in old MPSS and has been fixed in recent releases, > >> what is the OpenMPI policy ? > >> a) test the MPSS version and call pthread_cancel() or do *not* call > >> pthread_join if buggy MPSS is detected ? > >> b) display an error/warning if a buggy MPSS is detected ? > >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older > >> MPSS, it is in MPI_Finalize() so impact is limited */ > >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI > >> problem after all ? > >> e) something else ? > >> > >> Gilles > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14788.php > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14790.ph