Re: [OMPI devel] ORTE headers in OPAL source
The first variable can probably be moved to opal pretty easily. That is used when we need to fully shutdown the BTLs and re-init them on continue. We do not have to do that for tcp (since we leave the sockets open), but do have to do that for IB, for example. The second call is a bit tricky since this is leaving a 'note' about a file that needs to be created (touch'ed) on restart in order for the sm BTL component to restart properly. For sm we leave the share memory file open and inplace when we checkpoint since on 'continue' we just keep using it. But on restart the file will no longer be there and can cause the process to crash when restarted. So just before restart we touch the file, then cleanup the old reference and the old (newly touch'ed) file during the restart INC when the process is being rebuilt. So that is what that call is doing, just writing the name of the file into the metadata for the snapshot. Then opal_restart will touch the file just before calling the CRS component to restart the process. So we just need to replace it with a call that sets this data in the metadata file. Take a look in the CRS components and the CR infrastructure to see how they are writing to the snapshot metadata (they might do it directly). Unfortunately, I have been away from that code long enough to not easily remember how to do it. Let me know if that gives you enough to move forward on. Thanks, Josh On Fri, Oct 17, 2014 at 9:15 AM, Adrian Reberwrote: > Josh, > > I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are > two uses of orte code: > > if (orte_cr_continue_like_restart) > > and > > /* On restart we need the old file names to exist (not necessarily > * contain content) so the CRS component does not fail when searching > * for these old file handles. The restart procedure will make sure > * these files get cleaned up appropriately. > */ > orte_sstore.set_attr(orte_sstore_handle_current, > SSTORE_METADATA_LOCAL_TOUCH, > mca_btl_sm_component.sm_seg->shmem_ds.seg_name); > > > Do you have an idea how to fix those two? The first variable > orte_cr_continue_like_restart could probably be moved but I am not sure > how to handle the sstore call. > > Adrian > > > On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote: > > Those calls should be protected with the CR FT #define - If I remember > > correctly. We were using the sstore to track the shared memory file names > > so we could clean them up on restart. > > > > I'm not sure if the sstore framework is necessary in this location, since > > we should be able to tell opal_crs and it will do the right thing. I can > > try to look at it early next week if someone doesn't get to it before > then. > > > > -- Josh > > > > > > > > On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> > > wrote: > > > > > I think you're making a joke, right...? > > > > > > I see direct calls to ORTE sstore functionality in all three. > > > > > > > > > > > > > > > On Aug 8, 2014, at 5:42 PM, George Bosilca > wrote: > > > > > > > These are harmless. They are only used when FT is enabled which > should > > > rarely be the case. > > > > > > > > George. > > > > > > > > > > > > > > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) < > > > jsquy...@cisco.com> wrote: > > > > Here's a few ORTE headers in OPAL source -- can respective owners > clean > > > these up? Thanks. > > > > > > > > - > > > > mca/btl/smcuda/btl_smcuda.c > > > > 63:#include "orte/mca/sstore/sstore.h" > > > > > > > > mca/btl/sm/btl_sm.c > > > > 62:#include "orte/mca/sstore/sstore.h" > > > > > > > > mca/mpool/sm/mpool_sm_module.c > > > > 34:#include "orte/mca/sstore/sstore.h" > > > > - > > > > > > > > -- > > > > Jeff Squyres > > > > jsquy...@cisco.com > > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > ___ > > > > devel mailing list > > > > de...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php > > > > > > > > ___ > > > > devel mailing list > > > > de...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > >
Re: [OMPI devel] ORTE headers in OPAL source
Josh, I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are two uses of orte code: if (orte_cr_continue_like_restart) and /* On restart we need the old file names to exist (not necessarily * contain content) so the CRS component does not fail when searching * for these old file handles. The restart procedure will make sure * these files get cleaned up appropriately. */ orte_sstore.set_attr(orte_sstore_handle_current, SSTORE_METADATA_LOCAL_TOUCH, mca_btl_sm_component.sm_seg->shmem_ds.seg_name); Do you have an idea how to fix those two? The first variable orte_cr_continue_like_restart could probably be moved but I am not sure how to handle the sstore call. Adrian On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote: > Those calls should be protected with the CR FT #define - If I remember > correctly. We were using the sstore to track the shared memory file names > so we could clean them up on restart. > > I'm not sure if the sstore framework is necessary in this location, since > we should be able to tell opal_crs and it will do the right thing. I can > try to look at it early next week if someone doesn't get to it before then. > > -- Josh > > > > On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres)> wrote: > > > I think you're making a joke, right...? > > > > I see direct calls to ORTE sstore functionality in all three. > > > > > > > > > > On Aug 8, 2014, at 5:42 PM, George Bosilca wrote: > > > > > These are harmless. They are only used when FT is enabled which should > > rarely be the case. > > > > > > George. > > > > > > > > > > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) < > > jsquy...@cisco.com> wrote: > > > Here's a few ORTE headers in OPAL source -- can respective owners clean > > these up? Thanks. > > > > > > - > > > mca/btl/smcuda/btl_smcuda.c > > > 63:#include "orte/mca/sstore/sstore.h" > > > > > > mca/btl/sm/btl_sm.c > > > 62:#include "orte/mca/sstore/sstore.h" > > > > > > mca/mpool/sm/mpool_sm_module.c > > > 34:#include "orte/mca/sstore/sstore.h" > > > - > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > > > > > > -- > Joshua Hursey > Assistant Professor of Computer Science > University of Wisconsin-La Crosse > http://cs.uwlax.edu/~jjhursey > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15588.php Adrian -- Adrian Reber http://lisas.de/~adrian/ ink, n.: A villainous compound of tannogallate of iron, gum-arabic, and water, chiefly used to facilitate the infection of idiocy and promote intellectual crime. -- H.L. Mencken
Re: [OMPI devel] ORTE headers in OPAL source
I have seen it. I am still waiting for things to settle down before I start fixing the FT code ( again ;-) Adrian On Mon, Aug 11, 2014 at 01:40:33PM +, Jeff Squyres (jsquyres) wrote: > Ah, I see. > > Ok -- add it to the list of > FT-things-to-be-fixed-before-FT-can-be-supported-again (which I think Josh > just did :-) ). > > Also: Adrian -- FYI. :-) > > > On Aug 11, 2014, at 9:05 AM, George Bosilcawrote: > > > I just checked the code and noticed that all the usages of the sstore are > > protected by an OPAL_ENABLE_FT_CR define. As we are not supporting FT, I > > don't think this is something we should spend time fixing right now. > > > > George. > > > > > > > > On Sat, Aug 9, 2014 at 8:06 AM, Jeff Squyres (jsquyres) > > wrote: > > I think you're making a joke, right...? > > > > I see direct calls to ORTE sstore functionality in all three. > > > > > > > > > > On Aug 8, 2014, at 5:42 PM, George Bosilca wrote: > > > > > These are harmless. They are only used when FT is enabled which should > > > rarely be the case. > > > > > > George. > > > > > > > > > > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) > > > wrote: > > > Here's a few ORTE headers in OPAL source -- can respective owners clean > > > these up? Thanks. > > > > > > - > > > mca/btl/smcuda/btl_smcuda.c > > > 63:#include "orte/mca/sstore/sstore.h" > > > > > > mca/btl/sm/btl_sm.c > > > 62:#include "orte/mca/sstore/sstore.h" > > > > > > mca/mpool/sm/mpool_sm_module.c > > > 34:#include "orte/mca/sstore/sstore.h" > > > - > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ Adrian -- Adrian Reber http://lisas.de/~adrian/ Authentic: Indubitably true, in somebody's opinion.
Re: [OMPI devel] ORTE headers in OPAL source
Those calls should be protected with the CR FT #define - If I remember correctly. We were using the sstore to track the shared memory file names so we could clean them up on restart. I'm not sure if the sstore framework is necessary in this location, since we should be able to tell opal_crs and it will do the right thing. I can try to look at it early next week if someone doesn't get to it before then. -- Josh On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres)wrote: > I think you're making a joke, right...? > > I see direct calls to ORTE sstore functionality in all three. > > > > > On Aug 8, 2014, at 5:42 PM, George Bosilca wrote: > > > These are harmless. They are only used when FT is enabled which should > rarely be the case. > > > > George. > > > > > > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Here's a few ORTE headers in OPAL source -- can respective owners clean > these up? Thanks. > > > > - > > mca/btl/smcuda/btl_smcuda.c > > 63:#include "orte/mca/sstore/sstore.h" > > > > mca/btl/sm/btl_sm.c > > 62:#include "orte/mca/sstore/sstore.h" > > > > mca/mpool/sm/mpool_sm_module.c > > 34:#include "orte/mca/sstore/sstore.h" > > - > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey
Re: [OMPI devel] ORTE headers in OPAL source
I think you're making a joke, right...? I see direct calls to ORTE sstore functionality in all three. On Aug 8, 2014, at 5:42 PM, George Bosilcawrote: > These are harmless. They are only used when FT is enabled which should rarely > be the case. > > George. > > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) > wrote: > Here's a few ORTE headers in OPAL source -- can respective owners clean these > up? Thanks. > > - > mca/btl/smcuda/btl_smcuda.c > 63:#include "orte/mca/sstore/sstore.h" > > mca/btl/sm/btl_sm.c > 62:#include "orte/mca/sstore/sstore.h" > > mca/mpool/sm/mpool_sm_module.c > 34:#include "orte/mca/sstore/sstore.h" > - > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/