Re: [OMPI devel] ORTE headers in OPAL source

2014-10-19 Thread Josh Hursey
The first variable can probably be moved to opal pretty easily. That is
used when we need to fully shutdown the BTLs and re-init them on continue.
We do not have to do that for tcp (since we leave the sockets open), but do
have to do that for IB, for example.

The second call is a bit tricky since this is leaving a 'note' about a file
that needs to be created (touch'ed) on restart in order for the sm BTL
component to restart properly. For sm we leave the share memory file open
and inplace when we checkpoint since on 'continue' we just keep using it.
But on restart the file will no longer be there and can cause the process
to crash when restarted. So just before restart we touch the file, then
cleanup the old reference and the old (newly touch'ed) file during the
restart INC when the process is being rebuilt.

So that is what that call is doing, just writing the name of the file into
the metadata for the snapshot. Then opal_restart will touch the file just
before calling the CRS component to restart the process. So we just need to
replace it with a call that sets this data in the metadata file. Take a
look in the CRS components and the CR infrastructure to see how they are
writing to the snapshot metadata (they might do it directly).

Unfortunately, I have been away from that code long enough to not easily
remember how to do it. Let me know if that gives you enough to move forward
on.

Thanks,
Josh



On Fri, Oct 17, 2014 at 9:15 AM, Adrian Reber  wrote:

> Josh,
>
> I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are
> two uses of orte code:
>
> if (orte_cr_continue_like_restart)
>
> and
>
>  /* On restart we need the old file names to exist (not necessarily
>   * contain content) so the CRS component does not fail when  searching
>   * for these old file handles. The restart procedure will make sure
>   * these files get cleaned up appropriately.
>   */
>  orte_sstore.set_attr(orte_sstore_handle_current,
>   SSTORE_METADATA_LOCAL_TOUCH,
>   mca_btl_sm_component.sm_seg->shmem_ds.seg_name);
>
>
> Do you have an idea how to fix those two? The first variable
> orte_cr_continue_like_restart could probably be moved but I am not sure
> how to handle the sstore call.
>
> Adrian
>
>
> On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote:
> > Those calls should be protected with the CR FT #define - If I remember
> > correctly. We were using the sstore to track the shared memory file names
> > so we could clean them up on restart.
> >
> > I'm not sure if the sstore framework is necessary in this location, since
> > we should be able to tell opal_crs and it will do the right thing. I can
> > try to look at it early next week if someone doesn't get to it before
> then.
> >
> > -- Josh
> >
> >
> >
> > On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com>
> > wrote:
> >
> > > I think you're making a joke, right...?
> > >
> > > I see direct calls to ORTE sstore functionality in all three.
> > >
> > >
> > >
> > >
> > > On Aug 8, 2014, at 5:42 PM, George Bosilca 
> wrote:
> > >
> > > > These are harmless. They are only used when FT is enabled which
> should
> > > rarely be the case.
> > > >
> > > >   George.
> > > >
> > > >
> > > >
> > > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) <
> > > jsquy...@cisco.com> wrote:
> > > > Here's a few ORTE headers in OPAL source -- can respective owners
> clean
> > > these up?  Thanks.
> > > >
> > > > -
> > > > mca/btl/smcuda/btl_smcuda.c
> > > > 63:#include "orte/mca/sstore/sstore.h"
> > > >
> > > > mca/btl/sm/btl_sm.c
> > > > 62:#include "orte/mca/sstore/sstore.h"
> > > >
> > > > mca/mpool/sm/mpool_sm_module.c
> > > > 34:#include "orte/mca/sstore/sstore.h"
> > > > -
> > > >
> > > > --
> > > > Jeff Squyres
> > > > jsquy...@cisco.com
> > > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > Link to this post:
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > > 

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-17 Thread Adrian Reber
Josh,

I had a look at the code (e.g., opal/mca/btl/sm/btl_sm.c) and there are
two uses of orte code:

if (orte_cr_continue_like_restart)

and

 /* On restart we need the old file names to exist (not necessarily
  * contain content) so the CRS component does not fail when  searching
  * for these old file handles. The restart procedure will make sure
  * these files get cleaned up appropriately.
  */
 orte_sstore.set_attr(orte_sstore_handle_current,
  SSTORE_METADATA_LOCAL_TOUCH,
  mca_btl_sm_component.sm_seg->shmem_ds.seg_name);


Do you have an idea how to fix those two? The first variable
orte_cr_continue_like_restart could probably be moved but I am not sure
how to handle the sstore call.

Adrian


On Sat, Aug 09, 2014 at 08:46:31AM -0500, Josh Hursey wrote:
> Those calls should be protected with the CR FT #define - If I remember
> correctly. We were using the sstore to track the shared memory file names
> so we could clean them up on restart.
> 
> I'm not sure if the sstore framework is necessary in this location, since
> we should be able to tell opal_crs and it will do the right thing. I can
> try to look at it early next week if someone doesn't get to it before then.
> 
> -- Josh
> 
> 
> 
> On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) 
> wrote:
> 
> > I think you're making a joke, right...?
> >
> > I see direct calls to ORTE sstore functionality in all three.
> >
> >
> >
> >
> > On Aug 8, 2014, at 5:42 PM, George Bosilca  wrote:
> >
> > > These are harmless. They are only used when FT is enabled which should
> > rarely be the case.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) <
> > jsquy...@cisco.com> wrote:
> > > Here's a few ORTE headers in OPAL source -- can respective owners clean
> > these up?  Thanks.
> > >
> > > -
> > > mca/btl/smcuda/btl_smcuda.c
> > > 63:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/btl/sm/btl_sm.c
> > > 62:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/mpool/sm/mpool_sm_module.c
> > > 34:#include "orte/mca/sstore/sstore.h"
> > > -
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15588.php


Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
ink, n.:
A villainous compound of tannogallate of iron, gum-arabic,
and water, chiefly used to facilitate the infection of
idiocy and promote intellectual crime.
-- H.L. Mencken


Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
I have seen it. I am still waiting for things to settle down before I
start fixing the FT code ( again ;-)

Adrian

On Mon, Aug 11, 2014 at 01:40:33PM +, Jeff Squyres (jsquyres) wrote:
> Ah, I see.
> 
> Ok -- add it to the list of 
> FT-things-to-be-fixed-before-FT-can-be-supported-again (which I think Josh 
> just did :-) ).
> 
> Also: Adrian -- FYI.  :-)
> 
> 
> On Aug 11, 2014, at 9:05 AM, George Bosilca  wrote:
> 
> > I just checked the code and noticed that all the usages of the sstore are 
> > protected by an OPAL_ENABLE_FT_CR define. As we are not supporting FT, I 
> > don't think this is something we should spend time fixing right now.
> > 
> >   George.
> > 
> > 
> > 
> > On Sat, Aug 9, 2014 at 8:06 AM, Jeff Squyres (jsquyres) 
> >  wrote:
> > I think you're making a joke, right...?
> > 
> > I see direct calls to ORTE sstore functionality in all three.
> > 
> > 
> > 
> > 
> > On Aug 8, 2014, at 5:42 PM, George Bosilca  wrote:
> > 
> > > These are harmless. They are only used when FT is enabled which should 
> > > rarely be the case.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) 
> > >  wrote:
> > > Here's a few ORTE headers in OPAL source -- can respective owners clean 
> > > these up?  Thanks.
> > >
> > > -
> > > mca/btl/smcuda/btl_smcuda.c
> > > 63:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/btl/sm/btl_sm.c
> > > 62:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/mpool/sm/mpool_sm_module.c
> > > 34:#include "orte/mca/sstore/sstore.h"
> > > -
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to: 
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> > 
> > 
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/

Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
Authentic:
Indubitably true, in somebody's opinion.


Re: [OMPI devel] ORTE headers in OPAL source

2014-08-09 Thread Josh Hursey
Those calls should be protected with the CR FT #define - If I remember
correctly. We were using the sstore to track the shared memory file names
so we could clean them up on restart.

I'm not sure if the sstore framework is necessary in this location, since
we should be able to tell opal_crs and it will do the right thing. I can
try to look at it early next week if someone doesn't get to it before then.

-- Josh



On Sat, Aug 9, 2014 at 7:06 AM, Jeff Squyres (jsquyres) 
wrote:

> I think you're making a joke, right...?
>
> I see direct calls to ORTE sstore functionality in all three.
>
>
>
>
> On Aug 8, 2014, at 5:42 PM, George Bosilca  wrote:
>
> > These are harmless. They are only used when FT is enabled which should
> rarely be the case.
> >
> >   George.
> >
> >
> >
> > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Here's a few ORTE headers in OPAL source -- can respective owners clean
> these up?  Thanks.
> >
> > -
> > mca/btl/smcuda/btl_smcuda.c
> > 63:#include "orte/mca/sstore/sstore.h"
> >
> > mca/btl/sm/btl_sm.c
> > 62:#include "orte/mca/sstore/sstore.h"
> >
> > mca/mpool/sm/mpool_sm_module.c
> > 34:#include "orte/mca/sstore/sstore.h"
> > -
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] ORTE headers in OPAL source

2014-08-09 Thread Jeff Squyres (jsquyres)
I think you're making a joke, right...?

I see direct calls to ORTE sstore functionality in all three.




On Aug 8, 2014, at 5:42 PM, George Bosilca  wrote:

> These are harmless. They are only used when FT is enabled which should rarely 
> be the case.
> 
>   George.
> 
> 
> 
> On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres)  
> wrote:
> Here's a few ORTE headers in OPAL source -- can respective owners clean these 
> up?  Thanks.
> 
> -
> mca/btl/smcuda/btl_smcuda.c
> 63:#include "orte/mca/sstore/sstore.h"
> 
> mca/btl/sm/btl_sm.c
> 62:#include "orte/mca/sstore/sstore.h"
> 
> mca/mpool/sm/mpool_sm_module.c
> 34:#include "orte/mca/sstore/sstore.h"
> -
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15571.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/