[OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Samuel K. Gutierrez

WHAT: New System V shared memory component.

WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320

WHERE:
M  ompi/mca/btl/sm/btl_sm.c
M  ompi/mca/btl/sm/btl_sm_component.c
M  ompi/mca/btl/sm/btl_sm.h
M  ompi/mca/mpool/sm/mpool_sm_component.c
M  ompi/mca/mpool/sm/mpool_sm.h
M  ompi/mca/mpool/sm/mpool_sm_module.c
A  ompi/mca/common/sm/configure.m4
A  ompi/mca/common/sm/common_sm_sysv.h
A  ompi/mca/common/sm/common_sm_windows.c
A  ompi/mca/common/sm/common_sm_windows.h
A  ompi/mca/common/sm/common_sm.c
A  ompi/mca/common/sm/common_sm_sysv.c
A  ompi/mca/common/sm/common_sm.h
M  ompi/mca/common/sm/common_sm_mmap.c
M  ompi/mca/common/sm/common_sm_mmap.h
M  ompi/mca/common/sm/Makefile.am
M  ompi/mca/common/sm/help-mpi-common-sm.txt
M  ompi/mca/coll/sm/coll_sm_module.c
M  ompi/mca/coll/sm/coll_sm.h

WHEN: Upon acceptance.

TIMEOUT: Tuesday, June 8, 2010 (after devel concall).

HOW:
MCA mpi: parameter "mpi_common_sm" (current value: ,
  data source: default value)
  Which shared memory support will be used.  
Valid
  values: sysv,mmap - or a comma delimited  
combination
  of them (order dependent).  The first  
component that

  is successfully selected is used.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory







Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Samuel K. Gutierrez

Doh!

bitbucket repository: http://bitbucket.org/samuelkgutierrez/ompi_sysv_sm

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Jun 1, 2010, at 11:08 AM, Samuel K. Gutierrez wrote:


WHAT: New System V shared memory component.

WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320

WHERE:
M  ompi/mca/btl/sm/btl_sm.c
M  ompi/mca/btl/sm/btl_sm_component.c
M  ompi/mca/btl/sm/btl_sm.h
M  ompi/mca/mpool/sm/mpool_sm_component.c
M  ompi/mca/mpool/sm/mpool_sm.h
M  ompi/mca/mpool/sm/mpool_sm_module.c
A  ompi/mca/common/sm/configure.m4
A  ompi/mca/common/sm/common_sm_sysv.h
A  ompi/mca/common/sm/common_sm_windows.c
A  ompi/mca/common/sm/common_sm_windows.h
A  ompi/mca/common/sm/common_sm.c
A  ompi/mca/common/sm/common_sm_sysv.c
A  ompi/mca/common/sm/common_sm.h
M  ompi/mca/common/sm/common_sm_mmap.c
M  ompi/mca/common/sm/common_sm_mmap.h
M  ompi/mca/common/sm/Makefile.am
M  ompi/mca/common/sm/help-mpi-common-sm.txt
M  ompi/mca/coll/sm/coll_sm_module.c
M  ompi/mca/coll/sm/coll_sm.h

WHEN: Upon acceptance.

TIMEOUT: Tuesday, June 8, 2010 (after devel concall).

HOW:
MCA mpi: parameter "mpi_common_sm" (current value: ,
 data source: default value)
 Which shared memory support will be used.  
Valid
 values: sysv,mmap - or a comma delimited  
combination
 of them (order dependent).  The first  
component that

 is successfully selected is used.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Graham, Richard L.
Can you be a bit more explicit, please ?
I do not want this on our systems, so as long as this is a compile time 
decision, and as long as this does not degrade the performance of the current 
sm device, I will not object.

Rich

- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jun 01 13:08:46 2010
Subject: [OMPI devel] RFC: System V Shared Memory for Open MPI

WHAT: New System V shared memory component.

WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320

WHERE:
M  ompi/mca/btl/sm/btl_sm.c
M  ompi/mca/btl/sm/btl_sm_component.c
M  ompi/mca/btl/sm/btl_sm.h
M  ompi/mca/mpool/sm/mpool_sm_component.c
M  ompi/mca/mpool/sm/mpool_sm.h
M  ompi/mca/mpool/sm/mpool_sm_module.c
A  ompi/mca/common/sm/configure.m4
A  ompi/mca/common/sm/common_sm_sysv.h
A  ompi/mca/common/sm/common_sm_windows.c
A  ompi/mca/common/sm/common_sm_windows.h
A  ompi/mca/common/sm/common_sm.c
A  ompi/mca/common/sm/common_sm_sysv.c
A  ompi/mca/common/sm/common_sm.h
M  ompi/mca/common/sm/common_sm_mmap.c
M  ompi/mca/common/sm/common_sm_mmap.h
M  ompi/mca/common/sm/Makefile.am
M  ompi/mca/common/sm/help-mpi-common-sm.txt
M  ompi/mca/coll/sm/coll_sm_module.c
M  ompi/mca/coll/sm/coll_sm.h

WHEN: Upon acceptance.

TIMEOUT: Tuesday, June 8, 2010 (after devel concall).

HOW:
MCA mpi: parameter "mpi_common_sm" (current value: ,
   data source: default value)
   Which shared memory support will be used.  
Valid
   values: sysv,mmap - or a comma delimited  
combination
   of them (order dependent).  The first  
component that
   is successfully selected is used.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Samuel K. Gutierrez

Hi Rich,

I'll add a configure-time option.  This addition does not negatively  
impact the performance of the current sm component.


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 1, 2010, at 11:35 AM, Graham, Richard L. wrote:


Can you be a bit more explicit, please ?
I do not want this on our systems, so as long as this is a compile  
time decision, and as long as this does not degrade the performance  
of the current sm device, I will not object.


Rich

- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jun 01 13:08:46 2010
Subject: [OMPI devel] RFC: System V Shared Memory for Open MPI

WHAT: New System V shared memory component.

WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320

WHERE:
M  ompi/mca/btl/sm/btl_sm.c
M  ompi/mca/btl/sm/btl_sm_component.c
M  ompi/mca/btl/sm/btl_sm.h
M  ompi/mca/mpool/sm/mpool_sm_component.c
M  ompi/mca/mpool/sm/mpool_sm.h
M  ompi/mca/mpool/sm/mpool_sm_module.c
A  ompi/mca/common/sm/configure.m4
A  ompi/mca/common/sm/common_sm_sysv.h
A  ompi/mca/common/sm/common_sm_windows.c
A  ompi/mca/common/sm/common_sm_windows.h
A  ompi/mca/common/sm/common_sm.c
A  ompi/mca/common/sm/common_sm_sysv.c
A  ompi/mca/common/sm/common_sm.h
M  ompi/mca/common/sm/common_sm_mmap.c
M  ompi/mca/common/sm/common_sm_mmap.h
M  ompi/mca/common/sm/Makefile.am
M  ompi/mca/common/sm/help-mpi-common-sm.txt
M  ompi/mca/coll/sm/coll_sm_module.c
M  ompi/mca/coll/sm/coll_sm.h

WHEN: Upon acceptance.

TIMEOUT: Tuesday, June 8, 2010 (after devel concall).

HOW:
MCA mpi: parameter "mpi_common_sm" (current value: ,
  data source: default value)
  Which shared memory support will be used.
Valid
  values: sysv,mmap - or a comma delimited
combination
  of them (order dependent).  The first
component that
  is successfully selected is used.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-06-01 Thread Jeff Squyres
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:

> Just curious - your proposed fix sounds exactly like what was done in the 
> OPAL SOS work. Are you therefore proposing to use SOS to provide a more 
> informational status return?

No, I think Sylvain's talking about slightly modifying the existing mechanism:

1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity bitmask 
-- even if the bitmask indicates that this BTL can't talk to anyone.

2. Return != OMPI_SUCCESS: treat the problem as a fatal error.

I think Sylvain's point is that OMPI_SUCCESS can be returned for non-fatal 
errors if a BTL just wants to be ignored.  In such cases, the BTL can just set 
its connectivity mask to 0. This will allow OMPI to continue the job.  

For example, if verbs is borked (e.g., can't create CQ's), it can return a 
connectivity mask of 0 and OMPI_SUCCESS.  The BML is then free to fail over to 
some other BTL.

But if a malloc() fails down in some BTL, then the job is hosed anyway -- so 
why not return != OMPI_SUCCESS and try to abort cleanly?

For sites that want to treat verbs failures as fatal, we can add a new MCA 
param either in the openib BTL that says "treat all init failures as fatal to 
the job" or perhaps a new MCA param in R2 that says "if the connectivity map 
for BTL  is empty, abort the job".  Or something like that.

> If so, then it would seem the only real dispute here is: is there -any- 
> condition whereby a given BTL should have the authority to tell OMPI to 
> terminate an application, even if other BTLs could still function?

I think his cited example was if malloc() fails.

I could see some sites wanting to abort if their high-speed network was down 
(e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI to fail 
over to TCP.  The flip side of this argument is that the sysadmin could set 
"btl = ^tcp" in the system file, and then if openib/mx fails, the BML will 
abort because some peers won't be reachable.

> I understand that the current code may not yet support that operation, but I 
> do believe that was the intent of the design. So only the case where -all- 
> BTLs say "I can't do it" would result in termination. Rather than change that 
> design, I believe the intent is to work towards completing that 
> implementation. In the interim, it would seem most sensible to me that we add 
> an MCA param that specifies the termination behavior (i.e., attempt to 
> continue or terminate on first fatal BTL error).

Agreed.

I think that there are multiple different exit conditions from a BTL init:

1. BTL succeeded in initializing, and some peers are reachable
2. BTL succeeded in initializing, and no peers are reachable
3. BTL failed to initialize, but that failure is localized to the BTL (e.g., 
openib failed to create a CQ)
4. BTL failed to initialize, and the error is global in nature (e.g., malloc() 
fail)

I think it might be a site-specific decision as to whether to abort the job for 
condition 3 or not.  Today we default to not failing and pair that with an 
indirect method of failing (i.e., setting btl=^tcp).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] BTL add procs errors

2010-06-01 Thread Jeff Squyres
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:

> In my case, the error happens in :
>mca_btl_openib_add_procs()
>  mca_btl_openib_size_queues()
>adjust_cq()
>  ibv_create_cq_compat()
>ibv_create_cq()

Can you nail this down any further?  If I modify adjust_cq() to always return 
OMPI_ERROR, I see the openib BTL fail over properly to the TCP BTL.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Jeff Squyres
On Jun 1, 2010, at 1:35 PM, Graham, Richard L. wrote:

> Can you be a bit more explicit, please ?

Sam has sent several prior RFCs on this subject.  I believe he was asking for 
final testing before bringing it into the trunk.

> I do not want this on our systems, so as long as this is a compile time 
> decision

Just curious -- why don't you want it on your systems?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-01 Thread Samuel K. Gutierrez

Hi all,

Configure option added: --enable-sysv (default: disabled).

For sysv testing purposes, please enable.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 1, 2010, at 11:11 AM, Samuel K. Gutierrez wrote:


Doh!

bitbucket repository: http://bitbucket.org/samuelkgutierrez/ompi_sysv_sm

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Jun 1, 2010, at 11:08 AM, Samuel K. Gutierrez wrote:


WHAT: New System V shared memory component.

WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320

WHERE:
M  ompi/mca/btl/sm/btl_sm.c
M  ompi/mca/btl/sm/btl_sm_component.c
M  ompi/mca/btl/sm/btl_sm.h
M  ompi/mca/mpool/sm/mpool_sm_component.c
M  ompi/mca/mpool/sm/mpool_sm.h
M  ompi/mca/mpool/sm/mpool_sm_module.c
A  ompi/mca/common/sm/configure.m4
A  ompi/mca/common/sm/common_sm_sysv.h
A  ompi/mca/common/sm/common_sm_windows.c
A  ompi/mca/common/sm/common_sm_windows.h
A  ompi/mca/common/sm/common_sm.c
A  ompi/mca/common/sm/common_sm_sysv.c
A  ompi/mca/common/sm/common_sm.h
M  ompi/mca/common/sm/common_sm_mmap.c
M  ompi/mca/common/sm/common_sm_mmap.h
M  ompi/mca/common/sm/Makefile.am
M  ompi/mca/common/sm/help-mpi-common-sm.txt
M  ompi/mca/coll/sm/coll_sm_module.c
M  ompi/mca/coll/sm/coll_sm.h

WHEN: Upon acceptance.

TIMEOUT: Tuesday, June 8, 2010 (after devel concall).

HOW:
MCA mpi: parameter "mpi_common_sm" (current value: ,
data source: default value)
Which shared memory support will be used.  
Valid
values: sysv,mmap - or a comma delimited  
combination
of them (order dependent).  The first  
component that

is successfully selected is used.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel