Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

Couldn't explain it better. Thanks Jeff for the summary !

On Tue, 1 Jun 2010, Jeff Squyres wrote:


On May 31, 2010, at 10:27 AM, Ralph Castain wrote:

Just curious - your proposed fix sounds exactly like what was done in 
the OPAL SOS work. Are you therefore proposing to use SOS to provide a 
more informational status return?


No, I think Sylvain's talking about slightly modifying the existing 
mechanism:


1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity 
bitmask -- even if the bitmask indicates that this BTL can't talk to 
anyone.


2. Return != OMPI_SUCCESS: treat the problem as a fatal error.

I think Sylvain's point is that OMPI_SUCCESS can be returned for 
non-fatal errors if a BTL just wants to be ignored.  In such cases, the 
BTL can just set its connectivity mask to 0. This will allow OMPI to 
continue the job.


For example, if verbs is borked (e.g., can't create CQ's), it can return 
a connectivity mask of 0 and OMPI_SUCCESS.  The BML is then free to fail 
over to some other BTL.


But if a malloc() fails down in some BTL, then the job is hosed anyway 
-- so why not return != OMPI_SUCCESS and try to abort cleanly?


For sites that want to treat verbs failures as fatal, we can add a new 
MCA param either in the openib BTL that says "treat all init failures as 
fatal to the job" or perhaps a new MCA param in R2 that says "if the 
connectivity map for BTL  is empty, abort the job".  Or something 
like that.


If so, then it would seem the only real dispute here is: is there -any- 
condition whereby a given BTL should have the authority to tell OMPI to 
terminate an application, even if other BTLs could still function?


I think his cited example was if malloc() fails.

I could see some sites wanting to abort if their high-speed network was 
down (e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI 
to fail over to TCP.  The flip side of this argument is that the 
sysadmin could set "btl = ^tcp" in the system file, and then if 
openib/mx fails, the BML will abort because some peers won't be 
reachable.


I understand that the current code may not yet support that operation, 
but I do believe that was the intent of the design. So only the case 
where -all- BTLs say "I can't do it" would result in termination. 
Rather than change that design, I believe the intent is to work towards 
completing that implementation. In the interim, it would seem most 
sensible to me that we add an MCA param that specifies the termination 
behavior (i.e., attempt to continue or terminate on first fatal BTL 
error).


Agreed.

I think that there are multiple different exit conditions from a BTL 
init:


1. BTL succeeded in initializing, and some peers are reachable 2. BTL 
succeeded in initializing, and no peers are reachable 3. BTL failed to 
initialize, but that failure is localized to the BTL (e.g., openib 
failed to create a CQ) 4. BTL failed to initialize, and the error is 
global in nature (e.g., malloc() fail)


I think it might be a site-specific decision as to whether to abort the 
job for condition 3 or not.  Today we default to not failing and pair 
that with an indirect method of failing (i.e., setting btl=^tcp).


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Wrong documentation for MPI_Comm_size manual page

2010-06-02 Thread Number Cruncher
I'm working on some intercommunicator stuff at the moment. According to 
MPI-2.2 standard:
"An inter-communication is a point-to-point communication between 
processes in different groups" [Section 6.6]


yet the "man" page for MPI_Comm_size reads:
"If the communicator  is  an  **intra-communicator**  (enables  
communication  between  two groups),  this  function returns the size of 
the local group"


Shouldn't that be **inter-communicator**?

Thanks,
Simon


Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

On Tue, 1 Jun 2010, Jeff Squyres wrote:


On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:


In my case, the error happens in :
   mca_btl_openib_add_procs()
 mca_btl_openib_size_queues()
   adjust_cq()
 ibv_create_cq_compat()
   ibv_create_cq()


Can you nail this down any further?  If I modify adjust_cq() to always 
return OMPI_ERROR, I see the openib BTL fail over properly to the TCP 
BTL.
It must be because create_cq actually creates cqs. Try to apply this 
patch which makes create_cq_compat() *not* creates the cqs and return an 
error instead :


diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c
--- a/ompi/mca/btl/openib/btl_openib.c  Fri May 28 14:50:25 2010 +0200
+++ b/ompi/mca/btl/openib/btl_openib.c  Wed Jun 02 10:56:57 2010 +0200
@@ -146,6 +146,7 @@
 int cqe, void *cq_context, struct ibv_comp_channel *channel,
 int comp_vector)
 {
+return OMPI_ERROR;
 #if OMPI_IBV_CREATE_CQ_ARGS == 3
 return ibv_create_cq(context, cqe, channel);
 #else


You should see MPI_Init complete nicely and your application segfault on 
the next MPI operation.


Sylvain


Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread George Bosilca
I don't have any IB nodes, but I'm interested to see how this happens. What I 
would like to understand here is  how do we get back in the OpenIB code if the 
add_procs failed for the BTL ...

  george.

On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote:

> On Tue, 1 Jun 2010, Jeff Squyres wrote:
> 
>> On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:
>> 
>>> In my case, the error happens in :
>>>   mca_btl_openib_add_procs()
>>> mca_btl_openib_size_queues()
>>>   adjust_cq()
>>> ibv_create_cq_compat()
>>>   ibv_create_cq()
>> 
>> Can you nail this down any further?  If I modify adjust_cq() to always 
>> return OMPI_ERROR, I see the openib BTL fail over properly to the TCP BTL.
> It must be because create_cq actually creates cqs. Try to apply this patch 
> which makes create_cq_compat() *not* creates the cqs and return an error 
> instead :
> 
> diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c
> --- a/ompi/mca/btl/openib/btl_openib.c  Fri May 28 14:50:25 2010 +0200
> +++ b/ompi/mca/btl/openib/btl_openib.c  Wed Jun 02 10:56:57 2010 +0200
> @@ -146,6 +146,7 @@
> int cqe, void *cq_context, struct ibv_comp_channel *channel,
> int comp_vector)
> {
> +return OMPI_ERROR;
> #if OMPI_IBV_CREATE_CQ_ARGS == 3
> return ibv_create_cq(context, cqe, channel);
> #else
> 
> 
> You should see MPI_Init complete nicely and your application segfault on the 
> next MPI operation.
> 
> Sylvain
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread George Bosilca
I think adding support for sysv shared memory is a good thing. However, I have 
some strong objections over the implementation in the hg tree. Here are 2 of 
the major ones:

1) the sysv shared memory creation is __atomic__ based on the flags used. 
Therefore, all the RML messages exchange is totally useless.

2) the whole code is replicated in the 3 files (mmap, sysv and windows), even 
the common parts. However in the sysv case most of the comments have been 
modified to remove all capitals letter. I'm in favor of extracting all the 
common parts and moving them in a special file. What should be kept in the 
particular files should only be the really different parts (small part of the 
init and finalize).

  george.

On Jun 1, 2010, at 19:26 , Samuel K. Gutierrez wrote:

> Hi all,
> 
> Configure option added: --enable-sysv (default: disabled).
> 
> For sysv testing purposes, please enable.
> 
> Thanks!
> 
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> On Jun 1, 2010, at 11:11 AM, Samuel K. Gutierrez wrote:
> 
>> Doh!
>> 
>> bitbucket repository: http://bitbucket.org/samuelkgutierrez/ompi_sysv_sm
>> 
>> Thanks,
>> 
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>> 
>> 
>> On Jun 1, 2010, at 11:08 AM, Samuel K. Gutierrez wrote:
>> 
>>> WHAT: New System V shared memory component.
>>> 
>>> WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320
>>> 
>>> WHERE:
>>> M  ompi/mca/btl/sm/btl_sm.c
>>> M  ompi/mca/btl/sm/btl_sm_component.c
>>> M  ompi/mca/btl/sm/btl_sm.h
>>> M  ompi/mca/mpool/sm/mpool_sm_component.c
>>> M  ompi/mca/mpool/sm/mpool_sm.h
>>> M  ompi/mca/mpool/sm/mpool_sm_module.c
>>> A  ompi/mca/common/sm/configure.m4
>>> A  ompi/mca/common/sm/common_sm_sysv.h
>>> A  ompi/mca/common/sm/common_sm_windows.c
>>> A  ompi/mca/common/sm/common_sm_windows.h
>>> A  ompi/mca/common/sm/common_sm.c
>>> A  ompi/mca/common/sm/common_sm_sysv.c
>>> A  ompi/mca/common/sm/common_sm.h
>>> M  ompi/mca/common/sm/common_sm_mmap.c
>>> M  ompi/mca/common/sm/common_sm_mmap.h
>>> M  ompi/mca/common/sm/Makefile.am
>>> M  ompi/mca/common/sm/help-mpi-common-sm.txt
>>> M  ompi/mca/coll/sm/coll_sm_module.c
>>> M  ompi/mca/coll/sm/coll_sm.h
>>> 
>>> WHEN: Upon acceptance.
>>> 
>>> TIMEOUT: Tuesday, June 8, 2010 (after devel concall).
>>> 
>>> HOW:
>>> MCA mpi: parameter "mpi_common_sm" (current value: ,
>>>data source: default value)
>>>Which shared memory support will be used. Valid
>>>values: sysv,mmap - or a comma delimited combination
>>>of them (order dependent).  The first component that
>>>is successfully selected is used.
>>> 
>>> Thanks!
>>> 
>>> --
>>> Samuel K. Gutierrez
>>> Los Alamos National Laboratory
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Wrong documentation for MPI_Comm_size manual page

2010-06-02 Thread Jeff Squyres
Absolutely correct.  I've fixed it on the dev trunk and filed tickets to get 
the fix moved into the release branches.

Thanks!


On Jun 2, 2010, at 4:41 AM, Number Cruncher wrote:

> I'm working on some intercommunicator stuff at the moment. According to
> MPI-2.2 standard:
> "An inter-communication is a point-to-point communication between
> processes in different groups" [Section 6.6]
> 
> yet the "man" page for MPI_Comm_size reads:
> "If the communicator  is  an  **intra-communicator**  (enables 
> communication  between  two groups),  this  function returns the size of
> the local group"
> 
> Shouldn't that be **inter-communicator**?
> 
> Thanks,
> Simon
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote:

> It must be because create_cq actually creates cqs. Try to apply this
> patch which makes create_cq_compat() *not* creates the cqs and return an
> error instead :
> 
> diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c
> --- a/ompi/mca/btl/openib/btl_openib.c  Fri May 28 14:50:25 2010 +0200
> +++ b/ompi/mca/btl/openib/btl_openib.c  Wed Jun 02 10:56:57 2010 +0200
> @@ -146,6 +146,7 @@
>   int cqe, void *cq_context, struct ibv_comp_channel *channel,
>   int comp_vector)
>   {
> +return OMPI_ERROR;
>   #if OMPI_IBV_CREATE_CQ_ARGS == 3
>   return ibv_create_cq(context, cqe, channel);
>   #else
> 

Don't you mean return NULL?  This function is supposed to return a (struct 
ibv_cq *).

> You should see MPI_Init complete nicely and your application segfault on
> the next MPI operation.

That wouldn't surprise me if you return OMPI_ERROR here, since it's expecting a 
pointer return value (OMPI_ERROR != NULL, so the error check from 
ibv_create_cq_compat() won't detect the problem properly).  

Sidenote: why did we call it ibv_create_cq_compat()?  That seems like a 
namespace violation, and is quite confusing.  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:

> I think adding support for sysv shared memory is a good thing. However, I 
> have some strong objections over the implementation in the hg tree. Here are 
> 2 of the major ones:
> 
> 1) the sysv shared memory creation is __atomic__ based on the flags used. 
> Therefore, all the RML messages exchange is totally useless.

Not sure what you mean here.  common/sm may create new shmem segments at any 
time (e.g., during coll sm).  The RML message exchange is to ensure that only 1 
process creates and initializes the segment and then all the others just attach 
to it.

The initializing of the segment after it is created/attached could be pipelined 
a little more.  E.g, since the init has an atomicly-set flag indicating when 
it's done, the root could create the seg, signal the others that they can 
attach, and then do the init -- the non-root procs can wait for flag to change 
atomicly to know when the seg has been initialized).  Is that what you're 
referring to?

> 2) the whole code is replicated in the 3 files (mmap, sysv and windows), even 
> the common parts. However in the sysv case most of the comments have been 
> modified to remove all capitals letter. I'm in favor of extracting all the 
> common parts and moving them in a special file. What should be kept in the 
> particular files should only be the really different parts (small part of the 
> init and finalize).

Sam -- are the common parts really common?  I.e., could they be factored out?  
Or are they "just different enough" that factoring them out would be a PITA?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: move hwloc code base to opal/hwloc

2010-06-02 Thread Jeff Squyres
To follow up on this RFC...

We discussed this RFC on the weekly call and no one seemed to hate it.  But 
there was a desire to:

a) be able to compile out hwloc for environments that don't want/need it (e.g., 
embedded environments)
b) have some degree of isolation in case hwloc ever dies
c) have some comonality of hwloc support (e.g., a central copy of the topology 
as an OPAL global variable, etc.)

The agreed-on compromise was to have a small set of OPAL wrappers that hide the 
real hwloc API.  I.e., the OPAL/ORTE/OMPI code bases would use the OPAL 
wrappers, not hwloc itself.  This allows OMPI to cleanly compile out hwloc 
(e.g., return OPAL_ERR_NOT_AVAILABLE when hwloc is compiled out) for platforms 
that do not want hwloc support and hwloc-unsupported platforms.

The ball is in my court to come up with a decent OPAL subset of the hwloc API 
that makes sense.  On the one hand, the hwloc API is huge because it has many, 
many accessors for all different kinds of access patterns.  But OTOH, we 
probably don't need all those accessors, even if having a smaller set of 
accessors may mean slightly less convenient/efficient access to the hwloc data. 
 

I'll try to strike a balance and come back to the community with a proposal.




On May 13, 2010, at 8:35 PM, Jeff Squyres wrote:

> WHAT: hwloc is currently embedded in opal/mca/paffinity/hwloc/hwloc -- move 
> it to be a first class citizen in opal/hwloc.
> 
> WHY: Let other portions of the OPAL, ORTE, and OMPI code bases use hwloc 
> services (remember that hwloc provides detailed topology information, not 
> just processor binding).
> 
> WHERE: Move opal/mca/paffinity/hwloc/hwloc to opal/hwloc, and adjust 
> associated configury
> 
> WHEN: For v1.5.1
> 
> TIMEOUT: Tuesday call, May 25
> 
> -
> 
> MORE DETAILS:
> 
> The hwloc code base is *much* more powerful and useful than PLPA -- it 
> provides a wealth of information that PLPA did not.  Specifically: hwloc 
> provides data structures detailing the internal topology of a server.  You 
> can see cache line sizes, NUMA layouts, sockets, cores, hardware threads, 
> ...etc.
> 
> This information should be available to the entire OMPI code base -- not just 
> locked up in a paffinity component.  Putting hwloc up in opal/hwloc makes it 
> available everywhere.  Developers can just call hwloc_, and OMPI's build 
> system will automatically do all the right symbol-shifting if the embedded 
> hwloc is used in OMPI (and not symbol-shift if an external hwloc is used, 
> obviously).  It's magically delicious!
> 
> One immediate use that I'd like to see is to have the openib BTL use hwloc's 
> ibv functionality to find "nearby" HCAs (right now, you can only do this with 
> rankfiles).
> 
> I can foresee other components using cache line size information to help tune 
> performance (e.g., sm btl and sm coll...?).
> 
> To be clear: there will still be an hwloc paffinity component.  It just won't 
> embed its own copy of hwloc anymore.  It'll use the hwloc services provided 
> by the OMPI build system, just like the rest of the OPAL / ORTE / OMPI code 
> bases.
> 
> There will also be an option to compile hwloc out altogether -- some stubs 
> will be left that return ERR_NOT_SUPPORTED, or somesuch (details TBD).  The 
> reason for this is that there are some systems where processor affinity and 
> NUMA information aren't relevant (e.g., embedded systems).  Memory footprint 
> is key in such systems; hwloc would simply take up valuable RAM.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Remove all other paffinity components

2010-06-02 Thread Jeff Squyres
To follow up on this RFC...

This RFC also got discussed on the weekly call (and in several other 
discussions).  Again, no one seemed to hate it.  That being said, hwloc still 
needs a bit more soak time; I just committed the 32 bit fix the other day.

So this one will happen eventually (i.e., #1, below -- #2 is the other RFC).  
It'll probably be off in an hg branch at first, and then I'll bring the results 
to the community before bringing it back into the trunk.


On May 18, 2010, at 8:50 AM, Jeff Squyres wrote:

> On May 18, 2010, at 8:31 AM, Terry Dontje wrote:
> 
>> The above sounds like you are replacing the whole paffinity framework with 
>> hwloc.  Is that true?  Or is the hwloc accessors you are talking about 
>> non-paffinity related?
> 
> Good point; these have all gotten muddled in the email chain.  Let me 
> re-state everything in one place in an attempt to be clear:
> 
> 1. Split paffinity into two frameworks (because some OS's support one and not 
> the other):
>  - binding: just for getting and setting processor affinity
>  - hwmap: just for mapping (board, socket, core, hwthread) <--> OS processor 
> ID
>  --> Note that hwmap will be an expansion of the current paffinity 
> capabilities
> 
> 2. Add hwloc to opal
>  - Commit the hwloc tree to opal/util/hwloc (or somesuch)
>  - Have the ability to configure hwloc out (e.g., for embedded environments)
>  - Add a dozen or two hwloc wrappers in opal/util/hwloc.c|h
>  - The rest of the OPAL/ORTE/OMPI trees *only call these wrapper functions* 
> -- they do not call hwloc directly
>  - These wrappers will call the back-end hwloc functions or return 
> OPAL_ERR_NOT_SUPPORTED (or somesuch) if hwloc is not available
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Samuel K. Gutierrez

On Jun 2, 2010, at 7:28 AM, Jeff Squyres wrote:


On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:

I think adding support for sysv shared memory is a good thing.  
However, I have some strong objections over the implementation in  
the hg tree. Here are 2 of the major ones:


1) the sysv shared memory creation is __atomic__ based on the flags  
used. Therefore, all the RML messages exchange is totally useless.


Not sure what you mean here.  common/sm may create new shmem  
segments at any time (e.g., during coll sm).  The RML message  
exchange is to ensure that only 1 process creates and initializes  
the segment and then all the others just attach to it.


The initializing of the segment after it is created/attached could  
be pipelined a little more.  E.g, since the init has an atomicly-set  
flag indicating when it's done, the root could create the seg,  
signal the others that they can attach, and then do the init -- the  
non-root procs can wait for flag to change atomicly to know when the  
seg has been initialized).  Is that what you're referring to?


2) the whole code is replicated in the 3 files (mmap, sysv and  
windows), even the common parts. However in the sysv case most of  
the comments have been modified to remove all capitals letter.
I'm in favor of extracting all the common parts and moving them in  
a special file. What should be kept in the particular files should  
only be the really different parts (small part of the init and  
finalize).


Sam -- are the common parts really common?  I.e., could they be  
factored out?  Or are they "just different enough" that factoring  
them out would be a PITA?


I'm sure some refactoring could be done - let me take a look.
--
Samuel K. Gutierrez
Los Alamos National Laboratory



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread George Bosilca

On Jun 2, 2010, at 09:28 , Jeff Squyres wrote:

> On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
> 
>> I think adding support for sysv shared memory is a good thing. However, I 
>> have some strong objections over the implementation in the hg tree. Here are 
>> 2 of the major ones:
>> 
>> 1) the sysv shared memory creation is __atomic__ based on the flags used. 
>> Therefore, all the RML messages exchange is totally useless.
> 
> Not sure what you mean here.  common/sm may create new shmem segments at any 
> time (e.g., during coll sm).  The RML message exchange is to ensure that only 
> 1 process creates and initializes the segment and then all the others just 
> attach to it.

Absolutely not! The RML messaging is not about initializing the shared memory 
segment. As stated on my original text it has only one purpose: to ensure the 
file used by mmap is created atomically. The code for Windows do not exchange 
any RML messages as the function to allocate the shared memory region provided 
by the OS is atomic (exactly as the sysv one).

> The initializing of the segment after it is created/attached could be 
> pipelined a little more.  E.g, since the init has an atomicly-set flag 
> indicating when it's done, the root could create the seg, signal the others 
> that they can attach, and then do the init -- the non-root procs can wait for 
> flag to change atomicly to know when the seg has been initialized).  Is that 
> what you're referring to?

This is actually how the whole stuff is working today. As an example look at 
the sm BTL in file btl_sm.c line 541.

  george.

> 
>> 2) the whole code is replicated in the 3 files (mmap, sysv and windows), 
>> even the common parts. However in the sysv case most of the comments have 
>> been modified to remove all capitals letter. I'm in favor of extracting all 
>> the common parts and moving them in a special file. What should be kept in 
>> the particular files should only be the really different parts (small part 
>> of the init and finalize).
> 
> Sam -- are the common parts really common?  I.e., could they be factored out? 
>  Or are they "just different enough" that factoring them out would be a PITA?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:

> > Not sure what you mean here.  common/sm may create new shmem segments at 
> > any time (e.g., during coll sm).  The RML message exchange is to ensure 
> > that only 1 process creates and initializes the segment and then all the 
> > others just attach to it.
> 
> Absolutely not! The RML messaging is not about initializing the shared memory 
> segment. As stated on my original text it has only one purpose: to ensure the 
> file used by mmap is created atomically. The code for Windows do not exchange 
> any RML messages as the function to allocate the shared memory region 
> provided by the OS is atomic (exactly as the sysv one).

I thought that Sam said that it was important that only 1 process 
shmctl/IPC_RMID...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey

On Wed, 2 Jun 2010, Jeff Squyres wrote:


Don't you mean return NULL?  This function is supposed to return a (struct 
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make 
ibv_create_cq always return NULL, the scenario described by George works 
smoothly : returned OMPI_ERROR => bitmask cleared => connectivity problem 
=> stop or tcp fallback. The problem is more complicated than I thought.


But it made me progress on why I'm crashing : in my case, only a subset of 
processes have their create_cq fail. But others work fine, hence they 
request a qp creation, and my process which failed over on tcp starts 
creating a qp ... and crashes.


If you replace :
return NULL;
by :
if (atoi(getenv("OMPI_COMM_WORLD_RANK")) == 26)
return NULL;
(yes, that's ugly, but it's just to debug the problem) and run on -say- 32 
processes, you should be able to reproduce the bug. Well, unless I'm 
mistaken again.


The crash stack should look like this :
#0  0x003d0d605a30 in ibv_cmd_create_qp () from /usr/lib64/libibverbs.so.1
#1  0x7f28b44e049b in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
#2  0x003d0d609a42 in ibv_create_qp () from /usr/lib64/libibverbs.so.1
#3  0x7f28b6be6e6e in qp_create_one () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#4  0x7f28b6be78a4 in oob_module_start_connect () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#5  0x7f28b6be7fbb in rml_recv_cb () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#6  0x7f28b8c56868 in orte_rml_recv_msg_callback () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_rml_oob.so
#7  0x7f28b8a4cf96 in mca_oob_tcp_msg_recv_complete () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#8  0x7f28b8a4e2c2 in mca_oob_tcp_peer_recv_handler () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#9  0x7f28b9496898 in opal_event_base_loop () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#10 0x7f28b948ace9 in opal_progress () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#11 0x7f28b9951ed5 in ompi_request_default_wait_all () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libmpi.so.0

This new advance may change everything. Of course, stopping at the bml 
level still "solves" the problem, but maybe we can fix this more properly 
within the openib BTL. Unless this is a general 
out-of-band-connection-protocol issue ().


Sylvain



Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:

> But it made me progress on why I'm crashing : in my case, only a subset of
> processes have their create_cq fail.

Ah, this is the key.  If I have one process (out of many) fail the create_cq() 
function, I get a segv during finalize.  I'll dig.

> This new advance may change everything. Of course, stopping at the bml
> level still "solves" the problem, but maybe we can fix this more properly
> within the openib BTL. Unless this is a general
> out-of-band-connection-protocol issue ().

I don't think this is an OOB CPC issue.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Samuel K. Gutierrez

On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:


On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:

Not sure what you mean here.  common/sm may create new shmem  
segments at any time (e.g., during coll sm).  The RML message  
exchange is to ensure that only 1 process creates and initializes  
the segment and then all the others just attach to it.


Absolutely not! The RML messaging is not about initializing the  
shared memory segment. As stated on my original text it has only  
one purpose: to ensure the file used by mmap is created atomically.  
The code for Windows do not exchange any RML messages as the  
function to allocate the shared memory region provided by the OS is  
atomic (exactly as the sysv one).


I thought that Sam said that it was important that only 1 process  
shmctl/IPC_RMID...?


Hi George,

We are using RML messaging in the sysv code to exchange the shared  
memory ID (generated by exactly one process).  I'm not sure how we  
would go about passing along the shared memory ID without RML, but any  
ideas are greatly appreciated.


Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Ashley Pittman

On 2 Jun 2010, at 16:49, Jeff Squyres wrote:

> On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
> 
>> But it made me progress on why I'm crashing : in my case, only a subset of
>> processes have their create_cq fail.
> 
> Ah, this is the key.  If I have one process (out of many) fail the 
> create_cq() function, I get a segv during finalize.  I'll dig.

Is there an assumption that if process A claims to be able to communicate with 
process B that process B can also communicate with process A.  It almost sounds 
like the code needs to do a allreduce on the bitmask returned by the btls.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:

> > Ah, this is the key.  If I have one process (out of many) fail the 
> > create_cq() function, I get a segv during finalize.  I'll dig.
> 
> Is there an assumption that if process A claims to be able to communicate 
> with process B that process B can also communicate with process A.  It almost 
> sounds like the code needs to do a allreduce on the bitmask returned by the 
> btls.

Actually, this is exactly the case (I just dug into the code and verified this).

In this case, we're already well beyond the point where we synchronized and 
decided who can connect to whom.  I.e., the modex is already done -- the openib 
BTL in process X has decided that it is available and has advertised its RDMACM 
CPC and OOB CPC contact info.

But then later in process X during the openib BTL add_procs, something fails.  
So the openib clears the connect bits and transparently fails over to TCP.  No 
problem.

The problem is the other peers who think that they can still connect to process 
X via the openib BTL.

1. In this case, the openib BTL was not finalized, so there was a stub still 
there listening on the RDMACM CPC.  When another process tried to connect to 
X's RDMACM CPC port, Bad Things happened (because it was only half setup) and 
we segv'ed.

Obviously, this should be fixed.  "Fixed" in this case probably means closing 
down the RDMACM CPC listening port.  But then that leads to another form of 
Badness.

2. If the openib BTL cleanly shuts down and is *not* still listening on its 
modex-advertised RDMACM CPC contact port, then if some other process tries to 
contact process X via the modex info, it'll fail.  This will then be judged to 
be a fatal error.  Failover in the BML will simply have delayed the job abort 
until someone tries to contact X via the openib BTL.

I think that the majority of this discussion about the BML failure (or not) 
behavior assumed that *all* processes had the same failure (at least: *I* 
assumed this).  But if only *some* of the processes fail a given BTL add_procs, 
we have a problem because we're beyond the point of deciding who can connect to 
whom.  Shutting down a single BTL module at that point will create an 
inconsistency of the distributed data.

That seems wrong.

What to do?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread George Bosilca

On Jun 2, 2010, at 12:18 , Jeff Squyres wrote:

> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
> 
>>> Ah, this is the key.  If I have one process (out of many) fail the 
>>> create_cq() function, I get a segv during finalize.  I'll dig.
>> 
>> Is there an assumption that if process A claims to be able to communicate 
>> with process B that process B can also communicate with process A.  It 
>> almost sounds like the code needs to do a allreduce on the bitmask returned 
>> by the btls.
> 
> Actually, this is exactly the case (I just dug into the code and verified 
> this).
> 
> In this case, we're already well beyond the point where we synchronized and 
> decided who can connect to whom.  I.e., the modex is already done -- the 
> openib BTL in process X has decided that it is available and has advertised 
> its RDMACM CPC and OOB CPC contact info.
> 
> But then later in process X during the openib BTL add_procs, something fails. 
>  So the openib clears the connect bits and transparently fails over to TCP.  
> No problem.
> 
> The problem is the other peers who think that they can still connect to 
> process X via the openib BTL.
> 
> 1. In this case, the openib BTL was not finalized, so there was a stub still 
> there listening on the RDMACM CPC.  When another process tried to connect to 
> X's RDMACM CPC port, Bad Things happened (because it was only half setup) and 
> we segv'ed.
> 
> Obviously, this should be fixed.  "Fixed" in this case probably means closing 
> down the RDMACM CPC listening port.  But then that leads to another form of 
> Badness.

I wonder how this is possible. If a process X fails to connect to Y, how can Y 
succeed to connect to X ? Please enlighten me ...

> 
> 2. If the openib BTL cleanly shuts down and is *not* still listening on its 
> modex-advertised RDMACM CPC contact port, then if some other process tries to 
> contact process X via the modex info, it'll fail.  This will then be judged 
> to be a fatal error.  Failover in the BML will simply have delayed the job 
> abort until someone tries to contact X via the openib BTL.

Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one 
and the connection fails, then the PML will automatically try to use the next 
available BTL, so it will eventually fail over TCP (if available).

> 
> I think that the majority of this discussion about the BML failure (or not) 
> behavior assumed that *all* processes had the same failure (at least: *I* 
> assumed this).  But if only *some* of the processes fail a given BTL 
> add_procs, we have a problem because we're beyond the point of deciding who 
> can connect to whom.  Shutting down a single BTL module at that point will 
> create an inconsistency of the distributed data.

We did assume that at least the errors are symmetric, i.e. if A fails to 
connect to B then B will fail when trying to connect to A. However, if there 
are other BTL the connection is supposed to smoothly move over some other BTL. 
As an example in the MX BTL, if two nodes have MX support, but they do not 
share the same mapper the add_procs will silently fails.

  george.

> 
> That seems wrong.
> 
> What to do?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Eugene Loh

George Bosilca wrote:

We did assume that at least the errors are symmetric, i.e. if A fails 
to connect to B then B will fail when trying to connect to A.


I've not been following this thread closely, but thought I'd add a comment.

It used to be that the sm BTL could fail asymmetrically.  A shared 
memory could be allocated and processes start to allocate resources 
within shared memory.  At some point, the shared area would be 
exhausted.  So, some processes were set up to communicate to others, but 
the others would not be able to communicate back via the same BTL.  I 
think this led to much brokenness.  (E.g., how would a process return a 
sm fragment to a sender?)


At this point, my recollection of those issues is very fuzzy.

In any case, I think those issues went away with the shared-memory work 
I did a while back.  The size of the area is now computed to be large 
enough that each process's initial allocation would succeed.


Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread George Bosilca
How about ftok ? The init function takes a file_name as argument, and this file 
name is unique per instance of the shared memory region we want to create. We 
can use this file name with ftok to create a unique key_t that can be used by 
shmget to retrieve the shared memory identifier.

  george.

On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote:

> On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:
> 
>> On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
>> 
 Not sure what you mean here.  common/sm may create new shmem segments at 
 any time (e.g., during coll sm).  The RML message exchange is to ensure 
 that only 1 process creates and initializes the segment and then all the 
 others just attach to it.
>>> 
>>> Absolutely not! The RML messaging is not about initializing the shared 
>>> memory segment. As stated on my original text it has only one purpose: to 
>>> ensure the file used by mmap is created atomically. The code for Windows do 
>>> not exchange any RML messages as the function to allocate the shared memory 
>>> region provided by the OS is atomic (exactly as the sysv one).
>> 
>> I thought that Sam said that it was important that only 1 process 
>> shmctl/IPC_RMID...?
> 
> Hi George,
> 
> We are using RML messaging in the sysv code to exchange the shared memory ID 
> (generated by exactly one process).  I'm not sure how we would go about 
> passing along the shared memory ID without RML, but any ideas are greatly 
> appreciated.
> 
> Thanks,
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Samuel K. Gutierrez

Hi George,

That may work - I'll try it.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 2, 2010, at 10:59 AM, George Bosilca wrote:

How about ftok ? The init function takes a file_name as argument,  
and this file name is unique per instance of the shared memory  
region we want to create. We can use this file name with ftok to  
create a unique key_t that can be used by shmget to retrieve the  
shared memory identifier.


 george.

On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote:


On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:


On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:

Not sure what you mean here.  common/sm may create new shmem  
segments at any time (e.g., during coll sm).  The RML message  
exchange is to ensure that only 1 process creates and  
initializes the segment and then all the others just attach to it.


Absolutely not! The RML messaging is not about initializing the  
shared memory segment. As stated on my original text it has only  
one purpose: to ensure the file used by mmap is created  
atomically. The code for Windows do not exchange any RML messages  
as the function to allocate the shared memory region provided by  
the OS is atomic (exactly as the sysv one).


I thought that Sam said that it was important that only 1 process  
shmctl/IPC_RMID...?


Hi George,

We are using RML messaging in the sysv code to exchange the shared  
memory ID (generated by exactly one process).  I'm not sure how we  
would go about passing along the shared memory ID without RML, but  
any ideas are greatly appreciated.


Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
On Jun 2, 2010, at 12:42 PM, George Bosilca wrote:

> > 1. In this case, the openib BTL was not finalized, so there was a stub 
> > still there listening on the RDMACM CPC.  When another process tried to 
> > connect to X's RDMACM CPC port, Bad Things happened (because it was only 
> > half setup) and we segv'ed.
> >
> > Obviously, this should be fixed.  "Fixed" in this case probably means 
> > closing down the RDMACM CPC listening port.  But then that leads to another 
> > form of Badness.
> 
> I wonder how this is possible. If a process X fails to connect to Y, how can 
> Y succeed to connect to X ? Please enlighten me ...

It doesn't.  Process X segvs after it goes into the RDMACM CPC accept code 
(because the openib BTL was only half setup).

> > 2. If the openib BTL cleanly shuts down and is *not* still listening on its 
> > modex-advertised RDMACM CPC contact port, then if some other process tries 
> > to contact process X via the modex info, it'll fail.  This will then be 
> > judged to be a fatal error.  Failover in the BML will simply have delayed 
> > the job abort until someone tries to contact X via the openib BTL.
> 
> Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one 
> and the connection fails, then the PML will automatically try to use the next 
> available BTL, so it will eventually fail over TCP (if available).

Yes, there is a timeout.  I forget offhand what we do if the timeout occurs.  
We probably report the connect failure in the "normal" way, but I don't know 
that for sure.

> > I think that the majority of this discussion about the BML failure (or not) 
> > behavior assumed that *all* processes had the same failure (at least: *I* 
> > assumed this).  But if only *some* of the processes fail a given BTL 
> > add_procs, we have a problem because we're beyond the point of deciding who 
> > can connect to whom.  Shutting down a single BTL module at that point will 
> > create an inconsistency of the distributed data.
> 
> We did assume that at least the errors are symmetric, i.e. if A fails to 
> connect to B then B will fail when trying to connect to A. However, if there 
> are other BTL the connection is supposed to smoothly move over some other 
> BTL. As an example in the MX BTL, if two nodes have MX support, but they do 
> not share the same mapper the add_procs will silently fails.

This sounds dodgy and icky.  We have to wait for a connect timeout to fail over 
to the next BTL?

How long is the typical/default TCP timeout?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Jeff Squyres
Yes, I think the mmap code in the sm btl actually has a sync point inside 
add_procs that when the root allocs and sets up the area, it'll locally 
broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things 
happened; sm btl is broke" message.

But I am not confident about the other BTLs.


On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote:

> George Bosilca wrote:
> 
> > We did assume that at least the errors are symmetric, i.e. if A fails
> > to connect to B then B will fail when trying to connect to A.
> 
> I've not been following this thread closely, but thought I'd add a comment.
> 
> It used to be that the sm BTL could fail asymmetrically.  A shared
> memory could be allocated and processes start to allocate resources
> within shared memory.  At some point, the shared area would be
> exhausted.  So, some processes were set up to communicate to others, but
> the others would not be able to communicate back via the same BTL.  I
> think this led to much brokenness.  (E.g., how would a process return a
> sm fragment to a sender?)
> 
> At this point, my recollection of those issues is very fuzzy.
> 
> In any case, I think those issues went away with the shared-memory work
> I did a while back.  The size of the area is now computed to be large
> enough that each process's initial allocation would succeed.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Jeff Squyres
Don't forget that the RML is also used to broadcast the success/failure of the 
creation of the shared memory segment.

If the RML goes away, be sure that you can still determine that without hanging.

Personally, I still don't see the problem with using the RML stuff...


On Jun 2, 2010, at 1:07 PM, Samuel K. Gutierrez wrote:

> Hi George,
> 
> That may work - I'll try it.
> 
> Thanks!
> 
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> On Jun 2, 2010, at 10:59 AM, George Bosilca wrote:
> 
> > How about ftok ? The init function takes a file_name as argument, 
> > and this file name is unique per instance of the shared memory 
> > region we want to create. We can use this file name with ftok to 
> > create a unique key_t that can be used by shmget to retrieve the 
> > shared memory identifier.
> >
> >  george.
> >
> > On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote:
> >
> >> On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:
> >>
> >>> On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
> >>>
> > Not sure what you mean here.  common/sm may create new shmem 
> > segments at any time (e.g., during coll sm).  The RML message 
> > exchange is to ensure that only 1 process creates and 
> > initializes the segment and then all the others just attach to it.
> 
>  Absolutely not! The RML messaging is not about initializing the 
>  shared memory segment. As stated on my original text it has only 
>  one purpose: to ensure the file used by mmap is created 
>  atomically. The code for Windows do not exchange any RML messages 
>  as the function to allocate the shared memory region provided by 
>  the OS is atomic (exactly as the sysv one).
> >>>
> >>> I thought that Sam said that it was important that only 1 process 
> >>> shmctl/IPC_RMID...?
> >>
> >> Hi George,
> >>
> >> We are using RML messaging in the sysv code to exchange the shared 
> >> memory ID (generated by exactly one process).  I'm not sure how we 
> >> would go about passing along the shared memory ID without RML, but 
> >> any ideas are greatly appreciated.
> >>
> >> Thanks,
> >> --
> >> Samuel K. Gutierrez
> >> Los Alamos National Laboratory
> >>
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquy...@cisco.com
> >>> For corporate legal information go to:
> >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>>
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: System V Shared Memory for Open MPI

2010-06-02 Thread Samuel K. Gutierrez

Good point - I forgot about that.

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 2, 2010, at 11:40 AM, Jeff Squyres wrote:

Don't forget that the RML is also used to broadcast the success/ 
failure of the creation of the shared memory segment.


If the RML goes away, be sure that you can still determine that  
without hanging.


Personally, I still don't see the problem with using the RML stuff...


On Jun 2, 2010, at 1:07 PM, Samuel K. Gutierrez wrote:


Hi George,

That may work - I'll try it.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 2, 2010, at 10:59 AM, George Bosilca wrote:


How about ftok ? The init function takes a file_name as argument,
and this file name is unique per instance of the shared memory
region we want to create. We can use this file name with ftok to
create a unique key_t that can be used by shmget to retrieve the
shared memory identifier.

george.

On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote:


On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:


On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:


Not sure what you mean here.  common/sm may create new shmem
segments at any time (e.g., during coll sm).  The RML message
exchange is to ensure that only 1 process creates and
initializes the segment and then all the others just attach to  
it.


Absolutely not! The RML messaging is not about initializing the
shared memory segment. As stated on my original text it has only
one purpose: to ensure the file used by mmap is created
atomically. The code for Windows do not exchange any RML messages
as the function to allocate the shared memory region provided by
the OS is atomic (exactly as the sysv one).


I thought that Sam said that it was important that only 1 process
shmctl/IPC_RMID...?


Hi George,

We are using RML messaging in the sysv code to exchange the shared
memory ID (generated by exactly one process).  I'm not sure how we
would go about passing along the shared memory ID without RML, but
any ideas are greatly appreciated.

Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Eugene Loh

Jeff Squyres wrote:


Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the 
root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach 
and let's continue" or "bad things happened; sm btl is broke" message.
 

Yes, that's great.  But my point was that (it used to be that) after 
that point, processes would start eating chunks out of that shared area 
and for large proc counts the last allocations would fail.  (The size of 
the shared area was poorly chosen and happened to be insufficient.)  So, 
despite the sync point you describe, some procs would succeed at 
mca_btl_sm_add_procs() while others would not.  This particular case is 
now, I believe, resolved.  It just seemed at the time like a case where 
the upper layers were making assumptions that were inconsistent with 
what the sm BTL was providing.



But I am not confident about the other BTLs.

On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote:
 


George Bosilca wrote:
   


We did assume that at least the errors are symmetric, i.e. if A fails
to connect to B then B will fail when trying to connect to A.
 


I've not been following this thread closely, but thought I'd add a comment.

It used to be that the sm BTL could fail asymmetrically.  A shared
memory could be allocated and processes start to allocate resources
within shared memory.  At some point, the shared area would be
exhausted.  So, some processes were set up to communicate to others, but
the others would not be able to communicate back via the same BTL.  I
think this led to much brokenness.  (E.g., how would a process return a
sm fragment to a sender?)

At this point, my recollection of those issues is very fuzzy.

In any case, I think those issues went away with the shared-memory work
I did a while back.  The size of the area is now computed to be large
enough that each process's initial allocation would succeed.
   



[OMPI devel] RFC: openib BTL failover

2010-06-02 Thread Rolf vandeVaart

WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between
two or more openib BTLs.  New configurable code in openib BTL that works
with the bfo to do failover.  Note this only works when we have two or more
openib BTLs.  This does not failover to another BTL, like tcp.

TO CONFIGURE:
--enable-openib-failover

TO RUN:
--mca pml bfo

TIMEOUT:
June 16, 2010

ADDITIONAL DETAILS:
The design relies on the BTL to call back into the PML with each
fragment that fails so the PML can decide what needs to be done.
There is no additional message tracking or software acknowledges
added so that we can have minimal impact on latency.  Testing has
shown no measurable affect.

When errors are detected on the BTL, it is no longer used.  No effort
is made to bring it back if the problems get corrected.  If it gets
fixed before the next job starts, then it will be used by the next
job.

Under normal conditions, these changes have no effect whatsover on the
trunk as the bfo PML is never selected, and the failover support is
not configured into the openib BTL.  Every effort was made to minimize
the changes in the openib BTL.  The main changes are contained in two
new files that only get compiled when the -enable-openib-failover flag
is set.  The other changes consist of about 75 new lines in various
openib BTL files.

The bitbucket version is at:
http://bitbucket.org/rolfv/rfc-failover

Here are the files that would be added/changed.

BTL LAYER
M   ompi/mca/btl/btl.h
M   ompi/mca/btl/base/btl_base_mca.c
M   ompi/mca/btl/openib/btl_openib_component.c
M   ompi/mca/btl/openib/btl_openib.c
M   ompi/mca/btl/openib/btl_openib.h
M   ompi/mca/btl/openib/btl_openib_endpoint.h
M   ompi/mca/btl/openib/btl_openib_mca.c
A   ompi/mca/btl/openib/btl_openib_failover.c
A   ompi/mca/btl/openib/btl_openib_failover.h
M   ompi/mca/btl/openib/btl_openib_frag.h
M   ompi/mca/btl/openib/Makefile.am
M   ompi/config/ompi_check_openib.m4

PML LAYER
A   ompi/mca/pml/bfo
A   ompi/mca/pml/bfo/pml_bfo_comm.h
A   ompi/mca/pml/bfo/pml_bfo_sendreq.c
A   ompi/mca/pml/bfo/pml_bfo_isend.c
A   ompi/mca/pml/bfo/pml_bfo_component.c
A   ompi/mca/pml/bfo/Makefile.in
A   ompi/mca/pml/bfo/help-mpi-pml-bfo.txt
A   ompi/mca/pml/bfo/pml_bfo_recvfrag.h
A   ompi/mca/pml/bfo/pml_bfo_progress.c
A   ompi/mca/pml/bfo/pml_bfo_sendreq.h
A   ompi/mca/pml/bfo/pml_bfo_component.h
A   ompi/mca/pml/bfo/pml_bfo_failover.c
A   ompi/mca/pml/bfo/pml_bfo_recvreq.c
A   ompi/mca/pml/bfo/pml_bfo_irecv.c
A   ompi/mca/pml/bfo/pml_bfo_failover.h
A   ompi/mca/pml/bfo/pml_bfo_recvreq.h
A   ompi/mca/pml/bfo/pml_bfo_iprobe.c
A   ompi/mca/pml/bfo/pml_bfo.c
A   ompi/mca/pml/bfo/post_configure.sh
A   ompi/mca/pml/bfo/pml_bfo_hdr.h
A   ompi/mca/pml/bfo/pml_bfo_rdmafrag.c
A   ompi/mca/pml/bfo/pml_bfo_rdma.c
A   ompi/mca/pml/bfo/configure.params
A   ompi/mca/pml/bfo/pml_bfo.h
A   ompi/mca/pml/bfo/pml_bfo_rdmafrag.h
A   ompi/mca/pml/bfo/pml_bfo_rdma.h
A   ompi/mca/pml/bfo/.windows
A   ompi/mca/pml/bfo/Makefile.am
A   ompi/mca/pml/bfo/pml_bfo_comm.c
A   ompi/mca/pml/bfo/pml_bfo_start.c
A   ompi/mca/pml/bfo/pml_bfo_recvfrag.c