Re: [OMPI devel] SM init failures

2009-03-31 Thread Eugene Loh

Jeff Squyres wrote:

FWIW, George found what looks like a race condition in the sm init  
code today -- it looks like we don't call maffinity anywhere in the 
sm  btl startup, so we're not actually guaranteed that the memory is 
local  to any particular process(or) (!).  This race shouldn't cause 
segvs,  though; it should only mean that memory is potentially farther 
away  than we intended.


Is this that business that came up recently on one of these mail lists 
about setting the memory node to -1 rather than using the value we know 
it should be?  In mca_mpool_sm_alloc(), I do see a call to 
opal_maffinity_base_bind().


The central question is: does "first touch" mean both read and 
write?   I.e., is the first process that either reads *or* writes to a 
given  location considered "first touch"?  Or is it only the first write?


So, maybe the strategy is to create the shared area, have each process 
initialize its portion (FIFOs and free lists), have all processes sync, 
and then move on.  That way, you know all memory will be written by the 
appropriate owner before it's read by anyone else.  First-touch 
ownership will be proper and we won't be dependent on zero-filled pages.


The big question in my mind remains that we don't seem to know how to 
reproduce the failure (segv) that we're trying to fix.  I, personally, 
am reluctant to stick fixes into the code for problems I can't observe.


Re: [OMPI devel] SM init failures

2009-03-31 Thread Sylvain Jeaugey
Sorry to continue off-topic but going to System V shm would be for me 
like going back in the past.


System V shared memory used to be the main way to do shared memory on 
MPICH and from my (little) experience, this was truly painful :
 - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even kill 
-9 ?)
 - Naming issues : shm segments identified as 32 bits key potentially 
causing conflicts between applications or layers of the same application 
on one node
 - Space issues : the total shm size on a system is bound to 
/proc/sys/kernel/shmmax, needing admin configuration and causing conflicts 
between MPI applications running on the same node


Mmap'ed files can have a comprehensive name like -: Opal>--, preventing naming issues. If we are on linux, they 
can be allocated in /dev/shm to prevent filesystem trafic, and space is 
not limited.


Sylvain

On Mon, 30 Mar 2009, Tim Mattox wrote:


I've been lurking on this conversation, and I am again left with the impression
that the underlying shared memory configuration based on sharing a file
is flawed.  Why not use a System V shared memory segment without a
backing file as I described in ticket #1320?

On Mon, Mar 30, 2009 at 1:34 PM, George Bosilca  wrote:

Then it looks like the safest solution is the use either ftruncate or the
lseek method and then touch the first byte of all memory pages.
Unfortunately, I see two problems with this. First, there is a clear
performance hit on the startup time. And second, we will have to find a
pretty smart way to do this or we will completely break the memory affinity
stuff.

 george.

On Mar 30, 2009, at 13:24 , Iain Bason wrote:



On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote:


But don't we need the whole area to be zero filled?


It will be zero-filled on demand using the lseek/touch method.  However,
the OS may not reserve space for the skipped pages or disk blocks.  Thus one
could still get out of memory or file system full errors at arbitrary
points.  Presumably one could also get segfaults from an mmap'ed segment
whose pages couldn't be allocated when the demand came.

Iain

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] SM init failures

2009-03-31 Thread Jeff Squyres

On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:


> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm  btl startup, so we're not actually guaranteed that the memory is
> local  to any particular process(or) (!).  This race shouldn't cause
> segvs,  though; it should only mean that memory is potentially  
farther

> away  than we intended.

Is this that business that came up recently on one of these mail lists
about setting the memory node to -1 rather than using the value we  
know

it should be?  In mca_mpool_sm_alloc(), I do see a call to
opal_maffinity_base_bind().



No, it was a different thing -- but we missed the call to maffinity in  
mpool sm.  So that might make George's point moot (I see he still  
hasn't chimed in yet on this thread, perhaps that's why ;-) ).


To throw a little flame on the fire -- I notice the following from an  
MTT run last night:


[svbu-mpi004:17172] *** Process received signal ***
[svbu-mpi004:17172] Signal: Segmentation fault (11)
[svbu-mpi004:17172] Signal code: Invalid permissions (2)
[svbu-mpi004:17172] Failing at address: 0x2a98a3f080
[svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0]
[svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f22619]
[svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f225ee]
[svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f22946]
[svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so. 
0(opal_progress+0xa9) [0x2a95bbc078]
[svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0  
[0x2a95831324]
[svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0  
[0x2a9583185b]
[svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987e45be]
[svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987f160b]
[svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987e4c2e]
[svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so. 
0(PMPI_Barrier+0xd7) [0x2a9585987f]
[svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20)  
[0x402f88]
[svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)  
[0x2a9618e3fb]

[svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da]
[svbu-mpi004:17172] *** End of error message ***

Notice the "invalid permissions" message.  I didn't notice that  
before, but perhaps I wasn't looking.


I also see that this is under coll_tuned, not coll_hierarch (i.e.,  
*not* during MPI_INIT -- it's in a barrier).



> The central question is: does "first touch" mean both read and
> write?   I.e., is the first process that either reads *or* writes  
to a
> given  location considered "first touch"?  Or is it only the first  
write?


So, maybe the strategy is to create the shared area, have each process
initialize its portion (FIFOs and free lists), have all processes  
sync,
and then move on.  That way, you know all memory will be written by  
the

appropriate owner before it's read by anyone else.  First-touch
ownership will be proper and we won't be dependent on zero-filled  
pages.




That was what George was going at yesterday -- there's a section in  
the btl sm startup where you're setting up your own FIFOs.  But then  
there's a section later where you're looking at your peers' FIFOs.   
There's no synchronization between these two points -- when you're  
looking at your peers' FIFOs, you can tell if they're not setup yet by  
if the peer's FIFO is NULL or not.  If it's NULL, you loop and try  
again (until it's not NULL).  This is what George thought might be  
"bad" from a maffinity standpoint -- but perhaps this is moot if mpool  
sm is calling maffinity bind.



The big question in my mind remains that we don't seem to know how to
reproduce the failure (segv) that we're trying to fix.  I, personally,
am reluctant to stick fixes into the code for problems I can't  
observe.





Well, we *can* observe them -- I can reproduce them at a very low rate  
in my MTT runs.  We just don't understand the problem yet to know how  
to reproduce them manually.  To be clear: I'm violently agreeing with  
you: I want to fix the problem, but it would be much mo' betta to  
*know* that we fixed the problem rather than "well, it doesn't seem to  
be happening anymore."  :-)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SM init failures

2009-03-31 Thread Jeff Squyres

On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote:


Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.

System V shared memory used to be the main way to do shared memory on
MPICH and from my (little) experience, this was truly painful :
  - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even  
kill

-9 ?)
  - Naming issues : shm segments identified as 32 bits key potentially
causing conflicts between applications or layers of the same  
application

on one node
  - Space issues : the total shm size on a system is bound to
/proc/sys/kernel/shmmax, needing admin configuration and causing  
conflicts

between MPI applications running on the same node



Indeed.  The one saving grace here is that the cleanup issues  
apparently can be solved on Linux with a special flag that indicates  
"automatically remove this shmem when all processes attaching to it  
have died."  That was really the impetus for [re-]investigating sysv  
shm.  I, too, remember the sysv pain because we used it in LAM, too...


--
Jeff Squyres
Cisco Systems



[OMPI devel] custom btl

2009-03-31 Thread Roberto Ammendola
Hi all, I am developing a btl module for a custom interconnect board (we 
call it apelink, it's an academic project), and I am porting the module 
from 1.2 (at which it used to work) to 1.3 branch. Two issues:


1) the use of pls_rsh_agent is said to be deprecated. How do I spawn the 
jobs using rsh, then?


2) although compilation is fine, i get a

[gozer1:18640] mca: base: component_find: "mca_btl_apelink" does not 
appear to be a valid btl MCA dynamic component (ignored)


already with an ompi_info command. Probably something changed in the 1.3 
branch regarding DSO, which I should implement in my btl. Any hint?


thanks
roberto

--
__

Roberto AmmendolaINFN - Roma II - APE group
tel: +39-0672594504  email: roberto.ammend...@roma2.infn.it   // \
Via della Ricerca Scientifica 1 - 00133 Roma \\_/ //
__  ''-.._.-''-.._.. -(||)(')
 '''



Re: [OMPI devel] custom btl

2009-03-31 Thread Jeff Squyres

On Mar 31, 2009, at 11:15 AM, Roberto Ammendola wrote:

Hi all, I am developing a btl module for a custom interconnect board  
(we
call it apelink, it's an academic project), and I am porting the  
module

from 1.2 (at which it used to work) to 1.3 branch. Two issues:

1) the use of pls_rsh_agent is said to be deprecated. How do I spawn  
the

jobs using rsh, then?



The "pls" framework was replaced by the "plm" framework.  So  
"plm_rsh_agent" should work.  It defaults to "ssh : rsh" meaning that  
it'll look for ssh in your path, if it finds it, it will use it; if  
not, it'll look for rsh in your path, if it finds it, it will use it.   
If not, it'll fail.



2) although compilation is fine, i get a

[gozer1:18640] mca: base: component_find: "mca_btl_apelink" does not
appear to be a valid btl MCA dynamic component (ignored)

already with an ompi_info command. Probably something changed in the  
1.3

branch regarding DSO, which I should implement in my btl. Any hint?




This is likely due to dlopen failing with your component -- the most  
common reason for this is a missing/unresolvable symbol.  There's  
unfortunately a bug in libtool that doesn't show you the exact symbol  
that is unresolvable -- it instead may give a misleading error such as  
"file not found".  :-(


The way I have gotten around it before is to edit libltdl and add a  
printf.  :-(  Try this patch -- it compiles for me but I haven't  
tested it:


--- opal/libltdl/loaders/dlopen.c.~1~	2009-03-27 08:06:52.0  
-0400

+++ opal/libltdl/loaders/dlopen.c   2009-03-31 11:50:05.0 -0400
@@ -195,6 +195,9 @@

   if (!module)
 {
+const char *error;
+LT__GETERROR(error);
+fprintf(stderr, "Can't dlopen %s: %s\n", filename, error);
   DL__SETERROR (CANNOT_OPEN);
 }



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SM init failures

2009-03-31 Thread Eugene Loh

Jeff Squyres wrote:


On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote:


> FWIW, George found what looks like a race condition in the sm init
> code today -- it looks like we don't call maffinity anywhere in the
> sm  btl startup, so we're not actually guaranteed that the memory is
> local  to any particular process(or) (!).  This race shouldn't cause
> segvs,  though; it should only mean that memory is potentially  
farther

> away  than we intended.

Is this that business that came up recently on one of these mail lists
about setting the memory node to -1 rather than using the value we  know
it should be?  In mca_mpool_sm_alloc(), I do see a call to
opal_maffinity_base_bind().


No, it was a different thing -- but we missed the call to maffinity 
in  mpool sm.  So that might make George's point moot (I see he still  
hasn't chimed in yet on this thread, perhaps that's why ;-) ).


To throw a little flame on the fire -- I notice the following from an  
MTT run last night:


[svbu-mpi004:17172] *** Process received signal ***
[svbu-mpi004:17172] Signal: Segmentation fault (11)
[svbu-mpi004:17172] Signal code: Invalid permissions (2)
[svbu-mpi004:17172] Failing at address: 0x2a98a3f080
[svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0]
[svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f22619]
[svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f225ee]
[svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/ 
mca_btl_sm.so [0x2a97f22946]
[svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so. 
0(opal_progress+0xa9) [0x2a95bbc078]
[svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0  
[0x2a95831324]
[svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0  
[0x2a9583185b]
[svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987e45be]
[svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987f160b]
[svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/ 
mca_coll_tuned.so [0x2a987e4c2e]
[svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so. 
0(PMPI_Barrier+0xd7) [0x2a9585987f]
[svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20)  
[0x402f88]
[svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)  
[0x2a9618e3fb]

[svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da]
[svbu-mpi004:17172] *** End of error message ***

Notice the "invalid permissions" message.  I didn't notice that  
before, but perhaps I wasn't looking.


I also see that this is under coll_tuned, not coll_hierarch (i.e.,  
*not* during MPI_INIT -- it's in a barrier).


Yes, actually these happen "a lot".  (I've been spending time looking at 
IU_Sif/r20880 MTT stack traces.)


If the stack trace has MPI_Init in it, it seems to be going through 
mca_coll_hierarch.


Otherwise, the seg fault is in a collective call as you note -- could be 
MPI_Allgather, Barrier, Bcast, and I imagine there are others -- then 
mca_coll_tuned and eventually down to the sm BTL.


There are also quite a bit of orphaned(?) stack traces.  Just a segfault 
and a single-level stack a la

[ 0] /lib/libpthread.so


> The central question is: does "first touch" mean both read and
> write?   I.e., is the first process that either reads *or* writes  
to a
> given  location considered "first touch"?  Or is it only the first  
write?


So, maybe the strategy is to create the shared area, have each process
initialize its portion (FIFOs and free lists), have all processes  sync,
and then move on.  That way, you know all memory will be written by  the
appropriate owner before it's read by anyone else.  First-touch
ownership will be proper and we won't be dependent on zero-filled  
pages.


That was what George was going at yesterday -- there's a section in  
the btl sm startup where you're setting up your own FIFOs.  But then  
there's a section later where you're looking at your peers' FIFOs.   
There's no synchronization between these two points -- when you're  
looking at your peers' FIFOs, you can tell if they're not setup yet 
by  if the peer's FIFO is NULL or not.  If it's NULL, you loop and 
try  again (until it's not NULL).  This is what George thought might 
be  "bad" from a maffinity standpoint -- but perhaps this is moot if 
mpool  sm is calling maffinity bind.


The thing I was wondering about was memory barriers.  E.g., you 
initialize stuff and then post the FIFO pointer.  The other guy sees the 
FIFO pointer before the initialized memory.



The big question in my mind remains that we don't seem to know how to
reproduce the failure (segv) that we're trying to fix.  I, personally,
am reluctant to stick fixes into the code for problems I can't  observe.


Well, we *can* observe them -- I can reproduce them at a very low 
rate  in my MTT runs.  We just don't understand the problem yet to 
know how  to reproduce them manually.  To be clear: I'm violently 
agreeing with  y

[OMPI devel] mallopt fixes

2009-03-31 Thread Jeff Squyres
Ok, I've done a bunch of development and testing on the hg branch with  
all the mallopt fixes, etc., and I'm fairly confident that it's  
working properly.  I plan to put this stuff back into the trunk  
tomorrow by noonish US Eastern if no one finds any problems with it:


http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/mallopt/

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SM init failures

2009-03-31 Thread Jeff Squyres

On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:


The thing I was wondering about was memory barriers.  E.g., you
initialize stuff and then post the FIFO pointer.  The other guy sees  
the

FIFO pointer before the initialized memory.




We do do memory barriers during that SM startup sequence.  I haven't  
checked in a while, but I thought we were doing the right kinds of  
barriers in the right order...


But George mentioned on the call today that they may have found the  
issue, but they're testing it.  He didn't explain what the issue was  
in case he was wrong.  ;-)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SM init failures

2009-03-31 Thread Eugene Loh

Jeff Squyres wrote:


On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:


The thing I was wondering about was memory barriers.  E.g., you
initialize stuff and then post the FIFO pointer.  The other guy sees  
the

FIFO pointer before the initialized memory.


We do do memory barriers during that SM startup sequence.  I haven't  
checked in a while, but I thought we were doing the right kinds of  
barriers in the right order...


There are certainly *some* barriers.  The particular scenario I asked 
about didn't seem protected against (IMHO), but I certainly don't 
understand these issues and remain cautious about any guesses I make 
until I can demonstrate the problem and a solution.


Regarding "demonstrating the problem", I see the Sun MTT logs show some 
number of Init errors without mca_coll_hierarch involved.  I'll try 
rerunning on the same machines and see if I can trigger the problem.


But George mentioned on the call today that they may have found the  
issue, but they're testing it.  He didn't explain what the issue was  
in case he was wrong.  ;-)


'kay.