you are right, Jeff. from the security reasons "child" is not allowed to share memory with parent.
On Fri, Apr 24, 2015 at 9:20 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Does the child process end up with valid memory in the buffer in that > sample? Back when I paid attention to verbs (which was admittedly a long > time ago), the sample I pasted would segv... > > > > On Apr 24, 2015, at 9:40 AM, Mike Dubman <mi...@dev.mellanox.co.il> > wrote: > > > > ibv_fork_init() will set special flag for madvise() > (IBV_DONTFORK/DOFORK) to inherit (and not cow) registered/locked pages on > fork() and will maintain refcount for cleanup. > > > > I think some minimal kernel version required (2.6.x) which supports > these flags. > > > > I can check if internally if you think the behave is different. > > > > > > On Fri, Apr 24, 2015 at 1:41 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Mike -- > > > > What happens when you do this? > > > > ---- > > ibv_fork_init(); > > > > int *buffer = malloc(...); > > ibv_reg_mr(buffer, ...); > > > > if (fork() != 0) { > > // in the child > > *buffer = 3; > > // ... > > } > > ---- > > > > > > > > > On Apr 24, 2015, at 2:54 AM, Mike Dubman <mi...@dev.mellanox.co.il> > wrote: > > > > > > btw, ompi master now calls ibv_fork_init() before initializing > btl/mtl/oob frameworks and all fork fears should be addressed. > > > > > > > > > On Fri, Apr 24, 2015 at 4:37 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > Disable the memory manager / don't use leave pinned. Then you can > fork/exec without fear (because only MPI will have registered memory -- > it'll never leave user buffers registered after MPI communications finish). > > > > > > > > > > On Apr 23, 2015, at 9:25 PM, Howard Pritchard <hpprit...@gmail.com> > wrote: > > > > > > > > Jeff > > > > > > > > this is kind of a lanl thing. Jack and I are working offline. any > suggestions about openib and fork/exec may be useful however...and don't > say no to fork/exec not at least if you dream of mpi in the data center. > > > > > > > > On Apr 23, 2015 10:49 AM, "Galloway, Jack D" <ja...@lanl.gov> wrote: > > > > I am using a “homecooked” cluster at LANL, ~500 cores. There are a > whole bunch of fortran system calls doing the copying and pasting. The > full code is attached here, a bunch of if-then statements for user > options. Thanks for the help. > > > > > > > > > > > > > > > > --Jack Galloway > > > > > > > > > > > > > > > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard > Pritchard > > > > Sent: Thursday, April 23, 2015 8:15 AM > > > > To: Open MPI Users > > > > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly, > orphaned processes > > > > > > > > > > > > > > > > Hi Jack, > > > > > > > > Are you using a system at LANL? Maybe I could try to reproduce the > problem on the system you are using. The system call stuff adds a certain > bit of zest to the problem. does the app make fortran system calls to do > the copying and pasting? > > > > > > > > Howard > > > > > > > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" <ja...@lanl.gov> wrote: > > > > > > > > I have an MPI program that is fairly straight forward, essentially > "initialize, 2 sends from master to slaves, 2 receives on slaves, do a > bunch of system calls for copying/pasting then running a serial code on > each mpi task, tidy up and mpi finalize". > > > > > > > > This seems straightforward, but I'm not getting mpi_finalize to work > correctly. Below is a snapshot of the program, without all the system > copy/paste/call external code which I've rolled up in "do codish stuff" > type statements. > > > > > > > > program mpi_finalize_break > > > > > > > > !<variable declarations> > > > > > > > > call MPI_INIT(ierr) > > > > > > > > icomm = MPI_COMM_WORLD > > > > > > > > call MPI_COMM_SIZE(icomm,nproc,ierr) > > > > > > > > call MPI_COMM_RANK(icomm,rank,ierr) > > > > > > > > > > > > > > > > !<do codish stuff for a while> > > > > > > > > if (rank == 0) then > > > > > > > > !<set up some stuff then call MPI_SEND in a loop over number of > slaves> > > > > > > > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr) > > > > > > > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr) > > > > > > > > else > > > > > > > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > > > > > > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr) > > > > > > > > !<do codish stuff for a while> > > > > > > > > endif > > > > > > > > > > > > > > > > print*, "got here4", rank > > > > > > > > call MPI_BARRIER(icomm,ierr) > > > > > > > > print*, "got here5", rank, ierr > > > > > > > > call MPI_FINALIZE(ierr) > > > > > > > > > > > > > > > > print*, "got here6" > > > > > > > > end program mpi_finalize_break > > > > > > > > Now the problem I am seeing occurs around the "got here4", "got > here5" and "got here6" statements. I get the appropriate number of print > statements with corresponding ranks for "got here4", as well as "got > here5". Meaning, the master and all the slaves (rank 0, and all other > ranks) got to the barrier call, through the barrier call, and to > MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to > "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior. > Sometimes I'll get one less "got here6" than I expect, or sometimes I'll > get eight less (it varies), however the program hangs forever, never > closing and leaves an orphaned process on one (or more) of the compute > nodes. > > > > > > > > I am running this on an infiniband backbone machine, with the NFS > server shared over infiniband (nfs-rdma). I'm trying to determine how the > MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned > runs (not the same node, nor the same number of orphans every time). I'm > guessing it is related to the various system calls to cp, mv, > ./run_some_code, cp, mv but wasn't sure if it may be related to the speed > of infiniband too, as all this happens fairly quickly. I could have wrong > intuition as well. Anybody have thoughts? I could put the whole code if > helpful, but this condensed version I believe captures it. I'm running > openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running > firmware 2.9.1000. This is the mellanox firmware available through yum > with centos 6.5, 2.6.32-504.8.1.el6.x86_64. > > > > > > > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > > > > > > > > inet addr:192.168.6.254 Bcast:192.168.6.255 > Mask:255.255.255.0 > > > > > > > > inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link > > > > > > > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > > > > > > > RX packets:10952 errors:0 dropped:0 overruns:0 frame:0 > > > > > > > > TX packets:9805 errors:0 dropped:625413 overruns:0 > carrier:0 > > > > > > > > collisions:0 txqueuelen:256 > > > > > > > > RX bytes:830040 (810.5 KiB) TX bytes:643212 (628.1 KiB) > > > > > > > > > > > > > > > > hca_id: mlx4_0 > > > > > > > > transport: InfiniBand (0) > > > > > > > > fw_ver: 2.9.1000 > > > > > > > > node_guid: 0002:c903:0057:e7fc > > > > > > > > sys_image_guid: 0002:c903:0057:e7ff > > > > > > > > vendor_id: 0x02c9 > > > > > > > > vendor_part_id: 26428 > > > > > > > > hw_ver: 0xB0 > > > > > > > > board_id: MT_0D90110009 > > > > > > > > phys_port_cnt: 1 > > > > > > > > port: 1 > > > > > > > > state: PORT_ACTIVE (4) > > > > > > > > max_mtu: 4096 (5) > > > > > > > > active_mtu: 4096 (5) > > > > > > > > sm_lid: 1 > > > > > > > > port_lid: 2 > > > > > > > > port_lmc: 0x00 > > > > > > > > link_layer: InfiniBand > > > > > > > > > > > > > > > > This problem only occurs in this simple implementation, thus my > thinking it is tied to the system calls. I run several other, much larger, > much more robust MPI codes without issue on the machine. Thanks for the > help. > > > > > > > > --Jack > > > > > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26765.php > > > > > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26772.php > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26775.php > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26776.php > > > > > > > > > > > > -- > > > > > > Kind Regards, > > > > > > M. > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26778.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26783.php > > > > > > > > -- > > > > Kind Regards, > > > > M. > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26785.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26786.php -- Kind Regards, M.