Try adjusting this: oob_tcp_peer_retries = 10 to be
oob_tcp_peer_retries = 1000 It should have given you an error if this failed, but let's give it a try anyway. You might also check to see if you are hitting memory limitations. If so, or if you just want to try anyway, try reducing the value of coll_sync_barrier_before. Ralph On Mon, Jul 20, 2009 at 9:17 AM, Steven Dale <steven_d...@hc-sc.gc.ca>wrote: > > Okay, now the plot is just getting weirder. > > I implemented most of the changes you recommend below. We are not running > panasas, and our network is GB ethernet only, so I left the openib > parameters out as well. I also recompiled with the switches suggested in the > tlcc directory for the non-panasas file. > > Now our test case will run on 10 nodes with 160 permutations, which is a > step forward. It does however still crash with a routed:binomial error on 10 > nodes with 1600 permutations after about 14 minutes. With 800 permutations, > it runs quite happily as well. > > ....current openmpi-mca-param.conf is now: > > # $sysconf is a directory on a local disk, it is likely that changes > # to this file will need to be propagated to other nodes. If $sysconf > # is a directory that is shared via a networked filesystem, changes to > # this file will be visible to all nodes that share this $sysconf. > > # The format is straightforward: one per line, mca_param_name = > # rvalue. Quoting is ignored (so if you use quotes or escape > # characters, they'll be included as part of the value). For example: > > # Disable run-time MPI parameter checking > # mpi_param_check = 0 > > # Note that the value "~/" will be expanded to the current user's home > # directory. For example: > > # Change component loading path > # component_path = /usr/local/lib/openmpi:~/my_openmpi_components > > # See "ompi_info --param all all" for a full listing of Open MPI MCA > # parameters available and their default values. > orte_abort_timeout = 10 > opal_set_max_sys_limits = 1 > orte_no_session_dirs = /usr,/users,/home,/hcadmin > orte_tmpdir_base = /tmp > orte_allocation_required = 1 > coll_sync_priority = 100 > coll_sync_barrier_before = 1000 > coll_hierarch_priority = 90 > oob_tcp_if_include=eth3 > oob_tcp_peer_retries = 10 > oob_tcp_disable_family = IPv6 > oob_tcp_listen_mode = listen_thread > oob_tcp_sndbuf = 65536 > oob_tcp_rcvbuf = 65536 > btl = sm,tcp,self > ## Setup MPI options > mpi_show_handle_leaks = 0 > mpi_warn_on_fork = 1 > > Current compilation looks like this: > > #!/bin/sh > > # Takes about 20-25 minutes > > PATH=$PATH:/usr/local/bin:;export PATH > LDFLAGS="-m64" > CFLAGS="-m64" > CXXFLAGS="-m64" > FCFLAGS="-m64" > FFLAGS="-m64" > > # Build and install OpenMPI > > cd openmpi/openmpi-1.3.3 > > sh ./configure --enable-dlopen=no --enable-binaries=yes --enable-shared=yes > --enable-ipv6=no --enable-ft-thread=no > --enable-mca-no-build=crs,filem,routed-linear,snapc,pml-dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm > --with-slurm=yes --with-io-romio-flags="--with-file-system=ufs+nfs" > --with-memory-manager=ptmalloc2 --with-wrapper-ldflags="-m64" > --with-wrapper-cxxflags="-m64" --with-wrapper-fcflags="-m64" > --with-wrapper-fflags="-m64" > > make > make install > > ____________________ > Steve Dale > Senior Platform Analyst > Health Canada > > > > *Ralph Castain <r...@open-mpi.org>* > Sent by: users-boun...@open-mpi.org > > 07/17/2009 10:35 AM > Please respond to > Open MPI Users <us...@open-mpi.org> > > To > Open MPI Users <us...@open-mpi.org> cc > Subject > Re: [OMPI users] Possible openmpi bug? > > > > > Okay, just checking the obvious. :-) > > We regularly run with the exact same configuration here (i.e., slurm + > 16cpus/node) without problem on jobs that are both short and long, so it > seems doubtful that it would be an OMPI bug. However, it is possible as the > difference could be due to configuration and/or parameter settings. We have > seen some site-specific problems that are easily resolved with parameter > changes. > > You might take a look at our (LANL's) platform files for our slurm-based > system and see if they help. You will find them in the tarball at > > contrib/platform/lanl/tlcc > > Specifically, since you probably aren't running panasas (?), look at the > optimized-nopanasas and optimized-nopanasas.conf (they are a pair) files to > see how we configure the system for build, and the mca params we use to > execute applications. If you can, I would suggest giving them a try > (adjusting as required for your setup - e.g., you may want not want the -m64 > flags) and see if it resolves the problem. > > Ralph > > On Jul 17, 2009, at 7:15 AM, Steven Dale wrote: > > > I think it unlikely that its a time limit thing. Firstly, slurm is set up > with no time limit on jobs, and we get the same behaviour whether or not > slurm is in the picture. > In addition, we've run several other much larger jobs with a greater number > of permutations and they complete fine. > > This job takes about 5-10 minutes to run. We've run jobs that take a week > or more and the indivdual R process can be seen to run for days at a time > and they run fine. > > In addition, I'd find it hard to believe (although I concede the > possibility) that jobs entirely self-contained within the same box run > slower that jobs which span 2 boxes over the network. (14 cpus vs 17 cpus > for example). > > > ____________________ > Steve Dale > Senior Platform Analyst > Health Canada > Phone: (613)-948-4910 > E-mail: *steven_d...@hc-sc.gc.ca* <steven_d...@hc-sc.gc.ca> > > *Ralph Castain <**r...@open-mpi.org* <r...@open-mpi.org>*>* > Sent by: *users-boun...@open-mpi.org* <users-boun...@open-mpi.org> > > 07/17/2009 01:13 AM > Please respond to > Open MPI Users <*us...@open-mpi.org* <us...@open-mpi.org>> > > > To > Open MPI Users <*us...@open-mpi.org* <us...@open-mpi.org>> cc > Subject > Re: [OMPI users] Possible openmpi bug? > > > > > > > >From what I can see, it looks like your job is being terminated - > something is killing mpirun. Is it possible that the job runs slowly enough > on 14 or less cpus that it simply isn't completing within your specified > time limit? > > The lifeline message simply indicates that a process self-aborted because > it lost contact with its local daemon - in this case, mpirun (as that is > always daemon 0) - which means that the daemon was terminated for some > reason. > > > On Jul 16, 2009, at 11:15 AM, Steven Dale wrote: > > > Here is my situation: > > 2 Dell R900's with 16 cpus each and 64 GB RAM > OS: SuSE SLES 10 SP2 patched up to date > R version 2.9.1 > Rmpi version 0.5-7 > snow version 0.3-3 > maanova library version 1.14.0 > openmpi version 1.3.3 > slurm version 2.0.3 > > With a given set of R code, we get abnormal exits when using 14 or less > cpus. When using 15 or more, the job completes normally. > error is a variation on: > > [pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline > [[15549,0],0] lost > > during the array permutations. > > Increasing the number of permutations above 200 also produces similar > results. > > The R code is executed with a typical command line for 14 cpus being: > > sbatch -n 14 -i ./Rtest.txt --mail-type=ALL * > --mail-user=steven_d...@hc-sc.gc.ca* > <--mail-user=steven_d...@hc-sc.gc.ca>/usr/local/bin/R --no-save > > > Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network > is GB Ethernet copper tcp/ip. > > > I think this to be an openmpi error/bug due to the routed:binomial message. > This also had the same results with openmpi-1.3.2, R 2.9.0, maanova 1.12 and > slurm 2.0.1. > > > No non-default MCA parameters are set. > > LD_LIBRARY_PATH=/usr/local/lib. > > Configuration done with defaults. > > Any ideas are welcome. > > > > > ____________________ > Steve Dale > <bugrep.tar.bz2>_______________________________________________ > users mailing list* > **us...@open-mpi.org* <us...@open-mpi.org>* > **http://www.open-mpi.org/mailman/listinfo.cgi/users*<http://www.open-mpi.org/mailman/listinfo.cgi/users> > _______________________________________________ > users mailing list* > **us...@open-mpi.org* <us...@open-mpi.org>* > **http://www.open-mpi.org/mailman/listinfo.cgi/users*<http://www.open-mpi.org/mailman/listinfo.cgi/users> > _______________________________________________ > users mailing list* > **us...@open-mpi.org* <us...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >