Re: [OMPI users] Open MPI program cannot complete
thanksI got :-bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 48516.cluster is not activeActually, the job is running. Any help is appreciated. thanksJinxu DingOct. 27 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Tue, 26 Oct 2010 23:18:57 +0100 > To: dtustud...@hotmail.com > > > The "^M: bad interpreter" tells me that you've downloaded the file in Windows > and have got dos-based new-lines in the file. > > Assuming it's installed on your machine run "dos2unix padb" and it'll remove > them, if that doesn't work save the file using a unix based email program. I > hope this helps you when we finally get it working! > > Ashley. > > On 26 Oct 2010, at 22:14, Jack Bryan wrote: > > > Hi, > > > > I put your attahced padb in mypath and also set it up in env variable. > > I got this: > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2 > > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad > > interpreter: No such file or directory > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 26 2010 > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > From: ash...@pittman.co.uk > > Date: Tue, 26 Oct 2010 08:39:56 +0100 > > CC: tomview...@yahoo.com > > To: dtustud...@hotmail.com > > > > > > Sorry, I forgot to attach it last night. > > > > > > > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk >
Re: [OMPI users] Open MPI program cannot complete
thanks I got : -bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 48516.cluster is not active Actually, the job is running. Any help is appreciated. thanksJinxu Ding Oct. 26 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Tue, 26 Oct 2010 23:18:57 +0100 > To: dtustud...@hotmail.com > > > The "^M: bad interpreter" tells me that you've downloaded the file in Windows > and have got dos-based new-lines in the file. > > Assuming it's installed on your machine run "dos2unix padb" and it'll remove > them, if that doesn't work save the file using a unix based email program. I > hope this helps you when we finally get it working! > > Ashley. > > On 26 Oct 2010, at 22:14, Jack Bryan wrote: > > > Hi, > > > > I put your attahced padb in mypath and also set it up in env variable. > > I got this: > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2 > > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad > > interpreter: No such file or directory > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 26 2010 > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > From: ash...@pittman.co.uk > > Date: Tue, 26 Oct 2010 08:39:56 +0100 > > CC: tomview...@yahoo.com > > To: dtustud...@hotmail.com > > > > > > Sorry, I forgot to attach it last night. > > > > > > > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk >
Re: [OMPI users] Open MPI program cannot complete
thanksBut, I cannot see the attachment in the email. Would you please send me again ? and also copy to another my email:tomviewisu@yahoo.comthanksOct. 25 2010 From: dtustud...@hotmail.com To: ash...@pittman.co.uk Subject: RE: [OMPI users] Open MPI program cannot complete List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 16:53:32 -0600 thanks But, I cannot see the attachment in the email. Would you please send me again ? and also copy to another my email: tomview...@yahoo.com thanks Oct. 25 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 23:41:32 +0100 > To: dtustud...@hotmail.com > > > Thanks, that's tells me a lot. > > Try the attached padb, I've added the patch for you and remove the -w option. > Can you run it and send me back the output please. > > Ashley. > > On 25 Oct 2010, at 23:29, Jack Bryan wrote: > > > Thanks > > > > Here is the > > > > -bash-3.2$ qstat -fB > > Server: clusterName > > server_state = Active > > scheduling = True > > total_jobs = 26 > > state_count = Transit:0 Queued:7 Held:0 Waiting:0 Running:18 Exiting:0 > > acl_hosts = clustername > > default_queue = normal > > log_events = 511 > > mail_from = adm > > query_other_jobs = True > > resources_assigned.nodect = 246 > > scheduler_iteration = 600 > > node_check_rate = 150 > > tcp_timeout = 6 > > mom_job_sync = True > > pbs_version = 2.4.2 > > keep_completed = 300 > > submit_hosts = clusterName > > next_job_number = 48293 > > net_counter = 2 9 6 > > > > -bash-3.2$ qstat -w -n > > qstat: invalid option -- w > > > > > > Which line should I put the > > - > > --- padb (revision 401) > > +++ padb (working copy) > > @@ -2824,6 +2824,7 @@ > > foreach my $server (@servers) { > > pbs_get_lqsub( $user, $server ); # get job list by qsub > > } > > + print Dumper \%pbs_tabjobs; > > return \%pbs_tabjobs; > > } > > > > > > in the bin file padb > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 25 2010 > > > > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > From: ash...@pittman.co.uk > > > Date: Mon, 25 Oct 2010 22:54:21 +0100 > > > To: dtustud...@hotmail.com > > > > > > > > > [off list] > > > > > > The PBS support was added by a third-party so I've not used it in anger > > > myself, it appears you are doing the correct thing as far as I can tell. > > > > > > Can you send me the output of the following two commands and also apply > > > the patch below to padb (you can do this just in the bin dir - it's a > > > perl script) and send me the output when you run that as well? > > > > > > qstat -fB > > > qstat -w -n > > > > > > --- padb (revision 401) > > > +++ padb (working copy) > > > @@ -2824,6 +2824,7 @@ > > > foreach my $server (@servers) { > > > pbs_get_lqsub( $user, $server ); # get job list by qsub > > > } > > > + print Dumper \%pbs_tabjobs; > > > return \%pbs_tabjobs; > > > } > > > > > > On 25 Oct 2010, at 22:30, Jack Bryan wrote: > > > > > > > Thanks > > > > > > > > I have downloaded > > > > http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz > > > > > > > > and followed the instructions of INSTALL file and installed it at > > > > /mypath/padb32 > > > > > > > > But, I got: > > > > > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48279.cluster > > > > Job 48279.cluster is not active > > > > > > > > Actually, the job was running. > > > > > > > > I have installed > > > > bin at > > > > > > > > /mypath/padb32/bin > > > > > > > > > > > > libexec at > > > > /lustre/jxding/padb32/libexec > > > > > > > > When I installed it, I used > > > > > > > > ./configure --prefix=/mypath/padb32 > > > > > > > > I got > > > > - > > > > > > > > checking for a BSD-c
Re: [OMPI users] Open MPI program cannot complete
Thanks I have downloaded http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz and followed the instructions of INSTALL file and installed it at /mypath/padb32 But, I got: -bash-3.2$ padb -Ormgr=pbs -Q 48279.clusterJob 48279.cluster is not active Actually, the job was running. I have installed bin at /mypath/padb32/bin libexec at/lustre/jxding/padb32/libexec When I installed it, I used ./configure --prefix=/mypath/padb32 I got - checking for a BSD-compatible install... /usr/bin/install -cchecking whether build environment is sane... yeschecking for a thread-safe mkdir -p... /bin/mkdir -pchecking for gawk... gawkchecking whether make sets $(MAKE)... yeschecking for gcc... gccchecking whether the C compiler works... yeschecking for C compiler default output file name... a.outchecking for suffix of executables...checking whether we are cross compiling... nochecking for suffix of object files... ochecking whether we are using the GNU C compiler... yeschecking whether gcc accepts -g... yeschecking for gcc option to accept ISO C89... none neededchecking for style of include used by make... GNUchecking dependency style of gcc... gcc3checking whether gcc and cc understand -c and -o together... yesconfigure: creating ./config.statusconfig.status: creating Makefileconfig.status: creating src/Makefileconfig.status: executing depfiles commands --- -bash-3.2$ makeMaking all in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'gcc -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"padb\" -DVERSION=\"3.2-beta1\" -I.-Wall -g -O2 -MT minfo-minfo.o -MD -MP -MF .deps/minfo-minfo.Tpo -c -o minfo-minfo.o `test -f 'minfo.c' || echo './'`minfo.cminfo.c: In function âfind_symâ:minfo.c:158: warning: dereferencing type-punned pointer will break strict-aliasing rulesminfo.c: In function âmainâ:minfo.c:649: warning: type-punning to incomplete type might break strict-aliasing rulesminfo.c:650: warning: type-punning to incomplete type might break strict-aliasing rulesmv -f .deps/minfo-minfo.Tpo .deps/minfo-minfo.Pogcc -Wall -g -O2 -ldl -o minfo minfo-minfo.omake[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for `all-am'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'- -bash-3.2$ make installMaking install in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'make[2]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'test -z "/lustre/jxding/padb32/bin" || /bin/mkdir -p "/mypath/padb32/bin" /usr/bin/install -c padb '/lustre/jxding/padb32/bin'test -z "/lustre/jxding/padb32/libexec" || /bin/mkdir -p "/mypath/padb32/libexec" /usr/bin/install -c minfo '/lustre/jxding/padb32/libexec'make[2]: Nothing to be done for `install-data-am'.make[2]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[2]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[2]: Nothing to be done for `install-exec-am'.make[2]: Nothing to be done for `install-data-am'.make[2]: Leaving directory `/mypath/padb32/padb-3.2-beta1'make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'-bash-3.2$ make installcheckMaking installcheck in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Nothing to be done for `installcheck'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for `installcheck-am'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'-- Are there something wrong with what I have done ? Any help is appreciated. thanks Jack Oct. 25 2010 > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 20:40:18 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > > On 25 Oct 2010, at 20:18, Jack Bryan wrote: > > > Thanks > > I have downloaded > > http://padb.googlecode.com/files/padb-3.0.tgz > > > > and compile it. > > > > But, no user manual, I can not use it by padb -aQ. > > The -a flag is a shortcut to all jobs, if you are providing a jobid (which is > normally numeric) then don't set the -a flag. > > > Do you have use manual about how to use it ? > > In my previous mail I was assuming you were using orte to launch the jobs but > if you are using PBS then you'll need to use the 3.2 beta as the PBS code
Re: [OMPI users] Open MPI program cannot complete
can you install MPI on your local machine? As someone said earlier, you don't need a cluster to run MPI. You can run MPI with multiple processes on a single computer. On Mon, Oct 25, 2010 at 12:40 PM, Ashley Pittmanwrote: > > On 25 Oct 2010, at 20:18, Jack Bryan wrote: > > > Thanks > > I have downloaded > > http://padb.googlecode.com/files/padb-3.0.tgz > > > > and compile it. > > > > But, no user manual, I can not use it by padb -aQ. > > The -a flag is a shortcut to all jobs, if you are providing a jobid (which > is normally numeric) then don't set the -a flag. > > > Do you have use manual about how to use it ? > > In my previous mail I was assuming you were using orte to launch the jobs > but if you are using PBS then you'll need to use the 3.2 beta as the PBS > code is new, alternatively you could find the host where the PBS script > itself runs and check of the "ompi-ps" command gives you any output, if it > does then you could run it from there giving it the orte jobid. > > A bit of background about resource managers (in which I'm including orte > and PBS), padb supports many resource managers and tries to automatically > detect which ones you have installed on your system. If you don't specify > one then it'll see what is installed, if there is more than one resource > manager installed then it'll see which of them claim to have active jobs - > if only one resource manager meets this criteria then it'll pick that one - > hence 99% of the time it should just work. If more than one resource > manager claims to have active jobs then padb will refuse to run but ask the > user to specify one explicitly. > > You should try the following in order once you have 3.2 installed. > > padb -Ormgr=pbs -Q > > Or - find the node where the PBS script is being executed, check that the > ompi-ps command is returning the jobid and then run > > padb -Ormgr=orte -Q > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego
Re: [OMPI users] Open MPI program cannot complete
On 25 Oct 2010, at 20:18, Jack Bryan wrote: > Thanks > I have downloaded > http://padb.googlecode.com/files/padb-3.0.tgz > > and compile it. > > But, no user manual, I can not use it by padb -aQ. The -a flag is a shortcut to all jobs, if you are providing a jobid (which is normally numeric) then don't set the -a flag. > Do you have use manual about how to use it ? In my previous mail I was assuming you were using orte to launch the jobs but if you are using PBS then you'll need to use the 3.2 beta as the PBS code is new, alternatively you could find the host where the PBS script itself runs and check of the "ompi-ps" command gives you any output, if it does then you could run it from there giving it the orte jobid. A bit of background about resource managers (in which I'm including orte and PBS), padb supports many resource managers and tries to automatically detect which ones you have installed on your system. If you don't specify one then it'll see what is installed, if there is more than one resource manager installed then it'll see which of them claim to have active jobs - if only one resource manager meets this criteria then it'll pick that one - hence 99% of the time it should just work. If more than one resource manager claims to have active jobs then padb will refuse to run but ask the user to specify one explicitly. You should try the following in order once you have 3.2 installed. padb -Ormgr=pbs -Q Or - find the node where the PBS script is being executed, check that the ompi-ps command is returning the jobid and then run padb -Ormgr=orte -Q Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] Open MPI program cannot complete
ThanksI have downloaded http://padb.googlecode.com/files/padb-3.0.tgz and compile it. But, no user manual, I can not use it by padb -aQ. ./padb -aQ myjobpadb: Error: --all incompatible with specific ids Actually, myjob is running in the queue. Do you have use manual about how to use it ? thanks > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 18:08:32 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > > On 25 Oct 2010, at 17:26, Jack Bryan wrote: > > > Thanks, the problem is still there. > > > > I used: > > > > Only process 0 returns. Other processes are still struck in > > MPI_Finalize(). > > > > Any help is appreciated. > > You can use the command "padb -aQ" to show you the message queues for your > application, you'll need to download and install padb then simply run your > job, allow it to hang and they run padb - it'll show you the message queues > for each rank that it can find processes for (the ones that haven't exited). > If this isn't any help run "padb -axt" for the stack traces and send the > output to this list. > > The web-site is in my signature or there is a new beta release out this week > at http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks But, the code is too long. Jack Oct. 25 2010 > Date: Mon, 25 Oct 2010 14:08:54 -0400 > From: g...@ldeo.columbia.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Your job may be queued, not executing, because there are no > resources available, all nodes are busy. > Try qstat -a. > > Posting a code snippet with all your MPI calls may prove effective. > You might get a trove of advice for a thrift of effort. > > Jeff Squyres wrote: > > Check the man page for qsub for proper use. > > > > > > On Oct 25, 2010, at 1:49 PM, Jack Bryan wrote: > > > >> thanks > >> > >> I use > >> qsub -I nsga2_job.sh > >> qsub: waiting for job 48270.clusterName to start > >> > >> By qstat > >> I found the job name is none and no results show up. > >> > >> No shell prompt appear, the command line is hang there , no response. > >> > >> Any help is appreciated. > >> > >> Thanks > >> > >> Jack > >> > >> Oct. 25 2010 > >> > >>> From: jsquy...@cisco.com > >>> Date: Mon, 25 Oct 2010 13:39:30 -0400 > >>> To: us...@open-mpi.org > >>> Subject: Re: [OMPI users] Open MPI program cannot complete > >>> > >>> Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, > >>> "qsub -I ..." ? > >>> > >>> Then you get a shell prompt with your allocated cores and can run stuff > >>> interactively. I don't know if your site allows this, but interactive > >>> debugging here might be *significantly* easier than try to automate some > >>> debugging. > >>> > >>> > >>> On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > >>> > >>>> thanks > >>>> > >>>> I have to use #PBS to submit any jobs in my cluster. > >>>> I cannot use command line to hang a job on my cluster. > >>>> > >>>> this is my script: > >>>> -- > >>>> #!/bin/bash > >>>> #PBS -N jobname > >>>> #PBS -l walltime=00:08:00,nodes=1 > >>>> #PBS -q queuename > >>>> COMMAND=/mypath/myprog > >>>> NCORES=5 > >>>> > >>>> cd $PBS_O_WORKDIR > >>>> NODES=`cat $PBS_NODEFILE | wc -l` > >>>> NPROC=$(( $NCORES * $NODES )) > >>>> > >>>> mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > >>>> > >>>> --- > >>>> > >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > >>>> ZOMBIE_PID) in the script ? > >>>> And how to get ZOMBIE_PID from the script ? > >>>> > >>>> Any help is appreciated. > >>>> > >>>> thanks > >>>> > >>>> Oct. 25 2010 > >>>> > >>>> Date: Mon, 25 Oct 2010 19:24:35 +0200 > >>>> From: j...@59a2.org > >>>> To: us...@open-mpi.org > >>>> Subject: Re: [OMPI users] Open MPI program cannot complete > >>>> > >>>> On Mon, Oct 25, 2010 at 19:07, Jack Bryan <dtustud...@hotmail.com> wrote: > >>>> I need to use #PBS parallel job script to submit a job on MPI cluster. > >>>> > >>>> Is it not possible to reproduce locally? Most clusters have a way to > >>>> submit an interactive job (which would let you start this thing and then > >>>> inspect individual processes). Ashley's Padb suggestion will certainly > >>>> be better in a non-interactive environment. > >>>> > >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > >>>> ZOMBIE_PID) in the script ? > >>>> > >>>> Is control returning to your script after rank 0 has exited? In that > >>>> case, you can just put this on the next line. > >>>> > >>>> How to get the ZOMBIE_PID ? > >>>> > >>>> "ps" from the command line, or getpid() from C code. > >>>> > >>>> Jed > >>>> > >>>> ___ users mailing list > >>>> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> ___ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> -- > >>> Jeff Squyres > >>> jsquy...@cisco.com > >>> For corporate legal information go to: > >>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>> > >>> > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Your job may be queued, not executing, because there are no resources available, all nodes are busy. Try qstat -a. Posting a code snippet with all your MPI calls may prove effective. You might get a trove of advice for a thrift of effort. Jeff Squyres wrote: Check the man page for qsub for proper use. On Oct 25, 2010, at 1:49 PM, Jack Bryan wrote: thanks I use qsub -I nsga2_job.sh qsub: waiting for job 48270.clusterName to start By qstat I found the job name is none and no results show up. No shell prompt appear, the command line is hang there , no response. Any help is appreciated. Thanks Jack Oct. 25 2010 From: jsquy...@cisco.com Date: Mon, 25 Oct 2010 13:39:30 -0400 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, "qsub -I ..." ? Then you get a shell prompt with your allocated cores and can run stuff interactively. I don't know if your site allows this, but interactive debugging here might be *significantly* easier than try to automate some debugging. On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: thanks I have to use #PBS to submit any jobs in my cluster. I cannot use command line to hang a job on my cluster. this is my script: -- #!/bin/bash #PBS -N jobname #PBS -l walltime=00:08:00,nodes=1 #PBS -q queuename COMMAND=/mypath/myprog NCORES=5 cd $PBS_O_WORKDIR NODES=`cat $PBS_NODEFILE | wc -l` NPROC=$(( $NCORES * $NODES )) mpirun -np $NPROC --mca btl self,sm,openib $COMMAND --- Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? And how to get ZOMBIE_PID from the script ? Any help is appreciated. thanks Oct. 25 2010 Date: Mon, 25 Oct 2010 19:24:35 +0200 From: j...@59a2.org To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete On Mon, Oct 25, 2010 at 19:07, Jack Bryan <dtustud...@hotmail.com> wrote: I need to use #PBS parallel job script to submit a job on MPI cluster. Is it not possible to reproduce locally? Most clusters have a way to submit an interactive job (which would let you start this thing and then inspect individual processes). Ashley's Padb suggestion will certainly be better in a non-interactive environment. Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? Is control returning to your script after rank 0 has exited? In that case, you can just put this on the next line. How to get the ZOMBIE_PID ? "ps" from the command line, or getpid() from C code. Jed ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Check the man page for qsub for proper use. On Oct 25, 2010, at 1:49 PM, Jack Bryan wrote: > thanks > > I use > qsub -I nsga2_job.sh > qsub: waiting for job 48270.clusterName to start > > By qstat > I found the job name is none and no results show up. > > No shell prompt appear, the command line is hang there , no response. > > Any help is appreciated. > > Thanks > > Jack > > Oct. 25 2010 > > > From: jsquy...@cisco.com > > Date: Mon, 25 Oct 2010 13:39:30 -0400 > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, > > "qsub -I ..." ? > > > > Then you get a shell prompt with your allocated cores and can run stuff > > interactively. I don't know if your site allows this, but interactive > > debugging here might be *significantly* easier than try to automate some > > debugging. > > > > > > On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > > > > > thanks > > > > > > I have to use #PBS to submit any jobs in my cluster. > > > I cannot use command line to hang a job on my cluster. > > > > > > this is my script: > > > -- > > > #!/bin/bash > > > #PBS -N jobname > > > #PBS -l walltime=00:08:00,nodes=1 > > > #PBS -q queuename > > > COMMAND=/mypath/myprog > > > NCORES=5 > > > > > > cd $PBS_O_WORKDIR > > > NODES=`cat $PBS_NODEFILE | wc -l` > > > NPROC=$(( $NCORES * $NODES )) > > > > > > mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > > > > > > --- > > > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > > ZOMBIE_PID) in the script ? > > > And how to get ZOMBIE_PID from the script ? > > > > > > Any help is appreciated. > > > > > > thanks > > > > > > Oct. 25 2010 > > > > > > Date: Mon, 25 Oct 2010 19:24:35 +0200 > > > From: j...@59a2.org > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > On Mon, Oct 25, 2010 at 19:07, Jack Bryan <dtustud...@hotmail.com> wrote: > > > I need to use #PBS parallel job script to submit a job on MPI cluster. > > > > > > Is it not possible to reproduce locally? Most clusters have a way to > > > submit an interactive job (which would let you start this thing and then > > > inspect individual processes). Ashley's Padb suggestion will certainly be > > > better in a non-interactive environment. > > > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > > ZOMBIE_PID) in the script ? > > > > > > Is control returning to your script after rank 0 has exited? In that > > > case, you can just put this on the next line. > > > > > > How to get the ZOMBIE_PID ? > > > > > > "ps" from the command line, or getpid() from C code. > > > > > > Jed > > > > > > ___ users mailing list > > > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Open MPI program cannot complete
thanks I use qsub -I nsga2_job.shqsub: waiting for job 48270.clusterName to start By qstatI found the job name is none and no results show up. No shell prompt appear, the command line is hang there , no response. Any help is appreciated. Thanks Jack Oct. 25 2010 > From: jsquy...@cisco.com > Date: Mon, 25 Oct 2010 13:39:30 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, > "qsub -I ..." ? > > Then you get a shell prompt with your allocated cores and can run stuff > interactively. I don't know if your site allows this, but interactive > debugging here might be *significantly* easier than try to automate some > debugging. > > > On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > > > thanks > > > > I have to use #PBS to submit any jobs in my cluster. > > I cannot use command line to hang a job on my cluster. > > > > this is my script: > > -- > > #!/bin/bash > > #PBS -N jobname > > #PBS -l walltime=00:08:00,nodes=1 > > #PBS -q queuename > > COMMAND=/mypath/myprog > > NCORES=5 > > > > cd $PBS_O_WORKDIR > > NODES=`cat $PBS_NODEFILE | wc -l` > > NPROC=$(( $NCORES * $NODES )) > > > > mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > > > > --- > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > ZOMBIE_PID) in the script ? > > And how to get ZOMBIE_PID from the script ? > > > > Any help is appreciated. > > > > thanks > > > > Oct. 25 2010 > > > > Date: Mon, 25 Oct 2010 19:24:35 +0200 > > From: j...@59a2.org > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > On Mon, Oct 25, 2010 at 19:07, Jack Bryan <dtustud...@hotmail.com> wrote: > > I need to use #PBS parallel job script to submit a job on MPI cluster. > > > > Is it not possible to reproduce locally? Most clusters have a way to > > submit an interactive job (which would let you start this thing and then > > inspect individual processes). Ashley's Padb suggestion will certainly be > > better in a non-interactive environment. > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > ZOMBIE_PID) in the script ? > > > > Is control returning to your script after rank 0 has exited? In that case, > > you can just put this on the next line. > > > > How to get the ZOMBIE_PID ? > > > > "ps" from the command line, or getpid() from C code. > > > > Jed > > > > ___ users mailing list > > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
On Mon, Oct 25, 2010 at 19:35, Jack Bryanwrote: > I have to use #PBS to submit any jobs in my cluster. > I cannot use command line to hang a job on my cluster. > You don't need a cluster to run MPI jobs, can you run the job on whatever you development machine is? Does it hang there? PBS interactive jobs are started with qsub -I. > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > On the line after "mpirun ...", assuming that control returns to there after the hang. You didn't answer whether that was the case. > And how to get ZOMBIE_PID from the script ? > Simplest is "pgrep $COMMAND", or use ps. Jed
Re: [OMPI users] Open MPI program cannot complete
Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, "qsub -I ..." ? Then you get a shell prompt with your allocated cores and can run stuff interactively. I don't know if your site allows this, but interactive debugging here might be *significantly* easier than try to automate some debugging. On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > thanks > > I have to use #PBS to submit any jobs in my cluster. > I cannot use command line to hang a job on my cluster. > > this is my script: > -- > #!/bin/bash > #PBS -N jobname > #PBS -l walltime=00:08:00,nodes=1 > #PBS -q queuename > COMMAND=/mypath/myprog > NCORES=5 > > cd $PBS_O_WORKDIR > NODES=`cat $PBS_NODEFILE | wc -l` > NPROC=$(( $NCORES * $NODES )) > > mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > > --- > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > And how to get ZOMBIE_PID from the script ? > > Any help is appreciated. > > thanks > > Oct. 25 2010 > > Date: Mon, 25 Oct 2010 19:24:35 +0200 > From: j...@59a2.org > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > On Mon, Oct 25, 2010 at 19:07, Jack Bryan <dtustud...@hotmail.com> wrote: > I need to use #PBS parallel job script to submit a job on MPI cluster. > > Is it not possible to reproduce locally? Most clusters have a way to submit > an interactive job (which would let you start this thing and then inspect > individual processes). Ashley's Padb suggestion will certainly be better in > a non-interactive environment. > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > > Is control returning to your script after rank 0 has exited? In that case, > you can just put this on the next line. > > How to get the ZOMBIE_PID ? > > "ps" from the command line, or getpid() from C code. > > Jed > > ___ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Open MPI program cannot complete
On Mon, Oct 25, 2010 at 19:07, Jack Bryanwrote: > I need to use #PBS parallel job script to submit a job on MPI cluster. > Is it not possible to reproduce locally? Most clusters have a way to submit an interactive job (which would let you start this thing and then inspect individual processes). Ashley's Padb suggestion will certainly be better in a non-interactive environment. > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > Is control returning to your script after rank 0 has exited? In that case, you can just put this on the next line. > How to get the ZOMBIE_PID ? > "ps" from the command line, or getpid() from C code. Jed
Re: [OMPI users] Open MPI program cannot complete
On 25 Oct 2010, at 17:26, Jack Bryan wrote: > Thanks, the problem is still there. > > I used: > > Only process 0 returns. Other processes are still struck in > MPI_Finalize(). > > Any help is appreciated. You can use the command "padb -aQ" to show you the message queues for your application, you'll need to download and install padb then simply run your job, allow it to hang and they run padb - it'll show you the message queues for each rank that it can find processes for (the ones that haven't exited). If this isn't any help run "padb -axt" for the stack traces and send the output to this list. The web-site is in my signature or there is a new beta release out this week at http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] Open MPI program cannot complete
thanks, Would like to tell me how to use (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in MPI ? I need to use #PBS parallel job script to submit a job on MPI cluster. Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? How to get the ZOMBIE_PID ? thanks Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 19:01:38 +0200 From: j...@59a2.org To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete On Mon, Oct 25, 2010 at 18:26, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, the problem is still there. This really doesn't prove that there are no outstanding asynchronous requests, but perhaps you know that there are not, despite not being able to post a complete test case here. I suggest attaching a debugger and getting a stack trace from the zombies (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID). Jed ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks, the problem is still there. I used: cout << "In main(), I am rank " << myRank << " , I am before MPI_Barrier(MPI_COMM_WORLD). \n\n" << endl ; MPI_Barrier(MPI_COMM_WORLD);cout << "In main(), I am rank " << myRank << " , I am before MPI_Finalize() and after MPI_Barrier(MPI_COMM_WORLD). \n\n" << endl ; MPI_Finalize(); cout << "In main(), I am rank " << myRank << " , I am after MPI_Finalize(), then return 0 . \n\n" << endl ;return 0 ; Only process 0 returns. Other processes are still struck inMPI_Finalize(). Any help is appreciated. JACK Oct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 08:27:19 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete I think I got this problem before. Put a mpi_barrier(mpi_comm_world) before mpi_finalize for all processes. For me, mpi terminates nicely only when all process are calling mpi_finalize the same time. So I do it for all my programs. On Mon, Oct 25, 2010 at 7:13 AM, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, But, I have put a mpi_waitall(request) before cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; If the above sentence has been printed out, it means that all requests have been checked and finished. right ? What may be the possible reasons for that stuck ? Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 05:32:44 -0400 From: terry.don...@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete So what you are saying is *all* the ranks have entered MPI_Finalize and only a subset has exited per placing prints before and after MPI_Finalize. Good. So my guess is that the processes stuck in MPI_Finalize have a prior MPI request outstanding that for whatever reason is unable to complete. So I would first look at all the MPI requests and make sure they completed. --td On 10/25/2010 02:38 AM, Jack Bryan wrote: thanks I found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() " But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your co
Re: [OMPI users] Open MPI program cannot complete
I think I got this problem before. Put a mpi_barrier(mpi_comm_world) before mpi_finalize for all processes. For me, mpi terminates nicely only when all process are calling mpi_finalize the same time. So I do it for all my programs. On Mon, Oct 25, 2010 at 7:13 AM, Jack Bryan <dtustud...@hotmail.com> wrote: > Thanks, > But, I have put a mpi_waitall(request) before > > cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; > > If the above sentence has been printed out, it means that all requests have > been checked and finished. right ? > > What may be the possible reasons for that stuck ? > > Any help is appreciated. > > Jack > > Oct. 25 2010 > * > * > -- > Date: Mon, 25 Oct 2010 05:32:44 -0400 > From: terry.don...@oracle.com > > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > So what you are saying is *all* the ranks have entered MPI_Finalize and > only a subset has exited per placing prints before and after MPI_Finalize. > Good. So my guess is that the processes stuck in MPI_Finalize have a prior > MPI request outstanding that for whatever reason is unable to complete. So > I would first look at all the MPI requests and make sure they completed. > > --td > > On 10/25/2010 02:38 AM, Jack Bryan wrote: > > thanks > I found a problem: > > I used: > > cout << " I am rank " << rank << " I am before MPI_Finalize()" << > endl; > MPI_Finalize(); > cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; > return 0; > > I can get the output " I am rank 0 (1, 2, ) I am before > MPI_Finalize() ". > > and > " I am rank 0 I am after MPI_Finalize() " > But, other processes do not printed out "I am rank ... I am after > MPI_Finalize()" . > > It is weird. The process has reached the point just before > MPI_Finalize(), why they are hanged there ? > > Are there other better ways to check this ? > > Any help is appreciated. > > thanks > > Jack > > Oct. 25 2010 > > -- > From: solarbik...@gmail.com > Date: Sun, 24 Oct 2010 19:47:54 -0700 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > how do you know all process call mpi_finalize? did you have all of them > print out something before they call mpi_finalize? I think what Gustavo is > getting at is maybe you had some MPI calls within your snippets that hangs > your program, thus some of your processes never called mpi_finalize. > > On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com>wrote: > > Thanks, > > But, my code is too long to be posted. > > What are the common reasons of this kind of problems ? > > Any help is appreciated. > > Jack > > Oct. 24 2010 > > > From: g...@ldeo.columbia.edu > > Date: Sun, 24 Oct 2010 18:09:52 -0400 > > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > Hi Jack > > > > Your code snippet is too terse, doesn't show the MPI calls. > > It is hard to guess what is the problem this way. > > > > Gus Correa > > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > > > Thanks for the reply. > > > But, I use mpi_waitall() to make sure that all MPI communications have > been done before a process call MPI_Finalize() and returns. > > > > > > Any help is appreciated. > > > > > > thanks > > > > > > Jack > > > > > > Oct. 24 2010 > > > > > > > From: g...@ldeo.columbia.edu > > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > > To: us...@open-mpi.org > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > > > Hi Jack > > > > > > > > It may depend on "do some things". > > > > Does it involve MPI communication? > > > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > > > Gus Correa > > > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > > > Hi > > > > > > > > > > I got a problem of open MPI. > > > > > > > > > > My program has 5 processes. > > > > > > > > > > All of them can run MPI_Fina
Re: [OMPI users] Open MPI program cannot complete
Thanks, But, I have put a mpi_waitall(request) before cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; If the above sentence has been printed out, it means that all requests have been checked and finished. right ? What may be the possible reasons for that stuck ? Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 05:32:44 -0400 From: terry.don...@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete Message body So what you are saying is *all* the ranks have entered MPI_Finalize and only a subset has exited per placing prints before and after MPI_Finalize. Good. So my guess is that the processes stuck in MPI_Finalize have a prior MPI request outstanding that for whatever reason is unable to complete. So I would first look at all the MPI requests and make sure they completed. --td On 10/25/2010 02:38 AM, Jack Bryan wrote: thanks I found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() " But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack
Re: [OMPI users] Open MPI program cannot complete
So what you are saying is *all* the ranks have entered MPI_Finalize and only a subset has exited per placing prints before and after MPI_Finalize. Good. So my guess is that the processes stuck in MPI_Finalize have a prior MPI request outstanding that for whatever reason is unable to complete. So I would first look at all the MPI requests and make sure they completed. --td On 10/25/2010 02:38 AM, Jack Bryan wrote: thanks I found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() " But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 -------- From: solarbik...@gmail.com Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com <mailto:dtustud...@hotmail.com>> wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu> > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org <mailto:us...@open-mpi.org> > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu <mailto:g...@ldeo.columbia.edu> > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org <mailto:us...@open-mpi.org> > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(, ); > > > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > > > MPI_Comm_size(MPI_COMM_WORLD, ); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank
Re: [OMPI users] Open MPI program cannot complete
thanksI found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize();cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0;I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() "But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanksJackOct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(, ); > > > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > > > MPI_Comm_size(MPI_COMM_WORLD, ); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan <dtustud...@hotmail.com> wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(, ); > > > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > > > MPI_Comm_size(MPI_COMM_WORLD, ); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(, ); > > > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > > > MPI_Comm_size(MPI_COMM_WORLD, ); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Hi Jack Your code snippet is too terse, doesn't show the MPI calls. It is hard to guess what is the problem this way. Gus Correa On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > Thanks for the reply. > But, I use mpi_waitall() to make sure that all MPI communications have been > done before a process call MPI_Finalize() and returns. > > Any help is appreciated. > > thanks > > Jack > > Oct. 24 2010 > > > From: g...@ldeo.columbia.edu > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > Hi Jack > > > > It may depend on "do some things". > > Does it involve MPI communication? > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > Gus Correa > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > Hi > > > > > > I got a problem of open MPI. > > > > > > My program has 5 processes. > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > But, the whole program cannot be completed. > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > If I use 1 process to run it, no problem. > > > > > > Why ? > > > > > > My program: > > > > > > int main (int argc, char **argv) > > > { > > > > > > MPI_Init(, ); > > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > > MPI_Comm_size(MPI_COMM_WORLD, ); > > > MPI_Comm world; > > > world = MPI_COMM_WORLD; > > > > > > if (myRank == 0) > > > { > > > do some things. > > > } > > > > > > if (myRank != 0) > > > { > > > do some things. > > > MPI_Finalize(); > > > return 0 ; > > > } > > > if (myRank == 0) > > > { > > > MPI_Finalize(); > > > return 0; > > > } > > > > > > } > > > > > > And, some output files get wrong codes, which can not be readible. > > > In 1-process case, the program can print correct results to these output > > > files . > > > > > > Any help is appreciated. > > > > > > thanks > > > > > > Jack > > > > > > Oct. 24 2010 > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks for the reply. But, I use mpi_waitall() to make sure that all MPI communications have been done before a process call MPI_Finalize() and returns. Any help is appreciated. thanks Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 17:31:11 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > It may depend on "do some things". > Does it involve MPI communication? > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > Gus Correa > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > Hi > > > > I got a problem of open MPI. > > > > My program has 5 processes. > > > > All of them can run MPI_Finalize() and return 0. > > > > But, the whole program cannot be completed. > > > > In the MPI cluster job queue, it is strill in running status. > > > > If I use 1 process to run it, no problem. > > > > Why ? > > > > My program: > > > > int main (int argc, char **argv) > > { > > > > MPI_Init(, ); > > MPI_Comm_rank(MPI_COMM_WORLD, ); > > MPI_Comm_size(MPI_COMM_WORLD, ); > > MPI_Comm world; > > world = MPI_COMM_WORLD; > > > > if (myRank == 0) > > { > > do some things. > > } > > > > if (myRank != 0) > > { > > do some things. > > MPI_Finalize(); > > return 0 ; > > } > > if (myRank == 0) > > { > > MPI_Finalize(); > > return 0; > > } > > > > } > > > > And, some output files get wrong codes, which can not be readible. > > In 1-process case, the program can print correct results to these output > > files . > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Hi Jack It may depend on "do some things". Does it involve MPI communication? Also, why not put MPI_Finalize();return 0 outside the ifs? Gus Correa On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > Hi > > I got a problem of open MPI. > > My program has 5 processes. > > All of them can run MPI_Finalize() and return 0. > > But, the whole program cannot be completed. > > In the MPI cluster job queue, it is strill in running status. > > If I use 1 process to run it, no problem. > > Why ? > > My program: > > int main (int argc, char **argv) > { > > MPI_Init(, ); > MPI_Comm_rank(MPI_COMM_WORLD, ); > MPI_Comm_size(MPI_COMM_WORLD, ); > MPI_Comm world; > world = MPI_COMM_WORLD; > > if (myRank == 0) > { > do some things. > } > > if (myRank != 0) > { > do some things. > MPI_Finalize(); > return 0 ; > } > if (myRank == 0) > { > MPI_Finalize(); > return 0; > } > > } > > And, some output files get wrong codes, which can not be readible. > In 1-process case, the program can print correct results to these output > files . > > Any help is appreciated. > > thanks > > Jack > > Oct. 24 2010 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open MPI program cannot complete
Hi I got a problem of open MPI. My program has 5 processes. All of them can run MPI_Finalize() and return 0. But, the whole program cannot be completed. In the MPI cluster job queue, it is strill in running status. If I use 1 process to run it, no problem. Why ? My program: int main (int argc, char **argv) { MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm world; world = MPI_COMM_WORLD; if (myRank == 0){ do some things. } if (myRank != 0){ do some things. MPI_Finalize(); return 0 ; } if (myRank == 0){ MPI_Finalize(); return 0; } } And, some output files get wrong codes, which can not be readible. In 1-process case, the program can print correct results to these output files . Any help is appreciated. thanks Jack Oct. 24 2010