Re: [OMPI users] check point restart
Thanks Lloyd, Ralph . . regarding Ralph's comment, >I don't understand the comment about printing and recompiling. Usually, people just have the app >write its intermediate results to a file, and provide a cmd line option .. right, I shouldn't have written compile. It probably wouldn't increase the communications overhead that much to do this, I was just wondering if there might be something simpler. Erik
[OMPI users] check point restart
I run mpi on an NSF computer. One of the conditions of use is that jobs are limited to 24 hr duration to provide democratic allotment to its users. A long program can require many restarts, so it becomes necessary to store the state of the program in memory, print it, recompile, and and read the state to start again. I seem to remember a simpler approach (check point restart?) in which the state of the .exe code is saved and then simply restarted from its current position. Is there something like this for restarting an mpi program? Thanks, Erik -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
Re: [OMPI users] qsub error
yep, runs well now. On Sat, Feb 16, 2013 at 6:50 AM, Jeff Squyres (jsquyres) wrote: > Glad you got it working! > > On Feb 15, 2013, at 6:53 PM, Erik Nelson wrote: > > > I may have deleted any responses to this message. In either case, we > appear to have fixed the problem > > by installing a more current version of openmpi. > > > > > > On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson > wrote: > > > > I'm encountering an error using qsub that none of us can figure out. MPI > C++ programs seem to > > run fine when executed from the command line, but for some reason when I > submit them through > > the queue I get a strange error message .. > > > > > > > [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > > connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission > denied (13) > > > > > > the compute node 3-12 doesn't matter (the error can generate from any of > the nodes, and I'm > > guessing that 3-12 is the parent node here). > > > > To check if there was some problem with my own code, I created a simple > 'hello world' program > > (see attached files). > > > > Again, the program runs fine from the command line but fails in qsub > with the same sort of error > > message. > > > > I have included (i) the code (ii) the job script for qsub, and (iii) the > ".o" file from qsub for the > > "hello world" program. > > > > These don't look like MPI errors, but rather some conflict with, maybe, > secure communication > > across nodes. > > > > Is there something simple I can do to fix this? > > > > Thanks, Erik > > > > -- > > Erik Nelson > > > > Howard Hughes Medical Institute > > 6001 Forest Park Blvd., Room ND10.124 > > Dallas, Texas 75235-9050 > > > > p : 214 645 5981 > > f : 214 645 5948 > > > > > > > > -- > > Erik Nelson > > > > Howard Hughes Medical Institute > > 6001 Forest Park Blvd., Room ND10.124 > > Dallas, Texas 75235-9050 > > > > p : 214 645 5981 > > f : 214 645 5948 > > _______ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
Re: [OMPI users] qsub error
I may have deleted any responses to this message. In either case, we appear to have fixed the problem by installing a more current version of openmpi. On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson wrote: > > I'm encountering an error using qsub that none of us can figure out. MPI > C++ programs seem to > run fine when executed from the command line, but for some reason when I > submit them through > the queue I get a strange error message .. > > > [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > > connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied > (13) > > > the compute node 3-12 doesn't matter (the error can generate from any of > the nodes, and I'm > guessing that 3-12 is the parent node here). > > To check if there was some problem with my own code, I created a simple > 'hello world' program > (see attached files). > > Again, the program runs fine from the command line but fails in qsub with > the same sort of error > message. > > I have included (i) the code (ii) the job script for qsub, and (iii) the > ".o" file from qsub for the > "hello world" program. > > These don't look like MPI errors, but rather some conflict with, maybe, > secure communication > accross nodes. > > Is there something simple I can do to fix this? > > Thanks, Erik > > -- > Erik Nelson > > Howard Hughes Medical Institute > 6001 Forest Park Blvd., Room ND10.124 > Dallas, Texas 75235-9050 > > p : 214 645 5981 > f : 214 645 5948 -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
[OMPI users] qsub error
I'm encountering an error using qsub that none of us can figure out. MPI C++ programs seem to run fine when executed from the command line, but for some reason when I submit them through the queue I get a strange error message .. [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied (13) the compute node 3-12 doesn't matter (the error can generate from any of the nodes, and I'm guessing that 3-12 is the parent node here). To check if there was some problem with my own code, I created a simple 'hello world' program (see attached files). Again, the program runs fine from the command line but fails in qsub with the same sort of error message. I have included (i) the code (ii) the job script for qsub, and (iii) the ".o" file from qsub for the "hello world" program. These don't look like MPI errors, but rather some conflict with, maybe, secure communication accross nodes. Is there something simple I can do to fix this? Thanks, Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948 #include #include "/opt/openmpi/include/mpi.h" #define bufdim128 int main(int argc, char *argv[]) { char buffer[bufdim]; char id_str[32]; // mpi : MPI::Init(argc,argv); MPI::Status status; int size; int rank; int tag; size=MPI::COMM_WORLD.Get_size(); rank=MPI::COMM_WORLD.Get_rank(); tag=0; if (rank==0) { printf("%d: we have %d processors\n",rank,size); int i; i=1; for ( ;i hello.job Description: Binary data hello.job.o5822590 Description: Binary data
Re: [OMPI users] restricting a job to a set of hosts
Reuti, >-nolocal is IMO an option where you want to execute the `mpirun` on your local login machine and want the MPI >processes to be allocated somewhere in the cluster, in case you don't have any queuing system around to manage >the resources. yes, this is exactly my understanding of the -nolocal option. Otherwise, by specifying an 'image set' of processors, everything gets 'mapped' to some subset of processors in the image set. Again, thanks for your response. On Fri, Jul 27, 2012 at 5:15 AM, Reuti wrote: > Am 27.07.2012 um 03:21 schrieb Ralph Castain: > > > Application processes will *only* be placed on nodes included in the > allocation. The -nolocal flag is intended to ensure that no application > processes are started on the same node as mpirun in the case where that > node is included in the allocation. This happens, for example, with Torque, > where mpirun is executed on one of the allocated nodes. > > But the behavior is the same in Torque and SGE. The jobscript is executed > on one of the elected exechosts (neither the submit host, nor the qmaster > host [unless they are exechosts too]) and so eligible to be used too. In no > case there should be -nolocal being used. > > -nolocal is IMO an option where you want to execute the `mpirun` on your > local login machine and want the MPI processes to be allocated somewhere in > the cluster, in case you don't have any queuing system around to manage the > resources. > > -- Reuti > > > I believe SGE doesn't do that - and so the allocation won't include the > submit host, in which case you don't need -nolocal. > > > > > > On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote: > > > >> I was under the impression that the -nolocal option keeps processes off > the submit > >> host (since there may be hundreds or thousands of jobs submitted at any > time, > >> and we don't want this host to be overloaded). > >> > >> My understanding of what you said in you last email is that, by listing > the hosts, I > >> automatically send all processes (parent and child, or master and slave > if you > >> prefer) to the specified list of hosts. > >> > >> Reading your email below, it looks like this was the correct > understanding. > >> > >> > >> On Thu, Jul 26, 2012 at 5:20 PM, Reuti > wrote: > >> Am 26.07.2012 um 23:58 schrieb Erik Nelson: > >> > >> > Reuti, > >> > > >> > Thank you. Our queue is backed up, so it will take a little while > before I can try this. > >> > > >> > I assume that by specifying the nodes this way, I don't need (and it > would confuse > >> > the system) to add -nolocal. In other words, qsub will try to put the > parent node > >> > somewhere in this set. > >> > > >> > Is this the idea? > >> > >> Depends what you refer to by "parent node". I assume you mean the > submit host. This is never included in any created selection of SGE unless > it's an execution host too. > >> > >> The master host of the parallel job (i.e. the one where the jobscript > with the `mpiexec` is running) will be used as a normal machine from MPI's > point of view. > >> > >> -- Reuti > >> > >> > >> > Erik > >> > > >> > > >> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti > wrote: > >> > Am 26.07.2012 um 23:33 schrieb Erik Nelson: > >> > > >> > > I have a purely parallel job that runs ~100 processes. Each process > has ~identical > >> > > overhead so the speed of the program is dominated by the slowest > processor. > >> > > > >> > > For this reason, I would like to restrict the job to a specific set > of identical (fast) > >> > > processors on our cluster. > >> > > > >> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to > me what affect these > >> > > directives will have in a queuing environment. > >> > > > >> > > Currently, I submit the job using the "qsub" command in the "sge" > environment as : > >> > > > >> > > qsub -pe mpich 101 jobfile.job > >> > > > >> > > where jobfile contains the command > >> > > > >> > > mpirun -np 101 -nolocal ./executable > >> > > >> > I would leave -nolocal out here. > >> > > >> > $ qs
Re: [OMPI users] restricting a job to a set of hosts
I see. Thank you both for the prompt replies. On Thu, Jul 26, 2012 at 8:21 PM, Ralph Castain wrote: > Application processes will *only* be placed on nodes included in the > allocation. The -nolocal flag is intended to ensure that no application > processes are started on the same node as mpirun in the case where that > node is included in the allocation. This happens, for example, with Torque, > where mpirun is executed on one of the allocated nodes. > > I believe SGE doesn't do that - and so the allocation won't include the > submit host, in which case you don't need -nolocal. > > > On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote: > > I was under the impression that the -nolocal option keeps processes off > the submit > host (since there may be hundreds or thousands of jobs submitted at any > time, > and we don't want this host to be overloaded). > > My understanding of what you said in you last email is that, by listing > the hosts, I > automatically send all processes (parent and child, or master and slave if > you > prefer) to the specified list of hosts. > > Reading your email below, it looks like this was the correct understanding. > > > On Thu, Jul 26, 2012 at 5:20 PM, Reuti wrote: > >> Am 26.07.2012 um 23:58 schrieb Erik Nelson: >> >> > Reuti, >> > >> > Thank you. Our queue is backed up, so it will take a little while >> before I can try this. >> > >> > I assume that by specifying the nodes this way, I don't need (and it >> would confuse >> > the system) to add -nolocal. In other words, qsub will try to put the >> parent node >> > somewhere in this set. >> > >> > Is this the idea? >> >> Depends what you refer to by "parent node". I assume you mean the submit >> host. This is never included in any created selection of SGE unless it's an >> execution host too. >> >> The master host of the parallel job (i.e. the one where the jobscript >> with the `mpiexec` is running) will be used as a normal machine from MPI's >> point of view. >> >> -- Reuti >> >> >> > Erik >> > >> > >> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti >> wrote: >> > Am 26.07.2012 um 23:33 schrieb Erik Nelson: >> > >> > > I have a purely parallel job that runs ~100 processes. Each process >> has ~identical >> > > overhead so the speed of the program is dominated by the slowest >> processor. >> > > >> > > For this reason, I would like to restrict the job to a specific set >> of identical (fast) >> > > processors on our cluster. >> > > >> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me >> what affect these >> > > directives will have in a queuing environment. >> > > >> > > Currently, I submit the job using the "qsub" command in the "sge" >> environment as : >> > > >> > > qsub -pe mpich 101 jobfile.job >> > > >> > > where jobfile contains the command >> > > >> > > mpirun -np 101 -nolocal ./executable >> > >> > I would leave -nolocal out here. >> > >> > $ qsub -l >> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe >> mpich 101 jobfile.job >> > >> > -- Reuti >> > >> > >> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32 >> on our machine, >> > > each containing 8 cpu's (slots). How do I go about this? >> > > >> > > Thanks, Erik >> > > >> > > -- >> > > Erik Nelson >> > > >> > > Howard Hughes Medical Institute >> > > 6001 Forest Park Blvd., Room ND10.124 >> > > Dallas, Texas 75235-9050 >> > > >> > > p : 214 645 5981 >> > > f : 214 645 5948 >> > > ___ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > -- >> > Erik Nelson >> > >> > Howard Hughes Medical Institute >> > 6001 Forest Park Blvd., Room ND10.124 >> > Dallas, Texas 75235-9050 >> > >> > p : 214 645 5981 >> > f : 214 645 5948 >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Erik Nelson > > Howard Hughes Medical Institute > 6001 Forest Park Blvd., Room ND10.124 > Dallas, Texas 75235-9050 > > p : 214 645 5981 > f : 214 645 5948 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
Re: [OMPI users] restricting a job to a set of hosts
I was under the impression that the -nolocal option keeps processes off the submit host (since there may be hundreds or thousands of jobs submitted at any time, and we don't want this host to be overloaded). My understanding of what you said in you last email is that, by listing the hosts, I automatically send all processes (parent and child, or master and slave if you prefer) to the specified list of hosts. Reading your email below, it looks like this was the correct understanding. On Thu, Jul 26, 2012 at 5:20 PM, Reuti wrote: > Am 26.07.2012 um 23:58 schrieb Erik Nelson: > > > Reuti, > > > > Thank you. Our queue is backed up, so it will take a little while before > I can try this. > > > > I assume that by specifying the nodes this way, I don't need (and it > would confuse > > the system) to add -nolocal. In other words, qsub will try to put the > parent node > > somewhere in this set. > > > > Is this the idea? > > Depends what you refer to by "parent node". I assume you mean the submit > host. This is never included in any created selection of SGE unless it's an > execution host too. > > The master host of the parallel job (i.e. the one where the jobscript with > the `mpiexec` is running) will be used as a normal machine from MPI's point > of view. > > -- Reuti > > > > Erik > > > > > > On Thu, Jul 26, 2012 at 4:48 PM, Reuti > wrote: > > Am 26.07.2012 um 23:33 schrieb Erik Nelson: > > > > > I have a purely parallel job that runs ~100 processes. Each process > has ~identical > > > overhead so the speed of the program is dominated by the slowest > processor. > > > > > > For this reason, I would like to restrict the job to a specific set of > identical (fast) > > > processors on our cluster. > > > > > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me > what affect these > > > directives will have in a queuing environment. > > > > > > Currently, I submit the job using the "qsub" command in the "sge" > environment as : > > > > > > qsub -pe mpich 101 jobfile.job > > > > > > where jobfile contains the command > > > > > > mpirun -np 101 -nolocal ./executable > > > > I would leave -nolocal out here. > > > > $ qsub -l > "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe > mpich 101 jobfile.job > > > > -- Reuti > > > > > > > I would like to restrict the job to nodes compute-5-1 to compute-5-32 > on our machine, > > > each containing 8 cpu's (slots). How do I go about this? > > > > > > Thanks, Erik > > > > > > -- > > > Erik Nelson > > > > > > Howard Hughes Medical Institute > > > 6001 Forest Park Blvd., Room ND10.124 > > > Dallas, Texas 75235-9050 > > > > > > p : 214 645 5981 > > > f : 214 645 5948 > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > Erik Nelson > > > > Howard Hughes Medical Institute > > 6001 Forest Park Blvd., Room ND10.124 > > Dallas, Texas 75235-9050 > > > > p : 214 645 5981 > > f : 214 645 5948 > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
Re: [OMPI users] restricting a job to a set of hosts
Reuti, Thank you. Our queue is backed up, so it will take a little while before I can try this. I assume that by specifying the nodes this way, I don't need (and it would confuse the system) to add -nolocal. In other words, qsub will try to put the parent node somewhere in this set. Is this the idea? Erik On Thu, Jul 26, 2012 at 4:48 PM, Reuti wrote: > Am 26.07.2012 um 23:33 schrieb Erik Nelson: > > > I have a purely parallel job that runs ~100 processes. Each process has > ~identical > > overhead so the speed of the program is dominated by the slowest > processor. > > > > For this reason, I would like to restrict the job to a specific set of > identical (fast) > > processors on our cluster. > > > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me > what affect these > > directives will have in a queuing environment. > > > > Currently, I submit the job using the "qsub" command in the "sge" > environment as : > > > > qsub -pe mpich 101 jobfile.job > > > > where jobfile contains the command > > > > mpirun -np 101 -nolocal ./executable > > I would leave -nolocal out here. > > $ qsub -l > "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe > mpich 101 jobfile.job > > -- Reuti > > > > I would like to restrict the job to nodes compute-5-1 to compute-5-32 on > our machine, > > each containing 8 cpu's (slots). How do I go about this? > > > > Thanks, Erik > > > > -- > > Erik Nelson > > > > Howard Hughes Medical Institute > > 6001 Forest Park Blvd., Room ND10.124 > > Dallas, Texas 75235-9050 > > > > p : 214 645 5981 > > f : 214 645 5948 > > _______ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948
[OMPI users] restricting a job to a set of hosts
I have a purely parallel job that runs ~100 processes. Each process has ~identical overhead so the speed of the program is dominated by the slowest processor. For this reason, I would like to restrict the job to a specific set of identical (fast) processors on our cluster. I read the FAQ on -hosts and -hostfile, but it is still unclear to me what affect these directives will have in a queuing environment. Currently, I submit the job using the "qsub" command in the "sge" environment as : qsub -pe mpich 101 jobfile.job where jobfile contains the command mpirun -np 101 -nolocal ./executable I would like to restrict the job to nodes compute-5-1 to compute-5-32 on our machine, each containing 8 cpu's (slots). How do I go about this? Thanks, Erik -- Erik Nelson Howard Hughes Medical Institute 6001 Forest Park Blvd., Room ND10.124 Dallas, Texas 75235-9050 p : 214 645 5981 f : 214 645 5948