Re: [OMPI users] Running on crashing nodes
In a word, no. If a node crashes, OMPI will abort the currently-running job if it had processes on that node. There is no current ability to "ride-thru" such an event. That said, there is work being done to support "ride-thru". Most of that is in the current developer's code trunk, and more is coming, but I wouldn't consider it production-quality just yet. Specifically, the code that does what you specify below is done and works. It is recovery of the MPI job itself (collectives, lost messages, etc.) that remains to be completed. On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokauwrote: > Dear users, > > Our cluster has a number of nodes which have high probability to crash, so > it happens quite often that calculations stop due to one node getting down. > May be you know if it is possible to block the crashed nodes during run-time > when running with OpenMPI? I am asking about principal possibility to > program such behavior. Does OpenMPI allow such dynamic checking? The scheme > I am curious about is the following: > > 1. A code starts its tasks via mpirun on several nodes > 2. At some moment one node gets down > 3. The code realizes that the node is down (the results are lost) and > excludes it from the list of nodes to run its tasks on > 4. At later moment the user restarts the crashed node > 5. The code notices that the node is up again, and puts it back to the list > of active nodes > > > Regards, > Andrei > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] function fgets hangs a mpi program when it is used ompi-ps command
ompi-ps talks to mpirun to get the info, and then pretty-prints it to stderr. Best guess is that it is having problems contacting mpirun. Are you running it on the same node as mpirun (a requirement, unless you pass it the full contact info)? Check the ompi-ps man page and also "ompi-ps -h" to ensure you are running it correctly. There may be options that would help to figure out what is wrong (I forget what they all are). On Thu, Sep 23, 2010 at 12:21 PM, Matheus Bersot Siqueira Barros < matheusberso...@gmail.com> wrote: > Jeff and Ralph, > > Thank you for your reply. > > 1) I'm not running on machines with OpenFabrics. > > 2) In my example, ompi-ps prints a maximum 82 bytes per line. Even so, I > augment to 300 bytes per line to be sure that it is not the problem. > > char mystring [300]; > ... > fgets (mystring , 300 , pFile); > > 2) When I run ps, it shows just two process: ps and bash. > PID TTY TIME CMD > 1961 pts/500:00:00 bash > 2154 pts/500:00:00 ps > > But when I run ps -a -l, it appears my program(test.run) and other > processes. I put below just the information related to my program. > > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > 0 S 1000 1841 1840 0 80 0 - 18054 pipe_w pts/000:00:00 test.run > 0 S 1000 1842 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run > 0 S 1000 1843 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run > 0 S 1000 1844 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run > > pipe_s = wait state on read/write against a pipe. > > So, with that command I concluded that one mpi process is waiting for the > read of a pipe. > > The problem still persists. > > Thanks, > Matheus. > > > On Wed, Sep 22, 2010 at 11:24 AM, Ralph Castainwrote: > >> Printouts of less than 100 bytes would be unusual...but possible >> >> >> On Wed, Sep 22, 2010 at 8:15 AM, Jeff Squyres wrote: >> >>> Are you running on machines with OpenFabrics devices (that Open MPI is >>> using)? >>> >>> Is ompi-ps printing 100 bytes or more? >>> >>> What does ps show when your program is hung? >>> >>> >>> >>> On Sep 17, 2010, at 3:13 PM, Matheus Bersot Siqueira Barros wrote: >>> >>> > Open MPI Version = 1.4.2 >>> > OS = Ubuntu 10.04 LTS and CentOS 5.3 >>> > >>> > When I run the mpi program below in the terminal, the function fgets >>> hangs. >>> > How do I know it? I do a printf before and later the call of fgets and >>> only the message "before fgets()" is showed. >>> > >>> > However, when I run the same program at Eclipse 3.6 with CDT >>> 7.0.0.201006141710 or using gdb it runs normally. >>> > If you change the command in the function popen to another one(for >>> instance: "ls -l"), it will run correctly. >>> > >>> > I use the following commands to compile and run the program: >>> > >>> > compile : mpicc teste.c -o teste.run >>> > >>> > run : mpirun -np 4 ./teste.run >>> > >>> > >>> > Does anyone know why the program behaves like that? >>> > >>> > Thanks in advance, >>> > >>> > Matheus Bersot. >>> > >>> > MPI_PROGRAM: >>> > >>> > #include >>> > #include "mpi.h" >>> > >>> > int main(int argc, char *argv[]) >>> > { >>> >int rank, nprocs; >>> >FILE * pFile = NULL; >>> >char mystring [100]; >>> > >>> > MPI_Init(,); >>> > MPI_Comm_size(MPI_COMM_WORLD,); >>> > MPI_Comm_rank(MPI_COMM_WORLD,); >>> > >>> >if(rank == 0) >>> >{ >>> >pFile = popen ("ompi-ps" , "r"); >>> >if (pFile == NULL) perror ("Error opening file"); >>> >else { >>> > while(!feof(pFile)) >>> > { >>> >printf("before fgets()\n"); >>> >fgets (mystring , 100 , pFile); >>> >printf("after fgets()\n"); >>> >puts (mystring); >>> > } >>> > pclose (pFile); >>> >} >>> > } >>> > >>> > MPI_Finalize(); >>> >return 0; >>> > } >>> > ___ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > - > "In moments of crisis, only the inspiration is more important than > knowledge." > (Albert Einstein) > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [openib] segfault when using openib btl
Eloi, I am curious about your problem. Can you tell me what size of job it is? Does it always fail on the same bcast, or same process? Eloi Gaudry wrote: Hi Nysal, Thanks for your suggestions. I'm now able to get the checksum computed and redirected to stdout, thanks (I forgot the "-mca pml_base_verbose 5" option, you were right). I haven't been able to observe the segmentation fault (with hdr->tag=0) so far (when using pml csum) but I 'll let you know when I am. I've got two others question, which may be related to the error observed: 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI somehow depends on the btl being used (i.e. if I'm using openib, may I use the same number of MPI_Comm object as with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ? 2/ the segfaults only appears during a mpi collective call, with very small message (one int is being broadcast, for instance) ; i followed the guidelines given at http://icl.cs.utk.edu/open- mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build of OpenMPI asserts if I use a different min-size that 255. Anyway, if I deactivate eager_rdma, the segfaults remains. Does the openib btl handle very small message differently (even with eager_rdma deactivated) than tcp ? Others on the list does coalescing happen with non-eager_rdma? If so then that would possibly be one difference between the openib btl and tcp aside from the actual protocol used. is there a way to make sure that large messages and small messages are handled the same way ? Do you mean so they all look like eager messages? How large of messages are we talking about here 1K, 1M or 10M? --td Regards, Eloi On Friday 17 September 2010 17:57:17 Nysal Jan wrote: Hi Eloi, Create a debug build of OpenMPI (--enable-debug) and while running with the csum PML add "-mca pml_base_verbose 5" to the command line. This will print the checksum details for each fragment sent over the wire. I'm guessing it didnt catch anything because the BTL failed. The checksum verification is done in the PML, which the BTL calls via a callback function. In your case the PML callback is never called because the hdr->tag is invalid. So enabling checksum tracing also might not be of much use. Is it the first Bcast that fails or the nth Bcast and what is the message size? I'm not sure what could be the problem at this moment. I'm afraid you will have to debug the BTL to find out more. --Nysal On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudrywrote: Hi Nysal, thanks for your response. I've been unable so far to write a test case that could illustrate the hdr->tag=0 error. Actually, I'm only observing this issue when running an internode computation involving infiniband hardware from Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0 2.5GT/s, rev a0) with our time-domain software. I checked, double-checked, and rechecked again every MPI use performed during a parallel computation and I couldn't find any error so far. The fact that the very same parallel computation run flawlessly when using tcp (and disabling openib support) might seem to indicate that the issue is somewhere located inside the openib btl or at the hardware/driver level. I've just used the "-mca pml csum" option and I haven't seen any related messages (when hdr->tag=0 and the segfaults occurs). Any suggestion ? Regards, Eloi On Friday 17 September 2010 16:03:34 Nysal Jan wrote: Hi Eloi, Sorry for the delay in response. I haven't read the entire email thread, but do you have a test case which can reproduce this error? Without that it will be difficult to nail down the cause. Just to clarify, I do not work for an iwarp vendor. I can certainly try to reproduce it on an IB system. There is also a PML called csum, you can use it via "-mca pml csum", which will checksum the MPI messages and verify it at the receiver side for any data corruption. You can try using it to see if it is able to catch anything. Regards --Nysal On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry wrote: Hi Nysal, I'm sorry to intrrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message -- From: Eloi Gaudry To: Open MPI Users Date: Wed, 15 Sep 2010 16:27:43 +0200 Subject: Re: [OMPI users] [openib] segfault when using openib btl Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: Hi Jeff, Please find enclosed the output (valgrind.out.gz) from /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
Re: [OMPI users] "self scheduled" work & mpi receive???
Hi Ambrose, I'm interested in you work, i have a app to convert for myself and i don't know enough the MPI structure and syntaxe to make it... So if you wanna share your app i'm interested in taking a look at it!! Thanks and have a nice day!! Mikael Lavoie 2010/9/23 Lewis, Ambrose J.> Hi All: > > I’ve written an openmpi program that “self schedules” the work. > > The master task is in a loop chunking up an input stream and handing off > jobs to worker tasks. At first the master gives the next job to the next > highest rank. After all ranks have their first job, the master waits via an > MPI receive call for the next free worker. The master parses out the rank > from the MPI receive and sends the next job to this node. The jobs aren’t > all identical, so they run for slightly different durations based on the > input data. > > > > When I plot a histogram of the number of jobs each worker performed, the > lower mpi ranks are doing much more work than the higher ranks. For > example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. > My guess is that openmpi returns the lowest rank from the MPI Recv when > I’ve got MPI_ANY_SOURCE set and multiple sends have happened since the last > call. > > > > Is there a different Recv call to make that will spread out the data > better? > > > > THANXS! > > amb > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] "self scheduled" work & mpi receive???
That's a great suggestion...Thanks! amb -Original Message- From: users-boun...@open-mpi.org on behalf of Bowen Zhou Sent: Thu 9/23/2010 1:18 PM To: Open MPI Users Subject: Re: [OMPI users] "self scheduled" work & mpi receive??? > Hi All: > > I've written an openmpi program that "self schedules" the work. > > The master task is in a loop chunking up an input stream and handing off > jobs to worker tasks. At first the master gives the next job to the > next highest rank. After all ranks have their first job, the master > waits via an MPI receive call for the next free worker. The master > parses out the rank from the MPI receive and sends the next job to this > node. The jobs aren't all identical, so they run for slightly different > durations based on the input data. > > > > When I plot a histogram of the number of jobs each worker performed, the > lower mpi ranks are doing much more work than the higher ranks. For > example, in a 120 process run, rank 1 did 32 jobs while rank 119 only > did 2. My guess is that openmpi returns the lowest rank from the MPI > Recv when I've got MPI_ANY_SOURCE set and multiple sends have happened > since the last call. > > > > Is there a different Recv call to make that will spread out the data better? > > How about using MPI_Irecv? Let the master issue an MPI_Irecv for each worker and call MPI_Test to get the list of idle workers, then choose one from the idle list by some randomization? > > THANXS! > > amb > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] function fgets hangs a mpi program when it is used ompi-ps command
Jeff and Ralph, Thank you for your reply. 1) I'm not running on machines with OpenFabrics. 2) In my example, ompi-ps prints a maximum 82 bytes per line. Even so, I augment to 300 bytes per line to be sure that it is not the problem. char mystring [300]; ... fgets (mystring , 300 , pFile); 2) When I run ps, it shows just two process: ps and bash. PID TTY TIME CMD 1961 pts/500:00:00 bash 2154 pts/500:00:00 ps But when I run ps -a -l, it appears my program(test.run) and other processes. I put below just the information related to my program. F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 S 1000 1841 1840 0 80 0 - 18054 pipe_w pts/000:00:00 test.run 0 S 1000 1842 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run 0 S 1000 1843 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run 0 S 1000 1844 1840 0 80 0 - 18053 poll_s pts/000:00:00 test.run pipe_s = wait state on read/write against a pipe. So, with that command I concluded that one mpi process is waiting for the read of a pipe. The problem still persists. Thanks, Matheus. On Wed, Sep 22, 2010 at 11:24 AM, Ralph Castainwrote: > Printouts of less than 100 bytes would be unusual...but possible > > > On Wed, Sep 22, 2010 at 8:15 AM, Jeff Squyres wrote: > >> Are you running on machines with OpenFabrics devices (that Open MPI is >> using)? >> >> Is ompi-ps printing 100 bytes or more? >> >> What does ps show when your program is hung? >> >> >> >> On Sep 17, 2010, at 3:13 PM, Matheus Bersot Siqueira Barros wrote: >> >> > Open MPI Version = 1.4.2 >> > OS = Ubuntu 10.04 LTS and CentOS 5.3 >> > >> > When I run the mpi program below in the terminal, the function fgets >> hangs. >> > How do I know it? I do a printf before and later the call of fgets and >> only the message "before fgets()" is showed. >> > >> > However, when I run the same program at Eclipse 3.6 with CDT >> 7.0.0.201006141710 or using gdb it runs normally. >> > If you change the command in the function popen to another one(for >> instance: "ls -l"), it will run correctly. >> > >> > I use the following commands to compile and run the program: >> > >> > compile : mpicc teste.c -o teste.run >> > >> > run : mpirun -np 4 ./teste.run >> > >> > >> > Does anyone know why the program behaves like that? >> > >> > Thanks in advance, >> > >> > Matheus Bersot. >> > >> > MPI_PROGRAM: >> > >> > #include >> > #include "mpi.h" >> > >> > int main(int argc, char *argv[]) >> > { >> >int rank, nprocs; >> >FILE * pFile = NULL; >> >char mystring [100]; >> > >> > MPI_Init(,); >> > MPI_Comm_size(MPI_COMM_WORLD,); >> > MPI_Comm_rank(MPI_COMM_WORLD,); >> > >> >if(rank == 0) >> >{ >> >pFile = popen ("ompi-ps" , "r"); >> >if (pFile == NULL) perror ("Error opening file"); >> >else { >> > while(!feof(pFile)) >> > { >> >printf("before fgets()\n"); >> >fgets (mystring , 100 , pFile); >> >printf("after fgets()\n"); >> >puts (mystring); >> > } >> > pclose (pFile); >> >} >> > } >> > >> > MPI_Finalize(); >> >return 0; >> > } >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- - "In moments of crisis, only the inspiration is more important than knowledge." (Albert Einstein)
Re: [OMPI users] Question about Asynchronous collectives
CC stands for any Collective Communication operation. Every CC occurs on some communicator. Every CC is issued (basically the thread the call is on enters the call) at some point in time. If two threads are issuing CC calls on the same communicator, the issue order can become ambiguous so making CC calls from different threads but on the same communicator is generally unsafe. There is debate about whether it can be made safe by forcing some kind of thread serialization but since the MPI standard does not discuss thread serialization, the best advise is to use a different communicator for each thread and be sure you have control of issue order. When CC calls appear in some static order in a block of code that has no branches, issue order is simple to recognize. An example like this can cause problems unless you are sure every process has the same condition: If (condition) { MPI_Ibcast MPI_Ireduce } else { MPI_Ireduce MPI_Ibcast } If some ranks take the if and some ranks take the else, there is an "issue order" problem. (I do not have any idea why someone would do this) Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Gabriele FatigatiTo: Open MPI Users List-Post: users@lists.open-mpi.org Date: 09/23/2010 01:02 PM Subject: Re: [OMPI users] Question about Asynchronous collectives Sent by: users-boun...@open-mpi.org Sorry Richard, what is CC issue order on the communicator?, in particular, "CC", what does it mean? 2010/9/23 Richard Treumann request_1 and request_2 are just local variable names. The only thing that determines matching order is CC issue order on the communicator. At each process, some CC is issued first and some CC is issued second. The first issued CC at each process will try to match the first issued CC at the other processes. By this rule, rank 0: MPI_Ibcast; MPI_Ibcast Rank 1; MPI_Ibcast; MPI_Ibcast is well defined and rank 0: MPI_Ibcast; MPI_Ireduce Rank 1; MPI_Ireducet; MPI_Ibcast is incorrect. I do not agree with Jeff on this below. The Proc 1 case where the MPI_Waits are reversed simply requires the MPI implementation to make progress on both MPI_Ibcast operations in the first MPI_Wait. The second MPI_Wait call will simply find that the first MPI_Ibcast is already done. The second MPI_Wait call becomes, effectively, a query function. proc 0: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); proc 1: MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); That may/will deadlock. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Jeff Squyres To: Open MPI Users List-Post: users@lists.open-mpi.org Date: 09/23/2010 10:13 AM Subject: Re: [OMPI users] Question about Asynchronous collectives Sent by: users-boun...@open-mpi.org On Sep 23, 2010, at 10:00 AM, Gabriele Fatigati wrote: > to be sure, if i have one processor who does: > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > > it means that i can't have another process who does the follow: > > MPI_IBcast(MPI_COMM_WORLD, request_2) // firt Bcast for another process > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast for another process > > Because first Bcast of second process matches with first Bcast of first process, and it's wrong. If you did a "waitall" on both requests, it would probably work because MPI would just "figure it out". But if you did something like: proc 0: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); proc 1: MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); That may/will deadlock. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39
[OMPI users] How to know which process is running on which core?
Hi all, I'm new in the list. I don't know if this post has been treated before. My question is: Is there a way in the OMPI library to report which process is running on which core in a SMP system? I need to know processor affinity for optimizations issues. Regards Fernando Saez
[OMPI users] Porting Open MPI to ARM: How essential is the opal_sys_timer_get_cycles() function?
Dear Open MPI, How essential is Open MPI's opal_sys_timer_get_cycles() function? It apparently needs to access a timestamp register directly. That is a trivial operation in PPC (mftb) or x86 (tsc), but the ARM processor apparently doesn't have a similar function in its instruction set. Is it critical that opal_sys_timer_get_cycles() be written in assembly? Would a hack written in C suffice? Sincerely yours, Ken Mighell Kenneth Mighell, Scientist National Optical Astronomy Observatory 950 North Cherry Avenue Tucson, AZ 85719 U.S.A. email: mighell_at_[hidden] voice: (520) 318-8391
Re: [OMPI users] "self scheduled" work & mpi receive???
Hi All: I’ve written an openmpi program that “self schedules” the work. The master task is in a loop chunking up an input stream and handing off jobs to worker tasks. At first the master gives the next job to the next highest rank. After all ranks have their first job, the master waits via an MPI receive call for the next free worker. The master parses out the rank from the MPI receive and sends the next job to this node. The jobs aren’t all identical, so they run for slightly different durations based on the input data. When I plot a histogram of the number of jobs each worker performed, the lower mpi ranks are doing much more work than the higher ranks. For example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. My guess is that openmpi returns the lowest rank from the MPI Recv when I’ve got MPI_ANY_SOURCE set and multiple sends have happened since the last call. Is there a different Recv call to make that will spread out the data better? How about using MPI_Irecv? Let the master issue an MPI_Irecv for each worker and call MPI_Test to get the list of idle workers, then choose one from the idle list by some randomization? THANXS! amb ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Question about Asynchronous collectives
Sorry Richard, what is CC issue order on the communicator?, in particular, "CC", what does it mean? 2010/9/23 Richard Treumann> > request_1 and request_2 are just local variable names. > > The only thing that determines matching order is CC issue order on the > communicator. At each process, some CC is issued first and some CC is > issued second. The first issued CC at each process will try to match the > first issued CC at the other processes. By this rule, > rank 0: > MPI_Ibcast; MPI_Ibcast > Rank 1; > MPI_Ibcast; MPI_Ibcast > is well defined and > > rank 0: > MPI_Ibcast; MPI_Ireduce > Rank 1; > MPI_Ireducet; MPI_Ibcast > is incorrect. > > I do not agree with Jeff on this below. The Proc 1 case where the > MPI_Waits are reversed simply requires the MPI implementation to make > progress on both MPI_Ibcast operations in the first MPI_Wait. The second > MPI_Wait call will simply find that the first MPI_Ibcast is already done. > The second MPI_Wait call becomes, effectively, a query function. > > proc 0: > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > MPI_Wait(_1, ...); > MPI_Wait(_2, ...); > > proc 1: > MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast > MPI_Wait(_1, ...); > MPI_Wait(_2, ...); > > That may/will deadlock. > > > > > > Dick Treumann - MPI Team > IBM Systems & Technology Group > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 > Tele (845) 433-7846 Fax (845) 433-8363 > > > > From: > Jeff Squyres > To: Open MPI Users Date: 09/23/2010 10:13 AM Subject: Re: > [OMPI users] Question about Asynchronous collectives Sent by: > users-boun...@open-mpi.org > -- > > > > On Sep 23, 2010, at 10:00 AM, Gabriele Fatigati wrote: > > > to be sure, if i have one processor who does: > > > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > > > > it means that i can't have another process who does the follow: > > > > MPI_IBcast(MPI_COMM_WORLD, request_2) // firt Bcast for another process > > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast for another process > > > > Because first Bcast of second process matches with first Bcast of first > process, and it's wrong. > > If you did a "waitall" on both requests, it would probably work because MPI > would just "figure it out". But if you did something like: > > proc 0: > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > MPI_Wait(_1, ...); > MPI_Wait(_2, ...); > > proc 1: > MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast > MPI_Wait(_1, ...); > MPI_Wait(_2, ...); > > That may/will deadlock. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] "self scheduled" work & mpi receive???
Hi Lewis, On Thu, Sep 23, 2010 at 9:38 AM, Lewis, Ambrose J.wrote: > Hi All: > > I’ve written an openmpi program that “self schedules” the work. > > The master task is in a loop chunking up an input stream and handing off > jobs to worker tasks. At first the master gives the next job to the next > highest rank. After all ranks have their first job, the master waits via an > MPI receive call for the next free worker. The master parses out the rank > from the MPI receive and sends the next job to this node. The jobs aren’t > all identical, so they run for slightly different durations based on the > input data. > > > > When I plot a histogram of the number of jobs each worker performed, the > lower mpi ranks are doing much more work than the higher ranks. For > example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. > My guess is that openmpi returns the lowest rank from the MPI Recv when > I’ve got MPI_ANY_SOURCE set and multiple sends have happened since the last > call. > What is the time taken by each computation ? It is possible that computation time for longer tasks is much greater than computation time for shorter tasks ? > > > Is there a different Recv call to make that will spread out the data better? > > > > THANXS! > > amb > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Question about Asynchronous collectives
request_1 and request_2 are just local variable names. The only thing that determines matching order is CC issue order on the communicator. At each process, some CC is issued first and some CC is issued second. The first issued CC at each process will try to match the first issued CC at the other processes. By this rule, rank 0: MPI_Ibcast; MPI_Ibcast Rank 1; MPI_Ibcast; MPI_Ibcast is well defined and rank 0: MPI_Ibcast; MPI_Ireduce Rank 1; MPI_Ireducet; MPI_Ibcast is incorrect. I do not agree with Jeff on this below. The Proc 1 case where the MPI_Waits are reversed simply requires the MPI implementation to make progress on both MPI_Ibcast operations in the first MPI_Wait. The second MPI_Wait call will simply find that the first MPI_Ibcast is already done. The second MPI_Wait call becomes, effectively, a query function. proc 0: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); proc 1: MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); That may/will deadlock. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Jeff SquyresTo: Open MPI Users List-Post: users@lists.open-mpi.org Date: 09/23/2010 10:13 AM Subject: Re: [OMPI users] Question about Asynchronous collectives Sent by: users-boun...@open-mpi.org On Sep 23, 2010, at 10:00 AM, Gabriele Fatigati wrote: > to be sure, if i have one processor who does: > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > > it means that i can't have another process who does the follow: > > MPI_IBcast(MPI_COMM_WORLD, request_2) // firt Bcast for another process > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast for another process > > Because first Bcast of second process matches with first Bcast of first process, and it's wrong. If you did a "waitall" on both requests, it would probably work because MPI would just "figure it out". But if you did something like: proc 0: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); proc 1: MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); That may/will deadlock. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Question about Asynchronous collectives
On Sep 23, 2010, at 10:00 AM, Gabriele Fatigati wrote: > to be sure, if i have one processor who does: > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > > it means that i can't have another process who does the follow: > > MPI_IBcast(MPI_COMM_WORLD, request_2) // firt Bcast for another process > MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast for another process > > Because first Bcast of second process matches with first Bcast of first > process, and it's wrong. If you did a "waitall" on both requests, it would probably work because MPI would just "figure it out". But if you did something like: proc 0: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); proc 1: MPI_IBcast(MPI_COMM_WORLD, request_2) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast MPI_Wait(_1, ...); MPI_Wait(_2, ...); That may/will deadlock. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Question about Asynchronous collectives
Mm, to be sure, if i have one processor who does: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast it means that i can't have another process who does the follow: MPI_IBcast(MPI_COMM_WORLD, request_2) // firt Bcast for another process MPI_IBcast(MPI_COMM_WORLD, request_1) // second Bcast for another process Because first Bcast of second process matches with first Bcast of first process, and it's wrong. Is it right? 2010/9/23 Jeff Squyres> On Sep 23, 2010, at 6:28 AM, Gabriele Fatigati wrote: > > > i'm studing the interfaces of new collective routines in next MPI-3, and > i've read that new collectives haven't any tag. > > Correct. > > > So all collective operations must follow the ordering rules for > collective calls. > > Also correct. > > > From what i understand, this means that i can't use: > > > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast > > No, not quite right. You can have multiple outstanding ibcast's -- they'll > just be satisfied in the same order in all participating MPI processes. > > > but is it possible to do this: > > > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > > MPI_IReducet(MPI_COMM_WORLD, request_2) // othwer collective > > Correct -- this is also possible. > > More generally, you can have multiple outstanding non-blocking collectives > on a single communicator -- it doesn't matter if they are the same or > different collective operations. They will each be unique instances and will > be satisfied in order. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Question about Asynchronous collectives
On Sep 23, 2010, at 6:28 AM, Gabriele Fatigati wrote: > i'm studing the interfaces of new collective routines in next MPI-3, and i've > read that new collectives haven't any tag. Correct. > So all collective operations must follow the ordering rules for collective > calls. Also correct. > From what i understand, this means that i can't use: > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast No, not quite right. You can have multiple outstanding ibcast's -- they'll just be satisfied in the same order in all participating MPI processes. > but is it possible to do this: > > MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast > MPI_IReducet(MPI_COMM_WORLD, request_2) // othwer collective Correct -- this is also possible. More generally, you can have multiple outstanding non-blocking collectives on a single communicator -- it doesn't matter if they are the same or different collective operations. They will each be unique instances and will be satisfied in order. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] "self scheduled" work & mpi receive???
Hi All: I've written an openmpi program that "self schedules" the work. The master task is in a loop chunking up an input stream and handing off jobs to worker tasks. At first the master gives the next job to the next highest rank. After all ranks have their first job, the master waits via an MPI receive call for the next free worker. The master parses out the rank from the MPI receive and sends the next job to this node. The jobs aren't all identical, so they run for slightly different durations based on the input data. When I plot a histogram of the number of jobs each worker performed, the lower mpi ranks are doing much more work than the higher ranks. For example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. My guess is that openmpi returns the lowest rank from the MPI Recv when I've got MPI_ANY_SOURCE set and multiple sends have happened since the last call. Is there a different Recv call to make that will spread out the data better? THANXS! amb
[OMPI users] Running on crashing nodes
Dear users, Our cluster has a number of nodes which have high probability to crash, so it happens quite often that calculations stop due to one node getting down. May be you know if it is possible to block the crashed nodes during run-time when running with OpenMPI? I am asking about principal possibility to program such behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious about is the following: 1. A code starts its tasks via mpirun on several nodes 2. At some moment one node gets down 3. The code realizes that the node is down (the results are lost) and excludes it from the list of nodes to run its tasks on 4. At later moment the user restarts the crashed node 5. The code notices that the node is up again, and puts it back to the list of active nodes Regards, Andrei
[OMPI users] Question about Asynchronous collectives
Dear all, i'm studing the interfaces of new collective routines in next MPI-3, and i've read that new collectives haven't any tag. So all collective operations must follow the ordering rules for collective calls. >From what i understand, this means that i can't use: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IBcast(MPI_COMM_WORLD, request_2) // second Bcast but is it possible to do this: MPI_IBcast(MPI_COMM_WORLD, request_1) // first Bcast MPI_IReducet(MPI_COMM_WORLD, request_2) // othwer collective In other words, i can't overlap the same collective more time on one communicator, but is it possible with different collectives? Thanks a lot. -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] PathScale problems persist
You should probably take this up with Pathscale's support team. On Sep 23, 2010, at 3:56 AM, Rafael Arco Arredondo wrote: > I am using GCC 4.x: > > $ pathCC -v > PathScale(TM) Compiler Suite: Version 3.2 > Built on: 2008-06-16 16:41:38 -0700 > Thread model: posix > GNU gcc version 4.2.0 (PathScale 3.2 driver) > > $ pathCC -show-defaults > Optimization level and compilation target: > -O2 -mcpu=opteron -m64 -msse -msse2 -mno-sse3 -mno-3dnow -mno-sse4a > -gnu4 > > And I also tried with mpiCC -gnu4 to be totally sure. It's rather weird > that I get this error and Ake does not... > > I configured Open MPI with PathScale with the following line, by the > way: > > ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64 > --with-sge --enable-static CC=pathcc CXX=pathCC F77=pathf90 F90=pathf90 > FC=pathf90 > > And with GCC: > > ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64 > --with-sge --enable-static > > It's not an Infiniband or SGE issue. I also tried with all processes > running on the same node and without SGE. > > Best regards, > > Rafa > > On Wed, 2010-09-22 at 14:54 +0200, Ake Sandgren wrote: >> On Wed, 2010-09-22 at 14:16 +0200, Ake Sandgren wrote: >>> On Wed, 2010-09-22 at 07:42 -0400, Jeff Squyres wrote: This is a problem with the Pathscale compiler and old versions of >> GCC. See: >> http://www.open-mpi.org/faq/?category=building#pathscale-broken-with-mpi-c%2B%2B-api I note that you said you're already using GCC 4.x, but it's not >> clear from your text whether pathscale is using that compiler or a >> different GCC on the back-end. If you can confirm that pathscale *is* >> using GCC 4.x on the back-end, then this is worth reporting to the >> pathscale support people. >>> >>> I have no problem running the code below compiled with openmpi 1.4.2 >> and >>> pathscale 3.2. >> >> And i should of course have specified that this is with a GCC4.x >> backend. > -- > Rafael Arco Arredondo > Centro de Servicios de Informática y Redes de Comunicaciones > Campus de Fuentenueva - Edificio Mecenas > Universidad de Granada > E-18071 Granada Spain > Tel: +34 958 241010 Ext:31114 E-mail: rafaa...@ugr.es > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] PathScale problems persist
I am using GCC 4.x: $ pathCC -v PathScale(TM) Compiler Suite: Version 3.2 Built on: 2008-06-16 16:41:38 -0700 Thread model: posix GNU gcc version 4.2.0 (PathScale 3.2 driver) $ pathCC -show-defaults Optimization level and compilation target: -O2 -mcpu=opteron -m64 -msse -msse2 -mno-sse3 -mno-3dnow -mno-sse4a -gnu4 And I also tried with mpiCC -gnu4 to be totally sure. It's rather weird that I get this error and Ake does not... I configured Open MPI with PathScale with the following line, by the way: ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64 --with-sge --enable-static CC=pathcc CXX=pathCC F77=pathf90 F90=pathf90 FC=pathf90 And with GCC: ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64 --with-sge --enable-static It's not an Infiniband or SGE issue. I also tried with all processes running on the same node and without SGE. Best regards, Rafa On Wed, 2010-09-22 at 14:54 +0200, Ake Sandgren wrote: > On Wed, 2010-09-22 at 14:16 +0200, Ake Sandgren wrote: > > On Wed, 2010-09-22 at 07:42 -0400, Jeff Squyres wrote: > > > This is a problem with the Pathscale compiler and old versions of > GCC. See: > > > > > > > http://www.open-mpi.org/faq/?category=building#pathscale-broken-with-mpi-c%2B%2B-api > > > > > > I note that you said you're already using GCC 4.x, but it's not > clear from your text whether pathscale is using that compiler or a > different GCC on the back-end. If you can confirm that pathscale *is* > using GCC 4.x on the back-end, then this is worth reporting to the > pathscale support people. > > > > I have no problem running the code below compiled with openmpi 1.4.2 > and > > pathscale 3.2. > > And i should of course have specified that this is with a GCC4.x > backend. -- Rafael Arco Arredondo Centro de Servicios de Informática y Redes de Comunicaciones Campus de Fuentenueva - Edificio Mecenas Universidad de Granada E-18071 Granada Spain Tel: +34 958 241010 Ext:31114 E-mail: rafaa...@ugr.es