Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t
(Sorry for the delay, I missed the C/R question in the mail) On May 25, 2010, at 9:35 AM, Jeff Squyres wrote: On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote: | > 2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the | > full installation it complains it cannot find it. Note that I build BLCR with | > GCC but I am building OMPI with Intel compilers (V11.1) | | Can you be more specific here? I pointed to the insatllation path for BLCR but config complained that it couldn't find it. If BLCR is only needed for checkpoint / restart then we can leave without it. Is BLCR needed for suspend/resume of mpi jobs ? You mean suspend with ctrl-Z? If so, correct -- BLCR is *only* used for checkpoint/restart. Ctrl-Z just uses the SIGSTP functionality. So BLCR is used for the checkpoint/restart functionality in Open MPI. We have a webpage with some more details and examples at the link below: http://osl.iu.edu/research/ft/ompi-cr/ You should be able to suspend/resume an Open MPI job using SIGSTOP/ SIGCONT without the C/R functionality. We have FAQ item that talks about how to enable this functionality: http://www.open-mpi.org/faq/?category=running#suspend-resume You can combine the C/R and the SIGSTOP/SIGCONT functionality so that when you 'suspend' a job a checkpoint is taken and the process is stopped. You can continue the job by sending SIGCONT as normal. Additionally, this way if the job needs to be terminated for some reason (e.g., memory footprint, maintenance), it can be safely terminated and restarted from the checkpoint. I have a example of how this works at the link below: http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop As far as C/R integration with schedulers/resource managers, I know that the BLCR folks have been working with Torque to better integrate Open MPI+BLCR+Torque. If this is of interest, you might want to check with them on the progress of that project. -- Josh
Re: [OMPI users] Building from the SRPM version creates an rpm with striped libraries
Thanks Ashley, that did work, though I must say that %define __strip /bin/true is NOT very intuitive! I did get my symbols in the needed libraries, but unfortunately, at least for the compiler I used to build, I still have a typedef undefined, and that also prevents that method of launching TV. But we have our own workarounds for that. Cheers, PeterT Ashley Pittman wrote: This is a standard rpm feature although like most things it can be disabled. According to this mail and it's replies the two %defines below will prevent striping and the building of debuginfo rpms. http://lists.rpm.org/pipermail/rpm-list/2009-January/000122.html %define debug_package %{nil} %define __strip /bin/true Ashley. On 25 May 2010, at 00:25, Peter Thompson wrote: I have a user who prefers building rpm's from the srpm. That's okay, but for debugging via TotalView it creates a version with the openmpi .so files stripped and we can't gain control of the processes when launched via mpirun -tv. I've verified this with my own build of a 1.4.1 rpm which I then installed and noticed the same behavior that the user reports. I was hoping to give them some advice as to how to avoid the stripping, as it appears that the actual build of those libraries is done with -g and everything looks fine. But I can't figure out in the build (from the log file I created) just where that stripping takes place, or how to get around it if need be. The best guess I have is that it may be happening at the very end when an rpm-tmp file is executed, but that file has disappeared so I don't really know what it does. I thought it might be apparent in the spec file, but it's certainly not apparent to me! Any help or advice would be appreciated.
Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t
Hi Josh thanks for the reply. pls see below ... On 05/26/10 09:24, Josh Hursey wrote: (Sorry for the delay, I missed the C/R question in the mail) On May 25, 2010, at 9:35 AM, Jeff Squyres wrote: On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote: | > 2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the | > full installation it complains it cannot find it. Note that I build BLCR with | > GCC but I am building OMPI with Intel compilers (V11.1) | | Can you be more specific here? I pointed to the insatllation path for BLCR but config complained that it couldn't find it. If BLCR is only needed for checkpoint / restart then we can leave without it. Is BLCR needed for suspend/resume of mpi jobs ? You mean suspend with ctrl-Z? If so, correct -- BLCR is *only* used for checkpoint/restart. Ctrl-Z just uses the SIGSTP functionality. So BLCR is used for the checkpoint/restart functionality in Open MPI. We have a webpage with some more details and examples at the link below: http://osl.iu.edu/research/ft/ompi-cr/ You should be able to suspend/resume an Open MPI job using SIGSTOP/SIGCONT without the C/R functionality. We have FAQ item that talks about how to enable this functionality: http://www.open-mpi.org/faq/?category=running#suspend-resume You can combine the C/R and the SIGSTOP/SIGCONT functionality so that when you 'suspend' a job a checkpoint is taken and the process is stopped. You can continue the job by sending SIGCONT as normal. Additionally, this way if the job needs to be terminated for some reason (e.g., memory footprint, maintenance), it can be safely terminated and restarted from the checkpoint. I have a example of how this works at the link below: http://osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-stop As far as C/R integration with schedulers/resource managers, I know that the BLCR folks have been working with Torque to better integrate Open MPI+BLCR+Torque. If this is of interest, you might want to check with them on the progress of that project. So suspend/resume of OpenMPI jobs does not require BLCR. OK so I will proceed w/o it. best regards, Michael -- Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- % \ % Michael E. Thomadakis, Ph.D. Senior Lead Supercomputer Engineer/Res \ % E-mail: miket AT tamu DOT edu Texas A&M University \ % web:http://alphamike.tamu.edu Supercomputing Center \ % Voice: 979-862-3931Teague Research Center, 104B \ % FAX:979-847-8643 College Station, TX 77843, USA \ % \
Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t
Hi jeff, thanks for the reply. Pls see below . And a new question: How do you handle thread/task and memory affinity? Do you pass the requested affinity desires to the batch scheduler and them let it issue the specific placements for threads to the nodes ? This is something we are concerned as we are running multiple jobs on same node and we don't want to oversubscribe cores by binding there threads inadvertandly. Looking at ompi_info $ ompi_info | grep -i aff MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2) does this mean we have the full affinity support included or do I need to involve HWLOC in any way ? On 05/25/10 08:35, Jeff Squyres wrote: On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote: |> 1) high-resolution timers: how do I specify the HRT linux timers in the |> --with-timer=TYPE |> line of ./configure ? | | You shouldn't need to do anything; the "linux" timer component of Open MPI | should get automatically selected. You should be able to see this in the | stdout of Open MPI's "configure", and/or if you run ompi_info | grep timer | -- there should only be one entry: linux. If nothing is menioned, will it by default select 'linux' timers? Yes. Or I have to specify in th configure --with-timer=linux ? Nope. The philosophy of Open MPI is that whenever possible, we try to choose a sensible default. It never hurts to double check, but we try to do the Right Thing whenever it's possible to automatically choose it (within reason, of course). You can also check the output of ompi_info -- ompi_info tells you lots of things about your Open MPI installation. I actually spent some time looking around in the source trying to see which actual timer is the base. Is this a high-resolution timer such as a POSIX timers (timer_gettime or clock_nanosleep, etc.) or Intel processor's TSC ? I am just trying to stay away from gettimeofday() Understood. Ugh; I just poked into the code -- it's complicated how we resolve the timer functions. It looks like we put in the infrastructure into getting high resolution timers, but at least for Linux, we don't use it (the code falls back to gettimeofday). It looks like we're only using the high-resolution timers on AIX (!) and Solaris. Patches would be greatly appreciated; I'd be happy to walk someone through what to do. Which HRtimer is recommended for a Linux environment ? timer_gettime usually gives decent resolution and it is portable. I don't want to promise anything as I am already bogged down with several ongoing projects. You can give me *brief* instructions to see if this can be squeezed in. ... Justr as a feedback from one of the many HPC centers, for us it is most important to have a) a light-weight efficient MPI stack which makes the underlying IB h/w capabilities available and b) it can smoothly cooperate withe a batch scheduler / resource manager so that a mixture of jobs get a decent allocation of the cluster resources. Cools; good to know. We try to make these things very workable in Open MPI -- it's been a goal from day 1 to integrate with job schedulers, etc. And without high performance, we wouldn't have much to talk about. Please be sure to let us know of questions / problems / etc. I admit that we're sometimes a little slow to answer on the users list, but we do the best we can. So don't hesitate to bump us if we don't reply. Thanks! Thanks again... michael -- % \ % Michael E. Thomadakis, Ph.D. Senior Lead Supercomputer Engineer/Res \ % E-mail: miket AT tamu DOT edu Texas A&M University \ % web:http://alphamike.tamu.edu Supercomputing Center \ % Voice: 979-862-3931Teague Research Center, 104B \ % FAX:979-847-8643 College Station, TX 77843, USA \ % \
Re: [OMPI users] Deadlock question
On May 24, 2010, at 19:42 , Gijsbert Wiesenekker wrote: > My MPI program consists of a number of processes that send 0 or more messages > (using MPI_Isend) to 0 or more other processes. The processes check > periodically if messages are available to be processed. It was running fine > until I increased the message size, and I got deadlock problems. Googling > learned I was running into a classic deadlock problem if (see for example > http://www.cs.ucsb.edu/~hnielsen/cs140/mpi-deadlocks.html). The workarounds > suggested like changing the order of MPI_Send and MPI_Recv do not work in my > case, as it could be that one processor does not send any message at all to > the other processes, so MPI_Recv would wait indefinitely. > Any suggestions on how to avoid deadlock in this case? > > Thanks, > Gijsbert > An approach that seems to work in my case is the following: I was using separate message-tags for 'update_message' and 'no_more_messages'. All these were sent asynchronously. The receive code in pseudo-code looked like: -- if (probe_for_update_message() == FALSE) { if (probe_for_no_more_messages() == TRUE) { //we are done } else { //do some work } } else { //process update message } -- The problem with this receive code was that in between the probe_for_update_message() and the probe_for_no_more_messages() a processor could send several update messages, followed by 'no_more_messages', so I still needed to check for any pending update messages after a probe_for_no_more_messages(), which complicated handling deadlock. So I first created a special update message that signals 'no_more_messages', which simplified the receive code to: -- //probe_for_update_message() returns INVALID if no more messages, TRUE if message, FALSE if not if ((result = probe_for_update_message()) == INVALID) { //we are done } else if (result == TRUE) { //process update message } else //result == FALSE { //do some work } -- Now to deal with the deadlock I first created a function recv_update_message() that probes for update messages and pushes them onto a FIFO queue (for several reasons I cannot process the update message right away). In pseudo-code: -- int recv_update_message() { int result; if ((result = probe_for_update_message()) == TRUE) queue(update_message); return(result); } -- The asynchronous send code in pseudo-code looks like: -- MPI_Isend(update_message, &request); while(TRUE) { //deal with deadlock //I assume my deadlocks are caused by running out of system buffer space //hopefully polling pending update messages frees up buffer space recv_update_message(); MPI_Test(&request, &flag); if (flag) break; } -- The asynchronous receive code in pseudo-code looks like: -- //first check the FIFO queue if (dequeue(update_message)) return(TRUE); else { int result; if ((result = recv_update_message()) == INVALID) return(INVALID); if (result == TRUE) dequeue(update_message); return(result); } -- As a further refinement I use a queue per processor, and recv_update_message first tries to receive messages for the least used queues, but if deadlock is detected it tries to receive messages for all queues: -- MPI_Isend(update_message, &request); while(TRUE) nwaitx = 16; //threshold for deadlock nwait = 0; { if (nwait > 2 * nwaitx) { printf("possible deadlock detected\n"); nwaitx = nwait; recv_update_message(all_queues); } else { recv_update_message(least_used_queues_only); } MPI_Test(&request, &flag); if (flag) break; nwait++; } -- Gijsbert