Re: [OMPI devel] behavior of the r2 component
Many thanks for the fast reply!!! I checked again, but it doesn't become clear.:( Look like you miss the bitmap. I ignored the bitmap by purpose. The only lines were the bitmap is change are in the loop were all btl-modules are iterated. Something like that. for each btl{ ompi_bitmap_clear_all_bits(reachable);[line 229] rc = btl->btl_add_procs(btl, n_new_procs, new_procs, btl_endpoints, reachable); [line 232] } So when the add_proc function of the r2 component returns, the bitmap holds the information which process is reachable by the last btl. Here it is the btl with the lowest exclusivity. I could not imaging what purpose that should have so I ignored it. Every time, one of the endpoint is reacheable the corresponding bit in the bitmap is set to one. With "endpoint is reachable" you meant that the process is reachable? I belive the r2 function shows a different behavior, the bitmap only holds the information from the last btl. I want to add here that I'm not too familiar with C so I think I made a mistake in my last mail. mca_bml_r2_add_proc() creates a new array of processes, only holding the processes which are really new. But does NOT return it.(I was confused by the pointers. Sorry.) The endpoints in the bml_endpoints array correspond to the processes in the new array, so they do not correspond to the processes in the array the upper level holds. With the bitmap it is the same. The upper level reparse the bitmap and it will detect the number of registered BTL. Sorry, but I don't understand this. The more I think about it the more I believe that the behaviour of the add_proc function in the bml_framework should be something like this: When the function returns, procs holds all processes. bml_endpoint holds the endpoints, each corresponding to one process in the procs array. The corresponding bit in the bitmap is set when the bml can reach the process. Is that right? Many thanks!! André Thanks, george. On May 24, 2006, at 6:12 AM, Andre Lichei wrote: Hello currently I'm working at the r2 component of the bml framework. When I tried to get an deeper understanding of the component I experienced difficulties to figure out how the add_proc function should behave. So my question is how should the function behave, and what is the purpose of the bml_endpoint array? An explanation of my difficulties follows. The add_proc function is implemented in bml_r2.c and starts at line 164 mca_bml_r2_add_procs(size_t nprocs, struct ompi_proc_t** procs, struct mca_bml_base_endpoint_t** bml_endpoints, struct ompi_bitmap_t* reachable ) When I first read it, it seems that the function accepts an array of ompi_proc_t structs and return an array of the same size which contains one bml_endpoint for every process in the procs array. At the beginning of the function (line 193 to 204) is a loop checking if there are processes which are not new. If this is the case the existing bml_endpoint is selected and stored in the endpoint array. New processes are stored in an different array. This means if all processes are known the function behaves like described above. When there are new processes the procs array is overwritten with the newly created array of new processes.(line 210) This array may be shorter. (When there was at least one known process) So the number of elements in nprocs is overwritten too. (line 211) But this nummber is no pointer so the calling function couldn't notice it. Now new bml_endpoints are created an stored in the bml_endpoints array. But they are stored at the position the process has in the new array!(line 271) So existing entries may be overwritten. Example: The function receives an array with 4 processes (process 0 to 3). Process 2 is already known. So in the first loop the bml_endpoint of process 2 is stored at bml_endpoints[2]. Also a new array is created containing process 0,1,3. This new array replaces the procs array. Then for all three processes bml_endpoints are created and stored at bml_endpoint[0,1,2]. So the existing entry (bml_endpoint[2]) is overwritten. So the bml_endpoint array contains only three elements, but the calling function has as number 4, because the new number can't be returned. So my question again. Is this the intended behavior or is it a bug? How should the function behave? Thanks, André ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Oversubscription/Scheduling Bug
> -Original Message- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Donohue > Sent: Wednesday, May 24, 2006 10:27 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] Oversubscription/Scheduling Bug > > I'm using OpenMPI 1.0.2 (incase it makes a difference) > > $ mpirun -np 2 --hostfile test --host psd.umd.edu --mca > mpi_yield_when_idle 1 --mca orte_debug 1 hostname 2>&1 | grep yield > [psd:30325] pls:rsh: /usr/bin/ssh orted > --debug --bootproxy 1 --name --num_procs 2 > --vpid_start 0 --nodename --universe > paul@psd:default-universe-30325 --nsreplica > "0.0.0;tcp://128.8.96.50:35281" --gprreplica > "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 > [psd:30325] pls:rsh: not oversubscribed -- setting > mpi_yield_when_idle to 0 > [psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename > psd.umd.edu --universe paul@psd:default-universe-30325 > --nsreplica "0.0.0;tcp://128.8.96.50:35281" --gprreplica > "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 > $ > > When it runs the worker processes, it passes --mpi-call-yield > 0 to the workers even though I set mpi_yield_when_idle to 1 This actually winds up in a comedy of errors. The end result is that mpi_yield_when_idle *is* set to 1 in the MPI processes. 1. Strictly speaking, you're right that the rsh pls should probably not be setting that variable when *not* oversubscribing. More specifically, we should only set it to 1 when we are oversubscribing. But by point 3 (below), this is actually harmless. 2. The orted gets the option "--mpi-call-yield 0", but it does do the Right thing, actually -- it only sets the MCA parameter to 1 if the argument to --mpi-call-yield is > 0. Hence, in this case, it does not set the MCA parameter. 3. mpirun and the orted bundle up MCA parameters from the mpirun command line and environment and seed them in the newly-spawned processes. As such, mpirun command line and environment MCA parameters override anything that the orted may have set (e.g., via --mpi-call-yield). This is actually by design. You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. I'm trying to think of a case where this will not be true, and I think it's only platforms where we don't use the orted (e.g., Red Storm, where oversubscription is not an issue). > I tried testing 4 processes on a 2-way SMP as well. > One pair of processes is waiting on STDIN. > The other pair of processes is running calculations. > First, I ran only the calculations without the STDIN > processes - 35.5 second run time > Then I ran both pairs of processes, using slots=2 in my > hostfile, and mpi_yield_when_idle=1 for both pairs - 25 > minute run time > Then I ran both pairs of processes, using slots=1 in my > hostfile - 48 second run time This is quite fishy. Note that the processes blocking on STDIN should not be affected by the MPI yield setting -- the MPI yield setting is *only* in effect when you're waiting for progress in an MPI function (e.g., in MPI_SEND or MPI_RECV or the like). So: - on a 2 way SMP - if you have N processes running - 2 of which are blocking in MPI calls - (N-2) of which are blocking on Note that Open MPI's "blocking" calls usually spin trying to make progress. So in the above scenario, you'll have 2 MPI processes spinning heavily and probably fully utilizing both CPUs. The other (N-2) processes should not be a factor. So the question is -- why does setting mpi_yield_when_idle to 1 take so much time? I'm guessing that it's doing exactly what it's supposed to be doing -- lots and lots of yielding (although I agree that a difference of 48 seconds -> 25 minutes seems a bit excessive). The constant yielding could be quite expensive. Are your 2 processes doing a lot of very large communications with each other? > > Good point. I'll update the FAQ later today; thanks! > Sweet! It would probably be worth mentioning > mpi_yield_when_idle=1 in there too - it took some digging for > me to find that option > (After it's fixed, of course ;-) ) Will do. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI devel] Oversubscription/Scheduling Bug
On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote: You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. While Jeff is correct that the parameter informing the MPI process that it should idle when it's not busy is correctly set, it turns out that we are ignoring this parameter inside the MPI process. I'm looking into this and hope to have a fix this afternoon. Brian
Re: [OMPI devel] Oversubscription/Scheduling Bug
On Fri, 26 May 2006, Brian W. Barrett wrote: On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote: You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. While Jeff is correct that the parameter informing the MPI process that it should idle when it's not busy is correctly set, it turns out that we are ignoring this parameter inside the MPI process. I'm looking into this and hope to have a fix this afternoon. Mea culpa. Jeff's right that in a normal application, we are setting up to call sched_yield() when idle if the user sets mpi_yield_when_idle to 1, regardless of what is in the hostfile . The problem with my test case was that for various reasons, my test code was never actually "idling" - there were always things moving along, so our progress engine was deciding that the process should not be idled. Can you share your test code at all? I'm wondering if something similar is happening with your code. It doesn't sound like it should be "always working", but I'm wondering if you're triggering some corner case we haven't thought of. Brian -- Brian Barrett Graduate Student, Open Systems Lab, Indiana University http://www.osl.iu.edu/~brbarret/