[OMPI devel] behavior of the r2 component
Hello currently I'm working at the r2 component of the bml framework. When I tried to get an deeper understanding of the component I experienced difficulties to figure out how the add_proc function should behave. So my question is how should the function behave, and what is the purpose of the bml_endpoint array? An explanation of my difficulties follows. The add_proc function is implemented in bml_r2.c and starts at line 164 mca_bml_r2_add_procs(size_t nprocs, struct ompi_proc_t** procs, struct mca_bml_base_endpoint_t** bml_endpoints, struct ompi_bitmap_t* reachable ) When I first read it, it seems that the function accepts an array of ompi_proc_t structs and return an array of the same size which contains one bml_endpoint for every process in the procs array. At the beginning of the function (line 193 to 204) is a loop checking if there are processes which are not new. If this is the case the existing bml_endpoint is selected and stored in the endpoint array. New processes are stored in an different array. This means if all processes are known the function behaves like described above. When there are new processes the procs array is overwritten with the newly created array of new processes.(line 210) This array may be shorter. (When there was at least one known process) So the number of elements in nprocs is overwritten too. (line 211) But this nummber is no pointer so the calling function couldn't notice it. Now new bml_endpoints are created an stored in the bml_endpoints array. But they are stored at the position the process has in the new array!(line 271) So existing entries may be overwritten. Example: The function receives an array with 4 processes (process 0 to 3). Process 2 is already known. So in the first loop the bml_endpoint of process 2 is stored at bml_endpoints[2]. Also a new array is created containing process 0,1,3. This new array replaces the procs array. Then for all three processes bml_endpoints are created and stored at bml_endpoint[0,1,2]. So the existing entry (bml_endpoint[2]) is overwritten. So the bml_endpoint array contains only three elements, but the calling function has as number 4, because the new number can't be returned. So my question again. Is this the intended behavior or is it a bug? How should the function behave? Thanks, André
Re: [OMPI devel] Oversubscription/Scheduling Bug
Paul -- Many thanks for your detailed report. I apparently missed a whole boatload of e-mails on 2 May due to a problem with my mail client. Deep apologies for missing this mail! :-( More information below. > -Original Message- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Donohue > Sent: Friday, May 05, 2006 10:47 PM > To: de...@open-mpi.org > Subject: [OMPI devel] Oversubscription/Scheduling Bug > > I would like to be able to start a non-oversubscribed run of > a program in OpenMPI as if it were oversubscribed, so that > the processes run in Degraded Mode, such that I have the > option to start an additional simultaneous run on the same > nodes if necessary. > (Basically, I have a program that will ask for some data, run > for a while, then print some results, then stop and ask for > more data. It takes some time to collect and input the > additional data, so I would like to be able to start another > instance of the program which can be running while i'm > inputting data to the first instance, and can be inputting > while the first instance is running). > > Since I have single-processor nodes, the obvious solution > would be to set slots=0 for each of my nodes, so that using 1 > slot for every run causes the nodes to be oversubscribed. > However, it seems that slots=0 is treated like > slots=infinity, so my processes run in Aggressive Mode, and I > loose the ability to oversubscribe my node using two > independent processes. I'd prefer to keep the slots=0 synonymous to "infinity", if only for historical reasons (it's also less code to change :-) ). > So, I tried setting '--mca mpi_yield_when_idle 1', since this > sounded like it was meant to force Degraded Mode. But, it > didn't seem to do anything - my processes still ran in > Aggressive Mode. I skimmed through the source code real > quick, and it doesn't look like mpi_yield_when_idle is ever > actually used. Are you sure? How did you test this? I just did a few tests and it seems to work fine for me. The MCA param "mpi_yield_when_idle" is actually used within the OPAL layer (the name is somewhat of an abstraction break -- it reflects the fact that the progression engine used to be up in the MPI layer; it got put in OPAL when the entire source code tree was split into OPAL, ORTE, and OMPI) in opal/runtime/opal_progress.c. You can check for whether this param is set or not by using the mpi_show_mca_params MCA parameter. Setting this parameter to 1 will make all MPI processes display the current values for their MCA parameters to stderr. For example: - shell% mpirun -np 1 --mca mpi_show_mca_params 1 hello | & grep yield [foo.example.com:23206] mpi_yield_when_idle=0 shell% mpirun -np 1 --mca mpi_yield_when_idle 1 --mca mpi_show_mca_params 1 hello | & grep yield [foo.example.com:23213] mpi_yield_when_idle=1 - It may be difficult to tell if this behavior is working properly because, by definition, if you're in an oversubscribed situation (assuming that all your processes are trying to fully utilize the CPU), the entire system could be running pretty slowly anyway. The difference between aggressive and degraded mode is that we call yield() in the middle of tight progression loops in OMPI. Hence, if you're oversubscribed, this actually gives other processes a chance of being scheduled / run by the OS. For example, if you oversubscribe and don't have this param set, because OMPI uses tight repetitive loops for progression, you will typically see one process completely hogging the CPU for a long, long time before the OS finally lets another be scheduled. I just did a small test: running 3 processes on a 2-way SMP. Each MPI process sends a short message around in a ring pattern 100 times: - mpi_yield_when_idle=1 : 1.4 seconds running time - mpi_tield_when_idle=0 : 22.8 seconds running time So it can make a big difference. But don't expect it to completely mitigate the effects of oversubscription. > I also noticed another bug in the scheduler: > hostfile: > A slots=2 max-slots=2 > B slots=2 max-slots=2 > 'mpirun -np 5' quits with an over-subscription error > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever Yoinks; this is definitely a bug. I've filed a bug in our tracker to get this fixed. Thanks for reporting it. > And finally, on http://www.open-mpi.org/faq/?category=tuning > - 11. How do I tell Open MPI to use processor and/or memory affinity? > It mentions that OpenMPI will automatically disable processor > affinity on oversubscribed nodes. When I first read it, I Correct. > made the assumption that processor affinity and Degraded Mode > were incompatible. However, it seems that independent > non-oversubscribed processes running in Degraded Mode work > fine with processor affinity - it's only actually > oversubscribed processes which have problems. A note that > Degraded Mode and Processor Affinity work tog
Re: [OMPI devel] Oversubscription/Scheduling Bug
> > Since I have single-processor nodes, the obvious solution > > would be to set slots=0 for each of my nodes, so that using 1 > > slot for every run causes the nodes to be oversubscribed. > > However, it seems that slots=0 is treated like > > slots=infinity, so my processes run in Aggressive Mode, and I > > loose the ability to oversubscribe my node using two > > independent processes. > I'd prefer to keep the slots=0 synonymous to "infinity", if only for > historical reasons (it's also less code to change :-) ). Understandable. 'slots=0' mapping to 'infinity' is useful feature, I think. I only mentioned it because I figured I should give justification as to why mpi_yield_when_idle working properly was necessary (since it is not possible to duplicate its functionality by mucking with the slots value). > > So, I tried setting '--mca mpi_yield_when_idle 1', since this > > sounded like it was meant to force Degraded Mode. But, it > > didn't seem to do anything - my processes still ran in > > Aggressive Mode. I skimmed through the source code real > > quick, and it doesn't look like mpi_yield_when_idle is ever > > actually used. > Are you sure? How did you test this? I'm using OpenMPI 1.0.2 (incase it makes a difference) $ mpirun -np 2 --hostfile test --host psd.umd.edu --mca mpi_yield_when_idle 1 --mca orte_debug 1 hostname 2>&1 | grep yield [psd:30325] pls:rsh: /usr/bin/ssh orted --debug --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" --gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 [psd:30325] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename psd.umd.edu --universe paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" --gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 $ When it runs the worker processes, it passes --mpi-call-yield 0 to the workers even though I set mpi_yield_when_idle to 1 Perhaps this has something to do with it: (lines 689-703 of orte/mca/pls/rsh/pls_rsh_module.c) /* set the progress engine schedule for this node. * if node_slots is set to zero, then we default to * NOT being oversubscribed */ if (ras_node->node_slots > 0 && opal_list_get_size(&rmaps_node->node_procs) > ras_node->node_slots) { if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (%d %d)", ras_node->node_slots, opal_list_get_size(&rmaps_node->node_procs)); } free(argv[call_yield_index]); argv[call_yield_index] = strdup("1"); } else { if (mca_pls_rsh_component.debug) { opal_output(0, "pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0"); } free(argv[call_yield_index]); argv[call_yield_index] = strdup("0"); } It looks like mpi_yield_when_idle is ignored and only slots are taken into account... > It may be difficult to tell if this behavior is working properly > because, by definition, if you're in an oversubscribed situation > (assuming that all your processes are trying to fully utilize the CPU), > the entire system could be running pretty slowly anyway. In my case (fortunately? unfortunately?), it's fairly obvious when Degraded mode Aggressive mode are being used, since one process is idle (waiting for user input) while the other one is running. Even though the node is actually oversubscribed, in Degraded mode, the running process should be able to use most of the CPU since the idle process isn't doing much. > I just did a small test: running 3 processes on a 2-way SMP. Each MPI > process sends a short message around in a ring pattern 100 times: I tried testing 4 processes on a 2-way SMP as well. One pair of processes is waiting on STDIN. The other pair of processes is running calculations. First, I ran only the calculations without the STDIN processes - 35.5 second run time Then I ran both pairs of processes, using slots=2 in my hostfile, and mpi_yield_when_idle=1 for both pairs - 25 minute run time Then I ran both pairs of processes, using slots=1 in my hostfile - 48 second run time Pretty drastic difference ;-) > > I also noticed another bug in the scheduler: > > hostfile: > > A slots=2 max-slots=2 > > B slots=2 max-slots=2 > > 'mpirun -np 5' quits with an over-subscription error > > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever > Yoinks; this is definitely a bug. I've filed a bug in our tracker to > get this fixed. Than
Re: [OMPI devel] behavior of the r2 component
Look like you miss the bitmap. Every time, one of the endpoint is reacheable the corresponding bit in the bitmap is set to one. The upper level reparse the bitmap and it will detect the number of registered BTL. Thanks, george. On May 24, 2006, at 6:12 AM, Andre Lichei wrote: Hello currently I'm working at the r2 component of the bml framework. When I tried to get an deeper understanding of the component I experienced difficulties to figure out how the add_proc function should behave. So my question is how should the function behave, and what is the purpose of the bml_endpoint array? An explanation of my difficulties follows. The add_proc function is implemented in bml_r2.c and starts at line 164 mca_bml_r2_add_procs(size_t nprocs, struct ompi_proc_t** procs, struct mca_bml_base_endpoint_t** bml_endpoints, struct ompi_bitmap_t* reachable ) When I first read it, it seems that the function accepts an array of ompi_proc_t structs and return an array of the same size which contains one bml_endpoint for every process in the procs array. At the beginning of the function (line 193 to 204) is a loop checking if there are processes which are not new. If this is the case the existing bml_endpoint is selected and stored in the endpoint array. New processes are stored in an different array. This means if all processes are known the function behaves like described above. When there are new processes the procs array is overwritten with the newly created array of new processes.(line 210) This array may be shorter. (When there was at least one known process) So the number of elements in nprocs is overwritten too. (line 211) But this nummber is no pointer so the calling function couldn't notice it. Now new bml_endpoints are created an stored in the bml_endpoints array. But they are stored at the position the process has in the new array!(line 271) So existing entries may be overwritten. Example: The function receives an array with 4 processes (process 0 to 3). Process 2 is already known. So in the first loop the bml_endpoint of process 2 is stored at bml_endpoints[2]. Also a new array is created containing process 0,1,3. This new array replaces the procs array. Then for all three processes bml_endpoints are created and stored at bml_endpoint[0,1,2]. So the existing entry (bml_endpoint[2]) is overwritten. So the bml_endpoint array contains only three elements, but the calling function has as number 4, because the new number can't be returned. So my question again. Is this the intended behavior or is it a bug? How should the function behave? Thanks, André ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel