[OMPI devel] behavior of the r2 component

2006-05-24 Thread Andre Lichei

Hello

currently I'm working at the r2 component of the bml framework. When I 
tried to get an deeper understanding of the component I experienced 
difficulties to figure out how the add_proc function should behave. So 
my question is how should the function behave, and what is the purpose 
of the bml_endpoint array? An explanation of my difficulties follows.


The add_proc function is implemented in bml_r2.c and starts at line 164
mca_bml_r2_add_procs(size_t nprocs,

struct ompi_proc_t** procs,
struct mca_bml_base_endpoint_t** bml_endpoints,
struct ompi_bitmap_t* reachable
)


When I first read it, it seems that the function accepts an array of 
ompi_proc_t structs and return an array of the same size which contains 
one bml_endpoint for every process in the procs array.
At the beginning of the function (line 193 to 204) is a loop checking 
if there are processes which are not new. If this is the case the 
existing bml_endpoint is selected and stored in the endpoint array. New 
processes are stored in an different array. This means if all processes 
are known the function behaves like described above.
When there are new processes the procs array is overwritten with the 
newly created array of new processes.(line 210) This array may be 
shorter. (When there was at least one known process) So the number of 
elements in nprocs is overwritten too. (line 211) But this nummber is 
no pointer so the calling function couldn't notice it.
Now new bml_endpoints are created an stored in the bml_endpoints array. 
But they are stored at the  position the process has in the new 
array!(line 271) So existing entries may be overwritten.


Example:
The function receives an array with 4 processes (process 0 to 3). 
Process 2 is already known. So in the first loop the bml_endpoint of 
process 2 is stored at bml_endpoints[2]. Also a new array is created 
containing process 0,1,3. This new array replaces the procs array. Then 
for all three processes bml_endpoints are created and stored at 
bml_endpoint[0,1,2]. So the existing entry (bml_endpoint[2]) is 
overwritten.
So the bml_endpoint array contains only three elements, but the calling 
function has as number 4, because the new number can't be returned.


So my question again. Is this the intended behavior or is it a bug? How 
should the function behave?


Thanks,
André




Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-05-24 Thread Jeff Squyres (jsquyres)
Paul --

Many thanks for your detailed report.  I apparently missed a whole
boatload of e-mails on 2 May due to a problem with my mail client.  Deep
apologies for missing this mail!  :-(

More information below.


> -Original Message-
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Donohue
> Sent: Friday, May 05, 2006 10:47 PM
> To: de...@open-mpi.org
> Subject: [OMPI devel] Oversubscription/Scheduling Bug
> 
> I would like to be able to start a non-oversubscribed run of 
> a program in OpenMPI as if it were oversubscribed, so that 
> the processes run in Degraded Mode, such that I have the 
> option to start an additional simultaneous run on the same 
> nodes if necessary.
> (Basically, I have a program that will ask for some data, run 
> for a while, then print some results, then stop and ask for 
> more data.  It takes some time to collect and input the 
> additional data, so I would like to be able to start another 
> instance of the program which can be running while i'm 
> inputting data to the first instance, and can be inputting 
> while the first instance is running).
> 
> Since I have single-processor nodes, the obvious solution 
> would be to set slots=0 for each of my nodes, so that using 1 
> slot for every run causes the nodes to be oversubscribed.  
> However, it seems that slots=0 is treated like 
> slots=infinity, so my processes run in Aggressive Mode, and I 
> loose the ability to oversubscribe my node using two 
> independent processes.

I'd prefer to keep the slots=0 synonymous to "infinity", if only for
historical reasons (it's also less code to change :-) ).

> So, I tried setting '--mca mpi_yield_when_idle 1', since this 
> sounded like it was meant to force Degraded Mode.  But, it 
> didn't seem to do anything - my processes still ran in 
> Aggressive Mode.  I skimmed through the source code real 
> quick, and it doesn't look like mpi_yield_when_idle is ever 
> actually used.

Are you sure?  How did you test this?

I just did a few tests and it seems to work fine for me.  The MCA param
"mpi_yield_when_idle" is actually used within the OPAL layer (the name
is somewhat of an abstraction break -- it reflects the fact that the
progression engine used to be up in the MPI layer; it got put in OPAL
when the entire source code tree was split into OPAL, ORTE, and OMPI) in
opal/runtime/opal_progress.c.

You can check for whether this param is set or not by using the
mpi_show_mca_params MCA parameter.  Setting this parameter to 1 will
make all MPI processes display the current values for their MCA
parameters to stderr.  For example:

-
shell% mpirun -np 1 --mca mpi_show_mca_params 1 hello | & grep yield
[foo.example.com:23206] mpi_yield_when_idle=0
shell% mpirun -np 1 --mca mpi_yield_when_idle 1 --mca
mpi_show_mca_params 1 hello | & grep yield
[foo.example.com:23213] mpi_yield_when_idle=1
-

It may be difficult to tell if this behavior is working properly
because, by definition, if you're in an oversubscribed situation
(assuming that all your processes are trying to fully utilize the CPU),
the entire system could be running pretty slowly anyway.

The difference between aggressive and degraded mode is that we call
yield() in the middle of tight progression loops in OMPI.  Hence, if
you're oversubscribed, this actually gives other processes a chance of
being scheduled / run by the OS.  For example, if you oversubscribe and
don't have this param set, because OMPI uses tight repetitive loops for
progression, you will typically see one process completely hogging the
CPU for a long, long time before the OS finally lets another be
scheduled.  

I just did a small test: running 3 processes on a 2-way SMP.  Each MPI
process sends a short message around in a ring pattern 100 times:

- mpi_yield_when_idle=1 : 1.4 seconds running time
- mpi_tield_when_idle=0 : 22.8 seconds running time

So it can make a big difference.  But don't expect it to completely
mitigate the effects of oversubscription.

> I also noticed another bug in the scheduler:
> hostfile:
>  A slots=2 max-slots=2
>  B slots=2 max-slots=2
> 'mpirun -np 5' quits with an over-subscription error
> 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever

Yoinks; this is definitely a bug.  I've filed a bug in our tracker to
get this fixed.  Thanks for reporting it.

> And finally, on http://www.open-mpi.org/faq/?category=tuning 
> - 11. How do I tell Open MPI to use processor and/or memory affinity?
> It mentions that OpenMPI will automatically disable processor 
> affinity on oversubscribed nodes.  When I first read it, I 

Correct.

> made the assumption that processor affinity and Degraded Mode 
> were incompatible.  However, it seems that independent 
> non-oversubscribed processes running in Degraded Mode work 
> fine with processor affinity - it's only actually 
> oversubscribed processes which have problems.  A note that 
> Degraded Mode and Processor Affinity work tog

Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-05-24 Thread Paul Donohue
> > Since I have single-processor nodes, the obvious solution 
> > would be to set slots=0 for each of my nodes, so that using 1 
> > slot for every run causes the nodes to be oversubscribed.  
> > However, it seems that slots=0 is treated like 
> > slots=infinity, so my processes run in Aggressive Mode, and I 
> > loose the ability to oversubscribe my node using two 
> > independent processes.
> I'd prefer to keep the slots=0 synonymous to "infinity", if only for
> historical reasons (it's also less code to change :-) ).
Understandable. 'slots=0' mapping to 'infinity' is useful feature, I think.  I 
only mentioned it because I figured I should give justification as to why 
mpi_yield_when_idle working properly was necessary (since it is not possible to 
duplicate its functionality by mucking with the slots value).

> > So, I tried setting '--mca mpi_yield_when_idle 1', since this 
> > sounded like it was meant to force Degraded Mode.  But, it 
> > didn't seem to do anything - my processes still ran in 
> > Aggressive Mode.  I skimmed through the source code real 
> > quick, and it doesn't look like mpi_yield_when_idle is ever 
> > actually used.
> Are you sure?  How did you test this?

I'm using OpenMPI 1.0.2 (incase it makes a difference)

$ mpirun -np 2 --hostfile test --host psd.umd.edu --mca mpi_yield_when_idle 1 
--mca orte_debug 1 hostname 2>&1 | grep yield
[psd:30325] pls:rsh: /usr/bin/ssh  orted --debug --bootproxy 1 
--name  --num_procs 2 --vpid_start 0 --nodename  --universe 
paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" 
--gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0
[psd:30325] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 
--num_procs 2 --vpid_start 0 --nodename psd.umd.edu --universe 
paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" 
--gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0
$

When it runs the worker processes, it passes --mpi-call-yield 0 to the workers 
even though I set mpi_yield_when_idle to 1

Perhaps this has something to do with it:
(lines 689-703 of orte/mca/pls/rsh/pls_rsh_module.c)
/* set the progress engine schedule for this node.
 * if node_slots is set to zero, then we default to
 * NOT being oversubscribed
 */
if (ras_node->node_slots > 0 &&
opal_list_get_size(&rmaps_node->node_procs) > 
ras_node->node_slots) {
if (mca_pls_rsh_component.debug) {
opal_output(0, "pls:rsh: oversubscribed -- setting 
mpi_yield_when_idle to 1 (%d %d)",
ras_node->node_slots, 
opal_list_get_size(&rmaps_node->node_procs));
}
free(argv[call_yield_index]);
argv[call_yield_index] = strdup("1");
} else {
if (mca_pls_rsh_component.debug) {
opal_output(0, "pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0");
}
free(argv[call_yield_index]);
argv[call_yield_index] = strdup("0");
}

It looks like mpi_yield_when_idle is ignored and only slots are taken into 
account...

> It may be difficult to tell if this behavior is working properly
> because, by definition, if you're in an oversubscribed situation
> (assuming that all your processes are trying to fully utilize the CPU),
> the entire system could be running pretty slowly anyway.

In my case (fortunately? unfortunately?), it's fairly obvious when Degraded 
mode Aggressive mode are being used, since one process is idle (waiting for 
user input) while the other one is running.  Even though the node is actually 
oversubscribed, in Degraded mode, the running process should be able to use 
most of the CPU since the idle process isn't doing much.

> I just did a small test: running 3 processes on a 2-way SMP.  Each MPI
> process sends a short message around in a ring pattern 100 times:

I tried testing 4 processes on a 2-way SMP as well.
One pair of processes is waiting on STDIN.
The other pair of processes is running calculations.

First, I ran only the calculations without the STDIN processes - 35.5 second 
run time
Then I ran both pairs of processes, using slots=2 in my hostfile, and 
mpi_yield_when_idle=1 for both pairs - 25 minute run time
Then I ran both pairs of processes, using slots=1 in my hostfile - 48 second 
run time

Pretty drastic difference ;-)

> > I also noticed another bug in the scheduler:
> > hostfile:
> >  A slots=2 max-slots=2
> >  B slots=2 max-slots=2
> > 'mpirun -np 5' quits with an over-subscription error
> > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever
> Yoinks; this is definitely a bug.  I've filed a bug in our tracker to
> get this fixed.  Than

Re: [OMPI devel] behavior of the r2 component

2006-05-24 Thread George Bosilca
Look like you miss the bitmap. Every time, one of the endpoint is  
reacheable the corresponding bit in the bitmap is set to one. The  
upper level reparse the bitmap and it will detect the number of  
registered BTL.


  Thanks,
george.

On May 24, 2006, at 6:12 AM, Andre Lichei wrote:


Hello

currently I'm working at the r2 component of the bml framework. When I
tried to get an deeper understanding of the component I experienced
difficulties to figure out how the add_proc function should behave. So
my question is how should the function behave, and what is the purpose
of the bml_endpoint array? An explanation of my difficulties follows.

The add_proc function is implemented in bml_r2.c and starts at line  
164

mca_bml_r2_add_procs(size_t nprocs,

struct ompi_proc_t** procs,
struct mca_bml_base_endpoint_t**  
bml_endpoints,

struct ompi_bitmap_t* reachable
)


When I first read it, it seems that the function accepts an array of
ompi_proc_t structs and return an array of the same size which  
contains

one bml_endpoint for every process in the procs array.
At the beginning of the function (line 193 to 204) is a loop checking
if there are processes which are not new. If this is the case the
existing bml_endpoint is selected and stored in the endpoint array.  
New
processes are stored in an different array. This means if all  
processes

are known the function behaves like described above.
When there are new processes the procs array is overwritten with the
newly created array of new processes.(line 210) This array may be
shorter. (When there was at least one known process) So the number of
elements in nprocs is overwritten too. (line 211) But this nummber is
no pointer so the calling function couldn't notice it.
Now new bml_endpoints are created an stored in the bml_endpoints  
array.

But they are stored at the  position the process has in the new
array!(line 271) So existing entries may be overwritten.

Example:
The function receives an array with 4 processes (process 0 to 3).
Process 2 is already known. So in the first loop the bml_endpoint of
process 2 is stored at bml_endpoints[2]. Also a new array is created
containing process 0,1,3. This new array replaces the procs array.  
Then

for all three processes bml_endpoints are created and stored at
bml_endpoint[0,1,2]. So the existing entry (bml_endpoint[2]) is
overwritten.
So the bml_endpoint array contains only three elements, but the  
calling

function has as number 4, because the new number can't be returned.

So my question again. Is this the intended behavior or is it a bug?  
How

should the function behave?

Thanks,
André


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel