[OMPI devel] Oversubscription/Scheduling Bug

2006-05-05 Thread Paul Donohue
I would like to be able to start a non-oversubscribed run of a program in 
OpenMPI as if it were oversubscribed, so that the processes run in Degraded 
Mode, such that I have the option to start an additional simultaneous run on 
the same nodes if necessary.
(Basically, I have a program that will ask for some data, run for a while, then 
print some results, then stop and ask for more data.  It takes some time to 
collect and input the additional data, so I would like to be able to start 
another instance of the program which can be running while i'm inputting data 
to the first instance, and can be inputting while the first instance is 
running).

Since I have single-processor nodes, the obvious solution would be to set 
slots=0 for each of my nodes, so that using 1 slot for every run causes the 
nodes to be oversubscribed.  However, it seems that slots=0 is treated like 
slots=infinity, so my processes run in Aggressive Mode, and I loose the ability 
to oversubscribe my node using two independent processes.

So, I tried setting '--mca mpi_yield_when_idle 1', since this sounded like it 
was meant to force Degraded Mode.  But, it didn't seem to do anything - my 
processes still ran in Aggressive Mode.  I skimmed through the source code real 
quick, and it doesn't look like mpi_yield_when_idle is ever actually used.

So, could either slots=0 be changed to really mean slots=0, or could 
mpi_yield_when_idle be implemented so I can force my processes to run in 
Degraded Mode?


I also noticed another bug in the scheduler:
hostfile:
 A slots=2 max-slots=2
 B slots=2 max-slots=2
'mpirun -np 5' quits with an over-subscription error
'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever


And finally, on http://www.open-mpi.org/faq/?category=tuning - 11. How do I 
tell Open MPI to use processor and/or memory affinity?
It mentions that OpenMPI will automatically disable processor affinity on 
oversubscribed nodes.  When I first read it, I made the assumption that 
processor affinity and Degraded Mode were incompatible.  However, it seems that 
independent non-oversubscribed processes running in Degraded Mode work fine 
with processor affinity - it's only actually oversubscribed processes which 
have problems.  A note that Degraded Mode and Processor Affinity work together 
even though Processor Affinity and oversubscription do not would be nice.

Thanks a ton!
-Paul


Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-05-24 Thread Paul Donohue
> > Since I have single-processor nodes, the obvious solution 
> > would be to set slots=0 for each of my nodes, so that using 1 
> > slot for every run causes the nodes to be oversubscribed.  
> > However, it seems that slots=0 is treated like 
> > slots=infinity, so my processes run in Aggressive Mode, and I 
> > loose the ability to oversubscribe my node using two 
> > independent processes.
> I'd prefer to keep the slots=0 synonymous to "infinity", if only for
> historical reasons (it's also less code to change :-) ).
Understandable. 'slots=0' mapping to 'infinity' is useful feature, I think.  I 
only mentioned it because I figured I should give justification as to why 
mpi_yield_when_idle working properly was necessary (since it is not possible to 
duplicate its functionality by mucking with the slots value).

> > So, I tried setting '--mca mpi_yield_when_idle 1', since this 
> > sounded like it was meant to force Degraded Mode.  But, it 
> > didn't seem to do anything - my processes still ran in 
> > Aggressive Mode.  I skimmed through the source code real 
> > quick, and it doesn't look like mpi_yield_when_idle is ever 
> > actually used.
> Are you sure?  How did you test this?

I'm using OpenMPI 1.0.2 (incase it makes a difference)

$ mpirun -np 2 --hostfile test --host psd.umd.edu --mca mpi_yield_when_idle 1 
--mca orte_debug 1 hostname 2>&1 | grep yield
[psd:30325] pls:rsh: /usr/bin/ssh  orted --debug --bootproxy 1 
--name  --num_procs 2 --vpid_start 0 --nodename  --universe 
paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" 
--gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0
[psd:30325] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 
--num_procs 2 --vpid_start 0 --nodename psd.umd.edu --universe 
paul@psd:default-universe-30325 --nsreplica "0.0.0;tcp://128.8.96.50:35281" 
--gprreplica "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0
$

When it runs the worker processes, it passes --mpi-call-yield 0 to the workers 
even though I set mpi_yield_when_idle to 1

Perhaps this has something to do with it:
(lines 689-703 of orte/mca/pls/rsh/pls_rsh_module.c)
/* set the progress engine schedule for this node.
 * if node_slots is set to zero, then we default to
 * NOT being oversubscribed
 */
if (ras_node->node_slots > 0 &&
opal_list_get_size(&rmaps_node->node_procs) > 
ras_node->node_slots) {
if (mca_pls_rsh_component.debug) {
opal_output(0, "pls:rsh: oversubscribed -- setting 
mpi_yield_when_idle to 1 (%d %d)",
ras_node->node_slots, 
opal_list_get_size(&rmaps_node->node_procs));
}
free(argv[call_yield_index]);
argv[call_yield_index] = strdup("1");
} else {
if (mca_pls_rsh_component.debug) {
opal_output(0, "pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0");
}
free(argv[call_yield_index]);
argv[call_yield_index] = strdup("0");
}

It looks like mpi_yield_when_idle is ignored and only slots are taken into 
account...

> It may be difficult to tell if this behavior is working properly
> because, by definition, if you're in an oversubscribed situation
> (assuming that all your processes are trying to fully utilize the CPU),
> the entire system could be running pretty slowly anyway.

In my case (fortunately? unfortunately?), it's fairly obvious when Degraded 
mode Aggressive mode are being used, since one process is idle (waiting for 
user input) while the other one is running.  Even though the node is actually 
oversubscribed, in Degraded mode, the running process should be able to use 
most of the CPU since the idle process isn't doing much.

> I just did a small test: running 3 processes on a 2-way SMP.  Each MPI
> process sends a short message around in a ring pattern 100 times:

I tried testing 4 processes on a 2-way SMP as well.
One pair of processes is waiting on STDIN.
The other pair of processes is running calculations.

First, I ran only the calculations without the STDIN processes - 35.5 second 
run time
Then I ran both pairs of processes, using slots=2 in my hostfile, and 
mpi_yield_when_idle=1 for both pairs - 25 minute run time
Then I ran both pairs of processes, using slots=1 in my hostfile - 48 second 
run time

Pretty drastic difference ;-)

> > I also noticed another bug in the scheduler:
> > hostfile:
> >  A slots=2 max-slots=2
> >  B slots=2 max-slots=2
> > 'mpirun -np 5' quits with an over-subscription error
> > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever
> Yoinks; this is definitely a bug.  I've filed a bug in our tracker to
> get this fixed.  Than

Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-06-05 Thread Paul Donohue
Sorry Brian and Jeff - I sent you chasing after something of a red herring...

After much more testing and banging my head on the desk trying to figure this 
one out, it turns out '--mca mpi_yield_when_idle 1' on the command line does 
actually work properly for me...  The one or two times I had previously tried 
using the command line argument, my app (by unfortunate coincidence - it took 
me a long time to figure this one out) happened to run slowly for completely 
unrelated reasons.

However, instead of typing the command line argument each time, for the bulk of 
my testing I was instead putting 'mpi_yield_when_idle = 1' in 
/usr/local/etc/openmpi-mca-params.conf on the machine I ran 'mpirun' from.  I 
didn't update that file on each of my worker nodes - only on the node i was 
running 'mpirun' from.  I had assumed that this would have the same effect as 
typing '--mca mpi_yield_when_idle 1' on the command line - mpirun would read 
/usr/local/etc/openmpi-mca-params.conf, import all of the parameters, then 
propagate those parameters to the worker nodes as if the parameters were typed 
on the command line.  Apparently, in reality, orted reads 
/usr/local/etc/openmpi-mca-params.conf on the local node where orted is 
actually running, and entries in the file on the node where 'mpirun' is run are 
not propagated.  Is this a bug or an undocumented feature? ;)

Sorry to have wasted your time chasing the wrong problem...
-Paul

On Fri, May 26, 2006 at 01:09:22PM -0400, Brian W. Barrett wrote:
> On Fri, 26 May 2006, Brian W. Barrett wrote:
> 
> > On Fri, 26 May 2006, Jeff Squyres (jsquyres) wrote:
> >
> >> You can see this by slightly modifying your test command -- run "env"
> >> instead of "hostname".  You'll see that the environment variable
> >> OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on
> >> the mpirun command line, regardless of a) whether you're oversubscribing
> >> or not, and b) whatever is passed in through the orted.
> >
> > While Jeff is correct that the parameter informing the MPI process that it
> > should idle when it's not busy is correctly set, it turns out that we are
> > ignoring this parameter inside the MPI process.  I'm looking into this and
> > hope to have a fix this afternoon.
> 
> Mea culpa.  Jeff's right that in a normal application, we are setting up 
> to call sched_yield() when idle if the user sets mpi_yield_when_idle to 1, 
> regardless of what is in the hostfile .  The problem with my test case was 
> that for various reasons, my test code was never actually "idling" - there 
> were always things moving along, so our progress engine was deciding that 
> the process should not be idled.
> 
> Can you share your test code at all?  I'm wondering if something similar 
> is happening with your code.  It doesn't sound like it should be "always 
> working", but I'm wondering if you're triggering some corner case we 
> haven't thought of.
> 
> Brian
> 
> -- 
>Brian Barrett
>Graduate Student, Open Systems Lab, Indiana University
>http://www.osl.iu.edu/~brbarret/
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Re: [OMPI devel] Oversubscription/Scheduling Bug

2006-06-05 Thread Paul Donohue
> You make a good point about the values in that file, though -- I'll add
> some information to the FAQ that such config files are only valid on the
> nodes where they can be seen (i.e., that mpirun does not bundle up all
> these files and send them to remote nodes during mpirun).  Sorry for the
> confusion!
It would probably be helpful to add a note about this to the comments in the 
default copy of the config files as well.
Thanks a bunch!
-Paul