Hi,

I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) and 
trying to define the size of the created vm and for that I use "--mca 
orte_max_vm_size" which in general seems to work.

In this example I have a PBS job of 4 nodes and want to run the DVM on < 4 
nodes.
If I create the VM with size 3 or 4 (max_vm_size 1 and 0 respectively) 
everything works as expected.
However, when I create a VM of size 1 or 2 (max_vm_size 3 and 2 respectively) I 
get the stack trace below once I use orte-submit to start something within the 
VM.

[nid01280:02498] [[39239,0],0] orted:comm:process_commands() Processing 
Command: ORTE_DAEMON_SPAWN_JOB_CMD
orte-dvm: ../../../../../src/ompi/opal/class/opal_list.h:547: 
_opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.
[nid01280:02498] *** Process received signal ***
[nid01280:02498] Signal: Aborted (6)
[nid01280:02498] Signal code:  (-6)
[nid01280:02498] [ 0] /lib64/libpthread.so.0(+0xf810)[0x2ba3e274a810]
[nid01280:02498] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2ba3e298b885]
[nid01280:02498] [ 2] /lib64/libc.so.6(abort+0x181)[0x2ba3e298ce61]
[nid01280:02498] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x2ba3e2984740]
[nid01280:02498] [ 4] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-rte.so.0(+0x83f16)[0x2ba3e1687f16]
[nid01280:02498] [ 5] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-rte.so.0(orte_plm_base_setup_virtual_machine+0x473)[0x2ba3e16907fe]
[nid01280:02498] [ 6] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/openmpi/mca_plm_alps.so(+0x274d)[0x2ba3e666574d]
[nid01280:02498] [ 7] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xd81)[0x2ba3e198cee1]
[nid01280:02498] [ 8] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/bin/orte-dvm[0x402e20]
[nid01280:02498] [ 9] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2ba3e2977c36]
[nid01280:02498] [10] 
/global/homes/m/marksant/openmpi/edison/installed/HEAD/bin/orte-dvm[0x401d19]
[nid01280:02498] *** End of error message ***
[nid05888:25419] 
[[39239,0],1]:../../../../../../src/ompi/orte/mca/errmgr/default_orted/errmgr_default_orted.c(251)
 updating exit status to 1


Some questions:
- Am I understanding the purpose of orte_max_vm_size correctly?
- If so, then it seems some refcounting if off. Not sure where to start looking 
though ...
- I would rather have a bit more flexible way of specifying the size of the VM, 
but currently the orte-dvm command line parameters are limited. Would it be a 
matter of copying over some of the "-host" parameter stuff from orte-run or is 
it more involved?
- Can I configure the -host, -np, etc parameters also via MCA settings?
- Whats the magic combination of parameters to get all information about of 
ompi_info? As I can't find a way to even "find" the orte_max_vm_size parameter 
out of it, while I know it exists.

Thanks!

Mark

Reply via email to