Hi Ralph,

Sorry for the late reply, something along the lines of "swamped" ;-)

> On 03 Sep 2015, at 16:04 , Ralph Castain <r...@open-mpi.org> wrote:
> The purpose of orte_max_vm_size is to subdivide the allocation - i.e., for a 
> given mpirun execution, you can specify to only use a certain number of the 
> allocated nodes. If you want to further limit the VM to specific nodes in the 
> allocation, then you would use -host option.

*nods* Thanks, thats also how I interpreted it.

> It’s a little more complicated for your use-case as orte-dvm defines the VM, 
> not orte-submit. The latter simply tells orte-dvm to launch an application - 
> the daemons have already been established by orte-dvm and cannot change. So 
> if you want to setup orte-dvm and then submit to only some of the nodes, you 
> would have to use the -host option. Note that -host supports an extended 
> syntax for this purpose - you can ask for a specific number of “empty” nodes, 
> you can tell it to use only so many slots on a node, etc.

Ack. My question originated from running the dvm on a limited set.

> I’m confused by your examples because the max_vm_size values don’t seem 
> right. If you have a VM of size 1 or 2, then max_vm_size can only be 1 or 2. 
> You can’t have a max_vm_size larger than the number of available nodes. This 
> is probably the source of the problem you are seeing - I can add some 
> protection to ensure this doesn’t happen.

I screwed up my write-up, the actual calls were correct, but I understand your 
confusion :-)
(In my code I have a "reservation size", which I mixed up with the VM size in 
my original mail)

> We don’t appear to support either -host or -np as MCA params.
> I’m not sure -np would make sense,

I probably agree with that.

> but we could add a param for -host.

Yeah, that would help.

> We do have a param for the default hostfile, but that probably wouldn’t help 
> here.

I was expecting such a thing actually, that also raised my MCA question.

> We can certainly extend the orte-dvm and orte-submit cmd lines. I only 
> brought over a minimal set at first in order to get things running quickly, 
> but no problem with increasing capability. Just a question of finding a 
> little time.

Fully understandable!

> For ompi_info, try doing “ompi_info -l 9” to get the full output of params.

Right I tried that. So I don't understand it completely or it doesn't work as 
expected, as I dont manage to get e.g. "orte_max_vm_size" as output from that.

(I also believe that -all sets the level to 9 already)

Thanks!

Mark


> 
> 
>> On Sep 3, 2015, at 5:08 AM, Mark Santcroos <mark.santcr...@rutgers.edu> 
>> wrote:
>> 
>> Hi,
>> 
>> I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) 
>> and trying to define the size of the created vm and for that I use "--mca 
>> orte_max_vm_size" which in general seems to work.
>> 
>> In this example I have a PBS job of 4 nodes and want to run the DVM on < 4 
>> nodes.
>> If I create the VM with size 3 or 4 (max_vm_size 1 and 0 respectively) 
>> everything works as expected.
>> However, when I create a VM of size 1 or 2 (max_vm_size 3 and 2 
>> respectively) I get the stack trace below once I use orte-submit to start 
>> something within the VM.
>> 
>> [nid01280:02498] [[39239,0],0] orted:comm:process_commands() Processing 
>> Command: ORTE_DAEMON_SPAWN_JOB_CMD
>> orte-dvm: ../../../../../src/ompi/opal/class/opal_list.h:547: 
>> _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.
>> [nid01280:02498] *** Process received signal ***
>> [nid01280:02498] Signal: Aborted (6)
>> [nid01280:02498] Signal code:  (-6)
>> [nid01280:02498] [ 0] /lib64/libpthread.so.0(+0xf810)[0x2ba3e274a810]
>> [nid01280:02498] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2ba3e298b885]
>> [nid01280:02498] [ 2] /lib64/libc.so.6(abort+0x181)[0x2ba3e298ce61]
>> [nid01280:02498] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x2ba3e2984740]
>> [nid01280:02498] [ 4] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-rte.so.0(+0x83f16)[0x2ba3e1687f16]
>> [nid01280:02498] [ 5] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-rte.so.0(orte_plm_base_setup_virtual_machine+0x473)[0x2ba3e16907fe]
>> [nid01280:02498] [ 6] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/openmpi/mca_plm_alps.so(+0x274d)[0x2ba3e666574d]
>> [nid01280:02498] [ 7] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xd81)[0x2ba3e198cee1]
>> [nid01280:02498] [ 8] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/bin/orte-dvm[0x402e20]
>> [nid01280:02498] [ 9] 
>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x2ba3e2977c36]
>> [nid01280:02498] [10] 
>> /global/homes/m/marksant/openmpi/edison/installed/HEAD/bin/orte-dvm[0x401d19]
>> [nid01280:02498] *** End of error message ***
>> [nid05888:25419] 
>> [[39239,0],1]:../../../../../../src/ompi/orte/mca/errmgr/default_orted/errmgr_default_orted.c(251)
>>  updating exit status to 1
>> 
>> 
>> Some questions:
>> - Am I understanding the purpose of orte_max_vm_size correctly?
>> - If so, then it seems some refcounting if off. Not sure where to start 
>> looking though ...
>> - I would rather have a bit more flexible way of specifying the size of the 
>> VM, but currently the orte-dvm command line parameters are limited. Would it 
>> be a matter of copying over some of the "-host" parameter stuff from 
>> orte-run or is it more involved?
>> - Can I configure the -host, -np, etc parameters also via MCA settings?
>> - Whats the magic combination of parameters to get all information about of 
>> ompi_info? As I can't find a way to even "find" the orte_max_vm_size 
>> parameter out of it, while I know it exists.
>> 
>> Thanks!
>> 
>> Mark
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17930.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17934.php

Reply via email to