Oswin,

One more thing, can you

pbsdsh -v hostname

before invoking mpirun ?
Hopefully this should print the three hostnames

Then you can
ldd `which pbsdsh`
And see which libtorque.so is linked with it

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>Hi Gilles,
>
>There you go:
>
>[zbh251@a00551 ~]$ cat $PBS_NODEFILE
>a00551.science.domain
>a00554.science.domain
>a00553.science.domain
>[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca 
>plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>[a00551.science.domain:18889] mca: base: components_register: 
>registering framework ess components
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component pmi
>[a00551.science.domain:18889] mca: base: components_register: component 
>pmi has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tool
>[a00551.science.domain:18889] mca: base: components_register: component 
>tool has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component env
>[a00551.science.domain:18889] mca: base: components_register: component 
>env has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component hnp
>[a00551.science.domain:18889] mca: base: components_register: component 
>hnp has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component singleton
>[a00551.science.domain:18889] mca: base: components_register: component 
>singleton register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:18889] mca: base: components_register: component 
>slurm has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18889] mca: base: components_register: component 
>tm has no register or open function
>[a00551.science.domain:18889] mca: base: components_open: opening ess 
>components
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component pmi
>[a00551.science.domain:18889] mca: base: components_open: component pmi 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tool
>[a00551.science.domain:18889] mca: base: components_open: component tool 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component env
>[a00551.science.domain:18889] mca: base: components_open: component env 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component hnp
>[a00551.science.domain:18889] mca: base: components_open: component hnp 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component singleton
>[a00551.science.domain:18889] mca: base: components_open: component 
>singleton open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:18889] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18889] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18889] mca:base:select: Auto-selecting ess 
>components
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[pmi]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tool]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[env]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[hnp]
>[a00551.science.domain:18889] mca:base:select:(  ess) Query of component 
>[hnp] set priority to 100
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[singleton]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[slurm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Selected component 
>[hnp]
>[a00551.science.domain:18889] mca: base: close: component pmi closed
>[a00551.science.domain:18889] mca: base: close: unloading component pmi
>[a00551.science.domain:18889] mca: base: close: component tool closed
>[a00551.science.domain:18889] mca: base: close: unloading component tool
>[a00551.science.domain:18889] mca: base: close: component env closed
>[a00551.science.domain:18889] mca: base: close: unloading component env
>[a00551.science.domain:18889] mca: base: close: component singleton 
>closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>singleton
>[a00551.science.domain:18889] mca: base: close: component slurm closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>slurm
>[a00551.science.domain:18889] mca: base: close: component tm closed
>[a00551.science.domain:18889] mca: base: close: unloading component tm
>[a00551.science.domain:18889] mca: base: components_register: 
>registering framework plm components
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component isolated
>[a00551.science.domain:18889] mca: base: components_register: component 
>isolated has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component rsh
>[a00551.science.domain:18889] mca: base: components_register: component 
>rsh register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:18889] mca: base: components_register: component 
>slurm register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18889] mca: base: components_register: component 
>tm register function successful
>[a00551.science.domain:18889] mca: base: components_open: opening plm 
>components
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component isolated
>[a00551.science.domain:18889] mca: base: components_open: component 
>isolated open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component rsh
>[a00551.science.domain:18889] mca: base: components_open: component rsh 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:18889] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18889] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18889] mca:base:select: Auto-selecting plm 
>components
>[a00551.science.domain:18889] mca:base:select:(  plm) Querying component 
>[isolated]
>[a00551.science.domain:18889] mca:base:select:(  plm) Query of component 
>[isolated] set priority to 0
>[a00551.science.domain:18889] mca:base:select:(  plm) Querying component 
>[rsh]
>[a00551.science.domain:18889] [[INVALID],INVALID] plm:rsh_lookup on 
>agent ssh : rsh path NULL
>[a00551.science.domain:18889] mca:base:select:(  plm) Query of component 
>[rsh] set priority to 10
>[a00551.science.domain:18889] mca:base:select:(  plm) Querying component 
>[slurm]
>[a00551.science.domain:18889] mca:base:select:(  plm) Querying component 
>[tm]
>[a00551.science.domain:18889] mca:base:select:(  plm) Query of component 
>[tm] set priority to 75
>[a00551.science.domain:18889] mca:base:select:(  plm) Selected component 
>[tm]
>[a00551.science.domain:18889] mca: base: close: component isolated 
>closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>isolated
>[a00551.science.domain:18889] mca: base: close: component rsh closed
>[a00551.science.domain:18889] mca: base: close: unloading component rsh
>[a00551.science.domain:18889] mca: base: close: component slurm closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>slurm
>[a00551.science.domain:18889] plm:base:set_hnp_name: initial bias 18889 
>nodename hash 2226275586
>[a00551.science.domain:18889] plm:base:set_hnp_name: final jobfam 34937
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive start comm
>[a00551.science.domain:18889] mca: base: components_register: 
>registering framework ras components
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component loadleveler
>[a00551.science.domain:18889] mca: base: components_register: component 
>loadleveler register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component simulator
>[a00551.science.domain:18889] mca: base: components_register: component 
>simulator register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:18889] mca: base: components_register: component 
>slurm register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18889] mca: base: components_register: component 
>tm register function successful
>[a00551.science.domain:18889] mca: base: components_open: opening ras 
>components
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component loadleveler
>[a00551.science.domain:18889] mca: base: components_open: component 
>loadleveler open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component simulator
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:18889] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18889] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18889] mca:base:select: Auto-selecting ras 
>components
>[a00551.science.domain:18889] mca:base:select:(  ras) Querying component 
>[loadleveler]
>[a00551.science.domain:18889] [[34937,0],0] ras:loadleveler: NOT 
>available for selection
>[a00551.science.domain:18889] mca:base:select:(  ras) Querying component 
>[simulator]
>[a00551.science.domain:18889] mca:base:select:(  ras) Querying component 
>[slurm]
>[a00551.science.domain:18889] mca:base:select:(  ras) Querying component 
>[tm]
>[a00551.science.domain:18889] mca:base:select:(  ras) Query of component 
>[tm] set priority to 100
>[a00551.science.domain:18889] mca:base:select:(  ras) Selected component 
>[tm]
>[a00551.science.domain:18889] mca: base: close: unloading component 
>loadleveler
>[a00551.science.domain:18889] mca: base: close: unloading component 
>simulator
>[a00551.science.domain:18889] mca: base: close: component slurm closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>slurm
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_job
>[a00551.science.domain:18889] [[34937,0],0] ras:base:allocate
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>got hostname a00551.science.domain
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>not found -- added to list
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>got hostname a00554.science.domain
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>not found -- added to list
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>got hostname a00553.science.domain
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:allocate:discover: 
>not found -- added to list
>[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert 
>inserting 3 nodes
>[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert 
>updating HNP [a00551.science.domain] info to 1 slots
>[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert node 
>a00554.science.domain slots 1
>[a00551.science.domain:18889] [[34937,0],0] ras:base:node_insert node 
>a00553.science.domain slots 1
>
>======================   ALLOCATED NODES   ======================
>       a00551: slots=1 max_slots=0 slots_inuse=0 state=UP
>       a00554.science.domain: slots=1 max_slots=0 slots_inuse=0 state=UP
>       a00553.science.domain: slots=1 max_slots=0 slots_inuse=0 state=UP
>=================================================================
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm creating 
>map
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm add new 
>daemon [[34937,0],1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm assigning 
>new daemon [[34937,0],1] to node a00554.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm add new 
>daemon [[34937,0],2]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setup_vm assigning 
>new daemon [[34937,0],2] to node a00553.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching vm
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: final top-level 
>argv:
>       orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm 
>-mca ess_base_jobid 2289631232 -mca ess_base_vpid <template> -mca 
>ess_base_num_procs 3 -mca orte_hnp_uri 
>2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374
> 
>--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca 
>ras_base_verbose 10
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching on node 
>a00554.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: executing:
>       orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm 
>-mca ess_base_jobid 2289631232 -mca ess_base_vpid 1 -mca 
>ess_base_num_procs 3 -mca orte_hnp_uri 
>2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374
> 
>--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca 
>ras_base_verbose 10
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: launching on node 
>a00553.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:tm: executing:
>       orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm 
>-mca ess_base_jobid 2289631232 -mca ess_base_vpid 2 -mca 
>ess_base_num_procs 3 -mca orte_hnp_uri 
>2289631232.0;usock;tcp://130.226.12.194:59413;tcp6://[fe80::225:90ff:feeb:f6d5]:46374
> 
>--mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca 
>ras_base_verbose 10
>[a00551.science.domain:18889] [[34937,0],0] plm:tm:launch: finished 
>spawning orteds
>[a00551.science.domain:18894] mca: base: components_register: 
>registering framework ess components
>[a00551.science.domain:18894] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18894] mca: base: components_register: component 
>tm has no register or open function
>[a00551.science.domain:18894] mca: base: components_open: opening ess 
>components
>[a00551.science.domain:18894] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18894] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18894] mca:base:select: Auto-selecting ess 
>components
>[a00551.science.domain:18894] mca:base:select:(  ess) Querying component 
>[tm]
>[a00551.science.domain:18894] mca:base:select:(  ess) Query of component 
>[tm] set priority to 30
>[a00551.science.domain:18894] mca:base:select:(  ess) Selected component 
>[tm]
>[a00551.science.domain:18894] ess:tm setting name
>[a00551.science.domain:18894] ess:tm set name to [[34937,0],1]
>[a00551.science.domain:18895] mca: base: components_register: 
>registering framework ess components
>[a00551.science.domain:18895] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18895] mca: base: components_register: component 
>tm has no register or open function
>[a00551.science.domain:18895] mca: base: components_open: opening ess 
>components
>[a00551.science.domain:18895] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18895] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18895] mca:base:select: Auto-selecting ess 
>components
>[a00551.science.domain:18895] mca:base:select:(  ess) Querying component 
>[tm]
>[a00551.science.domain:18895] mca:base:select:(  ess) Query of component 
>[tm] set priority to 30
>[a00551.science.domain:18895] mca:base:select:(  ess) Selected component 
>[tm]
>[a00551.science.domain:18895] ess:tm setting name
>[a00551.science.domain:18895] ess:tm set name to [[34937,0],2]
>[a00551.science.domain:18894] mca: base: components_register: 
>registering framework plm components
>[a00551.science.domain:18894] mca: base: components_register: found 
>loaded component rsh
>[a00551.science.domain:18894] mca: base: components_register: component 
>rsh register function successful
>[a00551.science.domain:18894] mca: base: components_open: opening plm 
>components
>[a00551.science.domain:18894] mca: base: components_open: found loaded 
>component rsh
>[a00551.science.domain:18894] mca: base: components_open: component rsh 
>open function successful
>[a00551.science.domain:18894] mca:base:select: Auto-selecting plm 
>components
>[a00551.science.domain:18894] mca:base:select:(  plm) Querying component 
>[rsh]
>[a00551.science.domain:18894] [[34937,0],1] plm:rsh_lookup on agent ssh 
>: rsh path NULL
>[a00551.science.domain:18894] mca:base:select:(  plm) Query of component 
>[rsh] set priority to 10
>[a00551.science.domain:18894] mca:base:select:(  plm) Selected component 
>[rsh]
>[a00551.science.domain:18894] [[34937,0],1] setting up session dir with
>       tmpdir: UNDEF
>       host a00551
>[a00551.science.domain:18894] [[34937,0],1] bind() failed on error 
>Address already in use (98)
>[a00551.science.domain:18894] [[34937,0],1] ORTE_ERROR_LOG: Error in 
>file oob_usock_component.c at line 228
>[a00551.science.domain:18894] [[34937,0],1] plm:rsh_setup on agent ssh : 
>rsh path NULL
>[a00551.science.domain:18894] [[34937,0],1] plm:base:receive start comm
>[a00551.science.domain:18895] mca: base: components_register: 
>registering framework plm components
>[a00551.science.domain:18895] mca: base: components_register: found 
>loaded component rsh
>[a00551.science.domain:18895] mca: base: components_register: component 
>rsh register function successful
>[a00551.science.domain:18895] mca: base: components_open: opening plm 
>components
>[a00551.science.domain:18895] mca: base: components_open: found loaded 
>component rsh
>[a00551.science.domain:18895] mca: base: components_open: component rsh 
>open function successful
>[a00551.science.domain:18895] mca:base:select: Auto-selecting plm 
>components
>[a00551.science.domain:18895] mca:base:select:(  plm) Querying component 
>[rsh]
>[a00551.science.domain:18895] [[34937,0],2] plm:rsh_lookup on agent ssh 
>: rsh path NULL
>[a00551.science.domain:18895] mca:base:select:(  plm) Query of component 
>[rsh] set priority to 10
>[a00551.science.domain:18895] mca:base:select:(  plm) Selected component 
>[rsh]
>[a00551.science.domain:18895] [[34937,0],2] setting up session dir with
>       tmpdir: UNDEF
>       host a00551
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>from daemon [[34937,0],1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>from daemon [[34937,0],1] on node a00551
>[a00551.science.domain:18895] [[34937,0],2] bind() failed on error 
>Address already in use (98)
>[a00551.science.domain:18895] [[34937,0],2] ORTE_ERROR_LOG: Error in 
>file oob_usock_component.c at line 228
>[a00551.science.domain:18889] [[34937,0],0] RECEIVED TOPOLOGY FROM NODE 
>a00551
>[a00551.science.domain:18889] [[34937,0],0] ADDING TOPOLOGY PER USER 
>REQUEST TO NODE a00554.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>completed for daemon [[34937,0],1] at contact 
>2289631232.1;tcp://130.226.12.194:46861;tcp6://[fe80::225:90ff:feeb:f6d5]:33227
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>recvd 2 of 3 reported daemons
>[a00551.science.domain:18895] [[34937,0],2] plm:rsh_setup on agent ssh : 
>rsh path NULL
>[a00551.science.domain:18895] [[34937,0],2] plm:base:receive start comm
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>from daemon [[34937,0],2]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>from daemon [[34937,0],2] on node a00551
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>completed for daemon [[34937,0],2] at contact 
>2289631232.2;tcp://130.226.12.194:38146;tcp6://[fe80::225:90ff:feeb:f6d5]:44834
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_report_launch 
>recvd 3 of 3 reported daemons
>[a00551.science.domain:18889] [[34937,0],0] plm:base:setting topo to 
>that from node a00554.science.domain
>[a00551.science.domain:18889] [[34937,0],0] complete_setup on job 
>[34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:launch_apps for job 
>[34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing 
>msg
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc 
>state command from [[34937,0],1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for job [34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for vpid 1 state RUNNING exit_code 0
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done 
>processing commands
>a00551.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing 
>msg
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc 
>state command from [[34937,0],2]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for job [34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for vpid 2 state RUNNING exit_code 0
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done 
>processing commands
>[a00551.science.domain:18889] [[34937,0],0] plm:base:launch wiring up 
>iof for job [34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:launch job 
>[34937,1] is not a dynamic spawn
>a00551.science.domain
>a00551.science.domain
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing 
>msg
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc 
>state command from [[34937,0],1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for job [34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done 
>processing commands
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive processing 
>msg
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive update proc 
>state command from [[34937,0],2]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for job [34937,1]
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive got 
>update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive done 
>processing commands
>[a00551.science.domain:18889] [[34937,0],0] plm:base:orted_cmd sending 
>orted_exit commands
>[a00551.science.domain:18894] [[34937,0],1] plm:base:receive stop comm
>[a00551.science.domain:18894] mca: base: close: component rsh closed
>[a00551.science.domain:18894] mca: base: close: unloading component rsh
>[a00551.science.domain:18895] [[34937,0],2] plm:base:receive stop comm
>[a00551.science.domain:18895] mca: base: close: component rsh closed
>[a00551.science.domain:18895] mca: base: close: unloading component rsh
>[a00551.science.domain:18895] mca: base: close: component tm closed
>[a00551.science.domain:18895] mca: base: close: unloading component tm
>[a00551.science.domain:18894] mca: base: close: component tm closed
>[a00551.science.domain:18894] mca: base: close: unloading component tm
>[a00551.science.domain:18889] [[34937,0],0] ras:tm:finalize: success 
>(nothing to do)
>[a00551.science.domain:18889] mca: base: close: unloading component tm
>[a00551.science.domain:18889] [[34937,0],0] plm:base:receive stop comm
>[a00551.science.domain:18889] mca: base: close: component tm closed
>[a00551.science.domain:18889] mca: base: close: unloading component tm
>[a00551.science.domain:18889] mca: base: close: component hnp closed
>[a00551.science.domain:18889] mca: base: close: unloading component hnp
>
>
>Cheers,
>Oswin
>
>On 2016-09-08 12:13, Gilles Gouaillardet wrote:
>> Oswin,
>> 
>> 
>> can you please run again (one task per physical node) with
>> 
>> mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca
>> ras_base_verbose 10 hostname
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> 
>> On 9/8/2016 6:42 PM, Oswin Krause wrote:
>>> Hi,
>>> 
>>> i reconfigured to only have one physical node. Still no success, but 
>>> the nodefile now looks better. I still get the errors:
>>> 
>>> [a00551.science.domain:18021] [[34768,0],1] bind() failed on error 
>>> Address already in use (98)
>>> [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in 
>>> file oob_usock_component.c at line 228
>>> [a00551.science.domain:18022] [[34768,0],2] bind() failed on error 
>>> Address already in use (98)
>>> [a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in 
>>> file oob_usock_component.c at line 228
>>> 
>>> (btw: for some reason the bind errors where missing. sorry!)
>>> 
>>> PBS_NODEFILE
>>> a00551.science.domain
>>> a00554.science.domain
>>> a00553.science.domain
>>> -----------------------
>>> mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname
>>> [a00551.science.domain:18097] mca: base: components_register: 
>>> registering framework plm components
>>> [a00551.science.domain:18097] mca: base: components_register: found 
>>> loaded component isolated
>>> [a00551.science.domain:18097] mca: base: components_register: 
>>> component isolated has no register or open function
>>> [a00551.science.domain:18097] mca: base: components_register: found 
>>> loaded component rsh
>>> [a00551.science.domain:18097] mca: base: components_register: 
>>> component rsh register function successful
>>> [a00551.science.domain:18097] mca: base: components_register: found 
>>> loaded component slurm
>>> [a00551.science.domain:18097] mca: base: components_register: 
>>> component slurm register function successful
>>> [a00551.science.domain:18097] mca: base: components_register: found 
>>> loaded component tm
>>> [a00551.science.domain:18097] mca: base: components_register: 
>>> component tm register function successful
>>> [a00551.science.domain:18097] mca: base: components_open: opening plm 
>>> components
>>> [a00551.science.domain:18097] mca: base: components_open: found loaded 
>>> component isolated
>>> [a00551.science.domain:18097] mca: base: components_open: component 
>>> isolated open function successful
>>> [a00551.science.domain:18097] mca: base: components_open: found loaded 
>>> component rsh
>>> [a00551.science.domain:18097] mca: base: components_open: component 
>>> rsh open function successful
>>> [a00551.science.domain:18097] mca: base: components_open: found loaded 
>>> component slurm
>>> [a00551.science.domain:18097] mca: base: components_open: component 
>>> slurm open function successful
>>> [a00551.science.domain:18097] mca: base: components_open: found loaded 
>>> component tm
>>> [a00551.science.domain:18097] mca: base: components_open: component tm 
>>> open function successful
>>> [a00551.science.domain:18097] mca:base:select: Auto-selecting plm 
>>> components
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Querying 
>>> component [isolated]
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Query of 
>>> component [isolated] set priority to 0
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Querying 
>>> component [rsh]
>>> [a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on 
>>> agent ssh : rsh path NULL
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Query of 
>>> component [rsh] set priority to 10
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Querying 
>>> component [slurm]
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Querying 
>>> component [tm]
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Query of 
>>> component [tm] set priority to 75
>>> [a00551.science.domain:18097] mca:base:select:(  plm) Selected 
>>> component [tm]
>>> [a00551.science.domain:18097] mca: base: close: component isolated 
>>> closed
>>> [a00551.science.domain:18097] mca: base: close: unloading component 
>>> isolated
>>> [a00551.science.domain:18097] mca: base: close: component rsh closed
>>> [a00551.science.domain:18097] mca: base: close: unloading component 
>>> rsh
>>> [a00551.science.domain:18097] mca: base: close: component slurm closed
>>> [a00551.science.domain:18097] mca: base: close: unloading component 
>>> slurm
>>> [a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 
>>> 18097 nodename hash 2226275586
>>> [a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 
>>> 34561
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive start 
>>> comm
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating 
>>> map
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
>>> daemon [[34561,0],1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm 
>>> assigning new daemon [[34561,0],1] to node a00554.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
>>> daemon [[34561,0],2]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm 
>>> assigning new daemon [[34561,0],2] to node a00553.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching vm
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: final top-level 
>>> argv:
>>>     orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid <template> -mca 
>>> ess_base_num_procs 3 -mca orte_hnp_uri 
>>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
>>>  
>>> --mca plm_base_verbose 10
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node 
>>> a00554.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: executing:
>>>     orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid 1 -mca 
>>> ess_base_num_procs 3 -mca orte_hnp_uri 
>>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
>>>  
>>> --mca plm_base_verbose 10
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching on node 
>>> a00553.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm: executing:
>>>     orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>> tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid 2 -mca 
>>> ess_base_num_procs 3 -mca orte_hnp_uri 
>>> 2264989696.0;usock;tcp://130.226.12.194:35939;tcp6://[fe80::225:90ff:feeb:f6d5]:35904
>>>  
>>> --mca plm_base_verbose 10
>>> [a00551.science.domain:18097] [[34561,0],0] plm:tm:launch: finished 
>>> spawning orteds
>>> [a00551.science.domain:18102] mca: base: components_register: 
>>> registering framework plm components
>>> [a00551.science.domain:18102] mca: base: components_register: found 
>>> loaded component rsh
>>> [a00551.science.domain:18102] mca: base: components_register: 
>>> component rsh register function successful
>>> [a00551.science.domain:18102] mca: base: components_open: opening plm 
>>> components
>>> [a00551.science.domain:18102] mca: base: components_open: found loaded 
>>> component rsh
>>> [a00551.science.domain:18102] mca: base: components_open: component 
>>> rsh open function successful
>>> [a00551.science.domain:18102] mca:base:select: Auto-selecting plm 
>>> components
>>> [a00551.science.domain:18102] mca:base:select:(  plm) Querying 
>>> component [rsh]
>>> [a00551.science.domain:18102] [[34561,0],1] plm:rsh_lookup on agent 
>>> ssh : rsh path NULL
>>> [a00551.science.domain:18102] mca:base:select:(  plm) Query of 
>>> component [rsh] set priority to 10
>>> [a00551.science.domain:18102] mca:base:select:(  plm) Selected 
>>> component [rsh]
>>> [a00551.science.domain:18102] [[34561,0],1] bind() failed on error 
>>> Address already in use (98)
>>> [a00551.science.domain:18102] [[34561,0],1] ORTE_ERROR_LOG: Error in 
>>> file oob_usock_component.c at line 228
>>> [a00551.science.domain:18102] [[34561,0],1] plm:rsh_setup on agent ssh 
>>> : rsh path NULL
>>> [a00551.science.domain:18102] [[34561,0],1] plm:base:receive start 
>>> comm
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch from daemon [[34561,0],1]
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch from daemon [[34561,0],1] on node a00551
>>> [a00551.science.domain:18097] [[34561,0],0] RECEIVED TOPOLOGY FROM 
>>> NODE a00551
>>> [a00551.science.domain:18097] [[34561,0],0] ADDING TOPOLOGY PER USER 
>>> REQUEST TO NODE a00554.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch completed for daemon [[34561,0],1] at 
>>> contact 
>>> 2264989696.1;tcp://130.226.12.194:52354;tcp6://[fe80::225:90ff:feeb:f6d5]:60904
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch recvd 2 of 3 reported daemons
>>> [a00551.science.domain:18103] mca: base: components_register: 
>>> registering framework plm components
>>> [a00551.science.domain:18103] mca: base: components_register: found 
>>> loaded component rsh
>>> [a00551.science.domain:18103] mca: base: components_register: 
>>> component rsh register function successful
>>> [a00551.science.domain:18103] mca: base: components_open: opening plm 
>>> components
>>> [a00551.science.domain:18103] mca: base: components_open: found loaded 
>>> component rsh
>>> [a00551.science.domain:18103] mca: base: components_open: component 
>>> rsh open function successful
>>> [a00551.science.domain:18103] mca:base:select: Auto-selecting plm 
>>> components
>>> [a00551.science.domain:18103] mca:base:select:(  plm) Querying 
>>> component [rsh]
>>> [a00551.science.domain:18103] [[34561,0],2] plm:rsh_lookup on agent 
>>> ssh : rsh path NULL
>>> [a00551.science.domain:18103] mca:base:select:(  plm) Query of 
>>> component [rsh] set priority to 10
>>> [a00551.science.domain:18103] mca:base:select:(  plm) Selected 
>>> component [rsh]
>>> [a00551.science.domain:18103] [[34561,0],2] bind() failed on error 
>>> Address already in use (98)
>>> [a00551.science.domain:18103] [[34561,0],2] ORTE_ERROR_LOG: Error in 
>>> file oob_usock_component.c at line 228
>>> [a00551.science.domain:18103] [[34561,0],2] plm:rsh_setup on agent ssh 
>>> : rsh path NULL
>>> [a00551.science.domain:18103] [[34561,0],2] plm:base:receive start 
>>> comm
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch from daemon [[34561,0],2]
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch from daemon [[34561,0],2] on node a00551
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch completed for daemon [[34561,0],2] at 
>>> contact 
>>> 2264989696.2;tcp://130.226.12.194:41272;tcp6://[fe80::225:90ff:feeb:f6d5]:35343
>>> [a00551.science.domain:18097] [[34561,0],0] 
>>> plm:base:orted_report_launch recvd 3 of 3 reported daemons
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:setting topo to 
>>> that from node a00554.science.domain
>>>  Data for JOB [34561,1] offset 0
>>> 
>>>  ========================   JOB MAP   ========================
>>> 
>>>  Data for node: a00551    Num slots: 1    Max slots: 0    Num procs: 1
>>>      Process OMPI jobid: [34561,1] App: 0 Process rank: 0 Bound: 
>>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>> 
>>>  Data for node: a00554.science.domain    Num slots: 1    Max slots: 0  
>>>   Num procs: 1
>>>      Process OMPI jobid: [34561,1] App: 0 Process rank: 1 Bound: 
>>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>> 
>>>  Data for node: a00553.science.domain    Num slots: 1    Max slots: 0  
>>>   Num procs: 1
>>>      Process OMPI jobid: [34561,1] App: 0 Process rank: 2 Bound: 
>>> socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>> 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>> socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>> 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 
>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>> 
>>>  =============================================================
>>> [a00551.science.domain:18097] [[34561,0],0] complete_setup on job 
>>> [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch_apps for 
>>> job [34561,1]
>>> [1,0]<stdout>:a00551.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive 
>>> processing msg
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update 
>>> proc state command from [[34561,0],2]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for job [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for vpid 2 state RUNNING exit_code 0
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done 
>>> processing commands
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive 
>>> processing msg
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update 
>>> proc state command from [[34561,0],1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for job [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for vpid 1 state RUNNING exit_code 0
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done 
>>> processing commands
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch wiring up 
>>> iof for job [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:launch job 
>>> [34561,1] is not a dynamic spawn
>>> [1,2]<stdout>:a00551.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive 
>>> processing msg
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update 
>>> proc state command from [[34561,0],2]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for job [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done 
>>> processing commands
>>> [1,1]<stdout>:a00551.science.domain
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive 
>>> processing msg
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive update 
>>> proc state command from [[34561,0],1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for job [34561,1]
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive got 
>>> update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive done 
>>> processing commands
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:orted_cmd sending 
>>> orted_exit commands
>>> [a00551.science.domain:18102] [[34561,0],1] plm:base:receive stop comm
>>> [a00551.science.domain:18102] mca: base: close: component rsh closed
>>> [a00551.science.domain:18102] mca: base: close: unloading component 
>>> rsh
>>> [a00551.science.domain:18103] [[34561,0],2] plm:base:receive stop comm
>>> [a00551.science.domain:18103] mca: base: close: component rsh closed
>>> [a00551.science.domain:18103] mca: base: close: unloading component 
>>> rsh
>>> [a00551.science.domain:18097] [[34561,0],0] plm:base:receive stop comm
>>> [a00551.science.domain:18097] mca: base: close: component tm closed
>>> [a00551.science.domain:18097] mca: base: close: unloading component tm
>>> 
>>> 
>>> Best,
>>> Oswin
>>> 
>>> On 2016-09-08 10:33, Oswin Krause wrote:
>>>> Hi Gilles, Hi Ralph,
>>>> 
>>>> I have just rebuild openmpi. quite a lot more of information. As I
>>>> said, i did not tinker with the PBS_NODEFILE. I think the issue might
>>>> be NUMA here. I can try to go through the process and reconfigure to
>>>> non-numa and see whether this works. The issue might be that the node
>>>> allocation looks like this:
>>>> 
>>>> a00551.science.domain-0
>>>> a00552.science.domain-0
>>>> a00551.science.domain-1
>>>> 
>>>> and the last part then gets shortened which leads to the issue. Not
>>>> sure whether this makes sense but this is my explanation.
>>>> 
>>>> Here the output:
>>>> $PBS_NODEFILE
>>>> /var/lib/torque/aux//285.a00552.science.domain
>>>> PBS_NODEFILE
>>>> a00551.science.domain
>>>> a00553.science.domain
>>>> a00551.science.domain
>>>> ---------
>>>> [a00551.science.domain:16986] mca: base: components_register:
>>>> registering framework plm components
>>>> [a00551.science.domain:16986] mca: base: components_register: found
>>>> loaded component isolated
>>>> [a00551.science.domain:16986] mca: base: components_register:
>>>> component isolated has no register or open function
>>>> [a00551.science.domain:16986] mca: base: components_register: found
>>>> loaded component rsh
>>>> [a00551.science.domain:16986] mca: base: components_register:
>>>> component rsh register function successful
>>>> [a00551.science.domain:16986] mca: base: components_register: found
>>>> loaded component slurm
>>>> [a00551.science.domain:16986] mca: base: components_register:
>>>> component slurm register function successful
>>>> [a00551.science.domain:16986] mca: base: components_register: found
>>>> loaded component tm
>>>> [a00551.science.domain:16986] mca: base: components_register:
>>>> component tm register function successful
>>>> [a00551.science.domain:16986] mca: base: components_open: opening plm 
>>>> components
>>>> [a00551.science.domain:16986] mca: base: components_open: found 
>>>> loaded
>>>> component isolated
>>>> [a00551.science.domain:16986] mca: base: components_open: component
>>>> isolated open function successful
>>>> [a00551.science.domain:16986] mca: base: components_open: found 
>>>> loaded
>>>> component rsh
>>>> [a00551.science.domain:16986] mca: base: components_open: component
>>>> rsh open function successful
>>>> [a00551.science.domain:16986] mca: base: components_open: found 
>>>> loaded
>>>> component slurm
>>>> [a00551.science.domain:16986] mca: base: components_open: component
>>>> slurm open function successful
>>>> [a00551.science.domain:16986] mca: base: components_open: found 
>>>> loaded
>>>> component tm
>>>> [a00551.science.domain:16986] mca: base: components_open: component 
>>>> tm
>>>> open function successful
>>>> [a00551.science.domain:16986] mca:base:select: Auto-selecting plm 
>>>> components
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Querying
>>>> component [isolated]
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Query of
>>>> component [isolated] set priority to 0
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Querying 
>>>> component [rsh]
>>>> [a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on
>>>> agent ssh : rsh path NULL
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Query of
>>>> component [rsh] set priority to 10
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Querying 
>>>> component [slurm]
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Querying 
>>>> component [tm]
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Query of
>>>> component [tm] set priority to 75
>>>> [a00551.science.domain:16986] mca:base:select:(  plm) Selected 
>>>> component [tm]
>>>> [a00551.science.domain:16986] mca: base: close: component isolated 
>>>> closed
>>>> [a00551.science.domain:16986] mca: base: close: unloading component 
>>>> isolated
>>>> [a00551.science.domain:16986] mca: base: close: component rsh closed
>>>> [a00551.science.domain:16986] mca: base: close: unloading component 
>>>> rsh
>>>> [a00551.science.domain:16986] mca: base: close: component slurm 
>>>> closed
>>>> [a00551.science.domain:16986] mca: base: close: unloading component 
>>>> slurm
>>>> [a00551.science.domain:16986] plm:base:set_hnp_name: initial bias
>>>> 16986 nodename hash 2226275586
>>>> [a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam 
>>>> 33770
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive start 
>>>> comm
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm 
>>>> creating map
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new
>>>> daemon [[33770,0],1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm
>>>> assigning new daemon [[33770,0],1] to node a00553.science.domain
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level 
>>>> argv:
>>>>     orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>>> tm
>>>> -mca ess_base_jobid 2213150720 -mca ess_base_vpid <template> -mca
>>>> ess_base_num_procs 2 -mca orte_hnp_uri
>>>> 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821
>>>>  
>>>> --mca plm_base_verbose 10
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node
>>>> a00553.science.domain
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm: executing:
>>>>     orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>>> tm
>>>> -mca ess_base_jobid 2213150720 -mca ess_base_vpid 1 -mca
>>>> ess_base_num_procs 2 -mca orte_hnp_uri
>>>> 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821
>>>>  
>>>> --mca plm_base_verbose 10
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:tm:launch: finished
>>>> spawning orteds
>>>> [a00551.science.domain:16986] [[33770,0],0]
>>>> plm:base:orted_report_launch from daemon [[33770,0],1]
>>>> [a00551.science.domain:16986] [[33770,0],0]
>>>> plm:base:orted_report_launch from daemon [[33770,0],1] on node a00551
>>>> [a00551.science.domain:16986] [[33770,0],0] RECEIVED TOPOLOGY FROM 
>>>> NODE a00551
>>>> [a00551.science.domain:16986] [[33770,0],0] ADDING TOPOLOGY PER USER
>>>> REQUEST TO NODE a00553.science.domain
>>>> [a00551.science.domain:16986] [[33770,0],0]
>>>> plm:base:orted_report_launch completed for daemon [[33770,0],1] at
>>>> contact
>>>> 2213150720.1;tcp://130.226.12.194:38025;tcp6://[fe80::225:90ff:feeb:f6d5]:39080
>>>>  
>>>> [a00551.science.domain:16986] [[33770,0],0]
>>>> plm:base:orted_report_launch recvd 2 of 2 reported daemons
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:setting topo to
>>>> that from node a00553.science.domain
>>>>  Data for JOB [33770,1] offset 0
>>>> 
>>>>  ========================   JOB MAP   ========================
>>>> 
>>>>  Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 
>>>> 2
>>>>      Process OMPI jobid: [33770,1] App: 0 Process rank: 0 Bound: 
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>      Process OMPI jobid: [33770,1] App: 0 Process rank: 1 Bound: 
>>>> socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> 
>>>>  Data for node: a00553.science.domain    Num slots: 1    Max slots: 0 
>>>>    Num procs: 1
>>>>      Process OMPI jobid: [33770,1] App: 0 Process rank: 2 Bound: 
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> 
>>>>  =============================================================
>>>> [a00551.science.domain:16986] [[33770,0],0] complete_setup on job 
>>>> [33770,1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch_apps for
>>>> job [33770,1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive 
>>>> processing msg
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update
>>>> proc state command from [[33770,0],1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
>>>> update_proc_state for job [33770,1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
>>>> update_proc_state for vpid 2 state RUNNING exit_code 0
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done
>>>> processing commands
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive 
>>>> processing msg
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update
>>>> proc state command from [[33770,0],1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
>>>> update_proc_state for job [33770,1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got
>>>> update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done
>>>> processing commands
>>>> [1,1]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch wiring up
>>>> iof for job [33770,1]
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:launch job
>>>> [33770,1] is not a dynamic spawn
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_cmd 
>>>> sending
>>>> orted_exit commands
>>>> [a00551.science.domain:16986] [[33770,0],0] plm:base:receive stop 
>>>> comm
>>>> [a00551.science.domain:16986] mca: base: close: component tm closed
>>>> [a00551.science.domain:16986] mca: base: close: unloading component 
>>>> tm
>>>> 
>>>> 
>>>> 
>>>> On 2016-09-08 10:18, Gilles Gouaillardet wrote:
>>>>> Ralph,
>>>>> 
>>>>> 
>>>>> i am not sure i am reading you correctly, so let me clarify.
>>>>> 
>>>>> 
>>>>> i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying
>>>>> to reproduce an issue i could not reproduce otherwise.
>>>>> 
>>>>> /* my job submitted with -l nodes=3:ppn=1 do not start if there are
>>>>> only two nodes available, whereas the same user job
>>>>> 
>>>>> starts on two nodes */
>>>>> 
>>>>> thanks for the explanation of the torque internals, my hack was
>>>>> incomplete and not a valid one, i do acknowledge it.
>>>>> 
>>>>> 
>>>>> i re-read the email that started this thread and i found the
>>>>> information i was looking for
>>>>> 
>>>>> 
>>>>>> echo $PBS_NODEFILE
>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>> cat $PBS_NODEFILE
>>>>>> a00551.science.domain
>>>>>> a00553.science.domain
>>>>>> a00551.science.domain
>>>>> 
>>>>> 
>>>>> so, assuming the enduser did not edit his $PBS_NODEFILE, and torque 
>>>>> is
>>>>> correctly configured and not busted, then
>>>>> 
>>>>>> Torque indeed always provides an ordered file - the only way you 
>>>>>> can get an unordered one is for someone to edit it
>>>>> might be updated to
>>>>> 
>>>>> "Torque used to always provide an ordered file, but recent versions
>>>>> might not do that."
>>>>> 
>>>>> 
>>>>> makes sense ?
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> 
>>>>> On 9/8/2016 4:57 PM, r...@open-mpi.org wrote:
>>>>>> Someone has done some work there since I last did, but I can see 
>>>>>> the issue. Torque indeed always provides an ordered file - the only 
>>>>>> way you can get an unordered one is for someone to edit it, and 
>>>>>> that is forbidden - i.e., you get what you deserve because you are 
>>>>>> messing around with a system-defined file :-)
>>>>>> 
>>>>>> The problem is that Torque internally assigns a “launch ID” which 
>>>>>> is just the integer position of the nodename in the PBS_NODEFILE. 
>>>>>> So if you modify that position, then we get the wrong index - and 
>>>>>> everything goes down the drain from there. In your example, 
>>>>>> n1.cluster changed index from 3 to 2 because of your edit. Torque 
>>>>>> thinks that index 2 is just another reference to n0.cluster, and so 
>>>>>> we merrily launch a daemon onto the wrong node.
>>>>>> 
>>>>>> They have a good reason for doing things this way. It allows you to 
>>>>>> launch a process against each launch ID, and the pattern will 
>>>>>> reflect the original qsub request in what we would call a map-by 
>>>>>> slot round-robin mode. This maximizes the use of shared memory, and 
>>>>>> is expected to provide good performance for a range of apps.
>>>>>> 
>>>>>> Lesson to be learned: never, ever muddle around with a 
>>>>>> system-generated file. If you want to modify where things go, then 
>>>>>> use one or more of the mpirun options to do so. We give you lots 
>>>>>> and lots of knobs for just that reason.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet 
>>>>>>> <gil...@rist.or.jp> wrote:
>>>>>>> 
>>>>>>> Ralph,
>>>>>>> 
>>>>>>> 
>>>>>>> there might be an issue within Open MPI.
>>>>>>> 
>>>>>>> 
>>>>>>> on the cluster i used, hostname returns the FQDN, and 
>>>>>>> $PBS_NODEFILE uses the FQDN too.
>>>>>>> 
>>>>>>> my $PBS_NODEFILE has one line per task, and it is ordered
>>>>>>> 
>>>>>>> e.g.
>>>>>>> 
>>>>>>> n0.cluster
>>>>>>> 
>>>>>>> n0.cluster
>>>>>>> 
>>>>>>> n1.cluster
>>>>>>> 
>>>>>>> n1.cluster
>>>>>>> 
>>>>>>> 
>>>>>>> in my torque script, i rewrote the machinefile like this
>>>>>>> 
>>>>>>> n0.cluster
>>>>>>> 
>>>>>>> n1.cluster
>>>>>>> 
>>>>>>> n0.cluster
>>>>>>> 
>>>>>>> n1.cluster
>>>>>>> 
>>>>>>> and updated the PBS environment variable to point to my new file.
>>>>>>> 
>>>>>>> 
>>>>>>> then i invoked
>>>>>>> 
>>>>>>> mpirun hostname
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> in the first case, 2 tasks run on n0 and 2 tasks run on n1
>>>>>>> in the second case, 4 tasks run on n0, and none on n1.
>>>>>>> 
>>>>>>> so i am thinking we might not support unordered $PBS_NODEFILE.
>>>>>>> 
>>>>>>> as a reminder, the submit command was
>>>>>>> qsub -l nodes=3:ppn=1
>>>>>>> but for some reasons i ignore, only two nodes were allocated (two 
>>>>>>> slots on the first one, one on the second one)
>>>>>>> and if i understand correctly, $PBS_NODEFILE was not ordered.
>>>>>>> (e.g. n0 n1 n0 and *not * n0 n0 n1)
>>>>>>> 
>>>>>>> i tried to reproduce this without hacking $PBS_NODEFILE, but my 
>>>>>>> jobs hang in the queue if only two nodes with 16 slots each are 
>>>>>>> available and i request
>>>>>>> -l nodes=3:ppn=1
>>>>>>> i guess this is a different scheduler configuration, and i cannot 
>>>>>>> change that.
>>>>>>> 
>>>>>>> Could you please have a look at this ?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
>>>>>>>> The usual cause of this problem is that the nodename in the 
>>>>>>>> machinefile is given as a00551, while Torque is assigning the 
>>>>>>>> node name as a00551.science.domain. Thus, mpirun thinks those are 
>>>>>>>> two separate nodes and winds up spawning an orted on its own 
>>>>>>>> node.
>>>>>>>> 
>>>>>>>> You might try ensuring that your machinefile is using the exact 
>>>>>>>> same name as provided in your allocation
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet 
>>>>>>>>> <gilles.gouaillar...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Thanjs for the ligs
>>>>>>>>> 
>>>>>>>>>  From what i see now, it looks like a00551 is running both 
>>>>>>>>> mpirun and orted, though it should only run mpirun, and orted 
>>>>>>>>> should run only on a00553
>>>>>>>>> 
>>>>>>>>> I will check the code and see what could be happening here
>>>>>>>>> 
>>>>>>>>> Btw, what is the output of
>>>>>>>>> hostname
>>>>>>>>> hostname -f
>>>>>>>>> On a00551 ?
>>>>>>>>> 
>>>>>>>>> Out of curiosity, is a previous version of Open MPI (e.g. 
>>>>>>>>> v1.10.4) installled and running correctly on your cluster ?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Gilles
>>>>>>>>> 
>>>>>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>>>>>> Hi Gilles,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the hint with the machinefile. I know it is not 
>>>>>>>>>> equivalent
>>>>>>>>>> and i do not intend to use that approach. I just wanted to know 
>>>>>>>>>> whether
>>>>>>>>>> I could start the program successfully at all.
>>>>>>>>>> 
>>>>>>>>>> Outside torque(4.2), rsh seems to be used which works fine, 
>>>>>>>>>> querying a
>>>>>>>>>> password if no kerberos ticket is there
>>>>>>>>>> 
>>>>>>>>>> Here is the output:
>>>>>>>>>> [zbh251@a00551 ~]$ mpirun -V
>>>>>>>>>> mpirun (Open MPI) 2.0.1
>>>>>>>>>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>>>>>>>>>                  MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, 
>>>>>>>>>> Component
>>>>>>>>>> v2.0.1)
>>>>>>>>>>                  MCA ras: simulator (MCA v2.1.0, API v2.0.0, 
>>>>>>>>>> Component
>>>>>>>>>> v2.0.1)
>>>>>>>>>>                  MCA ras: slurm (MCA v2.1.0, API v2.0.0, 
>>>>>>>>>> Component
>>>>>>>>>> v2.0.1)
>>>>>>>>>>                  MCA ras: tm (MCA v2.1.0, API v2.0.0, Component 
>>>>>>>>>> v2.0.1)
>>>>>>>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 
>>>>>>>>>> --tag-output
>>>>>>>>>> -display-map hostname
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register:
>>>>>>>>>> registering framework plm components
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> found
>>>>>>>>>> loaded component isolated
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> component
>>>>>>>>>> isolated has no register or open function
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> found
>>>>>>>>>> loaded component rsh
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> component
>>>>>>>>>> rsh register function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> found
>>>>>>>>>> loaded component slurm
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> component
>>>>>>>>>> slurm register function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> found
>>>>>>>>>> loaded component tm
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>>>>>>>> component
>>>>>>>>>> tm register function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: 
>>>>>>>>>> opening plm
>>>>>>>>>> components
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>>>>>>> loaded
>>>>>>>>>> component isolated
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: 
>>>>>>>>>> component
>>>>>>>>>> isolated open function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>>>>>>> loaded
>>>>>>>>>> component rsh
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: 
>>>>>>>>>> component rsh
>>>>>>>>>> open function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>>>>>>> loaded
>>>>>>>>>> component slurm
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: 
>>>>>>>>>> component
>>>>>>>>>> slurm open function successful
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>>>>>>>> loaded
>>>>>>>>>> component tm
>>>>>>>>>> [a00551.science.domain:04104] mca: base: components_open: 
>>>>>>>>>> component tm
>>>>>>>>>> open function successful
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting 
>>>>>>>>>> plm
>>>>>>>>>> components
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying 
>>>>>>>>>> component
>>>>>>>>>> [isolated]
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of 
>>>>>>>>>> component
>>>>>>>>>> [isolated] set priority to 0
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying 
>>>>>>>>>> component
>>>>>>>>>> [rsh]
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of 
>>>>>>>>>> component
>>>>>>>>>> [rsh] set priority to 10
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying 
>>>>>>>>>> component
>>>>>>>>>> [slurm]
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying 
>>>>>>>>>> component
>>>>>>>>>> [tm]
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of 
>>>>>>>>>> component
>>>>>>>>>> [tm] set priority to 75
>>>>>>>>>> [a00551.science.domain:04104] mca:base:select:( plm) Selected 
>>>>>>>>>> component
>>>>>>>>>> [tm]
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component 
>>>>>>>>>> isolated
>>>>>>>>>> closed
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading 
>>>>>>>>>> component
>>>>>>>>>> isolated
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component rsh 
>>>>>>>>>> closed
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading 
>>>>>>>>>> component rsh
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component slurm 
>>>>>>>>>> closed
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading 
>>>>>>>>>> component
>>>>>>>>>> slurm
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register:
>>>>>>>>>> registering framework plm components
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>>>>>>>> found
>>>>>>>>>> loaded component rsh
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>>>>>>>> component
>>>>>>>>>> rsh register function successful
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: 
>>>>>>>>>> opening plm
>>>>>>>>>> components
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: found 
>>>>>>>>>> loaded
>>>>>>>>>> component rsh
>>>>>>>>>> [a00551.science.domain:04109] mca: base: components_open: 
>>>>>>>>>> component rsh
>>>>>>>>>> open function successful
>>>>>>>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting 
>>>>>>>>>> plm
>>>>>>>>>> components
>>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Querying 
>>>>>>>>>> component
>>>>>>>>>> [rsh]
>>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Query of 
>>>>>>>>>> component
>>>>>>>>>> [rsh] set priority to 10
>>>>>>>>>> [a00551.science.domain:04109] mca:base:select:( plm) Selected 
>>>>>>>>>> component
>>>>>>>>>> [rsh]
>>>>>>>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on 
>>>>>>>>>> error
>>>>>>>>>> Address already in use (98)
>>>>>>>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: 
>>>>>>>>>> Error in
>>>>>>>>>> file oob_usock_component.c at line 228
>>>>>>>>>> Data for JOB [53688,1] offset 0
>>>>>>>>>> 
>>>>>>>>>> ========================   JOB MAP ========================
>>>>>>>>>> 
>>>>>>>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num 
>>>>>>>>>> procs: 2
>>>>>>>>>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: 
>>>>>>>>>> socket
>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>> 2[hwt
>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>> socket
>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>> 7[hwt
>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] 
>>>>>>>>>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: 
>>>>>>>>>> socket
>>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 
>>>>>>>>>> 12[hwt
>>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
>>>>>>>>>> socket
>>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 
>>>>>>>>>> 17[hwt
>>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] 
>>>>>>>>>> Data for node: a00553.science.domain    Num slots: 1    Max 
>>>>>>>>>> slots: 0    Num
>>>>>>>>>> procs: 1
>>>>>>>>>>     Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: 
>>>>>>>>>> socket
>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>> 2[hwt
>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>> socket
>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>> 7[hwt
>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] 
>>>>>>>>>> =============================================================
>>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on 
>>>>>>>>>> job
>>>>>>>>>> [53688,1]
>>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive 
>>>>>>>>>> update proc
>>>>>>>>>> state command from [[53688,0],1]
>>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive 
>>>>>>>>>> got
>>>>>>>>>> update_proc_state for job [53688,1]
>>>>>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive 
>>>>>>>>>> update proc
>>>>>>>>>> state command from [[53688,0],1]
>>>>>>>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive 
>>>>>>>>>> got
>>>>>>>>>> update_proc_state for job [53688,1]
>>>>>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>>>>>> [a00551.science.domain:04109] mca: base: close: component rsh 
>>>>>>>>>> closed
>>>>>>>>>> [a00551.science.domain:04109] mca: base: close: unloading 
>>>>>>>>>> component rsh
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: component tm 
>>>>>>>>>> closed
>>>>>>>>>> [a00551.science.domain:04104] mca: base: close: unloading 
>>>>>>>>>> component tm
>>>>>>>>>> 
>>>>>>>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Which version of Open MPI are you running ?
>>>>>>>>>>> 
>>>>>>>>>>> I noted that though you are asking three nodes and one task 
>>>>>>>>>>> per node,
>>>>>>>>>>> you have been allocated 2 nodes only.
>>>>>>>>>>> I do not know if this is related to this issue.
>>>>>>>>>>> 
>>>>>>>>>>> Note if you use the machinefile, a00551 has two slots (since 
>>>>>>>>>>> it
>>>>>>>>>>> appears twice in the machinefile) but a00553 has 20 slots 
>>>>>>>>>>> (since it
>>>>>>>>>>> appears once in the machinefile, the number of slots is 
>>>>>>>>>>> automatically
>>>>>>>>>>> detected)
>>>>>>>>>>> 
>>>>>>>>>>> Can you run
>>>>>>>>>>> mpirun --mca plm_base_verbose 10 ...
>>>>>>>>>>> So we can confirm tm is used.
>>>>>>>>>>> 
>>>>>>>>>>> Before invoking mpirun, you might want to cleanup the ompi 
>>>>>>>>>>> directory in
>>>>>>>>>>> /tmp
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> 
>>>>>>>>>>> Gilles
>>>>>>>>>>> 
>>>>>>>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is 
>>>>>>>>>>>> build
>>>>>>>>>>>> with
>>>>>>>>>>>> tm support. Torque is correctly assigning nodes and I can run
>>>>>>>>>>>> mpi-programs on single nodes just fine. the problem starts 
>>>>>>>>>>>> when
>>>>>>>>>>>> processes are split between nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> For example, I create an interactive session with torque and 
>>>>>>>>>>>> start a
>>>>>>>>>>>> program by
>>>>>>>>>>>> 
>>>>>>>>>>>> qsub -I -n -l nodes=3:ppn=1
>>>>>>>>>>>> mpirun --tag-output -display-map hostname
>>>>>>>>>>>> 
>>>>>>>>>>>> which leads to
>>>>>>>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on 
>>>>>>>>>>>> error
>>>>>>>>>>>> Address already in use (98)
>>>>>>>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: 
>>>>>>>>>>>> Error in
>>>>>>>>>>>> file oob_usock_component.c at line 228
>>>>>>>>>>>> Data for JOB [65415,1] offset 0
>>>>>>>>>>>> 
>>>>>>>>>>>> ========================   JOB MAP ========================
>>>>>>>>>>>> 
>>>>>>>>>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num 
>>>>>>>>>>>> procs: 2
>>>>>>>>>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 0 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 2[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 7[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>>>>>>  
>>>>>>>>>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 1 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 
>>>>>>>>>>>> 1[core 12[hwt
>>>>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>>>>>>>>>>>> 0-1]], socket
>>>>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 
>>>>>>>>>>>> 1[core 17[hwt
>>>>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>>>>>>>>  
>>>>>>>>>>>> Data for node: a00553.science.domain    Num slots: 1    Max 
>>>>>>>>>>>> slots: 0    Num
>>>>>>>>>>>> procs: 1
>>>>>>>>>>>>     Process OMPI jobid: [65415,1] App: 0 Process rank: 2 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 2[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 7[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>>>>>>  
>>>>>>>>>>>> =============================================================
>>>>>>>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>>>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>>>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> if I login on a00551 and start using the hostfile generated 
>>>>>>>>>>>> by the
>>>>>>>>>>>> PBS_NODEFILE, everything works:
>>>>>>>>>>>> 
>>>>>>>>>>>> (from within the interactive session)
>>>>>>>>>>>> echo $PBS_NODEFILE
>>>>>>>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>>>>>>>> cat $PBS_NODEFILE
>>>>>>>>>>>> a00551.science.domain
>>>>>>>>>>>> a00553.science.domain
>>>>>>>>>>>> a00551.science.domain
>>>>>>>>>>>> 
>>>>>>>>>>>> (from within the separate login)
>>>>>>>>>>>> mpirun --hostfile 
>>>>>>>>>>>> /var/lib/torque/aux//278.a00552.science.domain -np 3
>>>>>>>>>>>> --tag-output -display-map hostname
>>>>>>>>>>>> 
>>>>>>>>>>>> Data for JOB [65445,1] offset 0
>>>>>>>>>>>> 
>>>>>>>>>>>> ========================   JOB MAP ========================
>>>>>>>>>>>> 
>>>>>>>>>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num 
>>>>>>>>>>>> procs: 2
>>>>>>>>>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 0 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 2[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 7[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>>>>>>  
>>>>>>>>>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 1 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 
>>>>>>>>>>>> 1[core 12[hwt
>>>>>>>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>>>>>>>>>>>> 0-1]], socket
>>>>>>>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 
>>>>>>>>>>>> 1[core 17[hwt
>>>>>>>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>>>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>>>>>>>>  
>>>>>>>>>>>> Data for node: a00553.science.domain    Num slots: 20    Max 
>>>>>>>>>>>> slots: 0    Num
>>>>>>>>>>>> procs: 1
>>>>>>>>>>>>     Process OMPI jobid: [65445,1] App: 0 Process rank: 2 
>>>>>>>>>>>> Bound: socket
>>>>>>>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 2[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
>>>>>>>>>>>> socket
>>>>>>>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>>>>>>>>>>>> 7[hwt
>>>>>>>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>>>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>>>>>>>  
>>>>>>>>>>>> =============================================================
>>>>>>>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>>>>>>>> [1,2]<stdout>:a00553.science.domain
>>>>>>>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>>>>>>>> 
>>>>>>>>>>>> I am kind of lost whats going on here. Anyone having an idea? 
>>>>>>>>>>>> I am
>>>>>>>>>>>> seriously considering this to be the problem of kerberos
>>>>>>>>>>>> authentification that we have to work with, but I fail to see 
>>>>>>>>>>>> how this
>>>>>>>>>>>> should affect the sockets.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Oswin
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org
>>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to