Hi Ralph,
I checked -cpus-per-proc in openmpi-1.7.4a1r29646. It works well as I want to do, which can adjust nprocs of each nodes dividing by number of threads. I think my problem is solved so far using -cpus-per-proc, thank you very mush. Regarding oversbuscribed problem, I checked NPROCS was really 8 by outputing the number. SCRIPT: echo mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to core Myprog mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to core Myprog OUTPUT: mpirun -machinefile pbs_hosts -np 8 -report-bindings -bind-to core Myprog -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- By the way, how did you verify the problem. It looks like for me that you run the job directly from cmd line. [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 --report-bindings -hostfile hosts hostname In my environment, such a direct run without Torque script also works fine. Anyway, as I already told you, my problem itself was solved. So I think the priority to check is very low. tmishima > FWIW: I verified that this works fine under a slurm allocation of 2 nodes, each with 12 slots. I filled the node without getting an "oversbuscribed" error message > > [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 --report-bindings -hostfile hosts hostname > [bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0 [core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../..][../../../../../..] > [bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0 [core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../../BB/BB][BB/BB/../../../..] > [bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1 [core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][../../BB/BB/BB/BB] > bend001 > bend001 > bend001 > > where > > [rhc@bend001 svn-trunk]$ cat hosts > bend001 slots=12 > > The only way I get the "out of resources" error is if I ask for more processes than I have slots - i.e., I give it the hosts file as shown, but ask for 13 or more processes. > > > BTW: note one important issue with cpus-per-proc, as shown above. Because I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my procs wound up being split across the two sockets (2 > cores on each). That's about the worst situation you can have. > > So a word of caution: it is up to the user to ensure that the mapping is "good". We just do what you asked us to do. > > > On Nov 13, 2013, at 8:30 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Guess I don't see why modifying the allocation is required - we have mapping options that should support such things. If you specify the total number of procs you want, and cpus-per-proc=4, it should > do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't understand the issue. > > Regardless, if NPROCS=8 (and you verified that by printing it out, not just assuming wc -l got that value), then it shouldn't think it is oversubscribed. I'll take a look under a slurm allocation as > that is all I can access. > > > On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Our cluster consists of three types of nodes. They have 8, 32 > and 64 slots respectively. Since the performance of each core is > almost same, mixed use of these nodes is possible. > > Furthremore, in this case, for hybrid application with openmpi+openmp, > the modification of hostfile is necesarry as follows: > > #PBS -l nodes=1:ppn=32+4:ppn=8 > export OMP_NUM_THREADS=4 > modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS > Myprog > > That's why I want to do that. > > Of course I know, If I quit mixed use, -npernode is better for this > purpose. > > (The script I showed you first is just a simplified one to clarify the > problem.) > > tmishima > > > Why do it the hard way? I'll look at the FAQ because that definitely > isn't a recommended thing to do - better to use -host to specify the > subset, or just specify the desired mapping using all the > various mappers we provide. > > On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Sorry for cross-post. > > Nodefile is very simple which consists of 8 lines: > > node08 > node08 > node08 > node08 > node08 > node08 > node08 > node08 > > Therefore, NPROCS=8 > > My aim is to modify the allocation as you pointed out. According to > Openmpi > FAQ, > proper subset of the hosts allocated to the Torque / PBS Pro job should > be > allowed. > > tmishima > > Please - can you answer my question on script2? What is the value of > NPROCS? > > Why would you want to do it this way? Are you planning to modify the > allocation?? That generally is a bad idea as it can confuse the system > > > On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Since what I really want is to run script2 correctly, please let us > concentrate script2. > > I'm not an expert of the inside of openmpi. What I can do is just > obsabation > from the outside. I doubt these lines are strange, especially the > last > one. > > [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list > [node08.cluster:26952] [[56581,0],0] Filtering thru apps > [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 > inuse > 0 > > These lines come from this part of orte_rmaps_base_get_target_nodes > in rmaps_base_support_fns.c: > > } else if (node->slots <= node->slots_inuse && > (ORTE_MAPPING_NO_OVERSUBSCRIBE & > ORTE_GET_MAPPING_DIRECTIVE(policy))) { > /* remove the node as fully used */ > OPAL_OUTPUT_VERBOSE((5, > orte_rmaps_base_framework.framework_output, > "%s Removing node %s slots %d inuse > %d", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > node->name, node->slots, node-> > slots_inuse)); > opal_list_remove_item(allocated_nodes, item); > OBJ_RELEASE(item); /* "un-retain" it */ > > I wonder why node->slots and node->slots_inuse is 0, which I can read > from the above line "Removing node node08 slots 0 inuse 0". > > Or I'm not sure but > "else if (node->slots <= node->slots_inuse &&" should be > "else if (node->slots < node->slots_inuse &&" ? > > tmishima > > On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Yes, the node08 has 8 slots but the process I run is also 8. > > #PBS -l nodes=node08:ppn=8 > > Therefore, I think it should allow this allocation. Is that right? > > Correct > > > My question is why scritp1 works and script2 does not. They are > almost same. > > #PBS -l nodes=node08:ppn=8 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > > #SCRITP1 > mpirun -report-bindings -bind-to core Myprog > > #SCRIPT2 > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > -bind-to > core > > This version is not only reading the PBS allocation, but also > invoking > the hostfile filter on top of it. Different code path. I'll take a > look > - > it should still match up assuming NPROCS=8. Any > possibility that it is a different number? I don't recall, but isn't > there some extra lines in the nodefile - e.g., comments? > > > Myprog > > tmishima > > I guess here's my confusion. If you are using only one node, and > that > node has 8 allocated slots, then we will not allow you to run more > than > 8 > processes on that node unless you specifically provide > the --oversubscribe flag. This is because you are operating in a > managed > environment (in this case, under Torque), and so we treat the > allocation as > "mandatory" by default. > > I suspect that is the issue here, in which case the system is > behaving > as > it should. > > Is the above accurate? > > > On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > It has nothing to do with LAMA as you aren't using that mapper. > > How many nodes are in this allocation? > > On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Hi Ralph, this is an additional information. > > Here is the main part of output by adding "-mca > rmaps_base_verbose > 50". > > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating > map > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP > in > allocation > [node08.cluster:26952] mca:rmaps: mapping job [56581,1] > [node08.cluster:26952] mca:rmaps: creating new map for job > [56581,1] > [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using > ppr > mapper > [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job > [56581,1] > [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using > seq > mapper > [node08.cluster:26952] mca:rmaps:resilient: cannot perform > initial > map > of > job [56581,1] - no fault groups > [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not > using > mindist > mapper > [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in > list > [node08.cluster:26952] [[56581,0],0] Filtering thru apps > [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > [node08.cluster:26952] [[56581,0],0] Removing node node08 slots > 0 > inuse 0 > > From this result, I guess it's related to oversubscribe. > So I added "-oversubscribe" and rerun, then it worked well as > show > below: > > [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in > list > [node08.cluster:27019] [[56774,0],0] Filtering thru apps > [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list > [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: > [node08.cluster:27019] node: node08 daemon: 0 > [node08.cluster:27019] [[56774,0],0] Starting bookmark at node > node08 > [node08.cluster:27019] [[56774,0],0] Starting at node node08 > [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job > [56774,1] > slots 1 num_procs 8 > [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - > skipping > [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is > oversubscribed - > performing second pass > [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to > node > node08 > [node08.cluster:27019] mca:rmaps:base: computing vpids by slot > for > job > [56774,1] > [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node > node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node > node08 > > I think something is wrong with treatment of oversubscription, > which > might > be > related to "#3893: LAMA mapper has problems" > > tmishima > > Hmmm...looks like we aren't getting your allocation. Can you > rerun > and > add -mca ras_base_verbose 50? > > On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Hi Ralph, > > Here is the output of "-mca plm_base_verbose 5". > > [node08.cluster:23573] mca:base:select:( plm) Querying > component > [rsh] > [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on > agent /usr/bin/rsh path NULL > [node08.cluster:23573] mca:base:select:( plm) Query of > component > [rsh] > set > priority to 10 > [node08.cluster:23573] mca:base:select:( plm) Querying > component > [slurm] > [node08.cluster:23573] mca:base:select:( plm) Skipping > component > [slurm]. > Query failed to return a module > [node08.cluster:23573] mca:base:select:( plm) Querying > component > [tm] > [node08.cluster:23573] mca:base:select:( plm) Query of > component > [tm] > set > priority to 75 > [node08.cluster:23573] mca:base:select:( plm) Selected > component > [tm] > [node08.cluster:23573] plm:base:set_hnp_name: initial bias > 23573 > nodename > hash 85176670 > [node08.cluster:23573] plm:base:set_hnp_name: final jobfam > 59480 > [node08.cluster:23573] [[59480,0],0] plm:base:receive start > comm > [node08.cluster:23573] [[59480,0],0] plm:base:setup_job > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > creating > map > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only > HNP > in > allocation > > > > > > -------------------------------------------------------------------------- > All nodes which are allocated for this job are already filled. > > > > > > -------------------------------------------------------------------------- > > Here, openmpi's configuration is as follows: > > ./configure \ > --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ > --with-tm \ > --with-verbs \ > --disable-ipv6 \ > --disable-vt \ > --enable-debug \ > CC=pgcc CFLAGS="-tp k8-64e" \ > CXX=pgCC CXXFLAGS="-tp k8-64e" \ > F77=pgfortran FFLAGS="-tp k8-64e" \ > FC=pgfortran FCFLAGS="-tp k8-64e" > > Hi Ralph, > > Okey, I can help you. Please give me some time to report the > output. > > Tetsuya Mishima > > I can try, but I have no way of testing Torque any more - so > all > I > can > do > is a code review. If you can build --enable-debug and add > -mca > plm_base_verbose 5 to your cmd line, I'd appreciate seeing > the > output. > > > On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp > wrote: > > > > Hi Ralph, > > Thank you for your quick response. > > I'd like to report one more regressive issue about Torque > support > of > openmpi-1.7.4a1r29646, which might be related to "#3893: > LAMA > mapper > has problems" I reported a few days ago. > > The script below does not work with openmpi-1.7.4a1r29646, > although it worked with openmpi-1.7.3 as I told you before. > > #!/bin/sh > #PBS -l nodes=node08:ppn=8 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > mpirun -machinefile pbs_hosts -np ${NPROCS} > -report-bindings > -bind-to > core > Myprog > > If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it > works > fine. > Since this happens without lama request, I guess it's not > the > problem > in lama itself. Anyway, please look into this issue as > well. > > Regards, > Tetsuya Mishima > > Done - thanks! > > On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp > wrote: > > > > Dear openmpi developers, > > I got a segmentation fault in traial use of > openmpi-1.7.4a1r29646 > built > by > PGI13.10 as shown below: > > [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 > -cpus-per-proc > 2 > -report-bindings mPre > [manage.cluster:23082] MCW rank 2 bound to socket 0[core > 4 > [hwt > 0]], > socket > 0[core 5[hwt 0]]: [././././B/B][./././././.] > [manage.cluster:23082] MCW rank 3 bound to socket 1[core > 6 > [hwt > 0]], > socket > 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > [manage.cluster:23082] MCW rank 0 bound to socket 0[core > 0 > [hwt > 0]], > socket > 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > [manage.cluster:23082] MCW rank 1 bound to socket 0[core > 2 > [hwt > 0]], > socket > 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > [manage:23082] *** Process received signal *** > [manage:23082] Signal: Segmentation fault (11) > [manage:23082] Signal code: Address not mapped (1) > [manage:23082] Failing at address: 0x34 > [manage:23082] *** End of error message *** > Segmentation fault (core dumped) > > [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun > core.23082 > GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > Copyright (C) 2009 Free Software Foundation, Inc. > ... > Core was generated by `mpirun -np 4 -cpus-per-proc 2 > -report-bindings > mPre'. > Program terminated with signal 11, Segmentation fault. > #0 0x00002b5f861c9c4f in recv_connect>>> > (mod=0x5f861ca20b00007f, > sd=32767, > hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > (gdb) where > #0 0x00002b5f861c9c4f in recv_connect > (mod=0x5f861ca20b00007f, > sd=32767, > hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, > flags=32767, > cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > #2 0x00002b5f848eb06a in > event_process_active_single_queue > (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > at ./event.c:1366 > #3 0x00002b5f848eb270 in event_process_active > (base=0x5f848eb84900007f) > at ./event.c:1435 > #4 0x00002b5f848eb849 in > opal_libevent2021_event_base_loop > (base=0x4077a000007f, flags=32767) at ./event.c:1645 > #5 0x00000000004077a0 in orterun (argc=7, > argv=0x7fff25bbd4a8) > at ./orterun.c:1030 > #6 0x00000000004067fb in main (argc=7, > argv=0x7fff25bbd4a8) > at ./main.c:13 > (gdb) quit > > > The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently > unnecessary, > which > causes the segfault. > > 624 /* lookup the corresponding process > */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> > origin); > 626 if (NULL == peer) { > 627 ui64 = (uint64_t*)(&peer->name); > 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > orte_oob_base_framework.framework_output, > 629 "%s > mca_oob_tcp_recv_connect: > connection from new peer", > 630 ORTE_NAME_PRINT > (ORTE_PROC_MY_NAME)); > 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > 632 peer->mod = mod; > 633 peer->name = hdr->origin; > 634 peer->state = MCA_OOB_TCP_ACCEPTING; > 635 ui64 = (uint64_t*)(&peer->name); > 636 if (OPAL_SUCCESS != > opal_hash_table_set_value_uint64 > (&mod-> > peers, (*ui64), peer)) { > 637 OBJ_RELEASE(peer); > 638 return; > 639 } > > > Please fix this mistake in the next release. > > Regards, > Tetsuya Mishima > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list>> us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > users mailing list > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users