It has nothing to do with LAMA as you aren't using that mapper. How many nodes are in this allocation?
On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, this is an additional information. > > Here is the main part of output by adding "-mca rmaps_base_verbose 50". > > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating map > [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP in > allocation > [node08.cluster:26952] mca:rmaps: mapping job [56581,1] > [node08.cluster:26952] mca:rmaps: creating new map for job [56581,1] > [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using ppr mapper > [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job [56581,1] > [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using seq mapper > [node08.cluster:26952] mca:rmaps:resilient: cannot perform initial map of > job [56581,1] - no fault groups > [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not using mindist > mapper > [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list > [node08.cluster:26952] [[56581,0],0] Filtering thru apps > [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 inuse 0 > > From this result, I guess it's related to oversubscribe. > So I added "-oversubscribe" and rerun, then it worked well as show below: > > [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in list > [node08.cluster:27019] [[56774,0],0] Filtering thru apps > [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list > [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: > [node08.cluster:27019] node: node08 daemon: 0 > [node08.cluster:27019] [[56774,0],0] Starting bookmark at node node08 > [node08.cluster:27019] [[56774,0],0] Starting at node node08 > [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job [56774,1] > slots 1 num_procs 8 > [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - skipping > [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is oversubscribed - > performing second pass > [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to node > node08 > [node08.cluster:27019] mca:rmaps:base: computing vpids by slot for job > [56774,1] > [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node node08 > [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node node08 > > I think something is wrong with treatment of oversubscription, which might > be > related to "#3893: LAMA mapper has problems" > > tmishima > >> Hmmm...looks like we aren't getting your allocation. Can you rerun and > add -mca ras_base_verbose 50? >> >> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> Here is the output of "-mca plm_base_verbose 5". >>> >>> [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh] >>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on >>> agent /usr/bin/rsh path NULL >>> [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh] > set >>> priority to 10 >>> [node08.cluster:23573] mca:base:select:( plm) Querying component > [slurm] >>> [node08.cluster:23573] mca:base:select:( plm) Skipping component > [slurm]. >>> Query failed to return a module >>> [node08.cluster:23573] mca:base:select:( plm) Querying component [tm] >>> [node08.cluster:23573] mca:base:select:( plm) Query of component [tm] > set >>> priority to 75 >>> [node08.cluster:23573] mca:base:select:( plm) Selected component [tm] >>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573 > nodename >>> hash 85176670 >>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480 >>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm >>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job >>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map >>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in >>> allocation >>> > -------------------------------------------------------------------------- >>> All nodes which are allocated for this job are already filled. >>> > -------------------------------------------------------------------------- >>> >>> Here, openmpi's configuration is as follows: >>> >>> ./configure \ >>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ >>> --with-tm \ >>> --with-verbs \ >>> --disable-ipv6 \ >>> --disable-vt \ >>> --enable-debug \ >>> CC=pgcc CFLAGS="-tp k8-64e" \ >>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ >>> F77=pgfortran FFLAGS="-tp k8-64e" \ >>> FC=pgfortran FCFLAGS="-tp k8-64e" >>> >>>> Hi Ralph, >>>> >>>> Okey, I can help you. Please give me some time to report the output. >>>> >>>> Tetsuya Mishima >>>> >>>>> I can try, but I have no way of testing Torque any more - so all I > can >>> do >>>> is a code review. If you can build --enable-debug and add -mca >>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing the >>>>> output. >>>>> >>>>> >>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>>> >>>>>> >>>>>> Hi Ralph, >>>>>> >>>>>> Thank you for your quick response. >>>>>> >>>>>> I'd like to report one more regressive issue about Torque support of >>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper >>>>>> has problems" I reported a few days ago. >>>>>> >>>>>> The script below does not work with openmpi-1.7.4a1r29646, >>>>>> although it worked with openmpi-1.7.3 as I told you before. >>>>>> >>>>>> #!/bin/sh >>>>>> #PBS -l nodes=node08:ppn=8 >>>>>> export OMP_NUM_THREADS=1 >>>>>> cd $PBS_O_WORKDIR >>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > -bind-to >>>> core >>>>>> Myprog >>>>>> >>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works >>> fine. >>>>>> Since this happens without lama request, I guess it's not the > problem >>>>>> in lama itself. Anyway, please look into this issue as well. >>>>>> >>>>>> Regards, >>>>>> Tetsuya Mishima >>>>>> >>>>>>> Done - thanks! >>>>>>> >>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Dear openmpi developers, >>>>>>>> >>>>>>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 >>>> built >>>>>> by >>>>>>>> PGI13.10 as shown below: >>>>>>>> >>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 > -cpus-per-proc >>> 2 >>>>>>>> -report-bindings mPre >>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt > 0]], >>>>>> socket >>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] >>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt > 0]], >>>>>> socket >>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] >>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt > 0]], >>>>>> socket >>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] >>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt > 0]], >>>>>> socket >>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] >>>>>>>> [manage:23082] *** Process received signal *** >>>>>>>> [manage:23082] Signal: Segmentation fault (11) >>>>>>>> [manage:23082] Signal code: Address not mapped (1) >>>>>>>> [manage:23082] Failing at address: 0x34 >>>>>>>> [manage:23082] *** End of error message *** >>>>>>>> Segmentation fault (core dumped) >>>>>>>> >>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 >>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) >>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. >>>>>>>> ... >>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 >>> -report-bindings >>>>>>>> mPre'. >>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, >>>>>> sd=32767, >>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>> (gdb) where >>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, >>>>>> sd=32767, >>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, > flags=32767, >>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 >>>>>>>> #2 0x00002b5f848eb06a in event_process_active_single_queue >>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) >>>>>>>> at ./event.c:1366 >>>>>>>> #3 0x00002b5f848eb270 in event_process_active >>>>>> (base=0x5f848eb84900007f) >>>>>>>> at ./event.c:1435 >>>>>>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop >>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 >>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) >>>>>>>> at ./orterun.c:1030 >>>>>>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) >>>>>> at ./main.c:13 >>>>>>>> (gdb) quit >>>>>>>> >>>>>>>> >>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently >>> unnecessary, >>>>>> which >>>>>>>> causes the segfault. >>>>>>>> >>>>>>>> 624 /* lookup the corresponding process */ >>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); >>>>>>>> 626 if (NULL == peer) { >>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); >>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, >>>>>>>> orte_oob_base_framework.framework_output, >>>>>>>> 629 "%s mca_oob_tcp_recv_connect: >>>>>>>> connection from new peer", >>>>>>>> 630 ORTE_NAME_PRINT >>>> (ORTE_PROC_MY_NAME)); >>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>> 632 peer->mod = mod; >>>>>>>> 633 peer->name = hdr->origin; >>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; >>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); >>>>>>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64 >>>>>> (&mod-> >>>>>>>> peers, (*ui64), peer)) { >>>>>>>> 637 OBJ_RELEASE(peer); >>>>>>>> 638 return; >>>>>>>> 639 } >>>>>>>> >>>>>>>> >>>>>>>> Please fix this mistake in the next release. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Tetsuya Mishima >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users