Further information.
I first time encountered this problem in openmpi-1.7.4.x, while opnempi-1.7.3 and 1.6.x works fine. My directory below is "testbed-openmpi-1.7.3", but it's realy 1.7.4a1r29646. I'm sorry, if I confuse you. [mishima@manage testbed-openmpi-1.7.3]$ ompi_info | grep "Open MPI:" Open MPI: 1.7.4a1r29646 It's obvious that the cause is in the difference between 1.7.3 and 1.7.4.x. tmishima > Indeed it should - most puzzling. I'll try playing with it on slurm using sbatch and see if I get the same behavior. Offhand, I can't see why the difference would exist unless somehow the script > itself is taking one of the execution slots, and somehow Torque is accounting for it. > > Will have to explore and get back to you on a new email thread. > > Thanks > Ralph > > On Nov 14, 2013, at 7:01 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > It's no problem that you let it lie until the problem becomes serious > > again. > > > > So, this is just an information for you. > > > > I agree with your opinion that the problem will lie in the modified > > hostfile. > > But exactly speaking, it's related to just adding -hostfile option to > > mpirun > > in Torque script, I think. > > > > To make it clear and to prove I never modify hostfile given by Torque, > > I change the script like this: > > > > [mishima@manage testbed-openmpi-1.7.3]$ cat myscript.sh > > #!/bin/sh > > #PBS -l nodes=node08:ppn=8 > > cd $PBS_O_WORKDIR > > cat $PBS_NODEFILE > > mpirun -machinefile $PBS_NODEFILE -report-bindings -bind-to core Myprog > > > > Here, $PBS_NODEFILE is the variable prepared by Torque, which contains > > allocated nodes. > > Furthermore, I removed "-np $NPROCS" to use the number given by Torque. > > Therefore, this means that I exactly use hostfile and nprocs given by > > Torque. > > > > Then, you have to submit the job to produce the problem, because direct run > > could work. > > The output of this job is below: > > > > [mishima@manage testbed-openmpi-1.7.3]$ qsub myscript.sh > > 7999.manage.cluster > > [mishima@manage testbed-openmpi-1.7.3]$ cat myscript.sh.o7999 > > node08 > > node08 > > node08 > > node08 > > node08 > > node08 > > node08 > > node08 > > -------------------------------------------------------------------------- > > All nodes which are allocated for this job are already filled. > > -------------------------------------------------------------------------- > > > > As you can see, it still causes the oversbuscribed problem. > > I know that -hostfile option is unnecessary in Torque script, but it should > > run even with this harmless option. > > > > tmishima > > > > > >> On Nov 14, 2013, at 3:25 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Hi Ralph, > >>> > >>> I checked -cpus-per-proc in openmpi-1.7.4a1r29646. > >>> It works well as I want to do, which can adjust nprocs > >>> of each nodes dividing by number of threads. > >>> > >>> I think my problem is solved so far using -cpus-per-proc, > >>> thank you very mush. > >> > >> Happy that works for you! > >> > >>> > >>> Regarding oversbuscribed problem, I checked NPROCS was really 8 > >>> by outputing the number. > >>> > >>> SCRIPT: > >>> echo mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings > > -bind-to > >>> core Myprog > >>> mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to > > core > >>> Myprog > >>> > >>> OUTPUT: > >>> mpirun -machinefile pbs_hosts -np 8 -report-bindings -bind-to core > > Myprog > >>> > > -------------------------------------------------------------------------- > >>> All nodes which are allocated for this job are already filled. > >>> > > -------------------------------------------------------------------------- > >>> > >>> By the way, how did you verify the problem. > >>> It looks like for me that you run the job directly from cmd line. > >>> > >>> [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 > >>> --report-bindings -hostfile hosts hostname > >>> > >>> In my environment, such a direct run without Torque script also works > > fine. > >> > >> Really? Your above cmd line is exactly the same as mine - a hardcoded > > value for np, passing in a machinefile (or hostfile - same thing) while in > > a matching allocation. The only difference I can see > >> is that your hostfile may conflict with the detected allocation since you > > modified it. I suspect that is causing the confusion. > >> > >> > >>> Anyway, as I already told you, my problem itself was solved. So I think > > the > >>> priority to check is very low. > >> > >> I suspect there really isn't a bug here - the problem most likely lies in > > the modified hostfile working against the detected allocation. I'll let it > > lie for now and see if something reveals itself at > >> a later date. > >> > >> Thanks! > >> Ralph > >> > >> > >>> > >>> tmishima > >>> > >>> > >>>> FWIW: I verified that this works fine under a slurm allocation of 2 > >>> nodes, each with 12 slots. I filled the node without getting an > >>> "oversbuscribed" error message > >>>> > >>>> [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 > >>> --report-bindings -hostfile hosts hostname > >>>> [bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket > > 0 > >>> [core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > > 0-1]]: > >>> [BB/BB/BB/BB/../..][../../../../../..] > >>>> [bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket > > 0 > >>> [core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt > > 0-1]]: > >>> [../../../../BB/BB][BB/BB/../../../..] > >>>> [bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket > > 1 > >>> [core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > > 0-1]]: > >>> [../../../../../..][../../BB/BB/BB/BB] > >>>> bend001 > >>>> bend001 > >>>> bend001 > >>>> > >>>> where > >>>> > >>>> [rhc@bend001 svn-trunk]$ cat hosts > >>>> bend001 slots=12 > >>>> > >>>> The only way I get the "out of resources" error is if I ask for more > >>> processes than I have slots - i.e., I give it the hosts file as shown, > > but > >>> ask for 13 or more processes. > >>>> > >>>> > >>>> BTW: note one important issue with cpus-per-proc, as shown above. > > Because > >>> I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my > > procs > >>> wound up being split across the two sockets (2 > >>>> cores on each). That's about the worst situation you can have. > >>>> > >>>> So a word of caution: it is up to the user to ensure that the mapping > > is > >>> "good". We just do what you asked us to do. > >>>> > >>>> > >>>> On Nov 13, 2013, at 8:30 PM, Ralph Castain <r...@open-mpi.org> wrote: > >>>> > >>>> Guess I don't see why modifying the allocation is required - we have > >>> mapping options that should support such things. If you specify the > > total > >>> number of procs you want, and cpus-per-proc=4, it should > >>>> do the same thing I would think. You'd get 2 procs on the 8 slot > > nodes, 8 > >>> on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you > > specified > >>> np=16). So I guess I don't understand the issue. > >>>> > >>>> Regardless, if NPROCS=8 (and you verified that by printing it out, not > >>> just assuming wc -l got that value), then it shouldn't think it is > >>> oversubscribed. I'll take a look under a slurm allocation as > >>>> that is all I can access. > >>>> > >>>> > >>>> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Our cluster consists of three types of nodes. They have 8, 32 > >>>> and 64 slots respectively. Since the performance of each core is > >>>> almost same, mixed use of these nodes is possible. > >>>> > >>>> Furthremore, in this case, for hybrid application with openmpi +openmp, > >>>> the modification of hostfile is necesarry as follows: > >>>> > >>>> #PBS -l nodes=1:ppn=32+4:ppn=8 > >>>> export OMP_NUM_THREADS=4 > >>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > >>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS > >>>> Myprog > >>>> > >>>> That's why I want to do that. > >>>> > >>>> Of course I know, If I quit mixed use, -npernode is better for this > >>>> purpose. > >>>> > >>>> (The script I showed you first is just a simplified one to clarify the > >>>> problem.) > >>>> > >>>> tmishima > >>>> > >>>> > >>>> Why do it the hard way? I'll look at the FAQ because that definitely > >>>> isn't a recommended thing to do - better to use -host to specify the > >>>> subset, or just specify the desired mapping using all the > >>>> various mappers we provide. > >>>> > >>>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Sorry for cross-post. > >>>> > >>>> Nodefile is very simple which consists of 8 lines: > >>>> > >>>> node08 > >>>> node08 > >>>> node08 > >>>> node08 > >>>> node08 > >>>> node08 > >>>> node08 > >>>> node08 > >>>> > >>>> Therefore, NPROCS=8 > >>>> > >>>> My aim is to modify the allocation as you pointed out. According to > >>>> Openmpi > >>>> FAQ, > >>>> proper subset of the hosts allocated to the Torque / PBS Pro job > > should > >>>> be > >>>> allowed. > >>>> > >>>> tmishima > >>>> > >>>> Please - can you answer my question on script2? What is the value of > >>>> NPROCS? > >>>> > >>>> Why would you want to do it this way? Are you planning to modify the > >>>> allocation?? That generally is a bad idea as it can confuse the system > >>>> > >>>> > >>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Since what I really want is to run script2 correctly, please let us > >>>> concentrate script2. > >>>> > >>>> I'm not an expert of the inside of openmpi. What I can do is just > >>>> obsabation > >>>> from the outside. I doubt these lines are strange, especially the > >>>> last > >>>> one. > >>>> > >>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > >>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list > >>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps > >>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > >>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 > >>>> inuse > >>>> 0 > >>>> > >>>> These lines come from this part of orte_rmaps_base_get_target_nodes > >>>> in rmaps_base_support_fns.c: > >>>> > >>>> } else if (node->slots <= node->slots_inuse && > >>>> (ORTE_MAPPING_NO_OVERSUBSCRIBE & > >>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) { > >>>> /* remove the node as fully used */ > >>>> OPAL_OUTPUT_VERBOSE((5, > >>>> orte_rmaps_base_framework.framework_output, > >>>> "%s Removing node %s slots %d inuse > >>>> %d", > >>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > >>>> node->name, node->slots, node-> > >>>> slots_inuse)); > >>>> opal_list_remove_item(allocated_nodes, item); > >>>> OBJ_RELEASE(item); /* "un-retain" it */ > >>>> > >>>> I wonder why node->slots and node->slots_inuse is 0, which I can read > >>>> from the above line "Removing node node08 slots 0 inuse 0". > >>>> > >>>> Or I'm not sure but > >>>> "else if (node->slots <= node->slots_inuse &&" should be > >>>> "else if (node->slots < node->slots_inuse &&" ? > >>>> > >>>> tmishima > >>>> > >>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Yes, the node08 has 8 slots but the process I run is also 8. > >>>> > >>>> #PBS -l nodes=node08:ppn=8 > >>>> > >>>> Therefore, I think it should allow this allocation. Is that right? > >>>> > >>>> Correct > >>>> > >>>> > >>>> My question is why scritp1 works and script2 does not. They are > >>>> almost same. > >>>> > >>>> #PBS -l nodes=node08:ppn=8 > >>>> export OMP_NUM_THREADS=1 > >>>> cd $PBS_O_WORKDIR > >>>> cp $PBS_NODEFILE pbs_hosts > >>>> NPROCS=`wc -l < pbs_hosts` > >>>> > >>>> #SCRITP1 > >>>> mpirun -report-bindings -bind-to core Myprog > >>>> > >>>> #SCRIPT2 > >>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings > >>>> -bind-to > >>>> core > >>>> > >>>> This version is not only reading the PBS allocation, but also > >>>> invoking > >>>> the hostfile filter on top of it. Different code path. I'll take a > >>>> look > >>>> - > >>>> it should still match up assuming NPROCS=8. Any > >>>> possibility that it is a different number? I don't recall, but isn't > >>>> there some extra lines in the nodefile - e.g., comments? > >>>> > >>>> > >>>> Myprog > >>>> > >>>> tmishima > >>>> > >>>> I guess here's my confusion. If you are using only one node, and > >>>> that > >>>> node has 8 allocated slots, then we will not allow you to run more > >>>> than > >>>> 8 > >>>> processes on that node unless you specifically provide > >>>> the --oversubscribe flag. This is because you are operating in a > >>>> managed > >>>> environment (in this case, under Torque), and so we treat the > >>>> allocation as > >>>> "mandatory" by default. > >>>> > >>>> I suspect that is the issue here, in which case the system is > >>>> behaving > >>>> as > >>>> it should. > >>>> > >>>> Is the above accurate? > >>>> > >>>> > >>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> > >>>> wrote: > >>>> > >>>> It has nothing to do with LAMA as you aren't using that mapper. > >>>> > >>>> How many nodes are in this allocation? > >>>> > >>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Hi Ralph, this is an additional information. > >>>> > >>>> Here is the main part of output by adding "-mca > >>>> rmaps_base_verbose > >>>> 50". > >>>> > >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating > >>>> map > >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP > >>>> in > >>>> allocation > >>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] > >>>> [node08.cluster:26952] mca:rmaps: creating new map for job > >>>> [56581,1] > >>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using > >>>> ppr > >>>> mapper > >>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job > >>>> [56581,1] > >>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using > >>>> seq > >>>> mapper > >>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform > >>>> initial > >>>> map > >>>> of > >>>> job [56581,1] - no fault groups > >>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not > >>>> using > >>>> mindist > >>>> mapper > >>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] > >>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in > >>>> list > >>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps > >>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list > >>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots > >>>> 0 > >>>> inuse 0 > >>>> > >>>> From this result, I guess it's related to oversubscribe. > >>>> So I added "-oversubscribe" and rerun, then it worked well as > >>>> show > >>>> below: > >>>> > >>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in > >>>> list > >>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps > >>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list > >>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: > >>>> [node08.cluster:27019] node: node08 daemon: 0 > >>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node > >>>> node08 > >>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08 > >>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job > >>>> [56774,1] > >>>> slots 1 num_procs 8 > >>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 > >>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - > >>>> skipping > >>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is > >>>> oversubscribed - > >>>> performing second pass > >>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to > >>>> node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot > >>>> for > >>>> job > >>>> [56774,1] > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node > >>>> node08 > >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node > >>>> node08 > >>>> > >>>> I think something is wrong with treatment of oversubscription, > >>>> which > >>>> might > >>>> be > >>>> related to "#3893: LAMA mapper has problems" > >>>> > >>>> tmishima > >>>> > >>>> Hmmm...looks like we aren't getting your allocation. Can you > >>>> rerun > >>>> and > >>>> add -mca ras_base_verbose 50? > >>>> > >>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>> > >>>> > >>>> Hi Ralph, > >>>> > >>>> Here is the output of "-mca plm_base_verbose 5". > >>>> > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>>> component > >>>> [rsh] > >>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on > >>>> agent /usr/bin/rsh path NULL > >>>> [node08.cluster:23573] mca:base:select:( plm) Query of > >>>> component > >>>> [rsh] > >>>> set > >>>> priority to 10 > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>>> component > >>>> [slurm] > >>>> [node08.cluster:23573] mca:base:select:( plm) Skipping > >>>> component > >>>> [slurm]. > >>>> Query failed to return a module > >>>> [node08.cluster:23573] mca:base:select:( plm) Querying > >>>> component > >>>> [tm] > >>>> [node08.cluster:23573] mca:base:select:( plm) Query of > >>>> component > >>>> [tm] > >>>> set > >>>> priority to 75 > >>>> [node08.cluster:23573] mca:base:select:( plm) Selected > >>>> component > >>>> [tm] > >>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias > >>>> 23573 > >>>> nodename > >>>> hash 85176670 > >>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam > >>>> 59480 > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start > >>>> comm > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm > >>>> creating > >>>> map > >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only > >>>> HNP > >>>> in > >>>> allocation > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > > -------------------------------------------------------------------------- > >>>> All nodes which are allocated for this job are already filled. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > > -------------------------------------------------------------------------- > >>>> > >>>> Here, openmpi's configuration is as follows: > >>>> > >>>> ./configure \ > >>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ > >>>> --with-tm \>> --with-verbs \ > >>>> --disable-ipv6 \ > >>>> --disable-vt \ > >>>> --enable-debug \ > >>>> CC=pgcc CFLAGS="-tp k8-64e" \ > >>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ > >>>> F77=pgfortran FFLAGS="-tp k8-64e" \ > >>>> FC=pgfortran FCFLAGS="-tp k8-64e" > >>>> > >>>> Hi Ralph, > >>>> > >>>> Okey, I can help you. Please give me some time to report the > >>>> output. > >>>> > >>>> Tetsuya Mishima > >>>> > >>>> I can try, but I have no way of testing Torque any more - so > >>>> all > >>>> I > >>>> can > >>>> do > >>>> is a code review. If you can build --enable-debug and add > >>>> -mca > >>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing > >>>> the > >>>> output. > >>>> > >>>> > >>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp > >>>> wrote: > >>>> > >>>> > >>>> > >>>> Hi Ralph, > >>>> > >>>> Thank you for your quick response. > >>>> > >>>> I'd like to report one more regressive issue about Torque > >>>> support > >>>> of > >>>> openmpi-1.7.4a1r29646, which might be related to "#3893: > >>>> LAMA > >>>> mapper > >>>> has problems" I reported a few days ago. > >>>> > >>>> The script below does not work with openmpi-1.7.4a1r29646, > >>>> although it worked with openmpi-1.7.3 as I told you before. > >>>> > >>>> #!/bin/sh > >>>> #PBS -l nodes=node08:ppn=8 > >>>> export OMP_NUM_THREADS=1 > >>>> cd $PBS_O_WORKDIR > >>>> cp $PBS_NODEFILE pbs_hosts > >>>> NPROCS=`wc -l < pbs_hosts` > >>>> mpirun -machinefile pbs_hosts -np ${NPROCS} > >>>> -report-bindings > >>>> -bind-to > >>>> core > >>>> Myprog > >>>> > >>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it > >>>> works > >>>> fine. > >>>> Since this happens without lama request, I guess it's not > >>>> the > >>>> problem > >>>> in lama itself. Anyway, please look into this issue as > >>>> well. > >>>> > >>>> Regards, > >>>> Tetsuya Mishima > >>>> > >>>> Done - thanks! > >>>> > >>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp > >>>> wrote: > >>>> > >>>> > >>>> > >>>> Dear openmpi developers, > >>>> > >>>> I got a segmentation fault in traial use of > >>>> openmpi-1.7.4a1r29646 > >>>> built > >>>> by > >>>> PGI13.10 as shown below: > >>>> > >>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 > >>>> -cpus-per-proc > >>>> 2 > >>>> -report-bindings mPre > >>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core > >>>> 4 > >>>> [hwt > >>>> 0]], > >>>> socket > >>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] > >>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core > >>>> 6 > >>>> [hwt > >>>> 0]], > >>>> socket > >>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > >>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core > >>>> 0 > >>>> [hwt > >>>> 0]], > >>>> socket > >>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > >>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core > >>>> 2 > >>>> [hwt > >>>> 0]], > >>>> socket > >>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > >>>> [manage:23082] *** Process received signal *** > >>>> [manage:23082] Signal: Segmentation fault (11) > >>>> [manage:23082] Signal code: Address not mapped (1) > >>>> [manage:23082] Failing at address: 0x34 > >>>> [manage:23082] *** End of error message *** > >>>> Segmentation fault (core dumped) > >>>> > >>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun > >>>> core.23082 > >>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > >>>> Copyright (C) 2009 Free Software Foundation, Inc. > >>>> ... > >>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 > >>>> -report-bindings > >>>> mPre'. > >>>> Program terminated with signal 11, Segmentation fault. > >>>> #0 0x00002b5f861c9c4f in recv_connect>>> > >>>> (mod=0x5f861ca20b00007f, > >>>> sd=32767, > >>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>> (gdb) where > >>>> #0 0x00002b5f861c9c4f in recv_connect > >>>> (mod=0x5f861ca20b00007f, > >>>> sd=32767, > >>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > >>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, > >>>> flags=32767, > >>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > >>>> #2 0x00002b5f848eb06a in > >>>> event_process_active_single_queue > >>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > >>>> at ./event.c:1366 > >>>> #3 0x00002b5f848eb270 in event_process_active > >>>> (base=0x5f848eb84900007f) > >>>> at ./event.c:1435 > >>>> #4 0x00002b5f848eb849 in > >>>> opal_libevent2021_event_base_loop > >>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 > >>>> #5 0x00000000004077a0 in orterun (argc=7, > >>>> argv=0x7fff25bbd4a8) > >>>> at ./orterun.c:1030 > >>>> #6 0x00000000004067fb in main (argc=7, > >>>> argv=0x7fff25bbd4a8) > >>>> at ./main.c:13 > >>>> (gdb) quit > >>>> > >>>> > >>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently > >>>> unnecessary, > >>>> which > >>>> causes the segfault. > >>>> > >>>> 624 /* lookup the corresponding process > >>>> */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> > >>>> origin); > >>>> 626 if (NULL == peer) { > >>>> 627 ui64 = (uint64_t*)(&peer->name); > >>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > >>>> orte_oob_base_framework.framework_output, > >>>> 629 "%s > >>>> mca_oob_tcp_recv_connect: > >>>> connection from new peer", > >>>> 630 ORTE_NAME_PRINT > >>>> (ORTE_PROC_MY_NAME)); > >>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > >>>> 632 peer->mod = mod; > >>>> 633 peer->name = hdr->origin; > >>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; > >>>> 635 ui64 = (uint64_t*)(&peer->name); > >>>> 636 if (OPAL_SUCCESS != > >>>> opal_hash_table_set_value_uint64 > >>>> (&mod-> > >>>> peers, (*ui64), peer)) { > >>>> 637 OBJ_RELEASE(peer); > >>>> 638 return; > >>>> 639 } > >>>> > >>>> > >>>> Please fix this mistake in the next release. > >>>> > >>>> Regards, > >>>> Tetsuya Mishima > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> > >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > >>> > >>>> users mailing list > >>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users