Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

tmishima Fri, 15 Nov 2013 00:57:12 -0500 (EST)


Further information.


I first time encountered this problem in openmpi-1.7.4.x,
while opnempi-1.7.3 and 1.6.x works fine.

My directory below is "testbed-openmpi-1.7.3", but it's realy
1.7.4a1r29646. I'm sorry, if I confuse you.

[mishima@manage testbed-openmpi-1.7.3]$ ompi_info | grep "Open MPI:"
                Open MPI: 1.7.4a1r29646

It's obvious that the cause is in the difference between 1.7.3 and 1.7.4.x.

tmishima


> Indeed it should - most puzzling. I'll try playing with it on slurm using
sbatch and see if I get the same behavior. Offhand, I can't see why the
difference would exist unless somehow the script
> itself is taking one of the execution slots, and somehow Torque is
accounting for it.
>
> Will have to explore and get back to you on a new email thread.
>
> Thanks
> Ralph
>
> On Nov 14, 2013, at 7:01 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > It's no problem that you let it lie until the problem becomes serious
> > again.
> >
> > So, this is just an information for you.
> >
> > I agree with your opinion that the problem will lie in the modified
> > hostfile.
> > But exactly speaking, it's related to just adding -hostfile option to
> > mpirun
> > in Torque script, I think.
> >
> > To make it clear and to prove I never modify hostfile given by Torque,
> > I change the script like this:
> >
> > [mishima@manage testbed-openmpi-1.7.3]$ cat myscript.sh
> > #!/bin/sh
> > #PBS -l nodes=node08:ppn=8
> > cd $PBS_O_WORKDIR
> > cat $PBS_NODEFILE
> > mpirun -machinefile $PBS_NODEFILE -report-bindings -bind-to core Myprog
> >
> > Here, $PBS_NODEFILE is the variable prepared by Torque, which contains
> > allocated nodes.
> > Furthermore, I removed "-np $NPROCS" to use the number given by Torque.
> > Therefore, this means that I exactly use hostfile and nprocs given by
> > Torque.
> >
> > Then, you have to submit the job to produce the problem, because direct
run
> > could work.
> > The output of this job is below:
> >
> > [mishima@manage testbed-openmpi-1.7.3]$ qsub myscript.sh
> > 7999.manage.cluster
> > [mishima@manage testbed-openmpi-1.7.3]$ cat myscript.sh.o7999
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> > node08
> >
--------------------------------------------------------------------------
> > All nodes which are allocated for this job are already filled.
> >
--------------------------------------------------------------------------
> >
> > As you can see, it still causes the oversbuscribed problem.
> > I know that -hostfile option is unnecessary in Torque script, but it
should
> > run even with this harmless option.
> >
> > tmishima
> >
> >
> >> On Nov 14, 2013, at 3:25 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Hi Ralph,
> >>>
> >>> I checked -cpus-per-proc in openmpi-1.7.4a1r29646.
> >>> It works well as I want to do, which can adjust nprocs
> >>> of each nodes dividing by number of threads.
> >>>
> >>> I think my problem is solved so far using -cpus-per-proc,
> >>> thank you very mush.
> >>
> >> Happy that works for you!
> >>
> >>>
> >>> Regarding oversbuscribed problem, I checked NPROCS was really 8
> >>> by outputing the number.
> >>>
> >>> SCRIPT:
> >>> echo mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings
> > -bind-to
> >>> core Myprog
> >>> mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to
> > core
> >>> Myprog
> >>>
> >>> OUTPUT:
> >>> mpirun -machinefile pbs_hosts -np 8 -report-bindings -bind-to core
> > Myprog
> >>>
> >
--------------------------------------------------------------------------
> >>> All nodes which are allocated for this job are already filled.
> >>>
> >
--------------------------------------------------------------------------
> >>>
> >>> By the way, how did you verify the problem.
> >>> It looks like for me that you run the job directly from cmd line.
> >>>
> >>> [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4
> >>> --report-bindings -hostfile hosts hostname
> >>>
> >>> In my environment, such a direct run without Torque script also works
> > fine.
> >>
> >> Really? Your above cmd line is exactly the same as mine - a hardcoded
> > value for np, passing in a machinefile (or hostfile - same thing) while
in
> > a matching allocation. The only difference I can see
> >> is that your hostfile may conflict with the detected allocation since
you
> > modified it. I suspect that is causing the confusion.
> >>
> >>
> >>> Anyway, as I already told you, my problem itself was solved. So I
think
> > the
> >>> priority to check is very low.
> >>
> >> I suspect there really isn't a bug here - the problem most likely lies
in
> > the modified hostfile working against the detected allocation. I'll let
it
> > lie for now and see if something reveals itself at
> >> a later date.
> >>
> >> Thanks!
> >> Ralph
> >>
> >>
> >>>
> >>> tmishima
> >>>
> >>>
> >>>> FWIW: I verified that this works fine under a slurm allocation of 2
> >>> nodes, each with 12 slots. I filled the node without getting an
> >>> "oversbuscribed" error message
> >>>>
> >>>> [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc
4
> >>> --report-bindings -hostfile hosts hostname
> >>>> [bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
socket
> > 0
> >>> [core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt
> > 0-1]]:
> >>> [BB/BB/BB/BB/../..][../../../../../..]
> >>>> [bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]],
socket
> > 0
> >>> [core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt
> > 0-1]]:
> >>> [../../../../BB/BB][BB/BB/../../../..]
> >>>> [bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]],
socket
> > 1
> >>> [core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt
> > 0-1]]:
> >>> [../../../../../..][../../BB/BB/BB/BB]
> >>>> bend001
> >>>> bend001
> >>>> bend001
> >>>>
> >>>> where
> >>>>
> >>>> [rhc@bend001 svn-trunk]$ cat hosts
> >>>> bend001 slots=12
> >>>>
> >>>> The only way I get the "out of resources" error is if I ask for more
> >>> processes than I have slots - i.e., I give it the hosts file as
shown,
> > but
> >>> ask for 13 or more processes.
> >>>>
> >>>>
> >>>> BTW: note one important issue with cpus-per-proc, as shown above.
> > Because
> >>> I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my
> > procs
> >>> wound up being split across the two sockets (2
> >>>> cores on each). That's about the worst situation you can have.
> >>>>
> >>>> So a word of caution: it is up to the user to ensure that the
mapping
> > is
> >>> "good". We just do what you asked us to do.
> >>>>
> >>>>
> >>>> On Nov 13, 2013, at 8:30 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>
> >>>> Guess I don't see why modifying the allocation is required - we have
> >>> mapping options that should support such things. If you specify the
> > total
> >>> number of procs you want, and cpus-per-proc=4, it should
> >>>> do the same thing I would think. You'd get 2 procs on the 8 slot
> > nodes, 8
> >>> on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you
> > specified
> >>> np=16). So I guess I don't understand the issue.
> >>>>
> >>>> Regardless, if NPROCS=8 (and you verified that by printing it out,
not
> >>> just assuming wc -l got that value), then it shouldn't think it is
> >>> oversubscribed. I'll take a look under a slurm allocation as
> >>>> that is all I can access.
> >>>>
> >>>>
> >>>> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Our cluster consists of three types of nodes. They have 8, 32
> >>>> and 64 slots respectively. Since the performance of each core is
> >>>> almost same, mixed use of these nodes is possible.
> >>>>
> >>>> Furthremore, in this case, for hybrid application with openmpi
+openmp,
> >>>> the modification of hostfile is necesarry as follows:
> >>>>
> >>>> #PBS -l nodes=1:ppn=32+4:ppn=8
> >>>> export OMP_NUM_THREADS=4
> >>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> >>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x
OMP_NUM_THREADS
> >>>> Myprog
> >>>>
> >>>> That's why I want to do that.
> >>>>
> >>>> Of course I know, If I quit mixed use, -npernode is better for this
> >>>> purpose.
> >>>>
> >>>> (The script I showed you first is just a simplified one to clarify
the
> >>>> problem.)
> >>>>
> >>>> tmishima
> >>>>
> >>>>
> >>>> Why do it the hard way? I'll look at the FAQ because that definitely
> >>>> isn't a recommended thing to do - better to use -host to specify the
> >>>> subset, or just specify the desired mapping using all the
> >>>> various mappers we provide.
> >>>>
> >>>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Sorry for cross-post.
> >>>>
> >>>> Nodefile is very simple which consists of 8 lines:
> >>>>
> >>>> node08
> >>>> node08
> >>>> node08
> >>>> node08
> >>>> node08
> >>>> node08
> >>>> node08
> >>>> node08
> >>>>
> >>>> Therefore, NPROCS=8
> >>>>
> >>>> My aim is to modify the allocation as you pointed out. According to
> >>>> Openmpi
> >>>> FAQ,
> >>>> proper subset of the hosts allocated to the Torque / PBS Pro job
> > should
> >>>> be
> >>>> allowed.
> >>>>
> >>>> tmishima
> >>>>
> >>>> Please - can you answer my question on script2? What is the value of
> >>>> NPROCS?
> >>>>
> >>>> Why would you want to do it this way? Are you planning to modify the
> >>>> allocation?? That generally is a bad idea as it can confuse the
system
> >>>>
> >>>>
> >>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Since what I really want is to run script2 correctly, please let us
> >>>> concentrate script2.
> >>>>
> >>>> I'm not an expert of the inside of openmpi. What I can do is just
> >>>> obsabation
> >>>> from the outside. I doubt these lines are strange, especially the
> >>>> last
> >>>> one.
> >>>>
> >>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
> >>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
> >>>> inuse
> >>>> 0
> >>>>
> >>>> These lines come from this part of orte_rmaps_base_get_target_nodes
> >>>> in rmaps_base_support_fns.c:
> >>>>
> >>>>    } else if (node->slots <= node->slots_inuse &&
> >>>>               (ORTE_MAPPING_NO_OVERSUBSCRIBE &
> >>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
> >>>>        /* remove the node as fully used */
> >>>>        OPAL_OUTPUT_VERBOSE((5,
> >>>> orte_rmaps_base_framework.framework_output,
> >>>>                             "%s Removing node %s slots %d inuse
> >>>> %d",
> >>>>                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
> >>>>                             node->name, node->slots, node->
> >>>> slots_inuse));
> >>>>        opal_list_remove_item(allocated_nodes, item);
> >>>>        OBJ_RELEASE(item);  /* "un-retain" it */
> >>>>
> >>>> I wonder why node->slots and node->slots_inuse is 0, which I can
read
> >>>> from the above line "Removing node node08 slots 0 inuse 0".
> >>>>
> >>>> Or I'm not sure but
> >>>> "else if (node->slots <= node->slots_inuse &&" should be
> >>>> "else if (node->slots < node->slots_inuse &&" ?
> >>>>
> >>>> tmishima
> >>>>
> >>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Yes, the node08 has 8 slots but the process I run is also 8.
> >>>>
> >>>> #PBS -l nodes=node08:ppn=8
> >>>>
> >>>> Therefore, I think it should allow this allocation. Is that right?
> >>>>
> >>>> Correct
> >>>>
> >>>>
> >>>> My question is why scritp1 works and script2 does not. They are
> >>>> almost same.
> >>>>
> >>>> #PBS -l nodes=node08:ppn=8
> >>>> export OMP_NUM_THREADS=1
> >>>> cd $PBS_O_WORKDIR
> >>>> cp $PBS_NODEFILE pbs_hosts
> >>>> NPROCS=`wc -l < pbs_hosts`
> >>>>
> >>>> #SCRITP1
> >>>> mpirun -report-bindings -bind-to core Myprog
> >>>>
> >>>> #SCRIPT2
> >>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
> >>>> -bind-to
> >>>> core
> >>>>
> >>>> This version is not only reading the PBS allocation, but also
> >>>> invoking
> >>>> the hostfile filter on top of it. Different code path. I'll take a
> >>>> look
> >>>> -
> >>>> it should still match up assuming NPROCS=8. Any
> >>>> possibility that it is a different number? I don't recall, but isn't
> >>>> there some extra lines in the nodefile - e.g., comments?
> >>>>
> >>>>
> >>>> Myprog
> >>>>
> >>>> tmishima
> >>>>
> >>>> I guess here's my confusion. If you are using only one node, and
> >>>> that
> >>>> node has 8 allocated slots, then we will not allow you to run more
> >>>> than
> >>>> 8
> >>>> processes on that node unless you specifically provide
> >>>> the --oversubscribe flag. This is because you are operating in a
> >>>> managed
> >>>> environment (in this case, under Torque), and so we treat the
> >>>> allocation as
> >>>> "mandatory" by default.
> >>>>
> >>>> I suspect that is the issue here, in which case the system is
> >>>> behaving
> >>>> as
> >>>> it should.
> >>>>
> >>>> Is the above accurate?
> >>>>
> >>>>
> >>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
> >>>> wrote:
> >>>>
> >>>> It has nothing to do with LAMA as you aren't using that mapper.
> >>>>
> >>>> How many nodes are in this allocation?
> >>>>
> >>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Hi Ralph, this is an additional information.
> >>>>
> >>>> Here is the main part of output by adding "-mca
> >>>> rmaps_base_verbose
> >>>> 50".
> >>>>
> >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
> >>>> map
> >>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
> >>>> in
> >>>> allocation
> >>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
> >>>> [node08.cluster:26952] mca:rmaps: creating new map for job
> >>>> [56581,1]
> >>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
> >>>> ppr
> >>>> mapper
> >>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
> >>>> [56581,1]
> >>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
> >>>> seq
> >>>> mapper
> >>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
> >>>> initial
> >>>> map
> >>>> of
> >>>> job [56581,1] - no fault groups
> >>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
> >>>> using
> >>>> mindist
> >>>> mapper
> >>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> >>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
> >>>> list
> >>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> >>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> >>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots
> >>>> 0
> >>>> inuse 0
> >>>>
> >>>> From this result, I guess it's related to oversubscribe.
> >>>> So I added "-oversubscribe" and rerun, then it worked well as
> >>>> show
> >>>> below:
> >>>>
> >>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
> >>>> list
> >>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
> >>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
> >>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
> >>>> [node08.cluster:27019]     node: node08 daemon: 0
> >>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
> >>>> node08
> >>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
> >>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
> >>>> [56774,1]
> >>>> slots 1 num_procs 8
> >>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> >>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
> >>>> skipping
> >>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
> >>>> oversubscribed -
> >>>> performing second pass
> >>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08>>>>
[node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
> >>>> node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
> >>>> for
> >>>> job
> >>>> [56774,1]
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
> >>>> node08
> >>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
> >>>> node08
> >>>>
> >>>> I think something is wrong with treatment of oversubscription,
> >>>> which
> >>>> might
> >>>> be
> >>>> related to "#3893: LAMA mapper has problems"
> >>>>
> >>>> tmishima
> >>>>
> >>>> Hmmm...looks like we aren't getting your allocation. Can you
> >>>> rerun
> >>>> and
> >>>> add -mca ras_base_verbose 50?
> >>>>
> >>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>> Here is the output of "-mca plm_base_verbose 5".
> >>>>
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>>> component
> >>>> [rsh]
> >>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> >>>> agent /usr/bin/rsh path NULL
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> >>>> component
> >>>> [rsh]
> >>>> set
> >>>> priority to 10
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>>> component
> >>>> [slurm]
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
> >>>> component
> >>>> [slurm].
> >>>> Query failed to return a module
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
> >>>> component
> >>>> [tm]
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
> >>>> component
> >>>> [tm]
> >>>> set
> >>>> priority to 75
> >>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
> >>>> component
> >>>> [tm]
> >>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
> >>>> 23573
> >>>> nodename
> >>>> hash 85176670
> >>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
> >>>> 59480
> >>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
> >>>> comm
> >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> >>>> creating
> >>>> map
> >>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
> >>>> HNP
> >>>> in
> >>>> allocation
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>> All nodes which are allocated for this job are already filled.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>
> >>>> Here, openmpi's configuration is as follows:
> >>>>
> >>>> ./configure \
> >>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> >>>> --with-tm \>> --with-verbs \
> >>>> --disable-ipv6 \
> >>>> --disable-vt \
> >>>> --enable-debug \
> >>>> CC=pgcc CFLAGS="-tp k8-64e" \
> >>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
> >>>> F77=pgfortran FFLAGS="-tp k8-64e" \
> >>>> FC=pgfortran FCFLAGS="-tp k8-64e"
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>> Okey, I can help you. Please give me some time to report the
> >>>> output.
> >>>>
> >>>> Tetsuya Mishima
> >>>>
> >>>> I can try, but I have no way of testing Torque any more - so
> >>>> all
> >>>> I
> >>>> can
> >>>> do
> >>>> is a code review. If you can build --enable-debug and add
> >>>> -mca
> >>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
> >>>> the
> >>>> output.
> >>>>
> >>>>
> >>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>> Thank you for your quick response.
> >>>>
> >>>> I'd like to report one more regressive issue about Torque
> >>>> support
> >>>> of
> >>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
> >>>> LAMA
> >>>> mapper
> >>>> has problems" I reported a few days ago.
> >>>>
> >>>> The script below does not work with openmpi-1.7.4a1r29646,
> >>>> although it worked with openmpi-1.7.3 as I told you before.
> >>>>
> >>>> #!/bin/sh
> >>>> #PBS -l nodes=node08:ppn=8
> >>>> export OMP_NUM_THREADS=1
> >>>> cd $PBS_O_WORKDIR
> >>>> cp $PBS_NODEFILE pbs_hosts
> >>>> NPROCS=`wc -l < pbs_hosts`
> >>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
> >>>> -report-bindings
> >>>> -bind-to
> >>>> core
> >>>> Myprog
> >>>>
> >>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
> >>>> works
> >>>> fine.
> >>>> Since this happens without lama request, I guess it's not
> >>>> the
> >>>> problem
> >>>> in lama itself. Anyway, please look into this issue as
> >>>> well.
> >>>>
> >>>> Regards,
> >>>> Tetsuya Mishima
> >>>>
> >>>> Done - thanks!
> >>>>
> >>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Dear openmpi developers,
> >>>>
> >>>> I got a segmentation fault in traial use of
> >>>> openmpi-1.7.4a1r29646
> >>>> built
> >>>> by
> >>>> PGI13.10 as shown below:
> >>>>
> >>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
> >>>> -cpus-per-proc
> >>>> 2
> >>>> -report-bindings mPre
> >>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core
> >>>> 4
> >>>> [hwt
> >>>> 0]],
> >>>> socket
> >>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> >>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core
> >>>> 6
> >>>> [hwt
> >>>> 0]],
> >>>> socket
> >>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> >>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core
> >>>> 0
> >>>> [hwt
> >>>> 0]],
> >>>> socket
> >>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> >>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core
> >>>> 2
> >>>> [hwt
> >>>> 0]],
> >>>> socket
> >>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> >>>> [manage:23082] *** Process received signal ***
> >>>> [manage:23082] Signal: Segmentation fault (11)
> >>>> [manage:23082] Signal code: Address not mapped (1)
> >>>> [manage:23082] Failing at address: 0x34
> >>>> [manage:23082] *** End of error message ***
> >>>> Segmentation fault (core dumped)
> >>>>
> >>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
> >>>> core.23082
> >>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> >>>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>>> ...
> >>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> >>>> -report-bindings
> >>>> mPre'.
> >>>> Program terminated with signal 11, Segmentation fault.
> >>>> #0  0x00002b5f861c9c4f in recv_connect>>>
> >>>> (mod=0x5f861ca20b00007f,
> >>>> sd=32767,
> >>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>> (gdb) where
> >>>> #0  0x00002b5f861c9c4f in recv_connect
> >>>> (mod=0x5f861ca20b00007f,
> >>>> sd=32767,
> >>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
> >>>> flags=32767,
> >>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> >>>> #2  0x00002b5f848eb06a in
> >>>> event_process_active_single_queue
> >>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> >>>> at ./event.c:1366
> >>>> #3  0x00002b5f848eb270 in event_process_active
> >>>> (base=0x5f848eb84900007f)
> >>>> at ./event.c:1435
> >>>> #4  0x00002b5f848eb849 in
> >>>> opal_libevent2021_event_base_loop
> >>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> >>>> #5  0x00000000004077a0 in orterun (argc=7,
> >>>> argv=0x7fff25bbd4a8)
> >>>> at ./orterun.c:1030
> >>>> #6  0x00000000004067fb in main (argc=7,
> >>>> argv=0x7fff25bbd4a8)
> >>>> at ./main.c:13
> >>>> (gdb) quit
> >>>>
> >>>>
> >>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
> >>>> unnecessary,
> >>>> which
> >>>> causes the segfault.
> >>>>
> >>>> 624      /* lookup the corresponding process
> >>>> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
> >>>> origin);
> >>>> 626      if (NULL == peer) {
> >>>> 627          ui64 = (uint64_t*)(&peer->name);
> >>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> >>>> orte_oob_base_framework.framework_output,
> >>>> 629                              "%s
> >>>> mca_oob_tcp_recv_connect:
> >>>> connection from new peer",
> >>>> 630                              ORTE_NAME_PRINT
> >>>> (ORTE_PROC_MY_NAME));
> >>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>> 632          peer->mod = mod;
> >>>> 633          peer->name = hdr->origin;
> >>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
> >>>> 635          ui64 = (uint64_t*)(&peer->name);
> >>>> 636          if (OPAL_SUCCESS !=
> >>>> opal_hash_table_set_value_uint64
> >>>> (&mod->
> >>>> peers, (*ui64), peer)) {
> >>>> 637              OBJ_RELEASE(peer);
> >>>> 638              return;
> >>>> 639          }
> >>>>
> >>>>
> >>>> Please fix this mistake in the next release.
> >>>>
> >>>> Regards,
> >>>> Tetsuya Mishima
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>>
> >>>
> >
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> >
> >>>
> >>>> users mailing list
> >>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to