FWIW, I do not get these segv's when compiling 64 bit in Opteron... I can run the IMB's (and other apps) to completion when using tcp,self.

(I did find that I missed the MPI_Allreduce count==0 case, which I just committed a fix for)


On Aug 18, 2005, at 2:04 PM, Rainer Keller wrote:

Hello Brian,
sure, attached is output of ompi_info -a on:

model name      : AMD Opteron(tm) Processor 246

Linux c3-19 2.4.21-OC_NUMA_fix #4 SMP Tue Nov 30 16:03:38 CET 2004 x86_64
unknown

It's a SuSE SLES8 distribution with the following libc:

hpcraink@c3-19:~ > /lib64/libc.so.6
GNU C Library stable release version 2.2.5, by Roland McGrath et al.
Copyright (C) 1992-2001, 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.2.2 (SuSE Linux).
Compiled on a Linux 2.4.19 system on 2003-03-27.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        Berkeley DB glibc 2.1 compat library by Thorsten Kukuk
        linuxthreads-0.9 by Xavier Leroy
        BIND-8.2.3-T5B
        libthread_db work sponsored by Alpha Processor Inc
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Report bugs using the `glibcbug' script to <b...@gnu.org>.

Compilation was done with pgcc-5.2-4

CU,
raY

On Thursday 18 August 2005 20:05, Brian Barrett wrote:

Just to double check, can you run ompi_info and send me the results?

Thanks,

Brian

On Aug 18, 2005, at 10:45 AM, Rainer Keller wrote:

Hello,
see the "same" (well probably not exactly same) thing here in
Opteron with
64bit (-g and so on), I get:

#0  0x0000000040085160 in orte_sds_base_contact_universe ()
at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
29          return orte_sds_base_module->contact_universe();
(gdb) where
#0  0x0000000040085160 in orte_sds_base_contact_universe ()
at ../../../../../orte/mca/sds/base/sds_base_interface.c:29
#1  0x0000000040063e95 in orte_init_stage1 ()
at ../../../orte/runtime/orte_init_stage1.c:185
#2  0x0000000040017e7d in orte_system_init ()
at ../../../orte/runtime/orte_system_init.c:38
#3  0x00000000400148f5 in orte_init () at ../../../orte/runtime/
orte_init.c:46
#4  0x000000004000dfc7 in main (argc=4, argv=0x7fbfffe8a8)
at ../../../../orte/tools/orterun/orterun.c:291
#5  0x0000002a95c0c017 in __libc_start_main () from /lib64/libc.so.6
#6  0x000000004000bf2a in _start ()
(gdb)
within mpirun

orte_sds_base_module here is Null...
This is without persistent orted; Just mpirun...

CU,
ray

On Thursday 18 August 2005 16:57, Nathan DeBardeleben wrote:

FYI, this only happens when I let OMPI compile 64bit on Linux.
When I
throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test
codes, mpirun, registry subscription codes, and JNI all work like
a champ.
Something's wrong with the 64bit it appears to me.

-- Nathan
Correspondence
------------------------------------------------------------------- --
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
------------------------------------------------------------------- --

Tim S. Woodall wrote:

Nathan,

I'll try to reproduce this sometime this week - but I'm pretty
swamped.
Is Greg also seeing the same behavior?

Thanks,
Tim

Nathan DeBardeleben wrote:

To expand on this further, orte_init() seg faults on both bluesteel
(32bit linux) and sparkplug (64bit linux) equally.  The required
condition is that orted must be running first (which of course we
require for our work - a persistent orte daemon and registry).


[bluesteel]~/ptp > ./dump_info
Segmentation fault
[bluesteel]~/ptp > gdb dump_info
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public
License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "x86_64-suse-linux"...Using host
libthread_db library "/lib64/tls/libthread_db.so.1".

(gdb) run
Starting program: /home/ndebard/ptp/dump_info

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x000000000045997d in orte_init_stage1 () at
orte_init_stage1.c:419
#2  0x00000000004156a7 in orte_system_init () at
orte_system_init.c:38
#3  0x00000000004151c7 in orte_init () at orte_init.c:46
#4  0x0000000000414cbb in main (argc=1, argv=0x7fbffff298) at
dump_info.c:185
(gdb)


-- Nathan
Correspondence
----------------------------------------------------------------- --
--
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
----------------------------------------------------------------- --
--

Nathan DeBardeleben wrote:

Just to clarify:
1: no orted started (meaning the MPIrun or registry programs will
start one by themselves) causes those programs to lock up.
2: starting orted by hand (trying to get these programs to
connect to
a centralized one) causes the connecting programs to seg fault.

-- Nathan
Correspondence
---------------------------------------------------------------- --
---
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------------------------------------------------------------- --
---

Nathan DeBardeleben wrote:

So I dropped an .ompi_ignore into that directory,
reconfigured, and
compile worked (yay!).
However, not a lot of progress: mpirun locks up, all my
registry test
programs lock up as well.  If I start the orted by hand, then
any of my

registry calling programs cause segfault:

[sparkplug]~/ptp > gdb sub_test
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public
License, and
you are
welcome to change it and/or distribute copies of it under
certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show
warranty" for
details.
This GDB was configured as "x86_64-suse-linux"...Using host
libthread_db library "/lib64/tls/libthread_db.so.1".

(gdb) run
Starting program: /home/ndebard/ptp/sub_test

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00000000004598a5 in orte_init_stage1 () at
orte_init_stage1.c:419 #2  0x00000000004155cf in
orte_system_init ()
at orte_system_init.c:38 #3  0x00000000004150ef in orte_init
() at
orte_init.c:46
#4  0x00000000004148a1 in main (argc=1, argv=0x7fbffff178) at
sub_test.c:60
(gdb)


Yes, I recompiled everything.

Here's an example of me trying something a little more
complicated
(which I believe locks up for the same reason - something
borked with
the registry interaction).


[sparkplug]~/ompi-test > bjssub -s 10000 -n 10 -i bash
Waiting for interactive job nodes.
(nodes 18 16 17 18 19 20 21 22 23 24 25)
Starting interactive job.
NODES=16,17,18,19,20,21,22,23,24,25
JOBID=18


so i got my nodes


ndebard@sparkplug:~/ompi-test> export
OMPI_MCA_ptl_base_exclude=sm
ndebard@sparkplug:~/ompi-test> export
OMPI_MCA_pls_bproc_seed_priority=101


and set these envvars like we need to use Greg's bproc,
without the
2nd export the machine's load maxes and locks up.


ndebard@sparkplug:~/ompi-test> bpstat
Node(s)                            Status          Mode
User     Group   100-128                            down
---------- root     root    0-15
up              ---x------ vchandu  vchandu
16-25                              up              ---x------
ndebard  ndebard
26-27                              up              ---x------
root     root    28-30                              up
---x--x--x root     root    ndebard@sparkplug:~/ompi-test>
env | grep
NODES
NODES=16,17,18,19,20,21,22,23,24,25


yes, i really have the nodes


ndebard@sparkplug:~/ompi-test> mpicc -o test-mpi test-mpi.c
ndebard@sparkplug:~/ompi-test>


recompile for good measure


ndebard@sparkplug:~/ompi-test> ls /tmp/openmpi-sessions-
ndebard*
/bin/ls: /tmp/openmpi-sessions-ndebard*: No such file or
directory


proof that there's no left over old directory


ndebard@sparkplug:~/ompi-test> mpirun -np 1 test-mpi


it never responds at this point - but I can kill it with ^C.


mpirun: killing job...
Killed
ndebard@sparkplug:~/ompi-test>


-- Nathan
Correspondence
--------------------------------------------------------------- --
----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
--------------------------------------------------------------- --
----

Jeff Squyres wrote:

Is this what Tim Prins was working on?

On Aug 16, 2005, at 5:21 PM, Tim S. Woodall wrote:

I'm not sure why this is even building... Is someone working
on this?
I thought we had .ompi_ignore files in this directory.

Tim

Nathan DeBardeleben wrote:

So I'm seeing all these nice emails about people developing
on OMPI
today yet I can't get it to compile.  Am I out here in
limbo on this
or
are others in the same boat?  The errors I'm seeing are
about some
bproc
code calling undefined functions and they are linked again
below.

-- Nathan
Correspondence
------------------------------------------------------------ --
------
- Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
------------------------------------------------------------ --
------
-

Nathan DeBardeleben wrote:

Back from training and trying to test this but now OMPI
doesn't
compile

at all:

gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include
-I../../../../include -I../../../.. -I../../../..
-I../../../../include -I../../../../opal -I../../../../ orte
-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-
compare
-Wmissing-prototypes -Wstrict-prototypes -Wcomment - pedantic
-Werror-implicit-function-declaration -fno-strict-
aliasing -MT
ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c
ras_lsf_bproc.c -o ras_lsf_bproc.o
ras_lsf_bproc.c: In function
`orte_ras_lsf_bproc_node_insert':
ras_lsf_bproc.c:32: error: implicit declaration of function
`orte_ras_base_node_insert'
ras_lsf_bproc.c: In function
`orte_ras_lsf_bproc_node_query':
ras_lsf_bproc.c:37: error: implicit declaration of function
`orte_ras_base_node_query'
make[4]: *** [ras_lsf_bproc.lo] Error 1
make[4]: Leaving directory
`/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ ras'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/orte'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi >


Clean SVN checkout this morning with configure:

[sparkplug]~/ompi > ./configure --enable-static --disable-
shared
--without-threads --prefix=/home/ndebard/local/ompi
--with-devel-headers


-- Nathan
Correspondence
----------------------------------------------------------- --
------
-- Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
----------------------------------------------------------- --
------
--

Brian Barrett wrote:

This is now fixed in SVN.  You should no longer need the
--build=i586...  hack to compile 32 bit code on Opterons.

Brian

On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:

On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:

We've got a 64bit Linux (SUSE) box here. For a variety of
reasons (Java, JNI, linking in with OMPI libraries, etc
which I
won't get into)
I need to compile OMPI 32 bit (or get 64bit versions of
a lot of
other
libraries).
I get various compile errors when I try different
things, but
first
let
me explain the system we have:


<snip>


This goes on and on and on actually.  And the 'is
incompatible
with
i386:x86-64 output' looks to be repeated for every line
before
this
error which actually caused the Make to bomb.

Any suggestions at all?  Surely someone must have tried
to force
OMPI to
build in 32bit mode on a 64bit machine.


I don't think anyone has tried to build 32 bit on an
Opteron,
which is the cause of the problems...

I think I know how to fix this, but won't happen until
later in
the weekend.  I can't think of a good workaround until
then.
Well, one possibility is to set the target like you were
doing
and disable ROMIO.  Actually, you'll also need to disable
Fortran 77.  So something like:

./configure [usual options] --build=i586-suse-linux --
disable-io-
romio --disable-f77

might just do the trick.

Brian


--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
-------------------------------------------------------------------- -
Dipl.-Inf. Rainer Keller             email: kel...@hlrs.de
  High Performance Computing         Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)          Fax: ++49 (0)711-678 7626
Nobelstrasse 19, R. O0.030 http://www.hlrs.de/people/ keller
  70550 Stuttgart
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller             email: kel...@hlrs.de
  High Performance Computing         Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)          Fax: ++49 (0)711-678 7626
  Nobelstrasse 19,  R. O0.030        http://www.hlrs.de/people/keller
  70550 Stuttgart

<ompi_info-2005.18.08.txt>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to