Re: [OMPI devel] Getting the number of nodes

2006-07-05 Thread Nathan DeBardeleben
I'm running this on my mac where I expected to only get back the 
localhost.  I upgraded to 1.0.2 a little while back, had been using one 
of the alphas (I think it was alpha 9 but I can't be sure) up until that 
point when this function returned '1' on my mac.


-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Ralph H Castain wrote:

Rc=0 indicates that the "get" function was successful, so this means that
there were no nodes on the NODE_SEGMENT. Were you running this in an
environment where nodes had been allocated to you? Or were you expecting to
find only "localhost" on the segment?

I'm not entirely sure, but I don't believe there have been significant
changes in 1.0.2 for some time. My guess is that something has changed on
your system as opposed to in the OpenMPI code you're using. Did you do an
update recently and then begin seeing this behavior? Your revision level is
1000+ behind the current repository, so my guess is that you haven't updated
for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't
be a problem. I'm just trying to understand why your function is doing
something different if the OpenMPI code your using hasn't changed.

Ralph



On 7/5/06 2:40 PM, "Nathan DeBardeleben" <ndeb...@lanl.gov> wrote:

  

Open MPI: 1.0.2
   Open MPI SVN revision: r9571
  

The rc value returned by the 'get' call is '0'.
All I'm doing is calling init with my own daemon name, it's coming up
fine, then I immediately call this to figure out how many nodes are
associated with this machine.

-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Ralph H Castain wrote:


Hi Nathan

Could you tell us which version of the code you are using, and print out the
rc value that was returned by the "get" call? I see nothing obviously wrong
with the code, but much depends on what happened prior to this call too.

BTW: you might want to release the memory stored in the returned values - it
could represent a substantial memory leak.

Ralph



On 7/5/06 9:28 AM, "Nathan DeBardeleben" <ndeb...@lanl.gov> wrote:

  
  

I used to use this code to get the number of nodes in a cluster /
machine / whatever:



int
get_num_nodes(void)
{
int rc;
size_t cnt;
orte_gpr_value_t **values;

rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR,

ORTE_NODE_SEGMENT, NULL, NULL, , );
  
if(rc != ORTE_SUCCESS) {

return 0;
}

return cnt;

}
  
  

This now returns '0' on my MAC when it used to return 1.  Is this not an
acceptable way of doing this?  Is there a cleaner / better way these days?



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  


[OMPI devel] Getting the number of nodes

2006-07-05 Thread Nathan DeBardeleben
I used to use this code to get the number of nodes in a cluster / 
machine / whatever:

int
get_num_nodes(void)
{
int rc;
size_t cnt;
orte_gpr_value_t **values;

rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR,

ORTE_NODE_SEGMENT, NULL, NULL, , );

if(rc != ORTE_SUCCESS) {

return 0;
}

return cnt;

}
This now returns '0' on my MAC when it used to return 1.  Is this not an 
acceptable way of doing this?  Is there a cleaner / better way these days?


--
-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Re: [O-MPI devel] Alpha 4 and job state transitions

2006-02-09 Thread Nathan DeBardeleben
I've coded a hacky workaround in our code to get past this.  Basically, 
I capture all of the state transitions and the first one fired for a job 
I fire the 'init' state internally in our tool.  Generally this occurs 
for one of the gate transitions, G1 or something.  It'll work this way.


Furthermore, we're telling our users to get your 1.0.2a4 (or whatever 
1.0.2 is available at the time).


The way I coded it when you guys put this into the main branch and the 
INIT state resumes firing then my code will start working that much 
better.  I really only brought it up because I felt it was a bug you 
might not have been aware of.


Thanks all.

-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Jeff Squyres wrote:

Nathan --

Ralph and I talked about this and decided not to bring it over to the  
1.0 branch -- the fix uses new functionality that exists on the trunk  
and not in the 1.0 branch.  The fix could be re-crafted to use  
existing functionality on the 1.0 branch (we're really trying to only  
put bug fixes on the 1.0 branch -- not any new functionality) -- but  
we didn't know if you cared.  :-)


Do you mind if this fix stays on the trunk, or do you need it in the  
v1.0 branch?




On Feb 8, 2006, at 4:36 PM, Nathan DeBardeleben wrote:

  

Thanks Ralph.

-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Ralph H. Castain wrote:


Nathan

This should now be fixed on the trunk. Once it is checked out more
thoroughly, I'll ask that it be moved to the 1.0 branch. For now, you
might want to check out the trunk and verify it meets your needs.

Ralph

At 03:05 PM 2/1/2006, you wrote:

  
This was happening on Alpha 1 as well but I upgraded today to  
Alpha 4 to

see if it's gone away - it has not.

I register a callback on a spawn() inside ORTE.  That callback  
includes
the current state and should be called as the job goes through  
those states.


I am now noticing that jobs never go through the INIT state.   
They may

also not go through others but definitely not ORTE_PROC_STATE_INIT.

I was registering the IOForwarding callback during the INIT phase  
so,
consequentially, I now do not have IOF.  There are other side  
effects

such as jobs that I start I think are perpetually in the 'starting'
state and then, suddenly, they're done.

Can someone look into / comment on this please?

Thanks.

--
-- Nathan
Correspondence
 
-

Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
 
-


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




  


[O-MPI devel] Alpha 4 and job state transitions

2006-02-01 Thread Nathan DeBardeleben
This was happening on Alpha 1 as well but I upgraded today to Alpha 4 to 
see if it's gone away - it has not.


I register a callback on a spawn() inside ORTE.  That callback includes 
the current state and should be called as the job goes through those states.


I am now noticing that jobs never go through the INIT state.  They may 
also not go through others but definitely not ORTE_PROC_STATE_INIT.


I was registering the IOForwarding callback during the INIT phase so, 
consequentially, I now do not have IOF.  There are other side effects 
such as jobs that I start I think are perpetually in the 'starting' 
state and then, suddenly, they're done.


Can someone look into / comment on this please?

Thanks.

--
-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



[O-MPI devel] Back to 32bit on 64bit machines...

2005-09-27 Thread Nathan DeBardeleben

So is this an error or am I configuring wrong?

Here's my configure:

[sparkplug]~/ompi > ./configure CFLAGS=-m32 FFLAGS=-m32 CXXFLAGS=-m32 
--without-threads --prefix=/home/ndebard/local/ompi 
--with-devel-headers --without-gm


I've also tried adding --build=i586-suse-linux, that didn't help either.
Basically the compile eventually ends here:

 g++ -DHAVE_CONFIG_H -I. -I. -I../../../include -I../../../include 
-I../../../include -I../../.. -I../../.. -I../../../include 
-I../../../opal -I../../../orte -I../../../ompi -m32 -g -Wall -Wundef 
-Wno-long-long -finline-functions -MT comm.lo -MD -MP -MF 
.deps/comm.Tpo -c comm.cc  -fPIC -DPIC -o .libs/comm.o
/bin/sh ../../../libtool --mode=link g++  -m32 -g -Wall -Wundef 
-Wno-long-long -finline-functions   -export-dynamic   -o libmpi_cxx.la 
-rpath /home/ndebard/local/ompi/lib  mpicxx.lo intercepts.lo comm.lo  
-lm  -lutil -lnsl
g++ -shared -nostdlib 
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib/crti.o 
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32/crtbeginS.o  
.libs/mpicxx.o .libs/intercepts.o .libs/comm.o  -lutil -lnsl 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/lib/../lib 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/lib 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib 
-L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../.. -L/lib/../lib 
-L/usr/lib/../lib /usr/lib64/libstdc++.so -lm -lc -lgcc_s_32 
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/32/crtendS.o 
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../lib/crtn.o  
-m32 -Wl,-soname -Wl,libmpi_cxx.so.0 -o .libs/libmpi_cxx.so.0.0.0

/usr/lib64/libstdc++.so: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make[3]: *** [libmpi_cxx.la] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/ompi/mpi/cxx'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/ompi/mpi'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/ompi'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi >


I'm having problems I think might be 64bit related and want to prove it 
by building in 32bit mode.

Oh, here's some basics if it helps.


[sparkplug]~/ompi > cat /etc/issue

Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l).


[sparkplug]~/ompi > uname -a
Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64 
x86_64 x86_64 GNU/Linux
[sparkplug]~/ompi > 



--
-- Nathan
Correspondence
---------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Re: [O-MPI devel] OMPI compile failing

2005-09-13 Thread Nathan DeBardeleben
I'm trying this on sparkplug.  I have no real desire to use GM, so if it 
can be disabled then that'd be great.


-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Tim S. Woodall wrote:


Nathan - What machine are you on?

Galen - have you tried GM w/ your changes?


Nathan DeBardeleben wrote:
 


Compiling I get:


   

gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include 
-I../../../../include -I../../../../include -I../../../.. 
-I../../../.. -I../../../../include -I../../../../opal 
-I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long 
-Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment 
-pedantic -Werror-implicit-function-declaration -fno-strict-aliasing 
-MT btl_gm.lo -MD -MP -MF .deps/btl_gm.Tpo -c btl_gm.c  -fPIC -DPIC -o 
.libs/btl_gm.o

btl_gm.c: In function `mca_btl_gm_prepare_src':
btl_gm.c:237: error: `gm_btl' undeclared (first use in this function)
btl_gm.c:237: error: (Each undeclared identifier is reported only once
btl_gm.c:237: error: for each function it appears in.)
btl_gm.c: In function `mca_btl_gm_prepare_dst':
btl_gm.c:398: warning: ISO C89 forbids mixed declarations and code
btl_gm.c:404: error: structure has no member named `mpoo_retain'
btl_gm.c:381: warning: unused variable `gm_btl'
make[4]: *** [btl_gm.lo] Error 1
make[4]: Leaving directory `/home/ndebard/ompi/ompi/mca/btl/gm'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca/btl'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/ompi'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi > 
 


I've configured using the option I thought to disable this:


   


--enable-mca-no-build=ptl-gm
 


I even tried --enable-mca-no-build=btl-gm.
No luck.

   


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 



[O-MPI devel] OMPI compile failing

2005-09-13 Thread Nathan DeBardeleben

Compiling I get:

 gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include 
-I../../../../include -I../../../../include -I../../../.. 
-I../../../.. -I../../../../include -I../../../../opal 
-I../../../../orte -I../../../../ompi -g -Wall -Wundef -Wno-long-long 
-Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment 
-pedantic -Werror-implicit-function-declaration -fno-strict-aliasing 
-MT btl_gm.lo -MD -MP -MF .deps/btl_gm.Tpo -c btl_gm.c  -fPIC -DPIC -o 
.libs/btl_gm.o

btl_gm.c: In function `mca_btl_gm_prepare_src':
btl_gm.c:237: error: `gm_btl' undeclared (first use in this function)
btl_gm.c:237: error: (Each undeclared identifier is reported only once
btl_gm.c:237: error: for each function it appears in.)
btl_gm.c: In function `mca_btl_gm_prepare_dst':
btl_gm.c:398: warning: ISO C89 forbids mixed declarations and code
btl_gm.c:404: error: structure has no member named `mpoo_retain'
btl_gm.c:381: warning: unused variable `gm_btl'
make[4]: *** [btl_gm.lo] Error 1
make[4]: Leaving directory `/home/ndebard/ompi/ompi/mca/btl/gm'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca/btl'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/ompi/dynamic-mca'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/ompi'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi > 


I've configured using the option I thought to disable this:


--enable-mca-no-build=ptl-gm


I even tried --enable-mca-no-build=btl-gm.
No luck.

--
-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



[O-MPI devel] 64bit shared library problems

2005-09-12 Thread Nathan DeBardeleben
I've been having this problem for a week or so and I've been asking 
other people to weigh in if they know what I'm doing wrong.  I've gotten 
no where on this so I figure I'll finally drop it out on the list.  
First, here's the important info:

The machine:


[sparkplug]~ > cat /etc/issue

Welcome to SuSE Linux 9.1 (x86-64) - Kernel \r (\l).


[sparkplug]~ > uname -a
Linux sparkplug 2.6.10 #4 SMP Wed Jan 26 11:50:00 MST 2005 x86_64 
x86_64 x86_64 GNU/Linux


My versions of libtool, autoconf, automake:


[sparkplug]~ > libtool --version
ltmain.sh (GNU libtool) 1.5.20 (1.1220.2.287 2005/08/31 18:54:15)

Copyright (C) 2005  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.

[sparkplug]~ > autoconf --version
autoconf (GNU Autoconf) 2.59
Written by David J. MacKenzie and Akim Demaille.

Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.

[sparkplug]~ > automake --version
automake (GNU automake) 1.8.5
Written by Tom Tromey <tro...@redhat.com>.

Copyright 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.
[sparkplug]~ > 


My ompi version: 7322 - but this has been going on for a few days like I 
said and I've been updating a lot, with no progress.


Configured using:

$ ./configure --enable-static --disable-shared --without-threads 
--prefix=/home/ndebard/local/ompi --with-devel-headers 
--enable-mca-no-build=ptl-gm


Simple C file which I will compile into a shared library:


int test_compile(int x) {
int rc;

rc = orte_init(true);
printf("rc = %d\n", rc);

return x + 1;
}


Above file is named 'testlib.c'

OK, so let's build this:


[sparkplug]~/ompi-test > mpicc -c testlib.c
[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld:
testlib.o: relocation R_X86_64_32 can not be used when making a shared
object; recompile with -fPIC
testlib.o: could not read symbols: Bad value
collect2: ld returned 1 exit status


OK so relocation problems.  Maybe I'll follow the directions and -fPIC 
my file myself:



[sparkplug]~/ompi-test > mpicc -c testlib.c -fPIC
[sparkplug]~/ompi-test > mpicc -shared -o libtestlib.so testlib.o
/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld:
/home/ndebard/local/ompi/lib/liborte.a(orte_init.o): relocation
R_X86_64_32 can not be used when making a shared object; recompile 
with -fPIC

/home/ndebard/local/ompi/lib/liborte.a: could not read symbols: Bad value
collect2: ld returned 1 exit status


OK so I read this as there's a relocation problem in 'liborte.a'.  I 
un-arred liborte.a and checked some of the files with 'file' and it says 
64bit.  I havn't yet written a script to check every file in here, but 
here's orte_init.o:



[sparkplug]~/<1>tmp > file orte_init.o
orte_init.o: ELF 64-bit LSB relocatable, AMD x86-64, version 1 (SYSV), 
not stripped


So that at least says it's 64bit.
And to confirm, my mpicc's 64bit too:


[sparkplug]~/<1>tmp > which mpicc
/home/ndebard/local/ompi/bin/mpicc
[sparkplug]~/<1>tmp > file /home/ndebard/local/ompi/bin/mpicc
/home/ndebard/local/ompi/bin/mpicc: ELF 64-bit LSB executable, AMD 
x86-64, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked 
(uses shared libs), not stripped


Someone suggested I take out the 'disabled-shared' from the configure 
line, so I did.  The result was the same.


So the result is that I can not build a shared library on a 64bit linux 
machine that uses orte calls.
So then I tried taking out the orte calls and instead use MPI calls.  
Sure, this function makes no sense but here it is now:



#include "orte_config.h"
#include 

int test_compile(int x) {
MPI_Comm_rank(MPI_COMM_WORLD, );

return x + 1;
}


And now, when I try and make a shared object I get relocation errors:

/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld: 
/home/ndebard/local/ompi/lib/libmpi.a(comm_init.o): relocation 
R_X86_64_32 can not be used when making a shared object; recompile 
with -fPIC

/home/ndebard/local/ompi/lib/libmpi.a: could not read symbols: Bad value


So... could perhaps the build be messed up and not be really using 64bit 
code?
Am I the only one seeing this?  It's a trivial test for those of you 
with access to a 64bit machine if you wouldn't mind testing for me.


Help would be greatly appreciated.

--
-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel

Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box

2005-08-18 Thread Nathan DeBardeleben
FYI, this only happens when I let OMPI compile 64bit on Linux.  When I 
throw in there CFLAGS=FFLAGS=CXXFLAGS=-m32 orted, my myriad of test 
codes, mpirun, registry subscription codes, and JNI all work like a champ.

Something's wrong with the 64bit it appears to me.

-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Tim S. Woodall wrote:


Nathan,

I'll try to reproduce this sometime this week - but I'm pretty swamped.
Is Greg also seeing the same behavior?

Thanks,
Tim

Nathan DeBardeleben wrote:
 

To expand on this further, orte_init() seg faults on both bluesteel 
(32bit linux) and sparkplug (64bit linux) equally.  The required 
condition is that orted must be running first (which of course we 
require for our work - a persistent orte daemon and registry).



   


[bluesteel]~/ptp > ./dump_info
Segmentation fault
[bluesteel]~/ptp > gdb dump_info
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and 
you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "x86_64-suse-linux"...Using host 
libthread_db library "/lib64/tls/libthread_db.so.1".


(gdb) run
Starting program: /home/ndebard/ptp/dump_info

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x0045997d in orte_init_stage1 () at orte_init_stage1.c:419
#2  0x004156a7 in orte_system_init () at orte_system_init.c:38
#3  0x004151c7 in orte_init () at orte_init.c:46
#4  0x00414cbb in main (argc=1, argv=0x7fb298) at 
dump_info.c:185

(gdb)
 


-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------



Nathan DeBardeleben wrote:


   


Just to clarify:
1: no orted started (meaning the MPIrun or registry programs will 
start one by themselves) causes those programs to lock up.
2: starting orted by hand (trying to get these programs to connect to 
a centralized one) causes the connecting programs to seg fault.


-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------



Nathan DeBardeleben wrote:




 

So I dropped an .ompi_ignore into that directory, reconfigured, and 
compile worked (yay!).
However, not a lot of progress: mpirun locks up, all my registry test 
programs lock up as well.  If I start the orted by hand, then any of my 
registry calling programs cause segfault:




 



   


[sparkplug]~/ptp > gdb sub_test
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and 
you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "x86_64-suse-linux"...Using host 
libthread_db library "/lib64/tls/libthread_db.so.1".


(gdb) run
Starting program: /home/ndebard/ptp/sub_test

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x004598a5 in orte_init_stage1 () at orte_init_stage1.c:419
#2  0x004155cf in orte_system_init () at orte_system_init.c:38
#3  0x004150ef in orte_init () at orte_init.c:46
#4  0x004148a1 in main (argc=1, argv=0x7fb178) at 
sub_test.c:60
(gdb) 



   

 


Yes, I recompiled everything.

Here's an example of me trying something a little more complicated 
(which I believe locks up for the same reason - something borked with 
the registry interaction).




 



   


[sparkplug]~/ompi-test > bjssub -s 1 -n 10 -i bash
Waiting for interactive job nodes.
(nodes 18 16 17 18 19 20 21 22 23 24 25)
Starting interactive job.
NODES=16,17,18,19,20,21,22,23,24,25
JOBID=18
  

 

   


so i got my nodes



   



 


ndebard@sparkplug:~/ompi-test> export OMPI_MCA_

Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box

2005-08-17 Thread Nathan DeBardeleben
To expand on this further, orte_init() seg faults on both bluesteel 
(32bit linux) and sparkplug (64bit linux) equally.  The required 
condition is that orted must be running first (which of course we 
require for our work - a persistent orte daemon and registry).



[bluesteel]~/ptp > ./dump_info
Segmentation fault
[bluesteel]~/ptp > gdb dump_info
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and 
you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "x86_64-suse-linux"...Using host 
libthread_db library "/lib64/tls/libthread_db.so.1".


(gdb) run
Starting program: /home/ndebard/ptp/dump_info

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x0045997d in orte_init_stage1 () at orte_init_stage1.c:419
#2  0x004156a7 in orte_system_init () at orte_system_init.c:38
#3  0x004151c7 in orte_init () at orte_init.c:46
#4  0x00414cbb in main (argc=1, argv=0x7fb298) at 
dump_info.c:185

(gdb)


-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------



Nathan DeBardeleben wrote:


Just to clarify:
 1: no orted started (meaning the MPIrun or registry programs will 
start one by themselves) causes those programs to lock up.
 2: starting orted by hand (trying to get these programs to connect to 
a centralized one) causes the connecting programs to seg fault.


-- Nathan
Correspondence
-----
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
---------



Nathan DeBardeleben wrote:

 

So I dropped an .ompi_ignore into that directory, reconfigured, and 
compile worked (yay!).
However, not a lot of progress: mpirun locks up, all my registry test 
programs lock up as well.  If I start the orted by hand, then any of my 
registry calling programs cause segfault:




   


[sparkplug]~/ptp > gdb sub_test
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and 
you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for 
details.
This GDB was configured as "x86_64-suse-linux"...Using host 
libthread_db library "/lib64/tls/libthread_db.so.1".


(gdb) run
Starting program: /home/ndebard/ptp/sub_test

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x004598a5 in orte_init_stage1 () at orte_init_stage1.c:419
#2  0x004155cf in orte_system_init () at orte_system_init.c:38
#3  0x004150ef in orte_init () at orte_init.c:46
#4  0x004148a1 in main (argc=1, argv=0x7fb178) at 
sub_test.c:60
(gdb) 
  

 


Yes, I recompiled everything.

Here's an example of me trying something a little more complicated 
(which I believe locks up for the same reason - something borked with 
the registry interaction).




   


[sparkplug]~/ompi-test > bjssub -s 1 -n 10 -i bash
Waiting for interactive job nodes.
(nodes 18 16 17 18 19 20 21 22 23 24 25)
Starting interactive job.
NODES=16,17,18,19,20,21,22,23,24,25
JOBID=18


   


so i got my nodes

  

 


ndebard@sparkplug:~/ompi-test> export OMPI_MCA_ptl_base_exclude=sm
ndebard@sparkplug:~/ompi-test> export 
OMPI_MCA_pls_bproc_seed_priority=101


   

and set these envvars like we need to use Greg's bproc, without the 
2nd export the machine's load maxes and locks up.


  

 


ndebard@sparkplug:~/ompi-test> bpstat
Node(s)Status  Mode   
User Group   100-128down
-- root root0-15   
up  ---x-- vchandu  vchandu
16-25  up  ---x-- 
ndebard  ndebard
26-27  up  ---x-- 
root root28-30  up  
---x--x--x root rootndebard@sparkplug:~/ompi-test> env | grep 
NODES

NODES=16,17,18,19,20,21,22,23,24,25


   

Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box

2005-08-16 Thread Nathan DeBardeleben
So I'm seeing all these nice emails about people developing on OMPI 
today yet I can't get it to compile.  Am I out here in limbo on this or 
are others in the same boat?  The errors I'm seeing are about some bproc 
code calling undefined functions and they are linked again below.


-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Nathan DeBardeleben wrote:

Back from training and trying to test this but now OMPI doesn't compile 
at all:


 

gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include 
-I../../../../include -I../../../.. -I../../../.. 
-I../../../../include -I../../../../opal -I../../../../orte 
-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare 
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
-Werror-implicit-function-declaration -fno-strict-aliasing -MT 
ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c 
ras_lsf_bproc.c -o ras_lsf_bproc.o

ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
ras_lsf_bproc.c:32: error: implicit declaration of function 
`orte_ras_base_node_insert'

ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
ras_lsf_bproc.c:37: error: implicit declaration of function 
`orte_ras_base_node_query'

make[4]: *** [ras_lsf_bproc.lo] Error 1
make[4]: Leaving directory `/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/orte'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi > 
   



Clean SVN checkout this morning with configure:

 

[sparkplug]~/ompi > ./configure --enable-static --disable-shared 
--without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers
   



-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Brian Barrett wrote:

 

This is now fixed in SVN.  You should no longer need the 
--build=i586...  hack to compile 32 bit code on Opterons.


Brian

On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:



   


On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:

  

 


We've got a 64bit Linux (SUSE) box here.  For a variety of reasons
(Java, JNI, linking in with OMPI libraries, etc which I won't get
into)
I need to compile OMPI 32 bit (or get 64bit versions of a lot of other
libraries).
I get various compile errors when I try different things, but first
let
me explain the system we have:


   




  

 


This goes on and on and on actually.  And the 'is incompatible with
i386:x86-64 output' looks to be repeated for every line before this
error which actually caused the Make to bomb.

Any suggestions at all?  Surely someone must have tried to force
OMPI to
build in 32bit mode on a 64bit machine.


   


I don't think anyone has tried to build 32 bit on an Opteron, which
is the cause of the problems...

I think I know how to fix this, but won't happen until later in the
weekend.  I can't think of a good workaround until then.  Well, one
possibility is to set the target like you were doing and disable
ROMIO.  Actually, you'll also need to disable Fortran 77.  So
something like:

 ./configure [usual options] --build=i586-suse-linux --disable-io-
romio --disable-f77

might just do the trick.

Brian


--
 Brian Barrett
 Open MPI developer
 http://www.open-mpi.org/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  

 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 



Re: [O-MPI devel] OMPI 32bit on a 64bit Linux box

2005-08-16 Thread Nathan DeBardeleben
Back from training and trying to test this but now OMPI doesn't compile 
at all:


 gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include 
-I../../../../include -I../../../.. -I../../../.. 
-I../../../../include -I../../../../opal -I../../../../orte 
-I../../../../ompi -g -Wall -Wundef -Wno-long-long -Wsign-compare 
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
-Werror-implicit-function-declaration -fno-strict-aliasing -MT 
ras_lsf_bproc.lo -MD -MP -MF .deps/ras_lsf_bproc.Tpo -c 
ras_lsf_bproc.c -o ras_lsf_bproc.o

ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_insert':
ras_lsf_bproc.c:32: error: implicit declaration of function 
`orte_ras_base_node_insert'

ras_lsf_bproc.c: In function `orte_ras_lsf_bproc_node_query':
ras_lsf_bproc.c:37: error: implicit declaration of function 
`orte_ras_base_node_query'

make[4]: *** [ras_lsf_bproc.lo] Error 1
make[4]: Leaving directory `/home/ndebard/ompi/orte/mca/ras/lsf_bproc'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/ndebard/ompi/orte/mca/ras'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/ndebard/ompi/orte/mca'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ndebard/ompi/orte'
make: *** [all-recursive] Error 1
[sparkplug]~/ompi > 


Clean SVN checkout this morning with configure:

[sparkplug]~/ompi > ./configure --enable-static --disable-shared 
--without-threads --prefix=/home/ndebard/local/ompi --with-devel-headers


-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Brian Barrett wrote:

This is now fixed in SVN.  You should no longer need the 
--build=i586...  hack to compile 32 bit code on Opterons.


Brian

On Aug 12, 2005, at 3:17 PM, Brian Barrett wrote:

 


On Aug 12, 2005, at 3:13 PM, Nathan DeBardeleben wrote:

   


We've got a 64bit Linux (SUSE) box here.  For a variety of reasons
(Java, JNI, linking in with OMPI libraries, etc which I won't get
into)
I need to compile OMPI 32 bit (or get 64bit versions of a lot of other
libraries).
I get various compile errors when I try different things, but first
let
me explain the system we have:
 




   


This goes on and on and on actually.  And the 'is incompatible with
i386:x86-64 output' looks to be repeated for every line before this
error which actually caused the Make to bomb.

Any suggestions at all?  Surely someone must have tried to force
OMPI to
build in 32bit mode on a 64bit machine.
 


I don't think anyone has tried to build 32 bit on an Opteron, which
is the cause of the problems...

I think I know how to fix this, but won't happen until later in the
weekend.  I can't think of a good workaround until then.  Well, one
possibility is to set the target like you were doing and disable
ROMIO.  Actually, you'll also need to disable Fortran 77.  So
something like:

  ./configure [usual options] --build=i586-suse-linux --disable-io-
romio --disable-f77

might just do the trick.

Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel