Re: [OMPI users] segfault on finalize

2009-09-28 Thread Thomas Ropars

You are right. An update fixes the problem.

Sorry.

Thomas

Jeff Squyres wrote:
It's a fairly strange place to get an error -- 
mca_base_param_finalize() is where we're tidying up command line 
parameters.


There was some memory bugs that have been fixed since 21970.  Can you 
update?



On Sep 25, 2009, at 9:49 AM, Thomas Ropars wrote:


Hi,

I'm using r21970 of the trunk on Linux  2.6.18-3-amd64 and gcc version
4.2.3 (Debian 4.2.3-2).

When I compile open mpi with the default options, it works.
But if I use --with-platform=optimized option, then I get a segfault for
every program I run.

==3073==  Access not within mapped region at address 0x30
==3073==at 0x535544D: mca_base_param_finalize (in
/home/tropars/open-mpi/install/lib/libopen-pal.so.0.0.0)
==3073==by 0x5339D55: opal_finalize_util (in
/home/tropars/open-mpi/install/lib/libopen-pal.so.0.0.0)
==3073==by 0x4E5A228: ompi_mpi_finalize (in
/home/tropars/open-mpi/install/lib/libmpi.so.0.0.0)
==3073==by 0x400BF2: main (in /home/tropars/open-mpi/tests/ring)

Regards,

Thomas
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users








[OMPI users] Debugging OpenMPI calls

2009-09-28 Thread Aniruddha Marathe
Hello,

I am new to OpenMPI library and I am trying to step through common MPI
communication calls using gdb. I attach gdb to one of the processes
(using the steps mentioned on the OpenMPI Debugging FAQ page) and set
a breakpoint on 'MPI_Barrier' and expect gdb to jump into the
definition of MPI_Barrier function.

I've manually added -g3 compilation flag to the Makefiles in some of
the directories that I thought relevant ({ROOT}/ompi/mpi/c etc). I
also specified the source file paths in gdb using the 'dir' command.
However, gdb is unable to jump into the appropriate source location
when it hits the breakpoint.

Could anyone please let me know if I am missing something here?

Thanks for looking into my post.

Regards,
Aniruddha


Re: [OMPI users] MPI_Irecv segmentation fault

2009-09-28 Thread Everette Clemmer
Yes I did, forgot to mention that in my last. Most of the example code
I've seen online passes the buffer variable by reference...

I think I've gotten past the segfault at this point, but it looks like
MPI_Isend is never completing. I have an MPI_Test() that sets a flag
immediately following the MPI_Irecv call, but the process seems to
hang before it gets to it. Not really sure why it wouldn't complete.

Everette

On Tue, Sep 22, 2009 at 9:24 AM, jody  wrote:
> Did you also change the "&buffer" to buffer in your MPI_Send call?
>
> Jody
>
> On Tue, Sep 22, 2009 at 1:38 PM, Everette Clemmer  wrote:
>> Hmm, tried changing MPI_Irecv( &buffer) to MPI_Irecv( buffer...)
>> and still no luck. Stack trace follows if that's helpful:
>>
>> prompt$ mpirun -np 2 ./display_test_debug
>> Sending 'q' from node 0 to node 1
>> [COMPUTER:50898] *** Process received signal ***
>> [COMPUTER:50898] Signal: Segmentation fault (11)
>> [COMPUTER:50898] Signal code:  (0)
>> [COMPUTER:50898] Failing at address: 0x0
>> [COMPUTER:50898] [ 0] 2   libSystem.B.dylib
>> 0x7fff87e280aa _sigtramp + 26
>> [COMPUTER:50898] [ 1] 3   ???
>> 0x 0x0 + 0
>> [COMPUTER:50898] [ 2] 4   GLUT
>> 0x000100024a21 glutMainLoop + 261
>> [COMPUTER:50898] [ 3] 5   display_test_debug
>> 0x00011444 xsMainLoop + 67
>> [COMPUTER:50898] [ 4] 6   display_test_debug
>> 0x00011335 main + 59
>> [COMPUTER:50898] [ 5] 7   display_test_debug
>> 0x00010d9c start + 52
>> [COMPUTER:50898] [ 6] 8   ???
>> 0x0001 0x0 + 1
>> [COMPUTER:50898] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local
>> exited on signal 15 (Terminated).
>> 1 additional process aborted (not shown)
>>
>> Thanks,
>> Everette
>>
>>
>> On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren  
>> wrote:
>>> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote:
 Hey all,

 I'm getting a segmentation fault when I attempt to receive a single
 character via MPI_Irecv. Code follows:

 void recv_func() {
               if( !MASTER ) {
                       char            buffer[ 1 ];
                       int             flag;
                       MPI_Request request;
                       MPI_Status      status;

                       MPI_Irecv( &buffer, 1, MPI_CHAR, 0, MPI_ANY_TAG, 
 MPI_COMM_WORLD, &request);
>>>
>>> It should be MPI_Irecv(buffer, 1, ...)
>>>
 The segfault disappears if I comment out the MPI_Irecv call in
 recv_func so I'm assuming that there's something wrong with the
 parameters that I'm passing to it. Thoughts?
>>>
>>> --
>>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>>> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
>>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>> --
>> - Everette
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
- Everette



Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-28 Thread Jeff Squyres

On Sep 27, 2009, at 1:45 PM, guosong wrote:


Hi Loh,
I used MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided);  
in my program and got provided = 0 which turns out to be the  
MPI_THREAD_SINGLE. Does this mean that I can not use  
MPI_THREAD_MULTIPLE model?


Correct.

To get Open MPI to support MPI_THREAD_MULTIPLE, you need to configure  
and build it with the --enable-mpi-threads switch to OMPI's ./ 
configure script.  We don't build MPI_THREAD_MULTIPLE support by  
default because it does add some performance overhead.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Debugging OpenMPI calls

2009-09-28 Thread Jeff Squyres

You might want to just configure Open MPI with:

  ./configure CFLAGS=-g3 ...

That will pass "-g3" to every Makefile in Open MPI.

FWIW: I do variants on this technique and gdb is always able to jump  
to the right source location if I "break MPI_Barrier" (for example).   
We actually have a "--enable-debug" option to OMPI's configure, but it  
does turn on a bunch of other debugging code that will definitely  
result in performance degradation at run-time (one of its side effects  
is to add "-g" to every Makefile).



On Sep 28, 2009, at 5:57 AM, Aniruddha Marathe wrote:


Hello,

I am new to OpenMPI library and I am trying to step through common MPI
communication calls using gdb. I attach gdb to one of the processes
(using the steps mentioned on the OpenMPI Debugging FAQ page) and set
a breakpoint on 'MPI_Barrier' and expect gdb to jump into the
definition of MPI_Barrier function.

I've manually added -g3 compilation flag to the Makefiles in some of
the directories that I thought relevant ({ROOT}/ompi/mpi/c etc). I
also specified the source file paths in gdb using the 'dir' command.
However, gdb is unable to jump into the appropriate source location
when it hits the breakpoint.

Could anyone please let me know if I am missing something here?

Thanks for looking into my post.

Regards,
Aniruddha
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-28 Thread guosong

Oh, thanks. I found that mpich2/gnu supports MPI_THREAD_MULTIPLE by default on 
my server. So if it supports MPI_THREAD_MULTIPLE, does it mean that I can run 
the program without segmentation fault (if there is no other bugs ^_^)

> From: jsquy...@cisco.com
> To: us...@open-mpi.org
> Date: Mon, 28 Sep 2009 11:28:31 -0400
> Subject: Re: [OMPI users] How to create multi-thread parallel program using 
> thread-safe send and recv?
> 
> On Sep 27, 2009, at 1:45 PM, guosong wrote:
> 
> > Hi Loh,
> > I used MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided); 
> > in my program and got provided = 0 which turns out to be the 
> > MPI_THREAD_SINGLE. Does this mean that I can not use 
> > MPI_THREAD_MULTIPLE model?
> 
> Correct.
> 
> To get Open MPI to support MPI_THREAD_MULTIPLE, you need to configure 
> and build it with the --enable-mpi-threads switch to OMPI's ./ 
> configure script. We don't build MPI_THREAD_MULTIPLE support by 
> default because it does add some performance overhead.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Messenger安全保护中心,免费修复系统漏洞,保护Messenger安全!
http://im.live.cn/safe/

Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-28 Thread Jeff Squyres

On Sep 28, 2009, at 11:48 AM, guosong wrote:

Oh, thanks. I found that mpich2/gnu supports MPI_THREAD_MULTIPLE by  
default on my server. So if it supports MPI_THREAD_MULTIPLE, does it  
mean that I can run the program without segmentation fault (if there  
is no other bugs ^_^)


Hypothetically, yes.  :-)

--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-28 Thread guosong

Thanks.

> From: jsquy...@cisco.com
> To: us...@open-mpi.org
> Date: Mon, 28 Sep 2009 11:49:36 -0400
> Subject: Re: [OMPI users] How to create multi-thread parallel program using 
> thread-safe send and recv?
> 
> On Sep 28, 2009, at 11:48 AM, guosong wrote:
> 
> > Oh, thanks. I found that mpich2/gnu supports MPI_THREAD_MULTIPLE by 
> > default on my server. So if it supports MPI_THREAD_MULTIPLE, does it 
> > mean that I can run the program without segmentation fault (if there 
> > is no other bugs ^_^)
> 
> Hypothetically, yes. :-)
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Messenger安全保护中心,免费修复系统漏洞,保护Messenger安全!
http://im.live.cn/safe/

[OMPI users] problem using openmpi with DMTCP

2009-09-28 Thread Kritiraj Sajadah
Dear All, 
  I am trying to integrate DMTCP with openmpi. IF I run a c 
application, it works fine. But when I execute the program using mpirun, It 
checkpoints application but gives error when restarting the application.

#
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain 
== AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
 id() = 2ab3f248-30933-4ac0d75a(99007)
 _sockDomain = 10
 _sockType = 1
 _sockProtocol = 0
Message: socket type not yet [fully] supported
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain 
== AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
 id() = 2ab3f248-30943-4ac0d75c(99007)
 _sockDomain = 10
 _sockType = 1
 _sockProtocol = 0
Message: socket type not yet [fully] supported
[31013] WARNING at connection.cpp:87 in restartDup2; 
REASON='JWARNING(_real_dup2 ( oldFd, fd ) == fd) failed'
 oldFd = 537
 fd = 1
 (strerror((*__errno_location ( = Bad file descriptor
[31013] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31015] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31017] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31007] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
MTCP: mtcp_restart_nolibc: mapping current version of 
/usr/lib/gconv/gconv-modules.cache into memory;
  _not_ file as it existed at time of checkpoint.
  Change mtcp_restart_nolibc.c:634 and re-compile, if you want different 
behavior.
[31015] ERROR at connection.cpp:372 in restoreOptions; REASON='JASSERT(ret == 
0) failed'
 (strerror((*__errno_location ( = Invalid argument
 fds[0] = 6
 opt->first = 26
 opt->second.size() = 4
Message: restoring setsockopt failed
Terminating...
#

Any suggestions is very welcomed.

regards,

Raj





Re: [OMPI users] "Failed to find the following executable" problemunder Torque

2009-09-28 Thread Blosch, Edwin L
Thanks for the reply. I looked harder at the command invocation and I think I 
stumbled across an answer.  My actual mpirun command is invoked from a Python 
script using the subprocess module. When you create a subprocess, one of the 
options is "shell" and I had that set to False, causing the actual invocation 
to use spawn or exec (one of the variants) instead of system().

When I pass down the argument list as follows, mpirun fails with "cannot find 
executable named '--prefix /usr/mpi/intel/openmpi-1.2.8' "

  Command:  ['mpirun', '--prefix /usr/mpi/intel/openmpi-1.2.8', '-np 8', '--mca 
btl ^tcp', ' --mca mpi_leave_pinned 1', '--mca mpool_base_use_mem_hooks 1', '-x 
LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1', 
'/tmp/7852.fwnaeglingio/falconv4_ibm_openmpi', '-cycles', '10', '-ri', 
'restart.5000', '-ro', '/tmp/7852.fwnaeglingio/restart.5000']

whereas if I take the additional step of removing spaces from the arguments, it 
works:

  Command:  ['mpirun', '--prefix', '/usr/mpi/intel/openmpi-1.2.8', 
'--machinefile', '/var/spool/torque/aux/7854.fwnaeglingio', '-np', '8', 
'--mca', 'btl', '^tcp', '--mca', 'mpi_leave_pinned', '1', '--mca', 
'mpool_base_use_mem_hooks', '1', '-x', 'LD_LIBRARY_PATH', '-x', 
'MPI_ENVIRONMENT=1', '/tmp/7854.fwnaeglingio/falconv4_ibm_openmpi', '-cycles', 
'10', '-ri', 'restart.5010', '-ro', '/tmp/7854.fwnaeglingio/restart.5010']


Somehow the handling of the argv list by orterun has changed in 1.2.8 as 
compared to 1.2.2-1, as the spawned command used to execute just fine.

I'm guessing the elements in argv used to be split on spaces first, before 
being parsed, whereas now they are not, resulting in the first string being 
reported as an unrecognized option.


> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: Saturday, September 26, 2009 8:24 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] "Failed to find the following executable"
> problemunder Torque
> 
> On Sep 25, 2009, at 7:55 AM, Blosch, Edwin L wrote:
> 
> > I'm having a problem running OpenMPI under Torque.  It complains
> > like there is a command syntax problem, but the three variations
> > below are all correct, best I can tell using mpirun -help.  The
> > environment in which the command executes, i.e. PATH and
> > LD_LIBRARY_PATH, is correct.  Torque is 2.3.x.  OpenMPI is 1.2.8.
> > OFED is 1.4.
> 
> Is your mpirun a script, perchance?  It's almost like the arguments
> that end up being passed are getting munged / re-ordered, and Bad
> Things happen such that the real mpirun under the covers gets confused.
> 
> > /usr/mpi/intel/openmpi-1.2.8/bin/mpirun -np 28 /tmp/43.fwnaeglingio/
> > falconv4_ibm_openmpi -cycles 100 -ri restart.0 -ro /tmp/
> > 43.fwnaeglingio/restart.0
> > 
> --
> > Failed to find the following executable:
> >
> > Host:   n8n26
> > Executable: -p
> 
> I don't even see -p in that argument list.  Where is it coming from?
> 
> A little background: OMPI's mpirun analyzes the command line tokens
> that are passed to it.  The first one that it doesn't recognize, it
> assumes is the executable to invoke.  In this case, OMPI's mpirun
> found a "-p" on the command line (not sure how that happened; perhaps /
> usr/mpi/intel/openmpi-1.2.8/bin/mpirun is not actually OMPI's mpirun,
> as I mentioned above...?) and tried to execute it.  But then there was
> no executable named "-p" to be found in the filesystem, then OMPI
> printed the error.
> 
> > mpirun --prefix /usr/mpi/intel/openmpi-1.2.8 --machinefile /var/
> > spool/torque/aux/45.fwnaeglingio -np 28 --mca btl ^tcp  --mca
> > mpi_leave_pinned 1 --mca mpool_base_use_mem_hooks 1 -x
> > LD_LIBRARY_PATH -x MPI_ENVIRONMENT /tmp/45.fwnaeglingio/
> > falconv4_ibm_openmpi -cycles 100 -ri restart.0 -ro /tmp/
> > 45.fwnaeglingio/restart.0
> > 
> --
> > Failed to find or execute the following executable:
> >
> > Host:   n8n27
> > Executable: --prefix /usr/mpi/intel/openmpi-1.2.8
> 
> Ditto on this one.  --prefix is a valid mpirun command line argument,
> so it should not have complained.
> 
> But then again, I confess to not remembering all the 1.2.x command
> line options; I don't remember if --prefix was introduced in the 1.2
> or 1.3 series...
> 
> > /usr/mpi/intel/openmpi-1.2.8/bin/mpirun -x LD_LIBRARY_PATH -x
> > MPI_ENVIRONMENT=1 /tmp/47.fwnaeglingio/falconv4_ibm_openmpi -cycles
> > 100 -ri restart.0 -ro /tmp/47.fwnaeglingio/restart.0
> > 
> --
> > Failed to find the following executable:
> >
> > Host:   n8n27
> > Executable: -
> 
> 
> Ditto to #1.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [btl_openib_component.c:1373:btl_openib_component_progress] error polling HP CQ with -2 errno says Success

2009-09-28 Thread Charles Wright

I've verified that ulimit -l is unlimited everywhere.

After further testing I think the errors are related to OFED not openmpi.
I've uninstalled the OFED that comes with SLES (1.4.0) and installed 
OFED 1.4.2 and 1.5-beta and I don't get the errors.


I got the idea to swap out OFED that after reading this:
http://kerneltrap.org/mailarchive/openfabrics-general/2008/11/3/3903184

Under OFED 1.4.0 (from SLES 11) I had to set options mlx4_core msi_x=0 
in /etc/modprobe.conf.local to even get the mlx4 module to load.

I found that advice here:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415
(Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core 
msi_x=0 being set)


Now my problem is that with OFED 1.4.2 and 1.5-beta the system hang and 
the GigE network stops working and I have to power cycle nodes to login.


I'm going to try to get some help from the OFED mailing list now. 


Pavel Shamis (Pasha) wrote:

Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on 
this FAQ - 
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages


Pasha.


Charles Wright wrote:

Hello,
   I just got some new cluster hardware :)  :(

I can't seem to overcome an openib problem
I get this at run time

error polling HP CQ with -2 errno says Success

I've tried 2 different IB switches and multiple sets of nodes all on 
one switch or the other to try to eliminate the hardware.   (IPoIB 
pings work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors.I'm not 
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self 
btl only probably)


Any advice would be appreciated.  Thanks.

asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready


# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your groupfor this job is : analyst
# Your job used : #   8 CPUs on dmc101
#   8 CPUs on dmc102
#   8 CPUs on dmc103
#   8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009

asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> 
asnrcw@dmc101:~> cd mpiprintrank

asnrcw@dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel 
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device] 
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error 
polling HP CQ with -2 errno says Success

error

Re: [OMPI users] Debugging OpenMPI calls

2009-09-28 Thread Aniruddha Marathe
Hi Jeff,

Thanks for the pointers. I tried with both CFLAGS=-g3 and --enable-debug
(separately), however, I am still unable to jump into the MPI source. It
seems I am missing a small step(s) somewhere.

I compiled my MPI application with the new library built with above flags,
ran it and attached gdb to one of the processes. Following are the steps
that I performed with gdb:

...
...
0x00110416 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install glibc.i686
(gdb) dir /home/amarathe/mpi/svn_openmpi/ompi-trunk/ompi/mpi/c
Source directories searched:
/home/amarathe/mpi/svn_openmpi/ompi-trunk/ompi/mpi/c:$cdir:$cwd
(gdb) break MPI_Barrier
Breakpoint 1 at 0x155596


When gdb hits breakpoint 1, it jumps at the address but cannot find the
source file for 'MPI_Barrier' definition.


Breakpoint 1, 0x00155596 in PMPI_Barrier () from
/home/amarathe/mpi/openmpi/openmpi-1.3.3_install/lib/libmpi.so.0
(gdb) s
Single stepping until exit from function PMPI_Barrier,
which has no line number information.
main (argc=1, argv=0xbf9a1484) at smg2000.c:114
114   P  = num_procs;
(gdb)


Is this the right approach?

Thanks,
Aniruddha

On Mon, Sep 28, 2009 at 8:40 AM, Jeff Squyres  wrote:

> You might want to just configure Open MPI with:
>
>  ./configure CFLAGS=-g3 ...
>
> That will pass "-g3" to every Makefile in Open MPI.
>
> FWIW: I do variants on this technique and gdb is always able to jump to the
> right source location if I "break MPI_Barrier" (for example).  We actually
> have a "--enable-debug" option to OMPI's configure, but it does turn on a
> bunch of other debugging code that will definitely result in performance
> degradation at run-time (one of its side effects is to add "-g" to every
> Makefile).
>
>
>
> On Sep 28, 2009, at 5:57 AM, Aniruddha Marathe wrote:
>
>  Hello,
>>
>> I am new to OpenMPI library and I am trying to step through common MPI
>> communication calls using gdb. I attach gdb to one of the processes
>> (using the steps mentioned on the OpenMPI Debugging FAQ page) and set
>> a breakpoint on 'MPI_Barrier' and expect gdb to jump into the
>> definition of MPI_Barrier function.
>>
>> I've manually added -g3 compilation flag to the Makefiles in some of
>> the directories that I thought relevant ({ROOT}/ompi/mpi/c etc). I
>> also specified the source file paths in gdb using the 'dir' command.
>> However, gdb is unable to jump into the appropriate source location
>> when it hits the breakpoint.
>>
>> Could anyone please let me know if I am missing something here?
>>
>> Thanks for looking into my post.
>>
>> Regards,
>> Aniruddha
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] use additional interface for openmpi

2009-09-28 Thread worldeb

 Hi folks,

I want to use for openmpi communication the additional ethernet interfaces on 
node and head node.
its is eth1 on nodes and eth4 on head node.
So how can I configure openmpi?

If I add in config file
btl_base_include=tcp,sm,self.
btl_tcp_if_include=eth1

will it work or not?

And how is it working with torque batch system (daemons listen eth0 on all 
nodes).

Thanx.


[OMPI users] Openmpi - Mac OS X SnowLeopard linking error

2009-09-28 Thread Pierre-Olivier Dallaire

Hi,

when compiling openmpi-1.3.3 with GNU or PGI compilers, the following  
occurs :


ibtool: link: gcc-4.2 -O3 -DNDEBUG -m64 -finline-functions -fno-strict- 
aliasing -fvisibility=hidden -o orte-iof orte-iof.o ../../../ 
orte/.libs/libopen-rte.a /Users/podallaire/Downloads/openmpi-1.3.3/ 
opal/.libs/libopen-pal.a -lutil

Undefined symbols:
"_orte_iof", referenced from:
_main in orte-iof.o
_abort_exit_callback in orte-iof.o
"_orte_routed", referenced from:
_orte_read_hnp_contact_file in libopen-rte.a(hnp_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a(rml_base_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a(rml_base_contact.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[2]: *** [orte-iof] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

From the following thread, it seems that an extra linking flag shoud  
be added : -all_load

See : 
http://www.pgroup.com/userforum/viewtopic.php?t=1594&sid=a9139f8d260d438afc74b5243e06679a

Anybody else had this problem ?

Thanks

PO




Re: [OMPI users] Openmpi - Mac OS X SnowLeopard linking error

2009-09-28 Thread Ralph Castain
Nope - I've been running on Snow Leopard almost since the day it came  
out without problem.


Key was that I had to re-install all my 3rd party software (e.g.,  
compilers) from Macports or wherever as none of the stuff I had  
installed on Leopard would work properly after the upgrade.


Didn't realize that until I found a thread on the Macports list where  
it was pointed out that you have to completely reinstall all such  
software after every major Mac OSX upgrade (i.e., from Tiger to  
Leopard to Snow Leopard).


On Sep 28, 2009, at 4:42 PM, Pierre-Olivier Dallaire wrote:


Hi,

when compiling openmpi-1.3.3 with GNU or PGI compilers, the  
following occurs :


ibtool: link: gcc-4.2 -O3 -DNDEBUG -m64 -finline-functions -fno- 
strict-aliasing -fvisibility=hidden -o orte-iof orte-iof.o ../../../ 
orte/.libs/libopen-rte.a /Users/podallaire/Downloads/openmpi-1.3.3/ 
opal/.libs/libopen-pal.a -lutil

Undefined symbols:
"_orte_iof", referenced from:
_main in orte-iof.o
_abort_exit_callback in orte-iof.o
"_orte_routed", referenced from:
_orte_read_hnp_contact_file in libopen-rte.a(hnp_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)

ld: symbol(s) not found
collect2: ld returned 1 exit status
make[2]: *** [orte-iof] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

From the following thread, it seems that an extra linking flag shoud  
be added : -all_load

See : 
http://www.pgroup.com/userforum/viewtopic.php?t=1594&sid=a9139f8d260d438afc74b5243e06679a

Anybody else had this problem ?

Thanks

PO


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Openmpi - Mac OS X SnowLeopard linking error

2009-09-28 Thread Pierre-Olivier Dallaire
This error only comes out when I try to build the fortran wrappers /  
will not fail if only building with gcc/g++


I had to included -all_load in several individual Makefile / using the  
env variable LIBS with ./configure  does not work.


Thanks !

PO

On 2009-09-28, at 6:57 PM, Ralph Castain wrote:

Nope - I've been running on Snow Leopard almost since the day it  
came out without problem.


Key was that I had to re-install all my 3rd party software (e.g.,  
compilers) from Macports or wherever as none of the stuff I had  
installed on Leopard would work properly after the upgrade.


Didn't realize that until I found a thread on the Macports list  
where it was pointed out that you have to completely reinstall all  
such software after every major Mac OSX upgrade (i.e., from Tiger to  
Leopard to Snow Leopard).


On Sep 28, 2009, at 4:42 PM, Pierre-Olivier Dallaire wrote:


Hi,

when compiling openmpi-1.3.3 with GNU or PGI compilers, the  
following occurs :


ibtool: link: gcc-4.2 -O3 -DNDEBUG -m64 -finline-functions -fno- 
strict-aliasing -fvisibility=hidden -o orte-iof orte-iof.o ../../../ 
orte/.libs/libopen-rte.a /Users/podallaire/Downloads/openmpi-1.3.3/ 
opal/.libs/libopen-pal.a -lutil

Undefined symbols:
"_orte_iof", referenced from:
_main in orte-iof.o
_abort_exit_callback in orte-iof.o
"_orte_routed", referenced from:
_orte_read_hnp_contact_file in libopen-rte.a(hnp_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)

ld: symbol(s) not found
collect2: ld returned 1 exit status
make[2]: *** [orte-iof] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

From the following thread, it seems that an extra linking flag  
shoud be added : -all_load

See : 
http://www.pgroup.com/userforum/viewtopic.php?t=1594&sid=a9139f8d260d438afc74b5243e06679a

Anybody else had this problem ?

Thanks

PO


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Openmpi - Mac OS X SnowLeopard linking error

2009-09-28 Thread Ralph Castain

That may explain it - I never build fortran (thank goodness).

On Sep 28, 2009, at 5:06 PM, Pierre-Olivier Dallaire wrote:

This error only comes out when I try to build the fortran wrappers /  
will not fail if only building with gcc/g++


I had to included -all_load in several individual Makefile / using  
the env variable LIBS with ./configure  does not work.


Thanks !

PO

On 2009-09-28, at 6:57 PM, Ralph Castain wrote:

Nope - I've been running on Snow Leopard almost since the day it  
came out without problem.


Key was that I had to re-install all my 3rd party software (e.g.,  
compilers) from Macports or wherever as none of the stuff I had  
installed on Leopard would work properly after the upgrade.


Didn't realize that until I found a thread on the Macports list  
where it was pointed out that you have to completely reinstall all  
such software after every major Mac OSX upgrade (i.e., from Tiger  
to Leopard to Snow Leopard).


On Sep 28, 2009, at 4:42 PM, Pierre-Olivier Dallaire wrote:


Hi,

when compiling openmpi-1.3.3 with GNU or PGI compilers, the  
following occurs :


ibtool: link: gcc-4.2 -O3 -DNDEBUG -m64 -finline-functions -fno- 
strict-aliasing -fvisibility=hidden -o orte-iof orte- 
iof.o ../../../orte/.libs/libopen-rte.a /Users/podallaire/ 
Downloads/openmpi-1.3.3/opal/.libs/libopen-pal.a -lutil

Undefined symbols:
"_orte_iof", referenced from:
_main in orte-iof.o
_abort_exit_callback in orte-iof.o
"_orte_routed", referenced from:
_orte_read_hnp_contact_file in libopen-rte.a(hnp_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)
_orte_rml_base_update_contact_info in libopen-rte.a 
(rml_base_contact.o)

ld: symbol(s) not found
collect2: ld returned 1 exit status
make[2]: *** [orte-iof] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

From the following thread, it seems that an extra linking flag  
shoud be added : -all_load

See : 
http://www.pgroup.com/userforum/viewtopic.php?t=1594&sid=a9139f8d260d438afc74b5243e06679a

Anybody else had this problem ?

Thanks

PO


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Debugging OpenMPI calls

2009-09-28 Thread Aniruddha Marathe
OK, it turned out to be a really stupid mistake.

Sorry for spamming and thanks for the help!

Regards,
Aniruddha

On Mon, Sep 28, 2009 at 11:28 AM, Aniruddha Marathe <
marathe.anirud...@gmail.com> wrote:

> Hi Jeff,
>
> Thanks for the pointers. I tried with both CFLAGS=-g3 and --enable-debug
> (separately), however, I am still unable to jump into the MPI source. It
> seems I am missing a small step(s) somewhere.
>
> I compiled my MPI application with the new library built with above flags,
> ran it and attached gdb to one of the processes. Following are the steps
> that I performed with gdb:
>
> ...
> ...
> 0x00110416 in __kernel_vsyscall ()
> Missing separate debuginfos, use: debuginfo-install glibc.i686
> (gdb) dir /home/amarathe/mpi/svn_openmpi/ompi-trunk/ompi/mpi/c
> Source directories searched:
> /home/amarathe/mpi/svn_openmpi/ompi-trunk/ompi/mpi/c:$cdir:$cwd
> (gdb) break MPI_Barrier
> Breakpoint 1 at 0x155596
>
>
> When gdb hits breakpoint 1, it jumps at the address but cannot find the
> source file for 'MPI_Barrier' definition.
>
>
> Breakpoint 1, 0x00155596 in PMPI_Barrier () from
> /home/amarathe/mpi/openmpi/openmpi-1.3.3_install/lib/libmpi.so.0
> (gdb) s
> Single stepping until exit from function PMPI_Barrier,
> which has no line number information.
> main (argc=1, argv=0xbf9a1484) at smg2000.c:114
> 114   P  = num_procs;
> (gdb)
>
>
> Is this the right approach?
>
> Thanks,
> Aniruddha
>
>
> On Mon, Sep 28, 2009 at 8:40 AM, Jeff Squyres  wrote:
>
>> You might want to just configure Open MPI with:
>>
>>  ./configure CFLAGS=-g3 ...
>>
>> That will pass "-g3" to every Makefile in Open MPI.
>>
>> FWIW: I do variants on this technique and gdb is always able to jump to
>> the right source location if I "break MPI_Barrier" (for example).  We
>> actually have a "--enable-debug" option to OMPI's configure, but it does
>> turn on a bunch of other debugging code that will definitely result in
>> performance degradation at run-time (one of its side effects is to add "-g"
>> to every Makefile).
>>
>>
>>
>> On Sep 28, 2009, at 5:57 AM, Aniruddha Marathe wrote:
>>
>>  Hello,
>>>
>>> I am new to OpenMPI library and I am trying to step through common MPI
>>> communication calls using gdb. I attach gdb to one of the processes
>>> (using the steps mentioned on the OpenMPI Debugging FAQ page) and set
>>> a breakpoint on 'MPI_Barrier' and expect gdb to jump into the
>>> definition of MPI_Barrier function.
>>>
>>> I've manually added -g3 compilation flag to the Makefiles in some of
>>> the directories that I thought relevant ({ROOT}/ompi/mpi/c etc). I
>>> also specified the source file paths in gdb using the 'dir' command.
>>> However, gdb is unable to jump into the appropriate source location
>>> when it hits the breakpoint.
>>>
>>> Could anyone please let me know if I am missing something here?
>>>
>>> Thanks for looking into my post.
>>>
>>> Regards,
>>> Aniruddha
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>