Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread W.Keegstra
Hi Gus, your first suggestion did the trick, it is working now thank you very much and also thanks to Ralph for helping out Wilko On Fri, 31 Jul 2009 14:00:05 -0400 Gus Correa wrote: Hi Wilco, list Two wild guesses: 1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d on R

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Gus Correa
Hi Wilco, list Two wild guesses: 1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d on RHEL/CentOS/Fedora type of Linux) set the system limits properly, in particular the stacksize. Something like this: ulimit -n 32768 ulimit -s unlimited ulimit -l unlimited We had problems

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Ralph Castain
If you are launching without Torque, then you will be launching with rsh or ssh. So yes, there will be some differences in the environment. For example, launching via ssh means that you pickup your remote .cshrc (or whatever shell flavor you use), while you don't when launching via Torque.

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread W.Keegstra
Hi, Sorry to bother you again, but the system limits are default limits for openSuSE 11.1, which are as far as I can see the same from OS version `openSuSE 10.0. Furthermore if I specify the node parameter such that the job is running on only 1 node (either with 2 or 8 cores) is runs well. A few

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Ralph Castain
You might check with your sys admin - or checkout the "ulimit" cmd. Depends on what the sys admin has set for system limits. On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote: Hi, Sofar I don't have a core file. the weird thing is that the same job will run well when openmpi is compiled wit

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Wilko Keegstra
Hi, Sofar I don't have a core file. the weird thing is that the same job will run well when openmpi is compiled without --with-tm. Is the amount of memory, or number of open files different in both cases? How can I force unlimited resources for the job?? only then I will get a core file. kind reg

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Ralph Castain
Ummm...this log indicates that OMPI ran perfectly - it is your application that segfaulted. Can you run gdb (or your favorite debugger) against a core file from your app? It looks like something in your app is crashing. As far as I can tell, everything is working fine. We launch and wireup

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Wilko Keegstra
Hi, I have recompiled openmpi with the --enabled-debug and --with-tm=/usr/local flags, and submitted the job to torque 2.3.7: #PBS -q cluster2 #PBS -l nodes=5:ppn=2 #PBS -N AlignImages #PBS -j oe /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose 5 --debug-daemons -machinefi

Re: [OMPI users] programs are segfaulting using Torque & OpenMPI

2009-07-31 Thread Ralph Castain
Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7 allocation, and the man page for tm_spawn? My only guess would be that something changed in those areas as we don't really use anything else from Torque, and run on Torque-based clusters in production every day. Not sure what