Hi Gus,
your first suggestion did the trick, it is working now
thank you very much and also thanks to Ralph for helping out
Wilko
On Fri, 31 Jul 2009 14:00:05 -0400
Gus Correa wrote:
Hi Wilco, list
Two wild guesses:
1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d
on R
Hi Wilco, list
Two wild guesses:
1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d on
RHEL/CentOS/Fedora type of Linux) set the system limits properly,
in particular the stacksize. Something like this:
ulimit -n 32768
ulimit -s unlimited
ulimit -l unlimited
We had problems
If you are launching without Torque, then you will be launching with
rsh or ssh. So yes, there will be some differences in the environment.
For example, launching via ssh means that you pickup your
remote .cshrc (or whatever shell flavor you use), while you don't when
launching via Torque.
Hi,
Sorry to bother you again, but the system limits are default limits
for openSuSE 11.1, which are as far as I can see the same from
OS version `openSuSE 10.0.
Furthermore if I specify the node parameter such that the job is
running on only 1 node (either with 2 or 8 cores) is runs well.
A few
You might check with your sys admin - or checkout the "ulimit" cmd.
Depends on what the sys admin has set for system limits.
On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote:
Hi,
Sofar I don't have a core file.
the weird thing is that the same job will run well when openmpi
is compiled wit
Hi,
Sofar I don't have a core file.
the weird thing is that the same job will run well when openmpi
is compiled without --with-tm.
Is the amount of memory, or number of open files different in both
cases?
How can I force unlimited resources for the job??
only then I will get a core file.
kind reg
Ummm...this log indicates that OMPI ran perfectly - it is your
application that segfaulted.
Can you run gdb (or your favorite debugger) against a core file from
your app? It looks like something in your app is crashing.
As far as I can tell, everything is working fine. We launch and wireup
Hi,
I have recompiled openmpi with the --enabled-debug and --with-tm=/usr/local
flags, and submitted the job to torque 2.3.7:
#PBS -q cluster2
#PBS -l nodes=5:ppn=2
#PBS -N AlignImages
#PBS -j oe
/usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose
5 --debug-daemons -machinefi
Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
allocation, and the man page for tm_spawn?
My only guess would be that something changed in those areas as we
don't really use anything else from Torque, and run on Torque-based
clusters in production every day. Not sure what