Hi Wilco, list

Two wild guesses:

1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d on RHEL/CentOS/Fedora type of Linux) set the system limits properly,
in particular the stacksize.  Something like this:

ulimit -n 32768
ulimit -s unlimited
ulimit -l unlimited

We had problems with this in the past,
with programs segfaulting for no apparent reason (most of the time
the default stack size was too small).


2) Make sure the Torque libtm you linked OpenMPI to is the one that corresponds to your Torque 2.3.7 (i.e. --with-tm=/full/path/to/torque-2.3.7/library/directory

If you have more than one version of torque installed on your system,
using the full path will prevent picking the wrong version.

My $0.02
Gus Correa

On Fri, Jul 31, 2009 at 11:31 AM, Ralph Castain<r...@open-mpi.org> wrote:
You might check with your sys admin - or checkout the "ulimit" cmd. Depends
on what the sys admin has set for system limits.


On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote:

Hi,

Sofar I don't have a core file.
the weird thing is that the same job will run well when openmpi
is compiled without --with-tm.
Is the amount of memory, or number of open files different in both
cases?
How can I force unlimited resources for the job??
only then I will get a core file.

kind regards,
Wilko

Ralph Castain wrote:
Ummm...this log indicates that OMPI ran perfectly - it is your
application that segfaulted.

Can you run gdb (or your favorite debugger) against a core file from
your app? It looks like something in your app is crashing.

As far as I can tell, everything is working fine. We launch and wireup
just fine, then detect one of your processes has segfaulted - which
triggers us to kill the remaining processes and terminate the job.


On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:

Hi,

I have recompiled openmpi with the --enabled-debug and
--with-tm=/usr/local
flags, and submitted the job to torque 2.3.7:

#PBS -q cluster2
#PBS -l nodes=5:ppn=2
#PBS -N AlignImages
#PBS -j oe
/usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose
5 --debug-daemons  -machinefile $PBS_NODEFILE
/pcs/programs/grip/bin/RunAlignmentMPI DoAlign
/pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
/pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0

and the job crashed almost immediately. i have attached:
tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731

I hope you can help me,
kind regards,
Wilko


Ralph Castain wrote:
Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
allocation, and the man page for tm_spawn?

My only guess would be that something changed in those areas as we
don't
really use anything else from Torque, and run on Torque-based clusters
in production every day. Not sure what version we have here, though I
believe it is pretty current (will check).

You also might want to configure OMPI 1.3.3 with --enable-debug. You
could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose
5
--debug-daemons on your mpirun cmd line to get a step-by-step
diagnostic
output of the interaction with Torque. Should give us some idea of
where
the failure is occurring.

Ralph

On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:

hi,

I have the following problem:

I am using openmpi 1.3.3

programs (directly and from scripts) submitted with mpiexec are
running
fine.

programs (directly and from scripts) submitted through Torque 2.3.7
with openmpi compiled with --with-tm (and torque-devel) installed
give segfaulting of the programs.

programs submitted through Torque 2.3.7 directly with openmpi
compiled without --with-tm (and NO torque-devel installed) run fine
however mpiexec programs from script (script submiited through torque)
are only running on 1 node, so I need openmpi compiled with --with-tm

We also have a cluster running with openmpi 1.2.9 compiled without
--with-tm in combination with torque 2.3.3 and everything is running
fine, so NO segfaults and mpiexec from script also running on the
nodes selected at submitting time.

I don't have errors on log files only on the job log file:


---------------------------------------------------------------------------


mpiexec noticed that process rank 7 with PID 3150 on node
rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------



Could anyone please help me,
many thanks in advance
Wilko Keegstra

--
+-------------------------------------------------------------+
| Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
| Groningen University       email: w.keegs...@rug.nl         |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4               phone: +31503634224              |
| 9747 AG GRONINGEN          fax  : +31503634800              |
| The Netherlands                                             |
+-------------------------------------------------------------+
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
+-------------------------------------------------------------+
| Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
| Groningen University       email: w.keegs...@rug.nl         |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4               phone: +31503634224              |
| 9747 AG GRONINGEN          fax  : +31503634800              |
| The Netherlands                                             |
+-------------------------------------------------------------+

<tm.3.gz><AlignImages.o34.gz><momlog-20090731.gz>_______________________________________________

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
+-------------------------------------------------------------+
| Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
| Groningen University       email: w.keegs...@rug.nl         |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4               phone: +31503634224              |
| 9747 AG GRONINGEN          fax  : +31503634800              |
| The Netherlands                                             |
+-------------------------------------------------------------+
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to