Re: [OMPI devel] opal_util_register_stackhandlers()

2009-03-19 Thread Jeff Squyres

Thanks for the bug report!

I've just changed the behavior to emit a warning and *not* intercept a  
signal if the old signal action is neither SIG_DFL nor SIG_IGN.  The  
opal_signal MCA parameter can be set to determine which signals you  
want to intercept; it defaults to the integer values of SIGABRT,  
SIGBUS, SIGFPE, SIGSEGV on your system.


We can probably get this in OMPI v1.3.2.


On Mar 19, 2009, at 11:13 AM, Kees Verstoep wrote:


Hi,

Currently, opal_util_register_stackhandlers() in opal/util/ 
stacktrace.c

calls sigaction() with a third NULL argument, meaning you don't look
at possibly previously installed signal handlers, and always override
them with print_stackframe().

But there are actually realistic scenarios where an application  
actively
uses these signals, and also wants to use MPI.  As an example, the  
default

opal "signal" parameter settings are such that SIG_SEGV is redirected.
Typically, indeed, SIG_SEGV indicates a bug somewhere, and the  
stacktrace
from Open MPI is a nice bonus.   However, the Sun Java JDK uses  
SIG_SEGV
to detect when stacks should be automatically extended, and it stops  
working

rather ungracefully when that handler gets replaced.

(BTW, we stumbled on this recently when we added an MPI backend for  
our
Ibis grid programming environment.  It took a bit of time to figure  
out
what was happening, since we got no usable stacktrace for the thread  
that

got bitten.  We suspected a bug in our native code mapping at first,
but MPICH did not have this problem).

In most cases, you can of course work around it by manually changing
the opal "signal" list, but it would be nicer if Open MPI would detect
the situation, and e.g. only install the stack printer when there is
no handler yet, or at least warn about the possible clash.

Thanks!
Kees Verstoep
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r20759

2009-03-19 Thread Jeff Squyres
There was a glitch in the SVN server this evening; you can tell that  
this r number is far lower than it should be.


IU is fixing it right now.  This commit will occur again with a new,  
higher SVN r number shortly...



Begin forwarded message:


From: 
Date: March 19, 2009 8:41:21 PM EDT
To: 
Subject: [OMPI svn-full] svn:open-mpi r20759
Reply-To: 

Author: jsquyres
Date: 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009)
New Revision: 20759
URL: https://svn.open-mpi.org/trac/ompi/changeset/20759

Log:
Per a comment on the users list, don't try to install our own signal
handlers if there are already non-default handlers installed.  Print a
warning if that situation arises.

'''NOTE:''' This is a definite target for OPAL_SOS conversion -- as it
is right now, this message will be displayed for ''every'' MPI
process.  We want this to be OPAL_SOS'ed when that becomes available
so that the error message can be aggregated nicely.

Added:
   trunk/opal/util/help-opal-util.txt
Text files modified:
   trunk/opal/util/Makefile.am  | 4 +++-
   trunk/opal/util/stacktrace.c |22 --
   2 files changed, 23 insertions(+), 3 deletions(-)

Modified: trunk/opal/util/Makefile.am
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/opal/util/Makefile.am (original)
+++ trunk/opal/util/Makefile.am 2009-03-19 20:41:21 EDT (Thu, 19 Mar  
2009)

@@ -9,7 +9,7 @@
 # University of Stuttgart.  All rights  
reserved.
 # Copyright (c) 2004-2005 The Regents of the University of  
California.

 # All rights reserved.
-# Copyright (c) 2007  Cisco Systems, Inc.  All rights reserved.
+# Copyright (c) 2007-2009 Cisco Systems, Inc.  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@@ -19,6 +19,8 @@

 SUBDIRS = keyval

+dist_pkgdata_DATA = help-opal-util.txt
+
 AM_LFLAGS = -Popal_show_help_yy
 LEX_OUTPUT_ROOT = lex.opal_show_help_yy


Added: trunk/opal/util/help-opal-util.txt
= 
= 
= 
= 
= 
= 
= 
= 
==

--- (empty file)
+++ trunk/opal/util/help-opal-util.txt  2009-03-19 20:41:21 EDT  
(Thu, 19 Mar 2009)

@@ -0,0 +1,25 @@
+# -*- text -*-
+#
+# Copyright (c) 2009 Cisco Systems, Inc.  All rights reserved.
+# $COPYRIGHT$
+#
+# Additional copyrights may follow
+#
+# $HEADER$
+#
+# This is the US/English general help file for Open MPI.
+#
+[stacktrace signal override]
+Open MPI was insertting a signal handler for signal %d but noticed
+that there is already a non-default handler installer.  Open MPI's
+handler was therefore not installed; your job will continue.  This
+warning message will only be displayed once, even if Open MPI
+encounters this situation again.
+
+To avoid displaying this warning message, you can either not install
+the error handler for signal %d or you can have Open MPI not try to
+install its own signal handler for this signal by setting the
+"opal_signals" MCA parameter.
+
+  Signal: %d
+  Current opal_signals value: %s

Modified: trunk/opal/util/stacktrace.c
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/opal/util/stacktrace.c(original)
+++ trunk/opal/util/stacktrace.c2009-03-19 20:41:21 EDT  
(Thu, 19 Mar 2009)

@@ -38,6 +38,7 @@
 #include "opal/mca/backtrace/backtrace.h"
 #include "opal/constants.h"
 #include "opal/util/output.h"
+#include "opal/util/show_help.h"

 #ifndef _NSIG
 #define _NSIG 32
@@ -410,11 +411,12 @@
 int opal_util_register_stackhandlers (void)
 {
 #if OMPI_WANT_PRETTY_PRINT_STACKTRACE && ! defined(__WINDOWS__)
-struct sigaction act;
+struct sigaction act, old;
 char * string_value;
 char * tmp;
 char * next;
 int param, i;
+bool showed_help = false;

 gethostname(stacktrace_hostname, sizeof(stacktrace_hostname));
 stacktrace_hostname[sizeof(stacktrace_hostname) - 1] = '\0';
@@ -459,10 +461,26 @@
 return OPAL_ERR_BAD_PARAM;
   }

-  ret = sigaction (sig, &act, NULL);
+  ret = sigaction (sig, &act, &old);
   if (ret != 0) {
 return OPAL_ERR_IN_ERRNO;
   }
+  if (SIG_IGN != old.sa_handler && SIG_DFL != old.sa_handler) {
+  if (!showed_help) {
+  /* JMS This is icky; there is no error message
+ aggregation here so this message may be repeated for
+ every single MPI process...  This should be replaced
+ with OPAL_SOS when that is done so that it can be
+ properly aggregated. */
+  opal_show_help("help-opal-util.txt",
+ "stacktrace signal override",
+ true, sig, sig, sig, string_value);
+  showed_help = true;
+  }
+  if (0 != sigaction(sig, &old, NULL)) {
+  return OPAL_ERR_IN_ERRNO;
+  }
+  }
 }
 free(string_value);
 #endif /* OMPI_WANT_P

Re: [OMPI devel] RFC: Final cleanup of included headers

2009-03-19 Thread Rainer Keller
Hi Ralph,
On Wednesday 18 March 2009 09:00:36 am Ralph Castain wrote:
> Could we hold off on this until after 1.3.2 is out the door and has a
> couple of days to stabilize? All these header file changes are making
> it more difficult to cleanly apply patches to the 1.3 branch.
Hmm, sure, we can hold off the big patch.
With the current plan, 1.3.2 should be out on 4/3.

Some intermediate (small!) steps however I'd still like to be able to apply?

> When we get past the next couple of weeks, the 1.3 branch should clear
> out the backlog of CMRs, and we should have the usual immediate "oops"
> fixes in to 1.3.2. Then this won't be such a problem.
However, it would be nice, if You could test the patch on Your systems, prior 
to me moving it into trunk. I want to limit the "down-time" of trunk (There 
may be a few places, where additional headers are required  -- as unnecessary 
headers were removed in lower-level headers).

Thanks,
Rainer
-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink




[OMPI devel] opal_util_register_stackhandlers()

2009-03-19 Thread Kees Verstoep

Hi,

Currently, opal_util_register_stackhandlers() in opal/util/stacktrace.c
calls sigaction() with a third NULL argument, meaning you don't look
at possibly previously installed signal handlers, and always override
them with print_stackframe().

But there are actually realistic scenarios where an application actively
uses these signals, and also wants to use MPI.  As an example, the default
opal "signal" parameter settings are such that SIG_SEGV is redirected.
Typically, indeed, SIG_SEGV indicates a bug somewhere, and the stacktrace
from Open MPI is a nice bonus.   However, the Sun Java JDK uses SIG_SEGV
to detect when stacks should be automatically extended, and it stops working
rather ungracefully when that handler gets replaced.

(BTW, we stumbled on this recently when we added an MPI backend for our
Ibis grid programming environment.  It took a bit of time to figure out
what was happening, since we got no usable stacktrace for the thread that
got bitten.  We suspected a bug in our native code mapping at first,
but MPICH did not have this problem).

In most cases, you can of course work around it by manually changing
the opal "signal" list, but it would be nicer if Open MPI would detect
the situation, and e.g. only install the stack printer when there is
no handler yet, or at least warn about the possible clash.

Thanks!
Kees Verstoep


[OMPI devel] Open MPI v1.3.1 released

2009-03-19 Thread Ralph Castain

The Open MPI Team, representing a consortium of research, academic,
and industry partners, is pleased to announce the release of Open MPI
version 1.3.1. This release is mainly a bug fix release over the v1.3.0
release, but there are few new features.  We strongly recommend
that all users upgrade to version 1.3.1 if possible.

Version 1.3.1 can be downloaded from the main Open MPI web site or
any of its mirrors (mirrors will be updating shortly).

Here is a list of changes in v1.3.1 as compared to v1.3.0:

- Added "sync" coll component to allow users to synchronize every N
 collective operations on a given communicator.
- Increased the default values of the IB and RNR timeout MCA parameters.
- Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler.
- Fix an error that prevented stdin from being forwarded if the
 rsh launcher was in use.  Thanks to Branden Moore for pointing out
 the problem.
- Correct a case where the added datatype is considered as contiguous  
but

 has gaps in the beginning.
- Fix an error that limited the number of comm_spawns that could
 simultaneously be running in some environments
- Correct a corner case in OB1's GET protocol for long messages; the
 error could sometimes cause MPI jobs using the openib BTL to hang.
- Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some
 new options to output to files and redirect output to xterm.  Thanks  
to

 Jody Weissmann for helping test out many of the new fixes and
 features.
- Fix SLURM race condition.
- Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1.  Thanks to
 Lisandro Dalcin for the bug report.
- Fix the DSO build of tm PLM.
- Various fixes for size disparity between C int's and Fortran
 INTEGER's.  Thanks to Christoph van Wullen for the bug report.
- Ensure that mpirun exits with a non-zero exit status when daemons or
 processes abort or fail to launch.
- Various fixes to work around Intel (NetEffect) RNIC behavior.
- Various fixes for mpirun's --preload-files and --preload-binary
 options.
- Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS.
- Add ability to forward SIFTSTP and SIGCONT to MPI processes if you
 set the MCA parameter orte_forward_job_control to 1.
- Allow the sm BTL to allocate larger amounts of shared memory if
 desired (helpful for very large multi-core boxen).
- Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX,
 leading to compile problems on some platforms.  Thanks to Andrea Iob
 for the bug report.
- Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it
 was accidentally being ignored.
- Fix some run-time issues with the sctp BTL.
- Ensure that RTLD_NEXT exists before trying to use it (e.g., it
 doesn't exist on Cygwin).  Thanks to Gustavo Seabra for reporting
 the issue.
- Various fixes to VampirTrace, including fixing compile errors on
 some platforms.
- Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in
 orterun.1 man page.  Thanks to Dirk Eddelbuettel for identifying the
 problem and submitting a patch.
- Implement the XML formatted output of stdout/stderr/stddiag.
- Fixed mpirun's -wdir switch to ensure that working directories for
 multiple app contexts are properly handled.  Thanks to Geoffroy
 Pignot for reporting the problem.
- Improvements to the MPI C++ integer constants:
 - Allow MPI::SEEK_* constants to be used as constants
 - Allow other MPI C++ constants to be used as array sizes
- Fix minor problem with orte-restart's command line options.  See
 ticket #1761 for details.  Thanks to Gregor Dschung for reporting
 the problem.




Re: [OMPI devel] 1.3.1rc5

2009-03-19 Thread Brad Benton
Things look good from the IBM side as well.  So, RM-approved for release.
--brad
ompi 1.3 co-release manager



On Thu, Mar 19, 2009 at 7:31 AM, Jeff Squyres  wrote:

> Looks good to cisco.  Ship it.
>
> I'm still seeing a very low incidence of the sm segv during startup (.01%
> -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm
> code for 1.3.2.
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] 1.3.1rc5

2009-03-19 Thread Jeff Squyres

Looks good to cisco.  Ship it.

I'm still seeing a very low incidence of the sm segv during startup (. 
01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in  
Eugene's new sm code for 1.3.2.


--
Jeff Squyres
Cisco Systems