Re: [OMPI devel] opal_util_register_stackhandlers()
Thanks for the bug report! I've just changed the behavior to emit a warning and *not* intercept a signal if the old signal action is neither SIG_DFL nor SIG_IGN. The opal_signal MCA parameter can be set to determine which signals you want to intercept; it defaults to the integer values of SIGABRT, SIGBUS, SIGFPE, SIGSEGV on your system. We can probably get this in OMPI v1.3.2. On Mar 19, 2009, at 11:13 AM, Kees Verstoep wrote: Hi, Currently, opal_util_register_stackhandlers() in opal/util/ stacktrace.c calls sigaction() with a third NULL argument, meaning you don't look at possibly previously installed signal handlers, and always override them with print_stackframe(). But there are actually realistic scenarios where an application actively uses these signals, and also wants to use MPI. As an example, the default opal "signal" parameter settings are such that SIG_SEGV is redirected. Typically, indeed, SIG_SEGV indicates a bug somewhere, and the stacktrace from Open MPI is a nice bonus. However, the Sun Java JDK uses SIG_SEGV to detect when stacks should be automatically extended, and it stops working rather ungracefully when that handler gets replaced. (BTW, we stumbled on this recently when we added an MPI backend for our Ibis grid programming environment. It took a bit of time to figure out what was happening, since we got no usable stacktrace for the thread that got bitten. We suspected a bug in our native code mapping at first, but MPICH did not have this problem). In most cases, you can of course work around it by manually changing the opal "signal" list, but it would be nicer if Open MPI would detect the situation, and e.g. only install the stack printer when there is no handler yet, or at least warn about the possible clash. Thanks! Kees Verstoep ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r20759
There was a glitch in the SVN server this evening; you can tell that this r number is far lower than it should be. IU is fixing it right now. This commit will occur again with a new, higher SVN r number shortly... Begin forwarded message: From: Date: March 19, 2009 8:41:21 PM EDT To: Subject: [OMPI svn-full] svn:open-mpi r20759 Reply-To: Author: jsquyres Date: 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) New Revision: 20759 URL: https://svn.open-mpi.org/trac/ompi/changeset/20759 Log: Per a comment on the users list, don't try to install our own signal handlers if there are already non-default handlers installed. Print a warning if that situation arises. '''NOTE:''' This is a definite target for OPAL_SOS conversion -- as it is right now, this message will be displayed for ''every'' MPI process. We want this to be OPAL_SOS'ed when that becomes available so that the error message can be aggregated nicely. Added: trunk/opal/util/help-opal-util.txt Text files modified: trunk/opal/util/Makefile.am | 4 +++- trunk/opal/util/stacktrace.c |22 -- 2 files changed, 23 insertions(+), 3 deletions(-) Modified: trunk/opal/util/Makefile.am = = = = = = = = == --- trunk/opal/util/Makefile.am (original) +++ trunk/opal/util/Makefile.am 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -9,7 +9,7 @@ # University of Stuttgart. All rights reserved. # Copyright (c) 2004-2005 The Regents of the University of California. # All rights reserved. -# Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2007-2009 Cisco Systems, Inc. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -19,6 +19,8 @@ SUBDIRS = keyval +dist_pkgdata_DATA = help-opal-util.txt + AM_LFLAGS = -Popal_show_help_yy LEX_OUTPUT_ROOT = lex.opal_show_help_yy Added: trunk/opal/util/help-opal-util.txt = = = = = = = = == --- (empty file) +++ trunk/opal/util/help-opal-util.txt 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -0,0 +1,25 @@ +# -*- text -*- +# +# Copyright (c) 2009 Cisco Systems, Inc. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for Open MPI. +# +[stacktrace signal override] +Open MPI was insertting a signal handler for signal %d but noticed +that there is already a non-default handler installer. Open MPI's +handler was therefore not installed; your job will continue. This +warning message will only be displayed once, even if Open MPI +encounters this situation again. + +To avoid displaying this warning message, you can either not install +the error handler for signal %d or you can have Open MPI not try to +install its own signal handler for this signal by setting the +"opal_signals" MCA parameter. + + Signal: %d + Current opal_signals value: %s Modified: trunk/opal/util/stacktrace.c = = = = = = = = == --- trunk/opal/util/stacktrace.c(original) +++ trunk/opal/util/stacktrace.c2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -38,6 +38,7 @@ #include "opal/mca/backtrace/backtrace.h" #include "opal/constants.h" #include "opal/util/output.h" +#include "opal/util/show_help.h" #ifndef _NSIG #define _NSIG 32 @@ -410,11 +411,12 @@ int opal_util_register_stackhandlers (void) { #if OMPI_WANT_PRETTY_PRINT_STACKTRACE && ! defined(__WINDOWS__) -struct sigaction act; +struct sigaction act, old; char * string_value; char * tmp; char * next; int param, i; +bool showed_help = false; gethostname(stacktrace_hostname, sizeof(stacktrace_hostname)); stacktrace_hostname[sizeof(stacktrace_hostname) - 1] = '\0'; @@ -459,10 +461,26 @@ return OPAL_ERR_BAD_PARAM; } - ret = sigaction (sig, &act, NULL); + ret = sigaction (sig, &act, &old); if (ret != 0) { return OPAL_ERR_IN_ERRNO; } + if (SIG_IGN != old.sa_handler && SIG_DFL != old.sa_handler) { + if (!showed_help) { + /* JMS This is icky; there is no error message + aggregation here so this message may be repeated for + every single MPI process... This should be replaced + with OPAL_SOS when that is done so that it can be + properly aggregated. */ + opal_show_help("help-opal-util.txt", + "stacktrace signal override", + true, sig, sig, sig, string_value); + showed_help = true; + } + if (0 != sigaction(sig, &old, NULL)) { + return OPAL_ERR_IN_ERRNO; + } + } } free(string_value); #endif /* OMPI_WANT_P
Re: [OMPI devel] RFC: Final cleanup of included headers
Hi Ralph, On Wednesday 18 March 2009 09:00:36 am Ralph Castain wrote: > Could we hold off on this until after 1.3.2 is out the door and has a > couple of days to stabilize? All these header file changes are making > it more difficult to cleanly apply patches to the 1.3 branch. Hmm, sure, we can hold off the big patch. With the current plan, 1.3.2 should be out on 4/3. Some intermediate (small!) steps however I'd still like to be able to apply? > When we get past the next couple of weeks, the 1.3 branch should clear > out the backlog of CMRs, and we should have the usual immediate "oops" > fixes in to 1.3.2. Then this won't be such a problem. However, it would be nice, if You could test the patch on Your systems, prior to me moving it into trunk. I want to limit the "down-time" of trunk (There may be a few places, where additional headers are required -- as unnecessary headers were removed in lower-level headers). Thanks, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
[OMPI devel] opal_util_register_stackhandlers()
Hi, Currently, opal_util_register_stackhandlers() in opal/util/stacktrace.c calls sigaction() with a third NULL argument, meaning you don't look at possibly previously installed signal handlers, and always override them with print_stackframe(). But there are actually realistic scenarios where an application actively uses these signals, and also wants to use MPI. As an example, the default opal "signal" parameter settings are such that SIG_SEGV is redirected. Typically, indeed, SIG_SEGV indicates a bug somewhere, and the stacktrace from Open MPI is a nice bonus. However, the Sun Java JDK uses SIG_SEGV to detect when stacks should be automatically extended, and it stops working rather ungracefully when that handler gets replaced. (BTW, we stumbled on this recently when we added an MPI backend for our Ibis grid programming environment. It took a bit of time to figure out what was happening, since we got no usable stacktrace for the thread that got bitten. We suspected a bug in our native code mapping at first, but MPICH did not have this problem). In most cases, you can of course work around it by manually changing the opal "signal" list, but it would be nicer if Open MPI would detect the situation, and e.g. only install the stack printer when there is no handler yet, or at least warn about the possible clash. Thanks! Kees Verstoep
[OMPI devel] Open MPI v1.3.1 released
The Open MPI Team, representing a consortium of research, academic, and industry partners, is pleased to announce the release of Open MPI version 1.3.1. This release is mainly a bug fix release over the v1.3.0 release, but there are few new features. We strongly recommend that all users upgrade to version 1.3.1 if possible. Version 1.3.1 can be downloaded from the main Open MPI web site or any of its mirrors (mirrors will be updating shortly). Here is a list of changes in v1.3.1 as compared to v1.3.0: - Added "sync" coll component to allow users to synchronize every N collective operations on a given communicator. - Increased the default values of the IB and RNR timeout MCA parameters. - Fix a compiler error noted by Mostyn Lewis with the PGI 8.0 compiler. - Fix an error that prevented stdin from being forwarded if the rsh launcher was in use. Thanks to Branden Moore for pointing out the problem. - Correct a case where the added datatype is considered as contiguous but has gaps in the beginning. - Fix an error that limited the number of comm_spawns that could simultaneously be running in some environments - Correct a corner case in OB1's GET protocol for long messages; the error could sometimes cause MPI jobs using the openib BTL to hang. - Fix a bunch of bugs in the IO forwarding (IOF) subsystem and add some new options to output to files and redirect output to xterm. Thanks to Jody Weissmann for helping test out many of the new fixes and features. - Fix SLURM race condition. - Fix MPI_File_c2f(MPI_FILE_NULL) to return 0, not -1. Thanks to Lisandro Dalcin for the bug report. - Fix the DSO build of tm PLM. - Various fixes for size disparity between C int's and Fortran INTEGER's. Thanks to Christoph van Wullen for the bug report. - Ensure that mpirun exits with a non-zero exit status when daemons or processes abort or fail to launch. - Various fixes to work around Intel (NetEffect) RNIC behavior. - Various fixes for mpirun's --preload-files and --preload-binary options. - Fix the string name in MPI::ERRORS_THROW_EXCEPTIONS. - Add ability to forward SIFTSTP and SIGCONT to MPI processes if you set the MCA parameter orte_forward_job_control to 1. - Allow the sm BTL to allocate larger amounts of shared memory if desired (helpful for very large multi-core boxen). - Fix a few places where we used PATH_MAX instead of OMPI_PATH_MAX, leading to compile problems on some platforms. Thanks to Andrea Iob for the bug report. - Fix mca_btl_openib_warn_no_device_params_found MCA parameter; it was accidentally being ignored. - Fix some run-time issues with the sctp BTL. - Ensure that RTLD_NEXT exists before trying to use it (e.g., it doesn't exist on Cygwin). Thanks to Gustavo Seabra for reporting the issue. - Various fixes to VampirTrace, including fixing compile errors on some platforms. - Fixed missing MPI_Comm_accept.3 man page; fixed minor issue in orterun.1 man page. Thanks to Dirk Eddelbuettel for identifying the problem and submitting a patch. - Implement the XML formatted output of stdout/stderr/stddiag. - Fixed mpirun's -wdir switch to ensure that working directories for multiple app contexts are properly handled. Thanks to Geoffroy Pignot for reporting the problem. - Improvements to the MPI C++ integer constants: - Allow MPI::SEEK_* constants to be used as constants - Allow other MPI C++ constants to be used as array sizes - Fix minor problem with orte-restart's command line options. See ticket #1761 for details. Thanks to Gregor Dschung for reporting the problem.
Re: [OMPI devel] 1.3.1rc5
Things look good from the IBM side as well. So, RM-approved for release. --brad ompi 1.3 co-release manager On Thu, Mar 19, 2009 at 7:31 AM, Jeff Squyres wrote: > Looks good to cisco. Ship it. > > I'm still seeing a very low incidence of the sm segv during startup (.01% > -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm > code for 1.3.2. > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] 1.3.1rc5
Looks good to cisco. Ship it. I'm still seeing a very low incidence of the sm segv during startup (. 01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm code for 1.3.2. -- Jeff Squyres Cisco Systems