Re: [OMPI devel] segv in ompi_info
Good suggestion - fixed on trunk in r32189 On Jul 9, 2014, at 2:30 PM, Paul Hargrove wrote: > I agree with Gilles that there is not a "bug", but I believe that OMPI could > do something better. > > First, I'll show that > a) this is not a new behavior > b) it is not limited to "less". > > $ (strace ompi_info -a | grep -m1 btl) 2>&1 | grep -e 'Open MPI:' -e SIGPIPE > write(1, "Open MPI: 1.4.5\n", 32) = 32 > --- SIGPIPE (Broken pipe) @ 0 (0) --- > +++ killed by SIGPIPE +++ > > a) the opmi_info output says "Open MPI: 1.4.5" (thus not new by any stretch). > b) the "-m1" argument to the inner "grep" says exit after the first match > > The "strace" is to detect/report that SIGPIPE was received. > The outer grep picks out the relevant info from the flood of strace output. > > So, the "issue" today seems to be that mxm is catching the signal and > producing a backtrace. This backtrace is NOT a desirable behavior. This is > not intrinsically the "fault" of mxm, because there is no reason to believe > that ompi_info would never link to (or dlopen) another library that performs > backtraces. > > So, I would suggest that ompi_info simply "signal(SIGPIPE, SIG_IGN);" to > resolve this in a way not specific to mxm. > > -Paul > > > On Wed, Jul 9, 2014 at 3:47 AM, Gilles Gouaillardet > wrote: > Mike, > > how do you test ? > i cannot reproduce a bug : > > if you run ompi_info -a -l 9 | less > > and i press 'q' at the early stage (e.g. before all output is written to the > pipe) > then the less process exits and receives SIG_PIPE and crash (which is a > normal unix behaviour) > > now if i press the spacebar until the end of the output (e.g. i get the (END) > message from less) > and then press 'q', then there is no problem. > > strace -e signal ompi_info -a -l 9 | true > will cause ompi_info receives a SIG_PIPE > > strace -e signal dd if=/dev/zero bs=1M count=1 | true > will cause dd receives a SIG_PIPE > > unless i miss something, i would conclude there is no bug > > Cheers, > > Gilles > > On 2014/07/09 19:33, Mike Dubman wrote: >> mxm only intercept signals and prints the stacktrace. >> happens on trunk as well. >> only when "| less" is used. >> >> >> >> >> >> >> On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) >> wrote: >> >>> I'm unable to replicate. Please provide more detail...? Is this a >>> problem in the MXM component? >>> >>> On Jul 8, 2014, at 9:20 AM, Mike Dubman wrote: >>> $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less Caught signal 13 (Broken pipe) backtrace 2 0x00054cac mxm_handle_error() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 3 0x00054e74 mxm_error_signal_handler() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 4 0x0033fbe32920 killpg() ??:0 5 0x0033fbedb650 __write_nocancel() interp.c:0 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 9 0x0033fbe48410 _IO_vfprintf() ??:0 10 0x0033fbe4f40a printf() ??:0 11 0x0002bc84 opal_info_out() >>> >>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 12 0x0002c6bb opal_info_show_mca_group_params() >>> >>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 13 0x0002c882 opal_info_show_mca_group_params() >>> >>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 14 0x0002cc13 opal_info_show_mca_params() >>> >>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 15 0x0002d074 opal_info_do_params() >>> >>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 16 0x0040167b main() ??:0 17 0x0033fbe1ecdd __libc_start_main() ??:0 18 0x00401349 _start() ??:0 === ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/07/15075.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/07/15076.php >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post:
Re: [OMPI devel] segv in ompi_info
I agree with Gilles that there is not a "bug", but I believe that OMPI could do something better. First, I'll show that a) this is not a new behavior b) it is not limited to "less". $ (strace ompi_info -a | grep -m1 btl) 2>&1 | grep -e 'Open MPI:' -e SIGPIPE write(1, "Open MPI: 1.4.5\n", 32) = 32 --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++ a) the opmi_info output says "Open MPI: 1.4.5" (thus not new by any stretch). b) the "-m1" argument to the inner "grep" says exit after the first match The "strace" is to detect/report that SIGPIPE was received. The outer grep picks out the relevant info from the flood of strace output. So, the "issue" today seems to be that mxm is catching the signal and producing a backtrace. This backtrace is NOT a desirable behavior. This is not intrinsically the "fault" of mxm, because there is no reason to believe that ompi_info would never link to (or dlopen) another library that performs backtraces. So, I would suggest that ompi_info simply "signal(SIGPIPE, SIG_IGN);" to resolve this in a way not specific to mxm. -Paul On Wed, Jul 9, 2014 at 3:47 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Mike, > > how do you test ? > i cannot reproduce a bug : > > if you run ompi_info -a -l 9 | less > > and i press 'q' at the early stage (e.g. before all output is written to > the pipe) > then the less process exits and receives SIG_PIPE and crash (which is a > normal unix behaviour) > > now if i press the spacebar until the end of the output (e.g. i get the > (END) message from less) > and then press 'q', then there is no problem. > > strace -e signal ompi_info -a -l 9 | true > will cause ompi_info receives a SIG_PIPE > > strace -e signal dd if=/dev/zero bs=1M count=1 | true > will cause dd receives a SIG_PIPE > > unless i miss something, i would conclude there is no bug > > Cheers, > > Gilles > > On 2014/07/09 19:33, Mike Dubman wrote: > > mxm only intercept signals and prints the stacktrace. > happens on trunk as well. > only when "| less" is used. > > > > > > > On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) > > wrote: > > > I'm unable to replicate. Please provide more detail...? Is this a > problem in the MXM component? > > On Jul 8, 2014, at 9:20 AM, Mike Dubman > wrote: > > > > $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less > Caught signal 13 (Broken pipe) > backtrace > 2 0x00054cac mxm_handle_error() > > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 > > 3 0x00054e74 mxm_error_signal_handler() > > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 > > 4 0x0033fbe32920 killpg() ??:0 > 5 0x0033fbedb650 __write_nocancel() interp.c:0 > 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 > 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 > 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 > 9 0x0033fbe48410 _IO_vfprintf() ??:0 > 10 0x0033fbe4f40a printf() ??:0 > 11 0x0002bc84 opal_info_out() > > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 > > 12 0x0002c6bb opal_info_show_mca_group_params() > > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 > > 13 0x0002c882 opal_info_show_mca_group_params() > > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 > > 14 0x0002cc13 opal_info_show_mca_params() > > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 > > 15 0x0002d074 opal_info_do_params() > > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 > > 16 0x0040167b main() ??:0 > 17 0x0033fbe1ecdd __libc_start_main() ??:0 > 18 0x00401349 _start() ??:0 > === > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/07/15075.php > > > -- > Jeff squyresjsquy...@cisco.com > For corporate legal information go > to:http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this > post:http://www.open-mpi.org/community/lists/devel/2014/07/15076.php > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15080.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi
Re: [OMPI devel] segv in ompi_info
Mike, how do you test ? i cannot reproduce a bug : if you run ompi_info -a -l 9 | less and i press 'q' at the early stage (e.g. before all output is written to the pipe) then the less process exits and receives SIG_PIPE and crash (which is a normal unix behaviour) now if i press the spacebar until the end of the output (e.g. i get the (END) message from less) and then press 'q', then there is no problem. strace -e signal ompi_info -a -l 9 | true will cause ompi_info receives a SIG_PIPE strace -e signal dd if=/dev/zero bs=1M count=1 | true will cause dd receives a SIG_PIPE unless i miss something, i would conclude there is no bug Cheers, Gilles On 2014/07/09 19:33, Mike Dubman wrote: > mxm only intercept signals and prints the stacktrace. > happens on trunk as well. > only when "| less" is used. > > > > > > > On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) > wrote: > >> I'm unable to replicate. Please provide more detail...? Is this a >> problem in the MXM component? >> >> On Jul 8, 2014, at 9:20 AM, Mike Dubman wrote: >> >>> >>> $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less >>> Caught signal 13 (Broken pipe) >>> backtrace >>> 2 0x00054cac mxm_handle_error() >> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 >>> 3 0x00054e74 mxm_error_signal_handler() >> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 >>> 4 0x0033fbe32920 killpg() ??:0 >>> 5 0x0033fbedb650 __write_nocancel() interp.c:0 >>> 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 >>> 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 >>> 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 >>> 9 0x0033fbe48410 _IO_vfprintf() ??:0 >>> 10 0x0033fbe4f40a printf() ??:0 >>> 11 0x0002bc84 opal_info_out() >> >> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 >>> 12 0x0002c6bb opal_info_show_mca_group_params() >> >> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 >>> 13 0x0002c882 opal_info_show_mca_group_params() >> >> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 >>> 14 0x0002cc13 opal_info_show_mca_params() >> >> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 >>> 15 0x0002d074 opal_info_do_params() >> >> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 >>> 16 0x0040167b main() ??:0 >>> 17 0x0033fbe1ecdd __libc_start_main() ??:0 >>> 18 0x00401349 _start() ??:0 >>> === >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/07/15075.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/07/15076.php >> > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15080.php
Re: [OMPI devel] segv in ompi_info
mxm only intercept signals and prints the stacktrace. happens on trunk as well. only when "| less" is used. On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) wrote: > I'm unable to replicate. Please provide more detail...? Is this a > problem in the MXM component? > > On Jul 8, 2014, at 9:20 AM, Mike Dubman wrote: > > > > > > > $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less > > Caught signal 13 (Broken pipe) > > backtrace > > 2 0x00054cac mxm_handle_error() > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 > > 3 0x00054e74 mxm_error_signal_handler() > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 > > 4 0x0033fbe32920 killpg() ??:0 > > 5 0x0033fbedb650 __write_nocancel() interp.c:0 > > 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 > > 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 > > 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 > > 9 0x0033fbe48410 _IO_vfprintf() ??:0 > > 10 0x0033fbe4f40a printf() ??:0 > > 11 0x0002bc84 opal_info_out() > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 > > 12 0x0002c6bb opal_info_show_mca_group_params() > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 > > 13 0x0002c882 opal_info_show_mca_group_params() > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 > > 14 0x0002cc13 opal_info_show_mca_params() > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 > > 15 0x0002d074 opal_info_do_params() > > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 > > 16 0x0040167b main() ??:0 > > 17 0x0033fbe1ecdd __libc_start_main() ??:0 > > 18 0x00401349 _start() ??:0 > > === > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15075.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15076.php >
Re: [OMPI devel] segv in ompi_info
I'm unable to replicate. Please provide more detail...? Is this a problem in the MXM component? On Jul 8, 2014, at 9:20 AM, Mike Dubman wrote: > > > $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less > Caught signal 13 (Broken pipe) > backtrace > 2 0x00054cac mxm_handle_error() > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 > 3 0x00054e74 mxm_error_signal_handler() > /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 > 4 0x0033fbe32920 killpg() ??:0 > 5 0x0033fbedb650 __write_nocancel() interp.c:0 > 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 > 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 > 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 > 9 0x0033fbe48410 _IO_vfprintf() ??:0 > 10 0x0033fbe4f40a printf() ??:0 > 11 0x0002bc84 opal_info_out() > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 > 12 0x0002c6bb opal_info_show_mca_group_params() > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 > 13 0x0002c882 opal_info_show_mca_group_params() > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 > 14 0x0002cc13 opal_info_show_mca_params() > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 > 15 0x0002d074 opal_info_do_params() > /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 > 16 0x0040167b main() ??:0 > 17 0x0033fbe1ecdd __libc_start_main() ??:0 > 18 0x00401349 _start() ??:0 > === > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15075.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] segv in ompi_info
$/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less Caught signal 13 (Broken pipe) backtrace 2 0x00054cac mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653 3 0x00054e74 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628 4 0x0033fbe32920 killpg() ??:0 5 0x0033fbedb650 __write_nocancel() interp.c:0 6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0 7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0 8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0 9 0x0033fbe48410 _IO_vfprintf() ??:0 10 0x0033fbe4f40a printf() ??:0 11 0x0002bc84 opal_info_out() /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853 12 0x0002c6bb opal_info_show_mca_group_params() /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658 13 0x0002c882 opal_info_show_mca_group_params() /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716 14 0x0002cc13 opal_info_show_mca_params() /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742 15 0x0002d074 opal_info_do_params() /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485 16 0x0040167b main() ??:0 17 0x0033fbe1ecdd __libc_start_main() ??:0 18 0x00401349 _start() ??:0 ===