On 23/4/2014 6:00 PM, Matthew Knepley wrote:
On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <zon...@gmail.com> wrote:
Hi,
Just to update that I managed to compare the values by reducing the problem
size to hundred plus values. The matrix and vector are almost the same compared
to my win7 output.
Run in the debugger and get a stack trace,
Hi,
I use -start_in_debugger option and it hangs at this point:
Program received signal SIGFPE, Arithmetic exception.
VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
71 ierr = PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);
(gdb) where
#0 VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
#1 0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
z=0x7fff24cd7f40) at pbvec.c:15
#2 0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
val=0x7fff24cd7f40) at rvector.c:128
#3 0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
#4 0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, x=0x14771890)
at itfunc.c:441
#5 0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, x=0x3959f38,
__ierr=0x384d8b8) at itfuncf.c:219
#6 0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
#7 0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
#8 0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
#9 0x00000000004093dc in main ()
(gdb)
Is this what you mean by a stack trace?
I have also used "bt full" and I have attached a more detailed output.
Matt
Also tried valgrind but it aborts almost immediately:
valgrind --leak-check=yes ./a.out
==17603== Memcheck, a memory error detector.
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17603== Using LibVEX rev 1658, a library for dynamic binary translation.
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17603== For more details, rerun with: -v
==17603==
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
==17603== Your program just tried to execute an instruction that Valgrind
==17603== did not recognise. There are two possible reasons for this.
==17603== 1. Your program has a bug and erroneously jumped to a non-code
==17603== location. If you are running Memcheck and you just saw a
==17603== warning about a bad jump, it's probably your program's fault.
==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,
==17603== i.e. it's Valgrind's fault. If you think this is the case or
==17603== you are not sure, please let us know and we'll try to fix it.
==17603== Either way, Valgrind will now raise a SIGILL signal which will
==17603== probably kill your program.
forrtl: severe (168): Program Exception - illegal instruction
Image PC Routine Line Source
libifcore.so.5 0000000005DD0F0E Unknown Unknown Unknown
libifcore.so.5 0000000005DD0DC7 Unknown Unknown Unknown
a.out 0000000001CB4CBB Unknown Unknown Unknown
a.out 00000000004093DC Unknown Unknown Unknown
libc.so.6 000000369141D974 Unknown Unknown Unknown
a.out 00000000004092E9 Unknown Unknown Unknown
==17603==
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
==17603== For counts of detected errors, rerun with: -v
==17603== searching for pointers to 8 not-freed blocks.
==17603== checked 2,340,280 bytes.
==17603==
==17603== LEAK SUMMARY:
==17603== definitely lost: 0 bytes in 0 blocks.
==17603== possibly lost: 0 bytes in 0 blocks.
==17603== still reachable: 239 bytes in 8 blocks.
==17603== suppressed: 0 bytes in 0 blocks.
==17603== Reachable blocks (those to which a pointer was found) are not shown.
==17603== To see them, rerun with: --show-reachable=yes
Thank you
Yours sincerely,
TAY wee-beng
On 23/4/2014 5:18 PM, TAY wee-beng wrote:
Hi,
My code was found to be giving error answer in one of the cluster, even on
single processor. No error msg was given. It used to be working fine.
I run the debug version and it gives the error msg:
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably
divide by zero
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC ERROR: or
try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR: INSTEAD the line number of the start of the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c
[0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c
[0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c
[0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: --------------------- Error Message
------------------------------------
[0]PETSC ERROR: Signal received!
[0]PETSC ERROR:
------------------------------------------------------------------------
It happens after KSPSolve. There was no problem on other cluster. So how should
I debug to find the error?
I tried to compare the input matrix and vector between different cluster but
there are too many values.
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
<stack.txt>