Re: [petsc-users] KSP breakdown in specific cluster (update)

TAY wee-beng Wed, 23 Apr 2014 18:45:07 -0700


On 24/4/2014 9:41 AM, Barry Smith wrote:

   The numbers like


         sum = 1.9762625833649862e-323

         rho = 1.9762625833649862e-323

         beta = 1.600807474747106e-316
         omega = 1.6910452843641213e-315
         d2 = 1.5718032521948665e-316

    are nonsense. They would generally indicate that something is wrong, but 
unfortunately don’t point to exactly what is wrong.


Hi,

In that case, how do I troubleshoot? Any suggestions?

Thanks.



    Barry

On Apr 23, 2014, at 8:12 PM, TAY wee-beng <zon...@gmail.com> wrote:

On 23/4/2014 6:00 PM, Matthew Knepley wrote:

On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <zon...@gmail.com> wrote:
Hi,

Just to update that I managed to compare the values by reducing the problem 
size to hundred plus values. The matrix and vector are almost the same compared 
to my win7 output.

Run in the debugger and get a stack trace,

Hi,

I use -start_in_debugger option and it hangs at this point:

Program received signal SIGFPE, Arithmetic exception.
VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
71          ierr = PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);
(gdb) where
#0  VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
#1  0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
     z=0x7fff24cd7f40) at pbvec.c:15
#2  0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
     val=0x7fff24cd7f40) at rvector.c:128
#3  0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
#4  0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, x=0x14771890)
     at itfunc.c:441
#5  0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, x=0x3959f38,
     __ierr=0x384d8b8) at itfuncf.c:219
#6  0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
#7  0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
#8  0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
#9  0x00000000004093dc in main ()
(gdb)

Is this what you mean by a stack trace?

I have also used "bt full" and I have attached a more detailed output.

    Matt

Also tried valgrind but it aborts almost immediately:


valgrind --leak-check=yes ./a.out
==17603== Memcheck, a memory error detector.
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17603== Using LibVEX rev 1658, a library for dynamic binary translation.
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17603== For more details, rerun with: -v
==17603==
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
==17603== Your program just tried to execute an instruction that Valgrind
==17603== did not recognise.  There are two possible reasons for this.
==17603== 1. Your program has a bug and erroneously jumped to a non-code
==17603==    location.  If you are running Memcheck and you just saw a
==17603==    warning about a bad jump, it's probably your program's fault.
==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,
==17603==    i.e. it's Valgrind's fault.  If you think this is the case or
==17603==    you are not sure, please let us know and we'll try to fix it.
==17603== Either way, Valgrind will now raise a SIGILL signal which will
==17603== probably kill your program.
forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine Line        Source
libifcore.so.5     0000000005DD0F0E  Unknown Unknown  Unknown
libifcore.so.5     0000000005DD0DC7  Unknown Unknown  Unknown
a.out              0000000001CB4CBB  Unknown Unknown  Unknown
a.out              00000000004093DC  Unknown Unknown  Unknown
libc.so.6          000000369141D974  Unknown Unknown  Unknown
a.out              00000000004092E9  Unknown Unknown  Unknown
==17603==
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
==17603== For counts of detected errors, rerun with: -v
==17603== searching for pointers to 8 not-freed blocks.
==17603== checked 2,340,280 bytes.
==17603==
==17603== LEAK SUMMARY:
==17603==    definitely lost: 0 bytes in 0 blocks.
==17603==      possibly lost: 0 bytes in 0 blocks.
==17603==    still reachable: 239 bytes in 8 blocks.
==17603==         suppressed: 0 bytes in 0 blocks.
==17603== Reachable blocks (those to which a pointer was found) are not shown.
==17603== To see them, rerun with: --show-reachable=yes

Thank you

Yours sincerely,

TAY wee-beng

On 23/4/2014 5:18 PM, TAY wee-beng wrote:
Hi,

My code was found to be giving error answer in one of the cluster, even on 
single processor. No error msg was given. It used to be working fine.

I run the debug version and it gives the error msg:

[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably 
divide by zero
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see 
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC ERROR: or 
try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory 
corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c
[0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c
[0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c
[0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: --------------------- Error Message 
------------------------------------
[0]PETSC ERROR: Signal received!
[0]PETSC ERROR: 
------------------------------------------------------------------------

It happens after KSPSolve. There was no problem on other cluster. So how should 
I debug to find the error?

I tried to compare the input matrix and vector between different cluster but 
there are too many values.





--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

<stack.txt>

Re: [petsc-users] KSP breakdown in specific cluster (update)

Reply via email to