Mark,

    I think it is ignoring the openmp pragmas I added. Can you please configure 
with the additional argument --with-openmp

Barry

> On Feb 22, 2015, at 11:21 AM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> Barry, I get three errors with -ksp_converged_reason using your branch.
> 
> Thanks,
> Mark
> 
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> [82]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
> [82]PETSC ERROR: Argument out of range
> [82]PETSC ERROR: Too many pushes
> [82]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
> trouble shooting.
> [82]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016  GIT 
> Date: 2015-02-21 11:26:56 -0600
> [82]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named 
> nid03897 by madams Sun Feb 22 09:12:35 2015
> [82]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo" 
> --CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo" --download-parmetis 
> --download-metis --with-ssl=0 --with-threadsafety --with-log=0 --with-cc=cc 
> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 
> --with-debugging=0 --with-fc=ftn --with-fortranlib-autodetect=0 
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/ 
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++ 
> PETSC_ARCH=arch-xc30-optts-intel PETSC_DIR=/global/homes/m/madams/petsc-barry
> [82]PETSC ERROR: #1 PetscViewerPushFormat() line 144 in 
> /global/u2/m/madams/petsc-barry/src/sys/classes/viewer/interface/viewa.c
> [82]PETSC ERROR: #2 KSPReasonViewFromOptionsUnsafe() line 424 in 
> /global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
> [82]PETSC ERROR: #3 KSPSolve() line 592 in 
> /global/u2/m/madams/petsc-barry/src/ksp/ksp/interface/itfunc.c
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> 
>  [snip]
> 
> [13]PETSC ERROR: 
> ------------------------------------------------------------------------
> [13]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> probably memory access out of range
> [13]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [13]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [13]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X 
> to find memory corruption errors
> [13]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and 
> run 
> [13]PETSC ERROR: to get more information on the crash.
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> [13]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
> [13]PETSC ERROR: Signal received
> [13]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
> trouble shooting.
> [13]PETSC ERROR: Petsc Development GIT revision: v3.5.3-2014-g463f016  GIT 
> Date: 2015-02-21 11:26:56 -0600
> [13]PETSC ERROR: ../../epsi/XGCa/xgca on a arch-xc30-optts-intel named 
> nid00713 by madams Sun Feb 22 09:12:35 2015
> [13]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo" 
> --CXXOPTFLAGS="-fast -no-ipo" --FOPTFLAGS="-fast -no-ipo" --download-parmetis 
> --download-metis --with-ssl=0 --with-threadsafety --with-log=0 --with-cc=cc 
> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 
> --with-debugging=0 --with-fc=ftn --with-fortranlib-autodetect=0 
> --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/ 
> --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++ 
> PETSC_ARCH=arch-xc30-optts-intel PETSC_DIR=/global/homes/m/madams/petsc-barry
> [13]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> Rank 13 [Sun Feb 22 09:13:04 2015] [c3-0c2s2n1] application called 
> MPI_Abort(MPI_COMM_WORLD, 59) - process 13
> 
>  [snip]
> 
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> forrtl: error (76): Abort trap signal
> Image              PC                Routine            Line        Source    
>          
> xgca               0000000002E10A41  Unknown               Unknown  Unknown
> xgca               0000000002E0F197  Unknown               Unknown  Unknown
> xgca               0000000002DC5B24  Unknown               Unknown  Unknown
> xgca               0000000002DC5936  Unknown               Unknown  Unknown
> xgca               0000000002D59C64  Unknown               Unknown  Unknown
> xgca               0000000002D60BE1  Unknown               Unknown  Unknown
> xgca               00000000015217D0  Unknown               Unknown  Unknown
> xgca               000000000152178B  Unknown               Unknown  Unknown
> xgca               0000000002E32271  Unknown               Unknown  Unknown
> xgca               0000000002BDCE52  Unknown               Unknown  Unknown
> xgca               0000000002BACDE3  Unknown               Unknown  Unknown
> xgca               0000000000A6DAB9  Unknown               Unknown  Unknown
> xgca               0000000000A6D394  Unknown               Unknown  Unknown
> xgca               00000000015217D0  Unknown               Unknown  Unknown
> xgca               00000000008BEE48  Unknown               Unknown  Unknown
> xgca               00000000008BE456  Unknown               Unknown  Unknown
> xgca               0000000000F0E10F  Unknown               Unknown  Unknown
> xgca               0000000000F0A1F2  Unknown               Unknown  Unknown
> xgca               0000000000A373F2  Unknown               Unknown  Unknown
> xgca               0000000000581957  petsc_lu_solver_          973  
> collisionf2.F90
> xgca               000000000057EB05  col_f_picard_step         372  
> collisionf2.F90
> xgca               0000000000564D6A  col_f_core_s_             945  
> collisionf.F90
> xgca               000000000056325F  f_collision_singl         254  
> collisionf.F90
> xgca               0000000000560409  f_collision_singl         350  
> collisionf.F90
> xgca               0000000002B1EF43  Unknown               UnLinear col_f_ 
> solve converged due to CONVERGED_RTOL iterations 1
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> Linear col_f_ solve converged due to CONVERGED_RTOL iterations 1
> known  Unknown
> 
> 
> On Sat, Feb 21, 2015 at 12:30 PM, Barry Smith <bsm...@mcs.anl.gov> wrote:
> 
> barry/threadsafety-kspconvergedreason-kspmonitor
> 
> Note that as always the monitoring and converged reasons for the various 
> threads will be printed jumbled up
> 
> In your own code make sure that any routines that use the default viewers 
> (like stdout) are in a omp critical section
> 
>  Barry
> 
> > On Feb 20, 2015, at 8:22 PM, Mark Adams <mfad...@lbl.gov> wrote:
> >
> > OK, I have a code setup to test it so feel free to make branch and I can 
> > test it.
> > Mark
> >
> > On Fri, Feb 20, 2015 at 7:13 PM, Barry Smith <bsm...@mcs.anl.gov> wrote:
> >
> >   Mark,
> >
> >    Yes, after looking at the code it does make sense. The reason is that 
> > Matt made me "improve" the -xxx_converged_reason to use viewers; but in 
> > your case there will be multiple threads (each associated with different 
> > KSP objects) each monkeying with the same (default) viewer thus possibly 
> > corrupting it.
> >
> >    I'll have to think a little bit about the best way to keep the 
> > functionality but be thread safe.
> >
> >   Barry
> >
> > > On Feb 20, 2015, at 5:57 PM, Mark Adams <mfad...@lbl.gov> wrote:
> > >
> > > Barry,
> > >
> > > We had a problem with the thread safe version and found, by pure luck, 
> > > that apparently if we use -ksp_converged_reason we get segv type failure. 
> > >  Does this sound sensible?
> > >
> > > I can give you an executable and environment the run this on Edison if 
> > > that is useful.
> > >
> > > Thanks,
> > > Mark
> > >
> > >
> > > On Tue, Feb 17, 2015 at 9:27 PM, Barry Smith <bsm...@mcs.anl.gov> wrote:
> > >
> > >   You need to configure with --with-threadsafety and --with-log=0 and 
> > > --with-debugging=0
> > >
> > >   Eventually we'll support at least the debugging with thread safety.
> > >
> > >   Barry
> > >
> > > Not sure about that strange message from the cray system.
> > >
> > >
> > > > On Feb 17, 2015, at 8:14 PM, Mark Adams <mfad...@lbl.gov> wrote:
> > > >
> > > > We have been testing master with a code that calls PETSc serial LU 
> > > > solvers from threads.  I have seen system messages with OMP (see way 
> > > > below) and Robert (cc'ed) reported this useful stack trace.
> > > >
> > > > I have not modified my (non-thread) build.  Perhaps I need to or are 
> > > > there PETSc runtime options?
> > > >
> > > > This is a Cray XC30 with Intel.
> > > >
> > > > Thanks,
> > > > Mark
> > > >
> > > > SC[0;39mESC[0;49m[116]PETSC ERROR: Object is in wrong state
> > > > [116]PETSC ERROR: Logging event had unbalanced begin/end pairs
> > > > [116]PETSC ERROR: See 
> > > > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble 
> > > > shooting.
> > > > [116]PETSC ERROR: Petsc Development GIT revision: v3.5.3-1570-gcaf1481  
> > > > GIT Date: 2015-02-07 17:34:17 -0600
> > > > [116]PETSC ERROR: ./xgca_petsc36_col on a arch-xc30-opt64-intel named 
> > > > nid05975 by rhager Tue Feb 17 10:46:32 2015
> > > > [116]PETSC ERROR: Configure options --COPTFLAGS="-fast -no-ipo" 
> > > > --CXXOPTFLAGS="-fast -no-ipoi" --FOPTFLAGS="-fast -no-ipo" 
> > > > --download-hypre --download-superlu_dist --
> > > > download-parmetis --download-metis --with-ssl=0 --with-cc=cc 
> > > > --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0 
> > > > --with-debugging=0 --with-fc=ftn --with
> > > > -fortranlib-autodetect=0 
> > > > --with-hdf5-dir=/opt/cray/hdf5-parallel/1.8.13/intel/140/ 
> > > > --with-shared-libraries=0 --with-x=0 --with-mpiexec=aprun LIBS=-lstdc++ 
> > > > --with-64-b
> > > > it-indices PETSC_ARCH=arch-xc30-opt64-intel 
> > > > PETSC_DIR=/global/u2/m/madams/petsc_master
> > > > [116]PETSC ERROR: #1 PetscLogEventEndDefault() line 694 in 
> > > > /global/u2/m/madams/petsc_master/src/sys/logging/utils/eventlog.c
> > > > [116]PETSC ERROR: #2 MatLUFactorSymbolic() line 2894 in 
> > > > /global/u2/m/madams/petsc_master/src/mat/interface/matrix.c
> > > > [116]PETSC ERROR: #3 PCSetUp_LU() line 127 in 
> > > > /global/u2/m/madams/petsc_master/src/ksp/pc/impls/factor/lu/lu.c
> > > > [116]PETSC ERROR: #4 PCSetUp() line 918 in 
> > > > /global/u2/m/madams/petsc_master/src/ksp/pc/interface/precon.c
> > > > [116]PETSC ERROR: #5 KSPSetUp() line 306 in 
> > > > /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > > [116]PETSC ERROR: #6 KSPSolve() line 503 in 
> > > > /global/u2/m/madams/petsc_master/src/ksp/ksp/interface/itfunc.c
> > > >
> > > >
> > > > Other error message:
> > > >
> > > >
> > > > OMP: Error #13: Assertion failure at kmp_runtime.c(1588).
> > > > OMP: Hint: Please submit a bug report with this message, compile and 
> > > > run commands used, and machine configuration info including native 
> > > > compiler and operating system versions. Faster response will be 
> > > > obtained by including all program sources. For information on 
> > > > submitting this issue, please see 
> > > > http://www.intel.com/software/products/support/.
> > > > _pmiu_daemon(SIGCHLD): [NID 05979] [c7-3c0s6n3] [Tue Feb 17 15:14:43 
> > > > 2015] PE RANK 23 exit signal Killed
> > > > _pmiu_daemon(SIGCHLD): [NID 05976] [c7-3c0s6n0] [Tue Feb 17 15:14:43 
> > > > 2015] PE RANK 10 exit signal Killed
> > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: initiated application 
> > > > termination
> > > > [NID 05979] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 239]. Please contact admin for details. 
> > > > Killing pid 18637(xgca)
> > > > [NID 05976] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 73]. Please contact admin for details. 
> > > > Killing pid 15380(xgca)
> > > > [NID 05984] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 34636(xgca)
> > > > [NID 05988] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 59]. Please contact admin for details. 
> > > > Killing pid 38496(xgca)
> > > > [NID 06019] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 11132(xgca)
> > > > [NID 05980] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 8320(xgca)
> > > > [NID 05993] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 46182(xgca)
> > > > [NID 06020] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 249]. Please contact admin for details. 
> > > > Killing pid 23753(xgca)
> > > > [NID 05987] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 87]. Please contact admin for details. 
> > > > Killing pid 11254(xgca)
> > > > [NID 05986] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 41]. Please contact admin for details. 
> > > > Killing pid 6630(xgca)
> > > > [NID 05981] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 31]. Please contact admin for details. 
> > > > Killing pid 10520(xgca)
> > > > [NID 05999] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 7]. Please contact admin for details. 
> > > > Killing pid 1843(xgca)
> > > > [NID 05985] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 26498(xgca)
> > > > [NID 05998] 2015-02-17 15:14:43 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 209]. Please contact admin for details. 
> > > > Killing pid 20387(xgca)
> > > > [NID 05994] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 39462(xgca)
> > > > [NID 05983] 2015-02-17 15:14:53 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 18598(xgca)
> > > > [NID 05995] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 42322(xgca)
> > > > [NID 05996] 2015-02-17 15:14:54 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 34248(xgca)
> > > > [NID 05978] 2015-02-17 15:14:55 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 9483(xgca)
> > > > [NID 05975] 2015-02-17 15:14:56 Apid 10147992: Cray HSN detected 
> > > > critical error 0x4416[ptag 0]. Please contact admin for details. 
> > > > Killing pid 11470(xgca)
> > > > Application 10147992 exit codes: 137
> > > > Application 10147992 exit signals: Killed
> > > > Application 10147992 resources: utime ~2194s, stime ~199s, Rss ~488560, 
> > > > inblocks ~908164, outblocks ~2571652
> > > >
> > >
> > >
> >
> >
> 
> 

Reply via email to