Re: [petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Eric Chamberland Tue, 02 Mar 2021 20:39:03 -0800


On 2021-03-02 10:59 p.m., Barry Smith wrote:

It could be related to MKL but it could also be due to problems withScalapack when used with OpenMP. Do you need Scalapack? Maybe you wantto use it since it used by MUMPS?

Yes, exactly for mumps!

#2: I can deal with that! :)
#3: I am not sure if this output is due to the way I configureOpenMPI/3.x:
$ ./configure --prefix=/opt/openmpi-3.x_debug --enable-debug--enable-picky CXXFLAGS=-std=c++14 --with-wrapper-cxxflags=-std=c++14--with-cma
or this export:

export OMPI_MCA_plm_base_verbose=5
which I left there to track an intermittent bug at singleton startup(https://www.mail-archive.com/devel@lists.open-mpi.org/msg19568.html)..
I will remove this now, 4 years later, it does not happen anymore...But I don't think it should be harmful for PETSc tests, is it?
I am guessing that this is just informative information that does notindicate a problem. But I am confused why it appears onlyoccasionally, presumably it is related to the current state of thesystem.
But the PETSc tests have no way of knowing that this type of output tostdout or stderr is "harmless" informative information versus amindication of something being seriously broken. One way of checkingPETSc tests in make test is to process the output and look forsomething that is not "normal" and this is definitely not normal.I think you need to turn off the verbosity when running PETSc testsand then hopefully this particular problem will go away.
  ok tao_constrained_tutorials-toyf_1
not ok diff-tao_constrained_tutorials-toyf_1 # Error code: 1
#       0a1,17
#       > [zorg:09243] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
#       > [zorg:09243] plm:base:set_hnp_name: initial bias 9243 nodename hash 
810220270
#       > [zorg:09243] plm:base:set_hnp_name: final jobfam 61119
#       > [zorg:09243] [[61119,0],0] plm:rsh_setup on agent ssh : rsh path NULL
#       > [zorg:09243] [[61119,0],0] plm:base:receive start comm
#       > [zorg:09243] [[61119,0],0] plm:base:setup_job
#       > [zorg:09243] [[61119,0],0] plm:base:setup_vm
#       > [zorg:09243] [[61119,0],0] plm:base:setup_vm creating map
#       > [zorg:09243] [[61119,0],0] setup:vm: working unmanaged allocation
#       > [zorg:09243] [[61119,0],0] using default hostfile 
/opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
#       > [zorg:09243] [[61119,0],0] plm:base:setup_vm only HNP in allocation
#       > [zorg:09243] [[61119,0],0] plm:base:setting slots for node zorg by 
cores
#       > [zorg:09243] [[61119,0],0] complete_setup on job [61119,1]
#       > [zorg:09243] [[61119,0],0] plm:base:launch_apps for job [61119,1]
#       > [zorg:09243] [[61119,0],0] plm:base:launch wiring up iof for job 
[61119,1]
#       > [zorg:09243] [[61119,0],0] plm:base:launch job [61119,1] is not a 
dynamic spawn
#       > [zorg:09243] [[61119,0],0] plm:base:launch [61119,1] registered
#       58a76,77
#       > [zorg:09243] [[61119,0],0] plm:base:orted_cmd sending orted_exit 
commands
#       > [zorg:09243] [[61119,0],0] plm:base:receive stop comm

Understood: here we did some workaround to filter these out forstdout/stderr comparison we do in nightly tests.

#4: I do have a default choice for L2 projections which uses HYPREBoomerAMG preconditioning...that now ends withKSP_DIVERGED_INDEFINITE_PC, so this is definitely a problem...
#5: we do sometime use superlu-dist
I'm afraid for the possibly MUMPS, Superlu_DIST and hypre problemsyou need to debug them one at a time by running the particulartroublesome example in the debugger to determine the problem. Itcould also be due to the relationship between the MKL and the OpenMPimplementation. I don't know exactly how MKL's multi-threaded coderuns in relation to OpenML and certainly if the compiler is providinga different OpenMP than MKL is using it will not work.


Does the guys who maintain all these libs are reading petsc-dev? ;)

Under most circumstances if you are using MKL with threading andPETSc you likely only want to use one MKL thread since PETSc alreadyhandles the maximum parallelism with MPI and there are no "extra"processors available to parallelize the BLAS/LAPACK called from PETScfor more performance inside the MKL. This may not be true if you areusing MUMPS which makes things far more complicated.
OpenMP is complicated in the context of PETSc and several externalpackages because different packages may use it in different ways thatrequire different tuning and I won't know the tuning for each.

okay: we do no use OpenMP neither... we all rely on MPI for parallelismtoo... So if I could just compile for CUDA without it, I would be happy....

But, do you think it could be turned on/off only for specific packagesat configuration time? In regards of the bugs encountered, it is notinteresting to activate it for all external packages...


regards,

Eric


  Barry

#6: we do start looking at DD solvers like hpddm...

So my killer question is: in regard of to the amount of work to haveall these external packages fixed, is it possible to activate OpenMPonly for the CUDA part?


Thanks,

Eric

On 2021-03-02 3:47 p.m., Barry Smith wrote:


  Eric,

    Thanks for the detailed information.

    I have cc:ed Pierre so he can look at the HPDDM failures.

On Mar 2, 2021, at 2:14 PM, Eric Chamberland<eric.chamberl...@giref.ulaval.ca<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:
Hi,
It all started when I wanted to test PETSC/CUDA compatibility forour code.
I had to activate --with-openmp to configure with --with-cuda=1successfully.

Certain packages like SuperLU_DIST require --with-openmp if using--with-cuda=1 but PETSc's own use of CUDA as well as some otherpackages do not require the --with-openmp.

I then saw that PETSC_HAVE_OPENMP is used at least in MUMPS (andsome other places).
So, I configured and tested petsc with openmp activated, without CUDA.
The first thing I see is that our code CI pipelines now fails formany tests.
After looking deeper, it seems that PETSc itself fails many testswhen I activate openmp!
Here are all the configurations I have results for, after/beforeactivating OpenMP for PETSc:


There seem to be several distinct issues

1) failures inside Scalapack.

2) possibly slightly different convergence rates for some exampleschanging the number of iterations slightly in PETSc.

3) trouble initializing something outside of PETSc, almost for surenot related to PETSc


[zorg:08517] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
#       [zorg:08517] plm:base:set_hnp_name: initial bias 8517 nodename hash 
810220270
#       [zorg:08517] plm:base:set_hnp_name: final jobfam 60385
#       [zorg:08517] [[60385,0],0] plm:rsh_setup on agent ssh : rsh path NULL
#       [zorg:08517] [[60385,0],0] plm:base:receive start comm
#       [zorg:08517] [[60385,0],0] plm:base:setup_job
#       [zorg:08517] [[60385,0],0] plm:base:setup_vm

4) problem with a hypre run Linear solve did not converge due toDIVERGED_INDEFINITE_PC iterations 3 , again not likely a PETSc issuebut a hypre and OpenMP issue

5) Different results for initia inside an external package
#       1c1
#       <  MatInertia: nneg: 17, nzero: 0, npos: 83
#       ---
#       >  MatInertia: nneg: 21, nzero: 0, npos: 79
         TEST 
arch-linux-c-debug/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
  ok ksp_ksp_tests-ex33_superlu_dist_2
not ok diff-ksp_ksp_tests-ex33_superlu_dist_2 # Error code: 1
#       1c1
#       <  MatInertia: nneg: 17, nzero: 0, npos: 83
#       ---
#       >  MatInertia: nneg: 25, nzero: 0, npos: 75


6) problems with the external package hpddm

not ok snes_tutorials-ex12_quad_hpddm_reuse_baij # Error code: 139
#         0 SNES Function norm 21.3344
#       [0]PETSC ERROR: 
------------------------------------------------------------------------
#       [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
probably memory access out of range
#       [0]PETSC ERROR: Try option -start_in_debugger or 
-on_error_attach_debugger
#       [0]PETSC ERROR: or 
seehttps://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind  
<https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
#       [0]PETSC ERROR: or tryhttp://valgrind.org  <http://valgrind.org/>  on 
GNU/linux and Apple Mac OS X to find memory corruption errors
#       [0]PETSC ERROR: likely location of problem given in stack below
#       [0]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
#       [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
available,
#       [0]PETSC ERROR:       INSTEAD the line number of the start of the 
function
#       [0]PETSC ERROR:       is given.
#       [0]PETSC ERROR: [0] constructionMatrix line 313 
/opt/petsc-main_debug/include/HPDDM_coarse_operator_impl.hpp
#       [0]PETSC ERROR: [0] construction line 256 
/opt/petsc-main_debug/include/HPDDM_coarse_operator_impl.hpp
#       [0]PETSC ERROR: [0] buildTwo line 987 
/opt/petsc-main_debug/include/HPDDM_schwarz.hpp
#       [0]PETSC ERROR: [0] next line 1130 
/opt/petsc-main_debug/include/HPDDM_schwarz.hpp
#       [0]PETSC ERROR: [0] PCSetUp_HPDDM line 746 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/pc/impls/hpddm/hpddm.cxx
#       [0]PETSC ERROR: [0] PCSetUp line 974 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/pc/interface/precon.c
#       [0]PETSC ERROR: [0] KSPSetUp line 319 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
#       [0]PETSC ERROR: [0] KSPSolve_Private line 808 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
#       [0]PETSC ERROR: [0] KSPSolve line 1080 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/ksp/ksp/interface/itfunc.c
#       [0]PETSC ERROR: [0] SNESSolve_NEWTONLS line 144 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/snes/impls/ls/ls.c
#       [0]PETSC ERROR: [0] SNESSolve line 4533 
/pmi/cmpbib/compilation_BIB_gcc_redhat_petsc-master_debug/COMPILE_AUTO/petsc-main-debug/src/snes/interface/snes.c
#       [0]PETSC ERROR: --------------------- Error Message 
--------------------------------------------------------------
#       [0]PETSC ERROR: Signal received
#       [0]PETSC ERROR: Seehttps://www.mcs.anl.gov/petsc/documentation/faq.html  
<https://www.mcs.anl.gov/petsc/documentation/faq.html>  for trouble shooting.

PETSc itself does not use OpenMP so turning on OpenMP for pure PETScshould generate no errors except possibly small changes in iterationrates due to the different way the floating point operations in MKLare done.

We don't see much use for OpenMP so rarely turn it on. What is yourend goal, to use PETSc on CUDA (for each you can keep OpenMP off) orsomething else?



  Barry

==============================================================================

==============================================================================

For petsc/master + OpenMPI 4.0.4 + MKL 2019.4.243:

With OpenMP=1

https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.03.02.02h00m02s_make_test.log

https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.03.02.02h00m02s_configure.log

# -------------
#   Summary
# -------------
# FAILED snes_tutorials-ex12_quad_hpddm_reuse_baij 
diff-ksp_ksp_tests-ex33_superlu_dist_2 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 
ksp_ksp_tutorials-ex50_tut_2 diff-ksp_ksp_tests-ex33_superlu_dist 
diff-snes_tutorials-ex56_hypre snes_tutorials-ex17_3d_q3_trig_elas 
snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij 
ksp_ksp_tutorials-ex5_superlu_dist_3 ksp_ksp_tutorials-ex5f_superlu_dist 
snes_tutorials-ex12_tri_parmetis_hpddm_baij diff-snes_tutorials-ex19_tut_3 
mat_tests-ex242_3 snes_tutorials-ex17_3d_q3_trig_vlap 
ksp_ksp_tutorials-ex5f_superlu_dist_3 snes_tutorials-ex19_superlu_dist 
diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre 
diff-ksp_ksp_tutorials-ex49_hypre_nullspace ts_tutorials-ex18_p1p1_xper_ref 
ts_tutorials-ex18_p1p1_xyper_ref snes_tutorials-ex19_superlu_dist_2 
ksp_ksp_tutorials-ex5_superlu_dist_2 
diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre 
ksp_ksp_tutorials-ex64_1 ksp_ksp_tutorials-ex5_superlu_dist 
ksp_ksp_tutorials-ex5f_superlu_dist_2
# success 8275/10003 tests (82.7%)
#*failed 33/10003*  tests (0.3%)

With OpenMP=0

https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.02.26.02h00m16s_make_test.log

https://giref.ulaval.ca/~cmpgiref/petsc-master-debug/2021.02.26.02h00m16s_configure.log

# -------------
#   Summary
# -------------
# FAILED tao_constrained_tutorials-tomographyADMM_6 
snes_tutorials-ex17_3d_q3_trig_elas mat_tests-ex242_3 
snes_tutorials-ex17_3d_q3_trig_vlap tao_leastsquares_tutorials-tomography_1 
tao_constrained_tutorials-tomographyADMM_5
# success 8262/9983 tests (82.8%)
#*failed 6/9983*  tests (0.1%)

==============================================================================

==============================================================================

For OpenMPI 3.1.x/master:

With OpenMP=1:

https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.03.01.22h00m01s_make_test.log

https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.03.01.22h00m01s_configure.log

# -------------
#   Summary
# -------------
# FAILED mat_tests-ex242_3 mat_tests-ex242_2 diff-mat_tests-ex219f_1 
diff-dm_tutorials-ex11f90_1 ksp_ksp_tutorials-ex5_superlu_dist_3 
diff-ksp_ksp_tutorials-ex49_hypre_nullspace 
ksp_ksp_tutorials-ex5f_superlu_dist_3 snes_tutorials-ex17_3d_q3_trig_vlap 
diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre 
diff-snes_tutorials-ex19_tut_3 diff-snes_tutorials-ex56_hypre 
diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre 
tao_leastsquares_tutorials-tomography_1 
tao_constrained_tutorials-tomographyADMM_4 
tao_constrained_tutorials-tomographyADMM_6 diff-tao_constrained_tutorials-toyf_1
# success 8142/9765 tests (83.4%)
#*failed 16/9765*  tests (0.2%)

With OpenMP=0:

https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.02.28.22h00m02s_make_test.log

https://giref.ulaval.ca/~cmpgiref/ompi_3.x/2021.02.28.22h00m02s_configure.log

# -------------
#   Summary
# -------------
# FAILED mat_tests-ex242_3 mat_tests-ex242_2 diff-mat_tests-ex219f_1 
diff-dm_tutorials-ex11f90_1 ksp_ksp_tutorials-ex56_2 
snes_tutorials-ex17_3d_q3_trig_vlap tao_leastsquares_tutorials-tomography_1 
tao_constrained_tutorials-tomographyADMM_4 diff-tao_constrained_tutorials-toyf_1
# success 8151/9767 tests (83.5%)
#*failed 9/9767*  tests (0.1%)

==============================================================================

==============================================================================

For OpenMPI 4.0.x/master:

With OpenMP=1:

https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.03.01.20h00m01s_make_test.log

https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.03.01.20h00m01s_configure.log

# FAILED snes_tutorials-ex17_3d_q3_trig_elas snes_tutorials-ex19_hypre 
ksp_ksp_tutorials-ex56_2 tao_leastsquares_tutorials-tomography_1 
tao_constrained_tutorials-tomographyADMM_5 mat_tests-ex242_3 
ksp_ksp_tutorials-ex55_hypre ksp_ksp_tutorials-ex5_superlu_dist_2 
tao_constrained_tutorials-tomographyADMM_6 snes_tutorials-ex56_hypre 
snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre 
ksp_ksp_tutorials-ex5f_superlu_dist_3 ksp_ksp_tutorials-ex34_hyprestruct 
diff-ksp_ksp_tutorials-ex49_hypre_nullspace 
snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre 
ksp_ksp_tutorials-ex5f_superlu_dist ksp_ksp_tutorials-ex5f_superlu_dist_2 
ksp_ksp_tutorials-ex5_superlu_dist snes_tutorials-ex19_tut_3 
snes_tutorials-ex19_superlu_dist ksp_ksp_tutorials-ex50_tut_2 
snes_tutorials-ex17_3d_q3_trig_vlap ksp_ksp_tutorials-ex5_superlu_dist_3 
snes_tutorials-ex19_superlu_dist_2 tao_constrained_tutorials-tomographyADMM_4 
ts_tutorials-ex26_2
# success 8125/9753 tests (83.3%)
#*failed 26/9753*  tests (0.3%)

With OpenMP=0

https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.02.28.20h00m04s_make_test.log

https://giref.ulaval.ca/~cmpgiref/ompi_4.x/2021.02.28.20h00m04s_configure.log

# FAILED mat_tests-ex242_3
# success 8174/9777 tests (83.6%)
#*failed 1/9777*  tests (0.0%)

==============================================================================

==============================================================================

Is that known and normal?

In all cases, I am using MKL and I suspect it may come fromthere... :/

I also saw a second problem, "make test" fails to compile petscexamples on older versions of MKL (but that's less important forme, I just upgraded to OneAPI to avoid this, but you may want to know):


https://giref.ulaval.ca/~cmpgiref/dernier_ompi/2021.03.02.02h16m01s_make_test.log

https://giref.ulaval.ca/~cmpgiref/dernier_ompi/2021.03.02.02h16m01s_configure.log

Thanks,

Eric

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

Re: [petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Reply via email to