Re: [petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Eric Chamberland Wed, 17 Mar 2021 08:23:23 -0700

Thanks Barry,

Just to report:


I tried to switch to the proposed smoother by default in our code:

-pc_hypre_boomeramg_relax_type_all l1scaled-SOR/Jacobi

However, I have some failures, even if I compiled without --with-openmp=1.

[0]PETSC ERROR: --------------------- Error Message--------------------------------------------------------------

[0]PETSC ERROR: Error in external library
[0]PETSC ERROR: Error in jac->setup(): error code 12

[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.htmlfor trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.14.5, Mar 03, 2021

[0]PETSC ERROR:/home/mefpp_ericc/GIREF/bin/Test.EstimationGradientHessien.dev on a named rohan by ericc Wed Mar 17 11:05:23 2021[0]PETSC ERROR: Configure options--prefix=/opt/petsc-3.14.5_debug_openmpi-4.1.0 --with-mpi-compilers=1--with-mpi-dir=/opt/openmpi-4.1.0 --with-cxx-dialect=C++14--with-make-np=12 --with-shared-libraries=1 --with-debugging=yes--with-memalign=64 --with-visibility=0 --with-64-bit-indices=0--download-ml=yes --download-mumps=yes --download-superlu=yes--download-hpddm=yes --download-slepc=yes --download-superlu_dist=yes--download-parmetis=yes --download-ptscotch=yes --download-metis=yes--download-strumpack=yes --download-suitesparse=yes --download-hypre=yes--with-blaslapack-dir=/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64--with-mkl_pardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/..--with-mkl_cpardiso-dir=/opt/intel/oneapi/mkl/2021.1.1/env/..--with-scalapack=1--with-scalapack-include=/opt/intel/oneapi/mkl/2021.1.1/env/../include--with-scalapack-lib="-L/opt/intel/oneapi/mkl/2021.1.1/env/../lib/intel64-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"[0]PETSC ERROR: #1 PCSetUp_HYPRE() line 408 in/tmp/petsc-3.14.5-debug/src/ksp/pc/impls/hypre/hypre.c[0]PETSC ERROR: #2 PCSetUp() line 1009 in/tmp/petsc-3.14.5-debug/src/ksp/pc/interface/precon.c[0]PETSC ERROR: #3 KSPSetUp() line 406 in/tmp/petsc-3.14.5-debug/src/ksp/ksp/interface/itfunc.c

But it seems to happen only on some cases, actually hermitian elementswhich have a lots of DOF per vertices... It seems to work wellotherwise, with some results differences I still have to analyse...


Do you think this might be a PETSc bug?

Does the error code is from PETSc or Hypre?

(if from hypre, I suggest to say "Hypre error code: 12" instead...)

Thanks,

Eric


On 2021-03-15 2:50 p.m., Barry Smith wrote:


   I posted some information at the issue.

IMHO it is likely a bug in one or more of hypre's smoothers thatuse OpenMP. We have never tested them before (and likely hypre has nottested all the combinations) and so would not have seen the bug.Hopefully they can just fix it.


   Barry

I got the problem to occur with ex56 with 2 MPI ranks and 4 OpenMPthreads, if I used less than 4 threads it did not generate anindefinite preconditioner.

On Mar 14, 2021, at 1:18 PM, Eric Chamberland<eric.chamberl...@giref.ulaval.ca<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:


Done:

https://github.com/hypre-space/hypre/issues/303

Maybe I will need some help about PETSc to answer their questions...

Eric

On 2021-03-14 3:44 a.m., Stefano Zampini wrote:

Eric

You should report these HYPRE issues upstreamhttps://github.com/hypre-space/hypre/issues<https://github.com/hypre-space/hypre/issues>

On Mar 14, 2021, at 3:44 AM, Eric Chamberland<eric.chamberl...@giref.ulaval.ca<mailto:eric.chamberl...@giref.ulaval.ca>> wrote:


For us it clearly creates problems in real computations...

I understand the need to have clean test for PETSc, but for me, itreveals that hypre isn't usable with more than one thread for now...

Another solution: force single-threaded configuration for hypreuntil this is fixed?


Eric

On 2021-03-13 8:50 a.m., Pierre Jolivet wrote:

-pc_hypre_boomeramg_relax_type_all Jacobi =>

Linear solve did not converge due to DIVERGED_INDEFINITE_PCiterations 3

-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi =>

OK, independently of the architecture it seems (Eric Docker imagewith 1 or 2 threads or my macOS), but contraction factor is higher

  Linear solve converged due to CONVERGED_RTOL iterations 8
  Linear solve converged due to CONVERGED_RTOL iterations 24
  Linear solve converged due to CONVERGED_RTOL iterations 26
v. currently
  Linear solve converged due to CONVERGED_RTOL iterations 7
  Linear solve converged due to CONVERGED_RTOL iterations 9
  Linear solve converged due to CONVERGED_RTOL iterations 10

Do we change this? Or should we force OMP_NUM_THREADS=1 for make test?

Thanks,
Pierre

On 13 Mar 2021, at 2:26 PM, Mark Adams <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:

Hypre uses a multiplicative smoother by default. It has achebyshev smoother. That with a Jacobi PC should be threadinvariant.

Mark

On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <pie...@joliv.et<mailto:pie...@joliv.et>> wrote:

    On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <pie...@joliv.et
    <mailto:pie...@joliv.et>> wrote:

    Hello Eric,
    I’ve made an “interesting” discovery, so I’ll put back the
    list in c/c.
    It appears the following snippet of code which uses
    Allreduce() + lambda function + MPI_IN_PLACE is:
    - Valgrind-clean with MPICH;
    - Valgrind-clean with OpenMPI 4.0.5;
    - not Valgrind-clean with OpenMPI 4.1.0.
    I’m not sure who is to blame here, I’ll need to look at the
    MPI specification for what is required by the implementors
    and users in that case.

    In the meantime, I’ll do the following:
    - update config/BuildSystem/config/packages/OpenMPI.py to
    use OpenMPI 4.1.0, see if any other error appears;
    - provide a hotfix to bypass the segfaults;


    I can confirm that splitting the single Allreduce with my own
    MPI_Op into two Allreduce with MAX and BAND fixes the
    segfaults with OpenMPI (*).

    - look at the hypre issue and whether they should be
    deferred to the hypre team.


    I don’t know if there is something wrong in hypre threading
    or if it’s just a side effect of threading, but it seems that
    the number of threads has a drastic effect on the quality of
    the PC.
    By default, it looks that there are two threads per process
    with your Docker image.
    If I force OMP_NUM_THREADS=1, then I get the same convergence
    as in the output file.

    Thanks,
    Pierre

    (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712
    <https://gitlab.com/petsc/petsc/-/merge_requests/3712>

    Thank you for the Docker files, they were really useful.
    If you want to avoid oversubscription failures, you can edit
    the file /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and
    append the line:
    localhost slots=12
    If you want to increase the timeout limit of PETSc test
    suite for each test, you can add the extra flag in your
    command line TIMEOUT=180 (default is 60, units are seconds).

    Thanks, I’ll ping you on GitLab when I’ve got something
    ready for you to try,
    Pierre

    <ompi.cxx>

    On 12 Mar 2021, at 8:54 PM, Eric Chamberland
    <eric.chamberl...@giref.ulaval.ca
    <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

    Hi Pierre,

    I now have a docker container reproducing the problems here.

    Actually, if I look at
    snes_tutorials-ex12_quad_singular_hpddm it fails like this:

    not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
    # Initial guess
    #       L_2 Error: 0.00803099
    # Initial Residual
    #       L_2 Residual: 1.09057
    #       Au - b = Au + F(0)
    #       Linear L_2 Residual: 1.09057
    # [d470c54ce086:14127] Read -1, expected 4096, errno = 1
    # [d470c54ce086:14128] Read -1, expected 4096, errno = 1
    # [d470c54ce086:14129] Read -1, expected 4096, errno = 1
    # [3]PETSC ERROR:
    ------------------------------------------------------------------------
    # [3]PETSC ERROR: Caught signal number 11 SEGV:
    Segmentation Violation, probably memory access out of range
    # [3]PETSC ERROR: Try option -start_in_debugger or
    -on_error_attach_debugger
    # [3]PETSC ERROR: or see
    https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
    <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
    # [3]PETSC ERROR: or try http://valgrind.org
    <http://valgrind.org/> on GNU/linux and Apple Mac OS X to
    find memory corruption errors
    # [3]PETSC ERROR: likely location of problem given in stack
    below
    # [3]PETSC ERROR: --------------------- Stack Frames
    ------------------------------------
    # [3]PETSC ERROR: Note: The EXACT line numbers in the stack
    are not available,
    # [3]PETSC ERROR: INSTEAD the line number of the start of
    the function
    # [3]PETSC ERROR: is given.
    # [3]PETSC ERROR: [3] buildTwo line 987
    /opt/petsc-main/include/HPDDM_schwarz.hpp
    # [3]PETSC ERROR: [3] next line 1130
    /opt/petsc-main/include/HPDDM_schwarz.hpp
    # [3]PETSC ERROR: --------------------- Error Message
    --------------------------------------------------------------
    # [3]PETSC ERROR: Signal received
    # [3]PETSC ERROR: [0]PETSC ERROR:
    ------------------------------------------------------------------------

    also ex12_quad_hpddm_reuse_baij fails with a lot more "Read
    -1, expected ..." which I don't know where they come from...?

    Hypre (like in diff-snes_tutorials-ex56_hypre) is also
    having DIVERGED_INDEFINITE_PC failures...

    Please see the 3 attached docker files:

    1) fedora_mkl_and_devtools : the DockerFile which install
    fedore 33 with gnu compilers and MKL and everything to develop.

    2) openmpi: the DockerFile to bluid OpenMPI

    3) petsc: The las DockerFile that build/install and test PETSc

    I build the 3 like this:

    docker build -t fedora_mkl_and_devtools -f
    fedora_mkl_and_devtools .

    docker build -t openmpi -f openmpi .

    docker build -t petsc -f petsc .

    Disclaimer: I am not a docker expert, so I may do things
    that are not docker-stat-of-the-art but I am opened to
    suggestions... ;)

    I have just ran it on my portable (long) which have not
    enough cores, so many more tests failed (should force
    --oversubscribe but don't know how to).  I will relaunch on
    my workstation in a few minutes.

    I will now test your branch! (sorry for the delay).

    Thanks,

    Eric

    On 2021-03-11 9:03 a.m., Eric Chamberland wrote:


    Hi Pierre,

    ok, that's interesting!

    I will try to build a docker image until tomorrow and give
    you the exact recipe to reproduce the bugs.

    Eric


    On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:

    On 11 Mar 2021, at 6:16 AM, Barry Smith
    <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> wrote:


      Eric,

     Sorry about not being more immediate. We still have
    this in our active email so you don't need to submit
    individual issues. We'll try to get to them as soon as
    we can.


    Indeed, I’m still trying to figure this out.
    I realized that some of my configure flags were different
    than yours, e.g., no --with-memalign.
    I’ve also added SuperLU_DIST to my installation.
    Still, I can’t reproduce any issue.
    I will continue looking into this, it appears I’m seeing
    some valgrind errors, but I don’t know if this is some
    side effect of OpenMPI not being valgrind-clean (last
    time I checked, there was no error with MPICH).

    Thank you for your patience,
    Pierre

    /usr/bin/gmake -f gmakefile test test-fail=1
    Using MAKEFLAGS: test-fail=1
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
     ok snes_tutorials-ex12_quad_hpddm_reuse_baij
     ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
     ok ksp_ksp_tests-ex33_superlu_dist_2
     ok diff-ksp_ksp_tests-ex33_superlu_dist_2
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
     ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
     ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
     ok ksp_ksp_tutorials-ex50_tut_2
     ok diff-ksp_ksp_tutorials-ex50_tut_2
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
     ok ksp_ksp_tests-ex33_superlu_dist
     ok diff-ksp_ksp_tests-ex33_superlu_dist
          TEST
    arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
     ok snes_tutorials-ex56_hypre
     ok diff-snes_tutorials-ex56_hypre
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
     ok ksp_ksp_tutorials-ex56_2
     ok diff-ksp_ksp_tutorials-ex56_2
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
     ok snes_tutorials-ex17_3d_q3_trig_elas
     ok diff-snes_tutorials-ex17_3d_q3_trig_elas
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
     ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
     ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
    not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
    #srun: error: Unable to create step for job 1426755: More
    processors requested than permitted
     ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command
    failed so no diff
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
     ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran
    required for this test
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
     ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
     ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
          TEST
    arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
     ok snes_tutorials-ex19_tut_3
     ok diff-snes_tutorials-ex19_tut_3
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
     ok snes_tutorials-ex17_3d_q3_trig_vlap
     ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
     ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran
    required for this test
          TEST
    arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
     ok snes_tutorials-ex19_superlu_dist
     ok diff-snes_tutorials-ex19_superlu_dist
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
     ok
    snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
     ok
    diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
     ok ksp_ksp_tutorials-ex49_hypre_nullspace
     ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
     ok snes_tutorials-ex19_superlu_dist_2
     ok diff-snes_tutorials-ex19_superlu_dist_2
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
    not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
    #srun: error: Unable to create step for job 1426755: More
    processors requested than permitted
     ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command
    failed so no diff
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
     ok
    snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
     ok
    diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
          TEST
    arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
     ok ksp_ksp_tutorials-ex64_1
     ok diff-ksp_ksp_tutorials-ex64_1
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
    not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
    #srun: error: Unable to create step for job 1426755: More
    processors requested than permitted
     ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command
    failed so no diff
          TEST
    
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
     ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran
    required for this test

     Barry

    On Mar 10, 2021, at 11:03 PM, Eric Chamberland
    <eric.chamberl...@giref.ulaval.ca
    <mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

    Barry,

    to get a some follow up on --with-openmp=1 failures,
    shall I open gitlab issues for:

    a) all hypre failures giving DIVERGED_INDEFINITE_PC

    b) all superlu_dist failures giving different results
    with initia and "Exceeded timeout limit of 60 s"

    c) hpddm failures "free(): invalid next size (fast)"
    and "Segmentation Violation"

    d) all tao's "Exceeded timeout limit of 60 s"

    I don't see how I could do all these debugging by myself...

    Thanks,

    Eric

--Eric Chamberland, ing., M. Ing

    Professionnel de recherche
    GIREF/Université Laval
    (418) 656-2131 poste 41 22 42

--Eric Chamberland, ing., M. Ing

    Professionnel de recherche
    GIREF/Université Laval
    (418) 656-2131 poste 41 22 42
    <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

Re: [petsc-dev] Petsc "make test" have more failures for --with-openmp=1

Reply via email to