Dear Patrick Sanan, Thank you very much for your answer, especially for your code. I was able to compile and run your code on 8 nodes with 20 processes per node. Below is the result
Testing with 160 MPI ranks reducing an array of size 32 (256 bytes) Running 5 burnin runs and 100 tests ... Done. For 100 runs with 5 burnin runs, on 160 MPI processes, min/max times over all ranks: MPI timer resolution: 1.0000e-06 seconds MPI timer resolution/#trials: 1.0000e-08 seconds B. Red. Only (min/max): 8.850098e-06 / 8.890629e-06 seconds N.B. Red. Only (min/max): 1.725912e-05 / 1.733065e-05 seconds Loc. Only (min/max): 2.364278e-04 / 2.374697e-04 seconds Blocking (min/max): 2.650309e-04 / 2.650595e-04 seconds Non-Blocking (min/max): 2.673984e-04 / 2.674508e-04 seconds Observe to see if the local time is enough to hide the reduction, and see if the reduction is indeed hidden It appears that the non-blocking computation with this test is no faster than the blocking computation. I think I am missing some suitable Intel MPI environment settings. I am now thinking about using MPICH, which does not require any environment settings for non-blocking computation. Could you please let me know which MPI (MPICH or OpenMPI) you used in your tests? Thanks again. Viet On Mon, Jan 25, 2021 at 7:47 PM Patrick Sanan <patrick.sa...@gmail.com> wrote: > Sorry about the delay in responding, but I'll add a couple of points here: > > > 1) It's important to have some reason to believe that pipelining will > actually help your problem. Pipelined Krylov methods work by overlapping > reductions with operator and preconditioner applications. So, to see > speedup, the time for a reduction needs to be comparable to the time for > the operator/preconditioner application. This will only be true in some > cases - typical cases are when you have a large number of ranks/nodes, a > slow network, or very fast operator/preconditioner applications (assuming > that these require the same time on each rank - it's an interesting case > when they don't, but unless you say otherwise I'll assume this doesn't > apply to your use case). > > 2) As you're discovering, simply ensuring that asynchronous progress > works, at the pure MPI level, isn't as easy as it might be, as it's so > dependent on the MPI implementation. > > > For both of these reasons, I suggest setting up a test that just directly > uses MPI (which you can of course do from a PETSc-style code) and allows > you to compare times for blocking and non-blocking reductions, overlapping > some (useless) local work. You should make sure to run multiple iterations > within the script, and also run the script multiple times on the cluster > (bearing in mind that it's possible that the performance will be affected > by other users of the system). > > I attach an old script I found that I used to test some of these things, > to give a more concrete idea of what I mean. Note that this was used early > on in our own exploration of these topics so I'm only offering it to give > an idea, not as a meaningful benchmark in its own right. > > Am 25.01.2021 um 09:17 schrieb Viet H.Q.H. <hqhv...@tohoku.ac.jp>: > > > Dear Barry, > > Thank you very much for your information. > > It seems complicated to set environment variables to allow asynchronous > progress and pinning threads to cores when using Intel MPI. > > $ export I_MPI_ASYNC_PROGRESS = 1 > $ export I_MPI_ASYNC_PROGRESS_PIN = <CPU list> > > > > https://techdecoded.intel.io/resources/hiding-communication-latency-using-mpi-3-non-blocking-collectives/ > > I'm still not sure how to get an appropriate "CPU list" when running MPI > with multiple nodes and multiple processes on one node. > Best, > Viet. > > > > > On Sat, Jan 23, 2021 at 3:01 AM Barry Smith <bsm...@petsc.dev> wrote: > >> >> >> https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html >> >> It states "and a partial support for non-blocking collectives ( MPI_Ibcas >> t, MPI_Ireduce , and MPI_Iallreduce )." I do not know what partial >> support means but you can try setting the variables and see if that helps. >> >> >> >> On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <hqhv...@tohoku.ac.jp> wrote: >> >> >> Dear Victor and Berry, >> >> Thank you so much for your answers. >> >> I fixed the code with the bug in the PetscCommSplitReductionBegin >> function as commented by Brave. >> >> ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject) >> u)); >> >> It was also a mistake to set the vector size too small. >> I just set a vector size of 100000000 and ran the code on 4 nodes with 2 >> processors per node. The result is as follows >> >> The time used for the asynchronous calculation: 0.022043 >> + | u | = 10000. >> The time used for the synchronous calculation: 0.016188 >> + | b | = 10000. >> >> Asynchronous computation still takes a longer time. >> >> I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the >> file $PETSC_DIR/include/petscconf.h >> >> I built Petsc by using the following script >> >> #!/usr/bin/bash >> set -e >> DATE="21.01.18" >> >> MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64" >> >> MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl" >> INSTL_DIR="${HOME}/local/petsc-3.14.3" >> BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}" >> PETSC_DIR="${HOME}/tmp/petsc" >> >> cd ${PETSC_DIR} >> ./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR} >> --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec >> --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR} >> --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native >> -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3 >> -march=native -mtune=native' >> >> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all >> make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install >> >> >> Intel 2018 also complies with the MPI-3 standard. >> >> Are there specific settings for Intel MPI to obtain the performance of >> the MPI_IALLREDUCE function? >> >> Sincerely, >> Viet. >> >> >> On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> ierr = VecNormBegin(u,NORM_2,&norm1); >>> ierr = >>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); >>> >>> How come you call this on Ax and not on u? For clarity, if nothing else, >>> I think you should call it on u. >>> >>> comb.c has >>> >>> /* >>> Split phase global vector reductions with support for combining the >>> communication portion of several operations. Using MPI-1.1 support >>> only >>> >>> The idea for this and much of the initial code is contributed by >>> Victor Eijkhout. >>> >>> Usage: >>> VecDotBegin(Vec,Vec,PetscScalar *); >>> VecNormBegin(Vec,NormType,PetscReal *); >>> .... >>> VecDotEnd(Vec,Vec,PetscScalar *); >>> VecNormEnd(Vec,NormType,PetscReal *); >>> >>> Limitations: >>> - The order of the xxxEnd() functions MUST be in the same order >>> as the xxxBegin(). There is extensive error checking to try to >>> insure that the user calls the routines in the correct order >>> */ >>> >>> #include <petsc/private/vecimpl.h> /*I "petscvec.h" I*/ >>> >>> static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void >>> *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm >>> comm,MPI_Request *request) >>> { >>> PETSC_UNUSED PetscErrorCode ierr; >>> >>> PetscFunctionBegin; >>> #if defined(PETSC_HAVE_MPI_IALLREDUCE) >>> ierr = >>> MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr); >>> #elif defined(PETSC_HAVE_MPIX_IALLREDUCE) >>> ierr = >>> MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr); >>> #else >>> ierr = >>> MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr); >>> *request = MPI_REQUEST_NULL; >>> #endif >>> PetscFunctionReturn(0); >>> } >>> >>> >>> So first check if $PETSC_DIR/include/petscconf.h has >>> >>> PETSC_HAVE_MPI_IALLREDUCE >>> >>> if it does not then the standard MPI reduce is called. >>> >>> If this is set then any improvement depends on the implementation of >>> iallreduce inside the MPI you are using. >>> >>> Barry >>> >>> >>> On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <hqhv...@tohoku.ac.jp> wrote: >>> >>> >>> Hello Petsc developers and supporters, >>> >>> I would like to confirm the performance of asynchronous computations of >>> inner product computation overlapping with matrix-vector multiplication >>> computation by the below code. >>> >>> >>> PetscLogDouble tt1,tt2; >>> KSP ksp; >>> //ierr = VecSet(c,one); >>> ierr = VecSet(c,one); >>> ierr = VecSet(u,one); >>> ierr = VecSet(b,one); >>> >>> ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr); >>> ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr); >>> >>> >>> ierr = PetscTime(&tt1);CHKERRQ(ierr); >>> ierr = VecNormBegin(u,NORM_2,&norm1); >>> ierr = >>> PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); >>> ierr = KSP_MatMult(ksp,A,c,Ac); >>> ierr = VecNormEnd(u,NORM_2,&norm1); >>> ierr = PetscTime(&tt2);CHKERRQ(ierr); >>> >>> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the >>> asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr); >>> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| = %g\n",(double) norm1); >>> CHKERRQ(ierr); >>> >>> >>> ierr = PetscTime(&tt1);CHKERRQ(ierr); >>> ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr); >>> ierr = KSP_MatMult(ksp,A,c,Ac); >>> ierr = PetscTime(&tt2);CHKERRQ(ierr); >>> >>> ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the >>> synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr); >>> ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| = %g\n",(double) norm2); >>> CHKERRQ(ierr); >>> >>> >>> On a cluster with two or four nodes, the asynchronous computation is >>> always much slower than synchronous computation. >>> >>> The time used for the asynchronous calculation: 0.000203 >>> + |u| = 100. >>> The time used for the synchronous calculation: 0.000006 >>> + |b| = 100. >>> >>> Are there any necessary settings on MPI or Petsc to gain performance of >>> asynchronous computation? >>> >>> Thank you very much for anything you can provide. >>> Sincerely, >>> Viet. >>> >>> >>> >>> >>> >>> >>> >>> >> >