Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Barry Smith Thu, 21 Jan 2021 19:16:20 -0800


> On Jan 21, 2021, at 9:11 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> I have tried it and it hangs, but that is expected. This is not something she 
> has prepared for.
> 
> I am working with Sherry on it.
> 
> And she is fine with just one thread and suggests it if she is in a thread. 
> 
> Now that I think about it, I don't understand why she needs OpenMP if she can 
> live with OMP_NUM_THREADS=1.


 It is very possible it was just a coding decision by one of her students and 
with a few ifdef in her code should would not need the OpenMP but I don't have 
the time or energy to check her code and design decision.

  Barry

> 
> Mark
> 
> 
> 
> On Thu, Jan 21, 2021 at 9:30 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
> 
>> On Jan 21, 2021, at 5:37 PM, Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> This did not work. I verified that MPI_Init_thread is being called correctly 
>> and that MPI returns that it supports this highest level of thread safety.
>> 
>> I am going to ask ORNL. 
>> 
>> And if I use:
>> 
>> -fieldsplit_i1_ksp_norm_type none
>> -fieldsplit_i1_ksp_max_it 300
>> 
>> for all 9 "i" variables, I can run normal iterations on the 10th variable, 
>> in a 10 species problem, and it works perfectly with 10 threads.
>> 
>> So it is definitely that VecNorm is not thread safe.
>> 
>> And, I want to call SuperLU_dist, which uses threads, but I don't want 
>> SuperLU to start using threads. Is there a way to tell superLU that there 
>> are no threads but have PETSc use them?
> 
>   My interpretation and Satish's for many years is that SuperLU_DIST has to 
> be built with and use OpenMP in order to work with CUDA. 
> 
>   def formCMakeConfigureArgs(self):
>     args = config.package.CMakePackage.formCMakeConfigureArgs(self)
>     if self.openmp.found:
>       self.usesopenmp = 'yes'
>     else:
>       args.append('-DCMAKE_DISABLE_FIND_PACKAGE_OpenMP=TRUE')
>     if self.cuda.found:
>       if not self.openmp.found:
>         raise RuntimeError('SuperLU_DIST GPU code currently requires OpenMP. 
> Use --with-openmp=1')
> 
> But this could be ok. You use OpenMP and then it uses OpenMP internally, each 
> doing their own business (what could go wrong :-)).
> 
> Have you tried it?
> 
>   Barry
> 
> 
>> 
>> Thanks,
>> Mark
>> 
>> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> OK, the problem is probably:
>> 
>> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED;
>> 
>> There is an example that sets:
>> 
>> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE;
>> 
>> This is what I need.
>> 
>> 
>> 
>> 
>> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> 
>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> Yes, the problem is that each KSP solver is running in an OMP thread (So at 
>> this point it only works for SELF and its Landau so it is all I need). It 
>> looks like MPI reductions called with a comm_self are not thread safe (eg, 
>> the could say, this is one proc, thus, just copy send --> recv, but they 
>> don't)
>> 
>> Instead of using SELF, how about Comm_dup() for each thread?
>> 
>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this.
>> Thanks, 
>> 
>> You would have to dup them all outside the OMP section, since it is not 
>> threadsafe. Then each thread uses one I think.
>> 
>> Yea sure. I do it in SetUp.
>> 
>> Well that worked to get different Comms, finally, I still get the same 
>> problem. The number of iterations differ wildly. This two species and two 
>> threads (13 SNES its that is not deterministic). Way below is one thread (8 
>> its) and fairly uniform iteration counts.
>> 
>> Maybe this MPI is just not thread safe at all. Let me look into it.
>> Thanks anyway,
>> 
>>    0 SNES Function norm 4.974994975313e-03
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x80017c60. 
>> Comms pc=0x67ad27c0 ksp=0x7ffe1600 newcomm=0x8014b6e0
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7ffdabc0. 
>> Comms pc=0x67ad27c0 ksp=0x7fff70d0 newcomm=0x7ffe9980
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 282
>>     1 SNES Function norm 1.836376279964e-05
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>> 19
>>     2 SNES Function norm 3.059930074740e-07
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>> 15
>>     3 SNES Function norm 4.744275398121e-08
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 4
>>     4 SNES Function norm 4.014828563316e-08
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 456
>>     5 SNES Function norm 5.670836337808e-09
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 2
>>     6 SNES Function norm 2.410421401323e-09
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 
>> 18
>>     7 SNES Function norm 6.533948191791e-10
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 458
>>     8 SNES Function norm 1.008133815842e-10
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 9
>>     9 SNES Function norm 1.690450876038e-11
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 4
>>    10 SNES Function norm 1.336383986009e-11
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 463
>>    11 SNES Function norm 1.873022410774e-12
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 113
>>    12 SNES Function norm 1.801834606518e-13
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL iterations 1
>>    13 SNES Function norm 1.004397317339e-13
>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 13
>> 
>> 
>> 
>> 
>>     0 SNES Function norm 4.974994975313e-03
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e265010. 
>> Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x6e25bc40. 
>> Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 282
>>     1 SNES Function norm 1.836376279963e-05
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 380
>>     2 SNES Function norm 3.018499983019e-07
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 387
>>     3 SNES Function norm 1.826353175637e-08
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 391
>>     4 SNES Function norm 1.378600599548e-09
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 392
>>     5 SNES Function norm 1.077289085611e-10
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 394
>>     6 SNES Function norm 8.571891727748e-12
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 395
>>     7 SNES Function norm 6.897647643450e-13
>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 
>> 395
>>     8 SNES Function norm 5.606434614114e-14
>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>  
>> 
>>    Matt
>>  
>>   Matt
>>  
>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> It looks like PETSc is just too clever for me. I am trying to get a 
>> different MPI_Comm into each block, but PETSc is thwarting me:
>> 
>> It looks like you are using SELF. Is that what you want? Do you want a bunch 
>> of comms with the same group, but independent somehow? I am confused.
>> 
>>    Matt
>>  
>>   if (jac->use_openmp) {
>>     ierr          = KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr);
>> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit with 
>> -------------- link: %p. Comms %p 
>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp));
>>   } else {
>>     ierr          = 
>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr);
>>   }
>> 
>> produces:
>> 
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e9cb4f0. 
>> Comms 0x660c6ad0 0x660c6ad0
>> In PCFieldSplitSetFields_FieldSplit with -------------- link: 0x7e88f7d0. 
>> Comms 0x660c6ad0 0x660c6ad0
>> 
>> How can I work around this?
>> 
>> 
>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> 
>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>> 
>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> So I put in a temporary hack to get the first Fieldsplit apply to NOT use 
>>> OMP and it sort of works. 
>>> 
>>> Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that 
>>> is a big problem.
>> 
>>   It should definitely not be creating vectors "in every" solve. But it does 
>> do lazy allocation of needed restarted vectors which may make it look like 
>> it is creating "every" vectors in every solve.  You can use 
>> -ksp_gmres_preallocate to force it to create all the restart vectors up 
>> front at KSPSetUp(). 
>> 
>> Well, I run the first solve w/o OMP and I see Vec dups in cuSparse Vecs in 
>> the 2nd solve. 
>>  
>> 
>>   Why is creating vectors "at every solve" a problem? It is not thread safe 
>> I guess?
>> 
>> It dies when it looks at the options database, in a Free in the get-options 
>> method to be exact (see stacks). 
>> 
>> ======= Backtrace: =========
>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4]
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec]
>> 
>>  
>> 
>>> Richardson works except the convergence test gets confused, presumably 
>>> because MPI reductions with PETSC_COMM_SELF is not threadsafe.
>> 
>>> 
>>> One fix for the norms might be to create each subdomain solver with a 
>>> different communicator.
>> 
>>    Yes you could do that. It might actually be the correct thing to do also, 
>> if you have multiple threads call MPI reductions on the same communicator 
>> that would be a problem. Each KSP should get a new MPI_Comm. 
>> 
>> OK. I will only do this.
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>

Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Reply via email to