Re: [petsc-dev] Putting more menu items at the top of petsc.org pages (how to?)

2023-02-21 Thread Zhang, Hong via petsc-dev
I think this is controlled by the theme we are using, which is 
pydata-sphinx-theme. It seems that the only way to do what you want is to 
modify the theme directly.

https://github.com/pydata/pydata-sphinx-theme/blob/main/src/pydata_sphinx_theme/__init__.py#L283

This function displays 5 TocTree directives before going to More. Perhaps we 
can create our own fork for this theme and customize it.

Hong (Mr.)

> On Feb 19, 2023, at 6:01 PM, Barry Smith  wrote:
> 
> 
>Googling "magic words sphinx menu" didn't help me so I ask here. Is there 
> a way to have more menu items at the top
> of the petsc.org pages instead of immediately going to More? I think I saw a 
> way before but cannot find the magic words.
> 
> 
> 
> 



[petsc-dev] GPU timers broken in main

2022-12-23 Thread Zhang, Hong via petsc-dev
GPU timers are currently broken in main. Event.GpuTime is always zero, so the 
GPU FLOPs reported in the log is zero too. Git bisect points to 
c708d6e3a1c9bc4418db993825b9337456e59b5c as the first bad commit. 

In this commit, the global variables in plog.c have two versions (one is 
thread-safe and the other is not), e.g. petsc_gtime_th and petsc_gtime. 
PetscLogGpuTimeBegin/End updates petsc_gtime to capture the GPU time, but 
petsc_gtime_th is used to update eventInfo->GpuTime for the log report. Stefano 
might want to take a took.

Hong

Re: [petsc-dev] Potential memory leak in PETSc - hypre interface when using Euclid

2022-10-27 Thread Zhang, Hong via petsc-dev
CCing Ruipeng. I think he can help with this.

Hong (Mr.)

> On Oct 27, 2022, at 3:53 PM, Barry Smith  wrote:
> 
> 
>  My quick examination of hypre.c shows the only relevant code in PETSc is 
> 
> PetscCall(PetscOptionsEList("-pc_hypre_boomeramg_smooth_type", "Enable more 
> complex smoothers", "None", HYPREBoomerAMGSmoothType, 
> PETSC_STATIC_ARRAY_LENGTH(HYPREBoomerAMGSmoothType), 
> HYPREBoomerAMGSmoothType[0], &indx, &flg));
>  if (flg) {
>jac->smoothtype = indx;
>PetscCallExternal(HYPRE_BoomerAMGSetSmoothType, jac->hsolver, indx + 6);
> 
> In other words PETSc just sends this option off to hypre and does not create 
> any objects or allocate any memory based on this option.
> 
> Thus my conclusion is the memory leak is within hypre. Likely valgrind would 
> locate the exact position easily.
> 
>> On Oct 27, 2022, at 4:27 PM, Emil Constantinescu via petsc-dev 
>>  wrote:
>> 
>> Hi there,
>> 
>> Tang Qi (LANL) reported a potential memory leak when using hypre/Euclid. 
>> Upon rudimentary testing, I could reproduce it for many examples in PETSc 
>> TS. The symptom is memory usage (measured with top)  with the number of time 
>> steps. Without Euclid, memory use does not increase.
>> 
>> For instance, one can try ex15 under TS:
>> 
>> ex15  -da_grid_x 50 -da_grid_y 50 -boundary 0 -ts_max_steps 20 -Jtype 1 
>> -ts_monitor -pc_type hypre -pc_hypre_boomeramg_smooth_type Euclid
>> 
>> I am not sure if it's PETSc - hypre that causes the memory use or hypre 
>> itself.
>> 
>> Can someone with more sophisticated tools take a look at it?
>> 
>> Emil
> 



Re: [petsc-dev] petsc4py, numpy's BLAS and PETSc's BLAS

2022-10-24 Thread Zhang, Hong via petsc-dev
The chances of these problems are very slim because almost nobody builds Numpy 
from source. I usually install it with pip. Pip-installed Numpy on Mac uses 
Openblas, which is shipped together with the numpy wheels. The official API to 
check which BLAS is used by Numpy is numpy.show_config(). However, it gives me 
false info on my laptop — the openblas libs do not really exist in 
/usr.local/lib.

openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), 
('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), 
('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), 
('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), 
('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = 
AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

I think Numpy is actually using the following openblas lib:
/usr/local/lib/python3.10/site-packages/numpy//.dylibs/libopenblas64_.0.dylib

I feel that it would be a big hassle if we want to determine the BLAS that 
Numpy is using, considering the different ways and platforms Numpy may be 
installed.

Hong (Mr.)

On Oct 21, 2022, at 4:20 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


 When PETSc is built with petsc4py this brings along, in some way, the 
BLAS/LAPACK that numpy is using. Yet PETSc is free to bring in its own 
BLAS/LAPACK libraries.

 To be completely proper should we be having configure (when used with 
petsc4py) determine the BLAS/LAPACK that numpy is using and only using that for 
PETSc's BLAS/LAPACK needs?  If not, why is ok to have both sets hanging around? 
Jose's new https://gitlab.com/petsc/petsc/-/merge_requests/5737 seems to 
indicate possible problems with having both.

 Barry



Re: [petsc-dev] Enhancing the PETSc Developer Experience

2022-09-28 Thread Zhang, Hong via petsc-dev
Reminder 
Please provide us with your inputs by answering the following questions and 
email me back by Friday, Sept. 30, if you have not done so.

Hong, Getnet, and Jacob Faibussowitsch


From: Zhang, Hong 
Sent: Friday, September 23, 2022 3:46 PM
To: For users of the development version of PETSc 
Cc: Zhang, Hong ; Betrie, Getnet ; Jacob, 
Robert L. 
Subject: Enhancing the PETSc Developer Experience

Dear PETSc developers,

We are compiling a section on "Enhancing the Developer Experience" that will be 
a part of the "PETSc Strategic Planning" document. Please provide us with your 
inputs by answering the following questions and email me back by Friday, Sept. 
30.

1. What do developers like or dislike about PETSc?
   e.g., repository use, code review, testing infrastructure, documentation, 
and others.

2. What can we do to make PETSc more attractive to new developers and students, 
or to keep the experienced developers?

3. Additional comment about the "Enhancing the Developer Experience".

Note: The feedback is confidential. The docs will not include your name.
Thank you in advance for your time.

Hong, Getnet, and Jacob


[petsc-dev] Enhancing the PETSc Developer Experience

2022-09-23 Thread Zhang, Hong via petsc-dev
Dear PETSc developers,

We are compiling a section on "Enhancing the Developer Experience" that will be 
a part of the "PETSc Strategic Planning" document. Please provide us with your 
inputs by answering the following questions and email me back by Friday, Sept. 
30.

1. What do developers like or dislike about PETSc?
   e.g., repository use, code review, testing infrastructure, documentation, 
and others.

2. What can we do to make PETSc more attractive to new developers and students, 
or to keep the experienced developers?

3. Additional comment about the "Enhancing the Developer Experience".

Note: The feedback is confidential. The docs will not include your name.
Thank you in advance for your time.

Hong, Getnet, and Jacob


Re: [petsc-dev] MatProduct_AtB --with-scalar-type=complex

2022-07-15 Thread Zhang, Hong via petsc-dev
Pierre,
I believe you are in the correct direction for debugging 
MatProductReplaceMats() . I 'll investigate it and let you know the result.
Hong


From: Pierre Jolivet 
Sent: Friday, July 15, 2022 12:01 AM
To: Zhang, Hong 
Cc: Barry Smith ; For users of the development version of 
PETSc 
Subject: Re: [petsc-dev] MatProduct_AtB --with-scalar-type=complex

Barry,
MatTransposeMatMultSymbolic_SeqAIJ_SeqAIJ() is indeed called.
product->alg is default, square is PETSC_FALSE.

Hong,
I believe the issue comes from the fact that atb->updateAt is PETSC_FALSE in 
MatProductNumeric_AtB_SeqAIJ_SeqAIJ().
If the name of this variable is relevant to its purpose, I believe it should be 
set to PETSC_TRUE when calling MatProductReplaceMats() whenever A is changed.
I would prefer using MatProductReplaceMats() because I’m implementing the same 
MatConvert() as MatNormal for the Hermitian case and it’s the only way to reuse 
the symbolic product, cf. 
https://petsc.org/main/src/mat/impls/normal/normm.c.html#line315 in the case 
where --with-scalar-type=real

Thanks,
Pierre

On 15 Jul 2022, at 5:52 AM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Pierre,
Our MatProductReplaceMats() is not well tested, which might be buggy. I 
simplified your code without calling MatProductReplaceMats() and got correct 
results in the cases
./ex -product_view ::ascii_matlab -convert false/true -correct false
and
./ex -product_view ::ascii_matlab -convert false/true -correct true

My code is attached. I'll investigate MatProductReplaceMats().
Hong





From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Barry Smith mailto:bsm...@petsc.dev>>
Sent: Thursday, July 14, 2022 4:38 PM
To: Pierre Jolivet mailto:pie...@joliv.et>>
Cc: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MatProduct_AtB --with-scalar-type=complex


  Can you confirm if MatTransposeMatMultSymbolic_SeqAIJ_SeqAIJ() ends up being 
called for you and what path it takes inside that routine (depends) on the 
algorithm it is using.



> On Jul 14, 2022, at 4:30 PM, Pierre Jolivet 
> mailto:pie...@joliv.et>> wrote:
>
> Hello,
> In the following example, the SeqAIJ implementation of MatProduct_AtB produce 
> a different (and wrong) result, compared to the SeqDense implementation or 
> MATLAB.
> I want to compute B = A^H A (where ^H is the Hermitian transpose).
> So I create a MatProduct with A and A.
> Duplicate A into another Mat which I conjugate.
> And I replace the first Mat of the product with this conjugate.
> I expect to get the proper result, which I don’t.
> Is the MatProduct_AtB implementation in the complex case not computing A^T B 
> (where ^T is the transpose)?
> For reference, here is how to properly compute A^H A with current main: 
> conj(A^H conj(A)) — so it requires an extra MatConjugate I’d like to avoid.
>
> Thanks,
> Pierre
>
> 
>
> $ ./ex -product_view ::ascii_matlab -A_view ::ascii_matlab -convert false
> %Mat Object: 1 MPI process
> %  type: seqdense
> % Size = 2 2
> Mat_0xc401_0 = zeros(2,2);
> Mat_0xc401_0 = [
> 7.2003197397953400e-01 + 6.1793966542126100e-02i 3.980919128602e-01 + 
> 7.3036588248200474e-02i
> 1.0022337819588500e-02 + 1.4463931936456476e-01i 1.0386628927366459e-01 + 
> 2.5078039364333193e-01i
> ];
> %Mat Object: 1 MPI process
> %  type: seqdense
> % Size = 2 2
> Mat_0xc401_1 = zeros(2,2);
> Mat_0xc401_1 = [
> 5.4328551781548817e-01 + 0.e+00i 3.2823965013353340e-01 + 
> 1.549814872689e-02i
> 3.2823965013353340e-01 + -1.549814872689e-02i 2.3724054059134142e-01 + 
> 0.e+00i
> ];
>
> $ ./ex -product_view ::ascii_matlab -convert true
> %Mat Object: 1 MPI process
> %  type: seqaij
> % Size = 2 2
> % Nonzeros = 4
> zzz = zeros(4,4);
> zzz = [
> 1 1  4.9380746380098023e-01 9.1886511660038694e-02
> 1 2  2.4666779825931440e-01 9.4705502650537468e-02
> 2 1  2.4666779825931440e-01 9.4705502650537468e-02
> 2 2  1.0079024247365802e-01 1.1019992594899400e-01
> ];
> Mat_0xc401_0 = spconvert(zzz);
>
> $ ./ex -product_view ::ascii_matlab -convert true -correct true
> %Mat Object: 1 MPI process
> %  type: seqaij
> % Size = 2 2
> % Nonzeros = 4
> zzz = zeros(4,4);
> zzz = [
> 1 1  5.4328551781548828e-01 -0.e+00
> 1 2  3.2823965013353340e-01 1.549814872696e-02
> 2 1  3.2823965013353340e-01 -1.549814872696e-02
> 2 2  2.3724054059134142e-01 -0.e+00
> ];
> Mat_0xc401_0 = spconvert(zzz);
>
> 





Re: [petsc-dev] MatProduct_AtB --with-scalar-type=complex

2022-07-14 Thread Zhang, Hong via petsc-dev
Pierre,
Our MatProductReplaceMats() is not well tested, which might be buggy. I 
simplified your code without calling MatProductReplaceMats() and got correct 
results in the cases
./ex -product_view ::ascii_matlab -convert false/true -correct false
and
./ex -product_view ::ascii_matlab -convert false/true -correct true

My code is attached. I'll investigate MatProductReplaceMats().
Hong





From: petsc-dev  on behalf of Barry Smith 

Sent: Thursday, July 14, 2022 4:38 PM
To: Pierre Jolivet 
Cc: For users of the development version of PETSc 
Subject: Re: [petsc-dev] MatProduct_AtB --with-scalar-type=complex


  Can you confirm if MatTransposeMatMultSymbolic_SeqAIJ_SeqAIJ() ends up being 
called for you and what path it takes inside that routine (depends) on the 
algorithm it is using.



> On Jul 14, 2022, at 4:30 PM, Pierre Jolivet  wrote:
>
> Hello,
> In the following example, the SeqAIJ implementation of MatProduct_AtB produce 
> a different (and wrong) result, compared to the SeqDense implementation or 
> MATLAB.
> I want to compute B = A^H A (where ^H is the Hermitian transpose).
> So I create a MatProduct with A and A.
> Duplicate A into another Mat which I conjugate.
> And I replace the first Mat of the product with this conjugate.
> I expect to get the proper result, which I don’t.
> Is the MatProduct_AtB implementation in the complex case not computing A^T B 
> (where ^T is the transpose)?
> For reference, here is how to properly compute A^H A with current main: 
> conj(A^H conj(A)) — so it requires an extra MatConjugate I’d like to avoid.
>
> Thanks,
> Pierre
>
> 
>
> $ ./ex -product_view ::ascii_matlab -A_view ::ascii_matlab -convert false
> %Mat Object: 1 MPI process
> %  type: seqdense
> % Size = 2 2
> Mat_0xc401_0 = zeros(2,2);
> Mat_0xc401_0 = [
> 7.2003197397953400e-01 + 6.1793966542126100e-02i 3.980919128602e-01 + 
> 7.3036588248200474e-02i
> 1.0022337819588500e-02 + 1.4463931936456476e-01i 1.0386628927366459e-01 + 
> 2.5078039364333193e-01i
> ];
> %Mat Object: 1 MPI process
> %  type: seqdense
> % Size = 2 2
> Mat_0xc401_1 = zeros(2,2);
> Mat_0xc401_1 = [
> 5.4328551781548817e-01 + 0.e+00i 3.2823965013353340e-01 + 
> 1.549814872689e-02i
> 3.2823965013353340e-01 + -1.549814872689e-02i 2.3724054059134142e-01 + 
> 0.e+00i
> ];
>
> $ ./ex -product_view ::ascii_matlab -convert true
> %Mat Object: 1 MPI process
> %  type: seqaij
> % Size = 2 2
> % Nonzeros = 4
> zzz = zeros(4,4);
> zzz = [
> 1 1  4.9380746380098023e-01 9.1886511660038694e-02
> 1 2  2.4666779825931440e-01 9.4705502650537468e-02
> 2 1  2.4666779825931440e-01 9.4705502650537468e-02
> 2 2  1.0079024247365802e-01 1.1019992594899400e-01
> ];
> Mat_0xc401_0 = spconvert(zzz);
>
> $ ./ex -product_view ::ascii_matlab -convert true -correct true
> %Mat Object: 1 MPI process
> %  type: seqaij
> % Size = 2 2
> % Nonzeros = 4
> zzz = zeros(4,4);
> zzz = [
> 1 1  5.4328551781548828e-01 -0.e+00
> 1 2  3.2823965013353340e-01 1.549814872696e-02
> 2 1  3.2823965013353340e-01 -1.549814872696e-02
> 2 2  2.3724054059134142e-01 -0.e+00
> ];
> Mat_0xc401_0 = spconvert(zzz);
>
> 

#include 
static char help[] = "";

int main(int argc,char **args)
{
  PetscInt  n = 2;
  Mat   array[1],B,conjugate;
  PetscBool flg = PETSC_FALSE, correct = PETSC_FALSE;

  PetscCall(PetscInitialize(&argc,&args,NULL,help));
  PetscCall(MatCreateConstantDiagonal(PETSC_COMM_WORLD,n,n,n,n,-1.0,array));
  PetscCall(MatConvert(array[0],MATDENSE,MAT_INPLACE_MATRIX,array));
  PetscCall(MatSetRandom(array[0],NULL));
  PetscCall(MatViewFromOptions(array[0],NULL,"-A_view"));
  PetscCall(PetscOptionsGetBool(NULL,NULL,"-convert",&flg,NULL));
  PetscCall(PetscOptionsGetBool(NULL,NULL,"-correct",&correct,NULL));
  if (flg) {
PetscCall(MatConvert(array[0],MATAIJ,MAT_INPLACE_MATRIX,array));
//PetscCall(PetscOptionsGetBool(NULL,NULL,"-correct",&correct,NULL));
  }

  PetscCall(MatDuplicate(array[0], MAT_COPY_VALUES, &conjugate));
  PetscCall(MatConjugate(conjugate));

  if (!correct) PetscCall(MatProductCreate(array[0],array[0],NULL,&B));
  else PetscCall(MatProductCreate(conjugate,array[0],NULL,&B));
  PetscCall(MatProductSetType(B,MATPRODUCT_AtB));
  PetscCall(MatProductSetFromOptions(B));
  PetscCall(MatProductSymbolic(B));
  PetscCall(MatProductNumeric(B));
  PetscCall(MatDestroy(&conjugate));

  PetscCall(MatViewFromOptions(B,NULL,"-product_view"));
  PetscCall(MatDestroy(&B));
  PetscCall(MatDestroy(array));
  PetscCall(PetscFinalize());
  return 0;
}


Re: [petsc-dev] odd log behavior

2022-05-17 Thread Zhang, Hong via petsc-dev
Python users including myself would love NaN since NaN is the default missing 
value marker for reasons of computational speed and convenience. For example, 
if you load these values into pandas, no extra code is needed to handle them. 
Other choices such as N/A would require some extra work for text replacement.

The -nan looks a bit weird. There should be a way to get rid of the - sign.

Hong (Mr.)

> On Apr 26, 2022, at 10:52 AM, Jacob Faibussowitsch  
> wrote:
> 
> There is an automatic warning that shows when you do run with 
> `-log_view_gpu_time`, but perhaps there should also be an automatic warning 
> when *not* running with it. It is unfortunate that NaN is the value printed 
> as this implies a bug but AFAIK it is unavoidable (Barry can say more on this 
> though).
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> 
>> On Apr 26, 2022, at 09:48, Jose E. Roman  wrote:
>> 
>> You have to add -log_view_gpu_time
>> See https://gitlab.com/petsc/petsc/-/merge_requests/5056
>> 
>> Jose
>> 
>> 
>>> El 26 abr 2022, a las 16:39, Mark Adams  escribió:
>>> 
>>> I'm seeing this on Perlmutter with Kokkos-CUDA. Nans in most log timing 
>>> data except the two 'Solve' lines.
>>> Just cg/jacobi on snes/ex56.
>>> 
>>> Any ideas?
>>> 
>>> VecTDot2 1.0   nan nan 1.20e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecNorm2 1.0   nan nan 1.00e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecCopy2 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> VecSet 5 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> VecAXPY4 1.0   nan nan 2.40e+01 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   1  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> VecPointwiseMult   1 1.0   nan nan 3.00e+00 1.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00 
>>> 100
>>> KSPSetUp   1 1.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
>>>  0  0  0  0  0   0  0  0  0  0  -nan-nan  0 0.00e+000 0.00e+00  >>> 0
>>> KSPSolve   1 1.0 4.0514e-04 1.0 5.50e+01 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   2  0  0  0  0 0-nan  0 0.00e+000 
>>> 0.00e+00 100
>>> SNESSolve  1 1.0 2.2128e-02 1.0 5.55e+05 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 72 56  0  0  0 100100  0  0  025-nan  0 0.00e+000 
>>> 0.00e+00  0
>> 
> 



Re: [petsc-dev] About the problem of Lagrange multiplier

2022-04-08 Thread Zhang, Hong via petsc-dev
Yahe,
What problem do you want to solve, a linear/nonlinear optimisation problem with 
equality constrains?
Hong

From: petsc-dev  on behalf of Barry Smith 

Sent: Friday, April 8, 2022 10:04 AM
To: 高亚贺 
Cc: petsc-dev@mcs.anl.gov 
Subject: Re: [petsc-dev] About the problem of Lagrange multiplier


How can Q be non-square? U has n entries so presumably K is n by n. Q has 
the same number of rows as K and from your definition of Q containing a_1  
a_n entries per row Q has n columns. So Q is also n by n. If this is the case 
then it appears you have the same number of Lagrange multipliers as u so you 
can simply create a DMDA with twice as many degrees of freedom on each vertex, 
on each vertex the first half of the degrees of freedom are u and the second 
half lambda. Note that this means the u and lambda (and hence the matrix 
entries also) are interlaced between u and lambda, but this is fine; it is only 
a convenience for human eyes that we like to write all the u before all the 
lambda; any representation in the computer is fine.

  Barry


On Apr 8, 2022, at 1:34 AM, 高亚贺 
mailto:gaoy...@buaa.edu.cn>> wrote:


Dear Mr./Ms.,


In fact, I want to solve a discretized equation like this

<1649395840117.png>


where K, U=[u1 u2 …un]T and F are fields sit on the vertices, and can easily be 
created by ‘DMCreateMatrix’ or ‘DMCreateGlobalVector’. λ is the Lagrange 
multiplier vector. The augmented Q (non-square) is the constraint coefficient 
matrix and has the form as

<1649395860765.png>


The Q is employed here to satisfy the following constraints

<1649395881836.png>


So how to build the entire system in-place in one big matrix (Kλ)? Could you 
give me more specific suggestions on this problem?


Thank you very much!


Best regards,

Yahe



-原始邮件-
发件人:"Barry Smith" mailto:bsm...@petsc.dev>>
发送时间:2022-04-07 23:10:20 (星期四)
收件人: "Matthew Knepley" mailto:knep...@gmail.com>>
抄送: "高亚贺" mailto:gaoy...@buaa.edu.cn>>, PETSc 
mailto:petsc-us...@mcs.anl.gov>>
主题: Re: [petsc-users] question


  DMStag may also be useful for your needs (and far simpler to use than DMPLEX) 
depending on where your Lagrange multipliers live. Note that regardless you 
should not need to be copying entire large submatrices around into bigger 
matrices; you can build the entire system in-place in one big matrix. MatNest 
is also a possibility depending on exactly what you are doing.

  If you explain what your Lagrange multipliers are (the constraints) we may be 
able to make more specific suggestions.

Barry




On Apr 7, 2022, at 8:26 AM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Apr 7, 2022 at 8:16 AM 高亚贺 via petsc-users 
mailto:petsc-us...@mcs.anl.gov>> wrote:


Dear Mr./Ms.,


I have used ‘DMCreateMatrix’ to create a matrix K, and also the 
‘DMCreateGlobalVector’ to create two vectors U (to be solved) and F (right-hand 
side), i.e. KU=F. Now, I want to add some complex constraints to this system 
through lagrangian multiplier method, and the constraint matrix is Q. The KU=F 
transforms to

<1649328463919.png>

   How to create Kλ, and how to effectively copy values K and Q to Kλ? Does the 
newly created Kλ and Fλ still have an advantage of DMDA? Or do you have any 
other good suggestions for this kind of problem?

DMDA can only really handle collocated discretizations, meaning all fields sit 
on the vertices. If you can discretize your problem this way, then just give it 
two fields and assemble K_\lambda as normal. If not, then you might look at 
DMPlex which supports a wider range of discretizations.

  Thanks,

 Matt


Thank you very much!


Best regards,

A PETSc user


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/





Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-08 Thread Zhang, Hong via petsc-dev
Here is an interesting thread discussing the memory issue for PyTorch (which I 
think is also relevant to PETSc):

https://github.com/pytorch/pytorch/issues/12873

The memory overhead (for both CPU and GPU) of PyTorch is getting worse and 
worse as it evolves. A conjecture is that the CUDA kernels in the library are 
responsible for this. But the overhead for Tensorflow2 is just around 300MB 
(compare to 1.5GB for PyTorch).

According to the discussion, there has not been a good way to decrease the 
memory overhead for PyTorch yet. Someone noticed that “removing half of the 
CUDA kernels can reduce the memory usage by half."

Hong

On Jan 7, 2022, at 9:23 PM, Barry Smith  wrote:


  Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory 
usage? I'm pretty sure I've seen one; it should be able to show memory usage as 
a function of time so you can see where the memory is being allocated

  Barry


On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists 
across the entire running time of an application. cupm_initialize contributes 
0.36GB out of 0.73GB.

If I had to guess this may be the latent overhead of CUDA streams and events, 
but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of 
streams that is not freed until cudaDeviceReset() is called. Maybe they 
initialize this pool immediately on start-up of the context? AFAIK there is no 
way to disable or modify this behavior.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 13:23, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 
0.73GB CUDA memory and this overhead persists across the entire running time of 
an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is still 
unclear what takes the remaining 0.37GB.

The torch issue is really a mystery. If I import torch only and do some tensor 
operations on GPU, it consumes only 0.004GB CUDA memory.


On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


1. Commenting out  ierr = 
__initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
device/impls/cupm/cupmcontext.hpp:L199

CUDA memory: 1.575GB
CUDA memory without importing torch:  0.370GB

This has the same effect as commenting out L437-L440 in interface/device.cxx

2. Comment out these two:
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]

CUDA memory: 1.936GB
CUDA memory without importing torch:   0.730GB

On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

They had no influence to the memory usage.
???

Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in 
cupmdevice.cxx as well.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:18, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

I have tried all of these. They had no influence to the memory usage.

On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

Ok next things to try out in order:

1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this

2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
Comment this out

3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
Comment this out

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:02, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
mailto:bsm...@petsc.dev

Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev
Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 
0.73GB CUDA memory and this overhead persists across the entire running time of 
an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is still 
unclear what takes the remaining 0.37GB.

The torch issue is really a mystery. If I import torch only and do some tensor 
operations on GPU, it consumes only 0.004GB CUDA memory.


On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


1. Commenting out  ierr = 
__initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
device/impls/cupm/cupmcontext.hpp:L199

CUDA memory: 1.575GB
CUDA memory without importing torch:  0.370GB

This has the same effect as commenting out L437-L440 in interface/device.cxx

2. Comment out these two:
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]

CUDA memory: 1.936GB
CUDA memory without importing torch:   0.730GB

On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

They had no influence to the memory usage.
???

Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in 
cupmdevice.cxx as well.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:18, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

I have tried all of these. They had no influence to the memory usage.

On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

Ok next things to try out in order:

1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this

2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
Comment this out

3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
Comment this out

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:02, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  I don't think this is right. We want the device initialized by PETSc , we 
just don't want the cublas and cusolve stuff initialized. In order to see how 
much memory initializing the blas and solvers takes.

  So I think you need to comment things in cupminterface.hpp like cublasCreate 
and cusolverDnCreate.

  Urgh, I hate C++ where huge chunks of real code are in header files.



On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Hit send too early…

If you don’t want to comment out, you can also run with "-device_enable lazy" 
option. Normally this is the default behavior but if -log_view or -log_summary 
is provided this defaults to “-device_enable eager”. See 
src/sys/objects/device/interface/device.cxx:398

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

You need to go into the PetscInitialize() routine find where it loads the 
cublas and cusolve and comment out those lines then run with -log_view

Comment out

#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
PetscDefined(HAVE_SYCL))
  ierr = 
PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
#endif

At src/sys/objects/pinit.c:956

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:24, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run

Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev

1. Commenting out  ierr = 
__initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in 
device/impls/cupm/cupmcontext.hpp:L199

CUDA memory: 1.575GB
CUDA memory without importing torch:  0.370GB

This has the same effect as commenting out L437-L440 in interface/device.cxx

2. Comment out these two:
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]

CUDA memory: 1.936GB
CUDA memory without importing torch:   0.730GB

On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

They had no influence to the memory usage.
???

Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in 
cupmdevice.cxx as well.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:18, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

I have tried all of these. They had no influence to the memory usage.

On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

Ok next things to try out in order:

1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this

2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
Comment this out

3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
Comment this out

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:02, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  I don't think this is right. We want the device initialized by PETSc , we 
just don't want the cublas and cusolve stuff initialized. In order to see how 
much memory initializing the blas and solvers takes.

  So I think you need to comment things in cupminterface.hpp like cublasCreate 
and cusolverDnCreate.

  Urgh, I hate C++ where huge chunks of real code are in header files.



On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Hit send too early…

If you don’t want to comment out, you can also run with "-device_enable lazy" 
option. Normally this is the default behavior but if -log_view or -log_summary 
is provided this defaults to “-device_enable eager”. See 
src/sys/objects/device/interface/device.cxx:398

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

You need to go into the PetscInitialize() routine find where it loads the 
cublas and cusolve and comment out those lines then run with -log_view

Comment out

#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
PetscDefined(HAVE_SYCL))
  ierr = 
PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
#endif

At src/sys/objects/pinit.c:956

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:24, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run with -log_view


On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(cai

Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev
I have tried all of these. They had no influence to the memory usage.

On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

Ok next things to try out in order:

1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;]
Put a PetscFunctionReturn(0); right after this

2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr = 
_devices[_defaultDevice]->configure();CHKERRQ(ierr);]
Comment this out

3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr = 
_devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
Comment this out

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 12:02, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  I don't think this is right. We want the device initialized by PETSc , we 
just don't want the cublas and cusolve stuff initialized. In order to see how 
much memory initializing the blas and solvers takes.

  So I think you need to comment things in cupminterface.hpp like cublasCreate 
and cusolverDnCreate.

  Urgh, I hate C++ where huge chunks of real code are in header files.



On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Hit send too early…

If you don’t want to comment out, you can also run with "-device_enable lazy" 
option. Normally this is the default behavior but if -log_view or -log_summary 
is provided this defaults to “-device_enable eager”. See 
src/sys/objects/device/interface/device.cxx:398

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

You need to go into the PetscInitialize() routine find where it loads the 
cublas and cusolve and comment out those lines then run with -log_view

Comment out

#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
PetscDefined(HAVE_SYCL))
  ierr = 
PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
#endif

At src/sys/objects/pinit.c:956

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:24, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run with -log_view


On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 0.004GB
hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py -log_view :0.txt
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 1.936GB


import torch
import sys
import os

import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = 
os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
sys.path.append(petsc4py_path)
import petsc4py
petsc4py.init(sys.argv)
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))










Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev
Initializing cutlass and cusolver does not affect the memory usage. I did the 
following to turn them off:

diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp 
b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
index 51fed809e4d..9a5f068323a 100644
--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
@@ -199,7 +199,7 @@ inline PetscErrorCode 
CUPMContext::setUp(PetscDeviceContext dctx) noexcept
 #if PetscDefined(USE_DEBUG)
   dci->timerInUse = PETSC_FALSE;
 #endif
-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:53 AM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  I don't think this is right. We want the device initialized by PETSc , we 
just don't want the cublas and cusolve stuff initialized. In order to see how 
much memory initializing the blas and solvers takes.

  So I think you need to comment things in cupminterface.hpp like cublasCreate 
and cusolverDnCreate.

  Urgh, I hate C++ where huge chunks of real code are in header files.



On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

Hit send too early…

If you don’t want to comment out, you can also run with "-device_enable lazy" 
option. Normally this is the default behavior but if -log_view or -log_summary 
is provided this defaults to “-device_enable eager”. See 
src/sys/objects/device/interface/device.cxx:398

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:29, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

You need to go into the PetscInitialize() routine find where it loads the 
cublas and cusolve and comment out those lines then run with -log_view

Comment out

#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || 
PetscDefined(HAVE_SYCL))
  ierr = 
PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
#endif

At src/sys/objects/pinit.c:956

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

On Jan 7, 2022, at 11:24, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run with -log_view


On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 0.004GB
hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py -log_view :0.txt
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 1.936GB


import torch
import sys
import os

import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = 
os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
sys.path.append(petsc4py_path)
import petsc4py
petsc4py.init(sys.argv)
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))








Re: [petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev
Commenting out the block containing PetscDeviceContextXXX reduces the memory 
cost from 1.9GB to 1.5GB.
Commenting out  PetscDeviceInitializeTypeFromOptions_Private(0 reduces it to 
0GB.

diff --git a/src/sys/objects/device/interface/device.cxx 
b/src/sys/objects/device/interface/device.cxx
index a682f16b696..1b2c7210dfe 100644
--- a/src/sys/objects/device/interface/device.cxx
+++ b/src/sys/objects/device/interface/device.cxx
@@ -422,7 +422,7 @@ PetscErrorCode 
PetscDeviceInitializeFromOptions_Internal(MPI_Comm comm)
 const auto deviceType = static_cast(i);
 auto initType = defaultInitType;

-ierr = 
PetscDeviceInitializeTypeFromOptions_Private(comm,deviceType,defaultDevice,defaultView,&initType);CHKERRQ(ierr);
+//ierr = 
PetscDeviceInitializeTypeFromOptions_Private(comm,deviceType,defaultDevice,defaultView,&initType);CHKERRQ(ierr);
 if (PetscDeviceConfiguredFor_Internal(deviceType) && (initType == 
PETSC_DEVICE_INIT_EAGER)) {
   initializeDeviceContextEagerly = PETSC_TRUE;
   deviceContextInitDevice= deviceType;
@@ -433,11 +433,13 @@ PetscErrorCode 
PetscDeviceInitializeFromOptions_Internal(MPI_Comm comm)

 /* somewhat inefficient here as the device context is potentially fully 
set up twice (once
  * when retrieved then the second time if setfromoptions makes changes) */
+/*
 ierr = PetscInfo1(PETSC_NULLPTR,"Eagerly initializing PetscDeviceContext 
with %s device\n",PetscDeviceTypes[deviceContextInitDevice]);CHKERRQ(ierr);
 ierr = 
PetscDeviceContextSetRootDeviceType_Internal(deviceContextInitDevice);CHKERRQ(ierr);
 ierr = PetscDeviceContextGetCurrentContext(&dctx);CHKERRQ(ierr);
 ierr = PetscDeviceContextSetFromOptions(comm,"root_",dctx);CHKERRQ(ierr);
 ierr = PetscDeviceContextSetUp(dctx);CHKERRQ(ierr);
+*/
   }
   PetscFunctionReturn(0);
 }

On Jan 7, 2022, at 10:24 AM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


Without log_view it does not load any cuBLAS/cuSolve immediately with -log_view 
it loads all that stuff at startup. You need to go into the PetscInitialize() 
routine find where it loads the cublas and cusolve and comment out those lines 
then run with -log_view


On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 0.004GB
hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py -log_view :0.txt
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 1.936GB


import torch
import sys
import os

import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = 
os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
sys.path.append(petsc4py_path)
import petsc4py
petsc4py.init(sys.argv)
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))





[petsc-dev] PETSc init eats too much CUDA memory

2022-01-07 Thread Zhang, Hong via petsc-dev
When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much 
for doing nothing. A test script is attached to reproduce the issue. If I 
remove the first line "import torch", PETSc consumes about 0.73GB, which is 
still significant. Does anyone have any idea about this behavior?

Thanks,
Hong


hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 0.004GB
hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples 
(caidao22/update-examples)$ python3 test.py -log_view :0.txt
CUDA memory before PETSc 0.000GB
CUDA memory after PETSc 1.936GB


import torch
import sys
import os

import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = 
os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
sys.path.append(petsc4py_path)
import petsc4py
petsc4py.init(sys.argv)
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))



Re: [petsc-dev] DMPLEX cannot support two different edges for the same two vertices, hence DMPLEX cannot?

2021-12-01 Thread Zhang, Hong via petsc-dev
We are working on a traffic flow application, in which same two vertices are 
connected by at least two edges. I have not seen any problem yet, even in the 
case that two vertices are located in different ranks.
Hong

From: Abhyankar, Shrirang G 
Sent: Wednesday, December 1, 2021 3:37 PM
To: Barry Smith ; Knepley, Matthew G (VISIT) 
; petsc-dev ; Zhang, Hong 
; Betrie, Getnet 
Subject: Re: DMPLEX cannot support two different edges for the same two 
vertices, hence DMPLEX cannot?


Barry,



“Is there anything we can do to support having multiple edges between the same 
two vertices”



Some of my power grid datasets have multiple edges between the same two 
vertices and I’ve not faced an issue with DMNetwork. However, all the data was 
read on rank 0 only (and then distributed).



Maybe the issue is with the edges being passed on different ranks?



Thanks,

Shri

From: Barry Smith 
Date: Wednesday, December 1, 2021 at 3:19 PM
To: "Knepley, Matthew G (VISIT)" , PETSc Development 
, "Abhyankar, Shrirang G" , 
"Zhang, Hong" , Getnet Betrie 
Subject: DMPLEX cannot support two different edges for the same two vertices, 
hence DMPLEX cannot?



Check twice before you click! This email originated from outside PNNL.





   Matt,



 If DMPlexBuildFromCellListParallel() is called with two edges that have 
the same two vertices what will happen? It looks like it ends up with an 
incorrect PetscSF if the two edges are passed on different ranks. Hence the 
DMPLEX is not valid and produces garbage.



 Neurons can be connected to themselves which seems to be breaking DMPLEX 
and hence DMNETWORK.



 Is there anything we can do to support having multiple edges between the 
same two vertices? If not is there a way we can have 
DMPlexBuildFromCellListParallel() generate an error automatically if there are 
such extra edges in the input data.



   Thanks



  Barry



In this work, the neurons are represented by vertices in the network and each 
synapse is a graph edge.






Re: [petsc-dev] I have started a new position

2021-09-13 Thread Zhang, Hong via petsc-dev
https://www.simonsfoundation.org/people/barry-smith/
[https://simonsfoundation.imgix.net/wp-content/uploads/2021/09/08170010/Barry-Smith.jpg?auto=format&q=90]<https://www.simonsfoundation.org/people/barry-smith/>
Barry Smith<https://www.simonsfoundation.org/people/barry-smith/>
Barry Smith on Simons Foundation
www.simonsfoundation.org


From: Barry Smith 
Sent: Monday, September 13, 2021 11:35 AM
To: Zhang, Hong 
Cc: For users of the development version of PETSc 
Subject: Re: [petsc-dev] I have started a new position


  The center for computational mathematics, I won't understand a thing in the 
other centers.

  Barry


On Sep 13, 2021, at 11:15 AM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Barry,
https://en.wikipedia.org/wiki/Flatiron_Institute
Flatiron Institute - Wikipedia<https://en.wikipedia.org/wiki/Flatiron_Institute>
The Flatiron Institute is an internal research division of the Simons 
Foundation, launched in 2016. It comprises five centers for computational 
science: the Center for Computational Astrophysics (CCA); the Center for 
Computational Biology (CCB); the Center for Computational Quantum Physics 
(CCQ); the Center for Computational Mathematics (CCM); and the Center for 
Computational Neuroscience (CCN).
en.wikipedia.org<http://en.wikipedia.org/>
Which center you'll be working with?
Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Barry Smith mailto:bsm...@petsc.dev>>
Sent: Monday, September 13, 2021 9:24 AM
To: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] I have started a new position


I have started a new research position at the Flatiron Institute and stopped my 
Argonne Associate position. The new position will allow me to continue to work 
on PETSc as well as work with new groups of people.

  Barry



Re: [petsc-dev] I have started a new position

2021-09-13 Thread Zhang, Hong via petsc-dev
Barry,
https://en.wikipedia.org/wiki/Flatiron_Institute
Flatiron Institute - Wikipedia
The Flatiron Institute is an internal research division of the Simons 
Foundation, launched in 2016. It comprises five centers for computational 
science: the Center for Computational Astrophysics (CCA); the Center for 
Computational Biology (CCB); the Center for Computational Quantum Physics 
(CCQ); the Center for Computational Mathematics (CCM); and the Center for 
Computational Neuroscience (CCN).
en.wikipedia.org
Which center you'll be working with?
Hong

From: petsc-dev  on behalf of Barry Smith 

Sent: Monday, September 13, 2021 9:24 AM
To: For users of the development version of PETSc 
Subject: [petsc-dev] I have started a new position


I have started a new research position at the Flatiron Institute and stopped my 
Argonne Associate position. The new position will allow me to continue to work 
on PETSc as well as work with new groups of people.

  Barry



Re: [petsc-dev] DMNetwork static sizing

2021-04-06 Thread Zhang, Hong via petsc-dev
Shri,
You designed this approach. Is it intended or out of implementation convenience 
at the time?
Hong

From: petsc-dev  on behalf of Matthew Knepley 

Sent: Monday, April 5, 2021 5:47 AM
To: PETSc 
Subject: [petsc-dev] DMNetwork static sizing

Dowe really need a configure time constant for

struct _p_DMNetworkComponentHeader {
  PetscInt index;/* index for user input global edge and vertex */
  PetscInt subnetid; /* Id for subnetwork */
  PetscInt ndata;/* number of components */
  PetscInt size[PETSC_DMNETWORK_MAXIMUM_COMPONENTS_PER_POINT];
  PetscInt key[PETSC_DMNETWORK_MAXIMUM_COMPONENTS_PER_POINT];
  PetscInt offset[PETSC_DMNETWORK_MAXIMUM_COMPONENTS_PER_POINT];
  PetscInt nvar[PETSC_DMNETWORK_MAXIMUM_COMPONENTS_PER_POINT]; /* Number of 
variables */
  PetscInt offsetvarrel[PETSC_DMNETWORK_MAXIMUM_COMPONENTS_PER_POINT]; /* 
offset from the first variable of the network point */
} PETSC_ATTRIBUTEALIGNED(PetscMax(sizeof(double),sizeof(PetscScalar)));

Can't we just allocate this struct when needed and carry the size along?

This design seem to go against the rest of what we do in PETSc?

  Thanks,

 Matt

--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/


Re: [petsc-dev] MatTransposeMatMult() bug

2021-03-18 Thread Zhang, Hong via petsc-dev
Pierre,
This would be helpful to users. Thanks,
Hong


From: Pierre Jolivet 
Sent: Thursday, March 18, 2021 10:08 AM
To: Zhang, Hong 
Cc: For users of the development version of PETSc ; 
Patrick Sanan 
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

Thanks for the suggestion Hong.
I’ve been somehow putting on hold edits in the documentation 
(src/docs/website/documentation/changes/dev.html notwithstanding) because I’m 
not sure yet of what will be migrated or not.
But I’ll edit the page according to Patrick’s comment.

Thanks,
Pierre

On 18 Mar 2021, at 4:00 PM, Patrick Sanan 
mailto:patrick.sa...@gmail.com>> wrote:

Sorry about the current mess but that page is halfway migrated, so any updates 
should go here:
https://docs.petsc.org/en/main/install/externalsoftware_documentation/



Am 18.03.2021 um 15:22 schrieb Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>>:

Pierre,
This is an external package to petsc. Shall it be listed at
https://www.mcs.anl.gov/petsc/miscellaneous/external.html?
PETSc: External Software - 
anl.gov<https://www.mcs.anl.gov/petsc/miscellaneous/external.html>
PETSc interfaces to the following optional external software (installing 
packages) (manual pages):. AMD - Approximate minimum degree orderings.; BLAS 
and LAPACK; Chaco - a graph partitioning package.; ESSL - IBM's math library 
for fast sparse direct LU factorization. FFTW - Fastest Fourier Transform in 
the West, developed at MIT by Matteo Frigo and Steven G. Johnson.
www.mcs.anl.gov<http://www.mcs.anl.gov/>
Hong

From: Pierre Jolivet mailto:pie...@joliv.et>>
Sent: Thursday, March 18, 2021 1:16 AM
To: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Cc: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

https://www.sciencedirect.com/science/article/abs/pii/S089812212155
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPHPDDM.html
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PC/PCHPDDM.html
I need to update the PETSc user manual though, specifically with respect to 
systems with multiple right-hand sides.
But don’t worry, Stefano has sorted the bug out, which was due to a faulty 
MatSetFromOptions() in MatMAIJ, used by MatTransposeMatMult().

Thanks,
Pierre

On 17 Mar 2021, at 11:21 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

What is hpddm? I do not see its document.
Hong


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, March 17, 2021 2:49 PM
To: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Cc: Pierre Jolivet mailto:pie...@joliv.et>>; For users of the 
development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

On Wed, Mar 17, 2021 at 3:27 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Pierre,
Do you mean a possible bug in C=AtB MatTransposeMatMult()?
Can you provide a stand-alone test without hpddm that reproduces this error?

Hong, you should be able to just configure with --download-hpddm and then run 
that ex76 test.

  Thanks,

 Matt

Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet mailto:pie...@joliv.et>>
Sent: Wednesday, March 17, 2021 4:31 AM
To: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] MatTransposeMatMult() bug

Hello,
While trying out Stefano’s PCApplyMat_MG() code (*), we stumbled upon weird 
numerical errors when reusing a Mat for both MatProduct_AB and MatProduct_AtB.
This reminded me that there has been a long-standing issue with 
MatTransposeMatMult(), see 
https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/pc/impls/hpddm/hpddm.cxx.html#line608,
 that I never looked into.
I’ve now been trying to figure this out, because this has side effects in 
multiple places (PCMG and PCHPDDM at least), and thus could impact user-code as 
well?
With this commit: 
https://gitlab.com/petsc/petsc/-/commit/03d8bd538039defc2fcc3e37d523735c4aaceba0
+
$ mpirun -n 4 src/ksp/ksp/tutorials/ex76 -ksp_converged_reason -pc_type hpddm 
-pc_hpddm_levels_1_eps_nev 20 -ksp_type preonly -mat_type aij -load_dir 
${DATAFILESPATH}/matrices/hpddm/GENEO -rhs 2 -pc_hpddm_coarse_correction 
balanced -C_input_mattransposematmult -D_output_mattransposematmult
I’m seeing that C is nonzero, but D is full of zeros.
Mat Object: 4 MPI processes
  type: mpidense
5.7098316584361917e-08 1.0159399260517841e-07
1.5812349976211856e-07 2.0688121715350138e-07
2.4887556933361981e-08 4.8111092300772958e-08
1.4606298643602107e-07 1.7213611729839211e-07
[…]
Mat Object: 4 MPI processes
  type: mpidense
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.0

Re: [petsc-dev] MatTransposeMatMult() bug

2021-03-18 Thread Zhang, Hong via petsc-dev
Pierre,
This is an external package to petsc. Shall it be listed at
https://www.mcs.anl.gov/petsc/miscellaneous/external.html?
PETSc: External Software - 
anl.gov<https://www.mcs.anl.gov/petsc/miscellaneous/external.html>
PETSc interfaces to the following optional external software (installing 
packages) (manual pages):. AMD - Approximate minimum degree orderings.; BLAS 
and LAPACK; Chaco - a graph partitioning package.; ESSL - IBM's math library 
for fast sparse direct LU factorization. FFTW - Fastest Fourier Transform in 
the West, developed at MIT by Matteo Frigo and Steven G. Johnson.
www.mcs.anl.gov
Hong

From: Pierre Jolivet 
Sent: Thursday, March 18, 2021 1:16 AM
To: Zhang, Hong 
Cc: For users of the development version of PETSc 
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

https://www.sciencedirect.com/science/article/abs/pii/S089812212155
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPHPDDM.html
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PC/PCHPDDM.html
I need to update the PETSc user manual though, specifically with respect to 
systems with multiple right-hand sides.
But don’t worry, Stefano has sorted the bug out, which was due to a faulty 
MatSetFromOptions() in MatMAIJ, used by MatTransposeMatMult().

Thanks,
Pierre

On 17 Mar 2021, at 11:21 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

What is hpddm? I do not see its document.
Hong


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Wednesday, March 17, 2021 2:49 PM
To: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Cc: Pierre Jolivet mailto:pie...@joliv.et>>; For users of the 
development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

On Wed, Mar 17, 2021 at 3:27 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Pierre,
Do you mean a possible bug in C=AtB MatTransposeMatMult()?
Can you provide a stand-alone test without hpddm that reproduces this error?

Hong, you should be able to just configure with --download-hpddm and then run 
that ex76 test.

  Thanks,

 Matt

Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet mailto:pie...@joliv.et>>
Sent: Wednesday, March 17, 2021 4:31 AM
To: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] MatTransposeMatMult() bug

Hello,
While trying out Stefano’s PCApplyMat_MG() code (*), we stumbled upon weird 
numerical errors when reusing a Mat for both MatProduct_AB and MatProduct_AtB.
This reminded me that there has been a long-standing issue with 
MatTransposeMatMult(), see 
https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/pc/impls/hpddm/hpddm.cxx.html#line608,
 that I never looked into.
I’ve now been trying to figure this out, because this has side effects in 
multiple places (PCMG and PCHPDDM at least), and thus could impact user-code as 
well?
With this commit: 
https://gitlab.com/petsc/petsc/-/commit/03d8bd538039defc2fcc3e37d523735c4aaceba0
+
$ mpirun -n 4 src/ksp/ksp/tutorials/ex76 -ksp_converged_reason -pc_type hpddm 
-pc_hpddm_levels_1_eps_nev 20 -ksp_type preonly -mat_type aij -load_dir 
${DATAFILESPATH}/matrices/hpddm/GENEO -rhs 2 -pc_hpddm_coarse_correction 
balanced -C_input_mattransposematmult -D_output_mattransposematmult
I’m seeing that C is nonzero, but D is full of zeros.
Mat Object: 4 MPI processes
  type: mpidense
5.7098316584361917e-08 1.0159399260517841e-07
1.5812349976211856e-07 2.0688121715350138e-07
2.4887556933361981e-08 4.8111092300772958e-08
1.4606298643602107e-07 1.7213611729839211e-07
[…]
Mat Object: 4 MPI processes
  type: mpidense
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
[…]

If one switches to a MatType which has no MatProduct_AtB implementation with B 
of type MPIDense (reminder: in that case, the product is computed 
column-by-column), e.g., -mat_type sbaij, one gets the expected result.
Mat Object: 4 MPI processes
  type: mpidense
7.2003197398135299e-01 9.5191869895699011e-01
6.1793966541680234e-02 9.3884397585488877e-01
1.0022337823233585e-02 2.4653068080134588e-01
1.4463931936094099e-01 8.6111517670701687e-01

Is there a bug somewhere with the MatAIJ implementation, or am I doing 
something which is not allowed by the MatProduct() machinery?

Thanks,
Pierre

(*) https://gitlab.com/petsc/petsc/-/merge_requests/3717


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>



Re: [petsc-dev] MatTransposeMatMult() bug

2021-03-17 Thread Zhang, Hong via petsc-dev
What is hpddm? I do not see its document.
Hong


From: Matthew Knepley 
Sent: Wednesday, March 17, 2021 2:49 PM
To: Zhang, Hong 
Cc: Pierre Jolivet ; For users of the development version of 
PETSc 
Subject: Re: [petsc-dev] MatTransposeMatMult() bug

On Wed, Mar 17, 2021 at 3:27 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Pierre,
Do you mean a possible bug in C=AtB MatTransposeMatMult()?
Can you provide a stand-alone test without hpddm that reproduces this error?

Hong, you should be able to just configure with --download-hpddm and then run 
that ex76 test.

  Thanks,

 Matt

Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet mailto:pie...@joliv.et>>
Sent: Wednesday, March 17, 2021 4:31 AM
To: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] MatTransposeMatMult() bug

Hello,
While trying out Stefano’s PCApplyMat_MG() code (*), we stumbled upon weird 
numerical errors when reusing a Mat for both MatProduct_AB and MatProduct_AtB.
This reminded me that there has been a long-standing issue with 
MatTransposeMatMult(), see 
https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/pc/impls/hpddm/hpddm.cxx.html#line608,
 that I never looked into.
I’ve now been trying to figure this out, because this has side effects in 
multiple places (PCMG and PCHPDDM at least), and thus could impact user-code as 
well?
With this commit: 
https://gitlab.com/petsc/petsc/-/commit/03d8bd538039defc2fcc3e37d523735c4aaceba0
+
$ mpirun -n 4 src/ksp/ksp/tutorials/ex76 -ksp_converged_reason -pc_type hpddm 
-pc_hpddm_levels_1_eps_nev 20 -ksp_type preonly -mat_type aij -load_dir 
${DATAFILESPATH}/matrices/hpddm/GENEO -rhs 2 -pc_hpddm_coarse_correction 
balanced -C_input_mattransposematmult -D_output_mattransposematmult
I’m seeing that C is nonzero, but D is full of zeros.
Mat Object: 4 MPI processes
  type: mpidense
5.7098316584361917e-08 1.0159399260517841e-07
1.5812349976211856e-07 2.0688121715350138e-07
2.4887556933361981e-08 4.8111092300772958e-08
1.4606298643602107e-07 1.7213611729839211e-07
[…]
Mat Object: 4 MPI processes
  type: mpidense
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
[…]

If one switches to a MatType which has no MatProduct_AtB implementation with B 
of type MPIDense (reminder: in that case, the product is computed 
column-by-column), e.g., -mat_type sbaij, one gets the expected result.
Mat Object: 4 MPI processes
  type: mpidense
7.2003197398135299e-01 9.5191869895699011e-01
6.1793966541680234e-02 9.3884397585488877e-01
1.0022337823233585e-02 2.4653068080134588e-01
1.4463931936094099e-01 8.6111517670701687e-01

Is there a bug somewhere with the MatAIJ implementation, or am I doing 
something which is not allowed by the MatProduct() machinery?

Thanks,
Pierre

(*) https://gitlab.com/petsc/petsc/-/merge_requests/3717


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


Re: [petsc-dev] MatTransposeMatMult() bug

2021-03-17 Thread Zhang, Hong via petsc-dev
Pierre,
Do you mean a possible bug in C=AtB MatTransposeMatMult()?
Can you provide a stand-alone test without hpddm that reproduces this error?
Hong

From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Wednesday, March 17, 2021 4:31 AM
To: For users of the development version of PETSc 
Subject: [petsc-dev] MatTransposeMatMult() bug

Hello,
While trying out Stefano’s PCApplyMat_MG() code (*), we stumbled upon weird 
numerical errors when reusing a Mat for both MatProduct_AB and MatProduct_AtB.
This reminded me that there has been a long-standing issue with 
MatTransposeMatMult(), see 
https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/pc/impls/hpddm/hpddm.cxx.html#line608,
 that I never looked into.
I’ve now been trying to figure this out, because this has side effects in 
multiple places (PCMG and PCHPDDM at least), and thus could impact user-code as 
well?
With this commit: 
https://gitlab.com/petsc/petsc/-/commit/03d8bd538039defc2fcc3e37d523735c4aaceba0
+
$ mpirun -n 4 src/ksp/ksp/tutorials/ex76 -ksp_converged_reason -pc_type hpddm 
-pc_hpddm_levels_1_eps_nev 20 -ksp_type preonly -mat_type aij -load_dir 
${DATAFILESPATH}/matrices/hpddm/GENEO -rhs 2 -pc_hpddm_coarse_correction 
balanced -C_input_mattransposematmult -D_output_mattransposematmult
I’m seeing that C is nonzero, but D is full of zeros.
Mat Object: 4 MPI processes
  type: mpidense
5.7098316584361917e-08 1.0159399260517841e-07
1.5812349976211856e-07 2.0688121715350138e-07
2.4887556933361981e-08 4.8111092300772958e-08
1.4606298643602107e-07 1.7213611729839211e-07
[…]
Mat Object: 4 MPI processes
  type: mpidense
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
0.e+00 0.e+00
[…]

If one switches to a MatType which has no MatProduct_AtB implementation with B 
of type MPIDense (reminder: in that case, the product is computed 
column-by-column), e.g., -mat_type sbaij, one gets the expected result.
Mat Object: 4 MPI processes
  type: mpidense
7.2003197398135299e-01 9.5191869895699011e-01
6.1793966541680234e-02 9.3884397585488877e-01
1.0022337823233585e-02 2.4653068080134588e-01
1.4463931936094099e-01 8.6111517670701687e-01

Is there a bug somewhere with the MatAIJ implementation, or am I doing 
something which is not allowed by the MatProduct() machinery?

Thanks,
Pierre

(*) https://gitlab.com/petsc/petsc/-/merge_requests/3717


Re: [petsc-dev] Argonne GPU Virtual Hackathon - Accepted

2021-03-12 Thread Zhang, Hong via petsc-dev


On Mar 12, 2021, at 5:25 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  Jed,

 Thanks for the insight.

 Maybe Hong and his Ellpack format?   Or his independent set algorithm?

These two features are currently functional on NVIDIA GPUs. Neither needs 
extensive development or refactorization, thus not suitable for this Argonne 
GPU Hackathon. But we do need reviewers to look at the two MRs and move forward.

It is also worth mentioning that the CUDA-based maximal independent set 
algorithm was developed during last GPU Hackathon at ORNL. Richard, Junchao and 
I attended that one. It was a great experience. The commitment is extremely 
intensive for both the mentors and the attendees.

Thanks,
Hong


 Maybe Stefano and his COO matrix assembly on GPUs?

 Others?

Barry


On Mar 12, 2021, at 4:37 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

I helped with one of these a couple years ago. It's important to go in with a 
well-defined problem and mini-app. If "PETSc" is the topic, then you should 
start with a representative benchmark problem that runs in no more than a 
couple minutes. It could be two problems, one that we think is good and one 
that we think is weak. The format is good for incrementally profiling and 
refactoring existing code. Probably less so for doing careful implementations 
of genuinely novel algorithms. (And if you plan to do that, make sure there's a 
way that the one person focused on that lift doesn't block the rest of the 
team.)

Barry Smith mailto:bsm...@petsc.dev>> writes:

  Folks,

 Get was able to get us into the ANL GPU hackathon that will use the 
"theta" GPU system at ANL (that has NVIDIA GPUs).  ALCF accounts are needed for 
the hackathon (but can be created). Registration information is below. If we go 
with specific goals it could be a useful experience.

  Barry



Begin forwarded message:

From: Andi Moore mailto:amo...@nvidia.com>>
Subject: Argonne GPU Virtual Hackathon - Accepted
Date: March 11, 2021 at 5:09:02 PM CST
To: "d.get...@gmail.com" 
mailto:d.get...@gmail.com>>, 
"gbet...@anl.gov" 
mailto:gbet...@anl.gov>>, 
"bsm...@petsc.dev" 
mailto:bsm...@petsc.dev>>
Cc: "Ghadar, Yasaman" mailto:gha...@anl.gov>>, Julia Levites 
mailto:jlevi...@nvidia.com>>
Message-Id: 
mailto:mw3pr12mb4588bf988609b8f7a49f6a21b3...@mw3pr12mb4588.namprd12.prod.outlook.com>>

Dear Team Barry Smith,

Congratulations! Your team has been selected to participate in the Argonne GPU 
Virtual Hackathon.

The next steps:

Registration
Each team member is required to register for the event. Please click here 
 to register and 
select the Hackathon from the drop-down menu. Please register as soon as 
possible.

Mentors
Mentors are being selected for each team based on your research fields and 
coding needs. An introductory call for teams and mentors has been scheduled for 
Wednesday, April 7th, details below.

System Access
Request an ALCF Computer User Account  if you 
do not currently have one. If you have an ALCF account, but it is currently 
inactive,  reactivate  your 
account.
Specify the following in the account request/reactivation form:
Project Name:  gpu_hack
Principal Investigator (PI):  Yasaman Ghadar

Contact accou...@alcf.anl.gov 
 if you have questions about your ALCF account.

Additional information
Review the GPU Hackathon Virtual Attendee Guide 

Review the technical resources website 
 which will help you prepare 
for the hackathon. The page includes information about profiling tools, 
compilers, libraries, programming models and more.

Slack
All communication for the event will occur through the Slack workspace
Please join 

 the Slack workspace

Agenda - All calendar invites are attached
Argonne GPU Hackathon Team and Mentor Meeting: 2:00PM-3:00PM CST – April 7, 
2021 Zoom Link 


Argonne GPU Hackathon Day 1: 9:00AM-5:00PM CST – April 20, 2021 Zoom Link 


Argonne GPU Hackathon Day 2: 9:00AM-5:00PM CST April 27, 2021 Zoom Link 


Argonne GPU Hackathon Day 3: 9:00AM-5:00PM CST April 28, 2021 Zoom Link 


Argonne GPU Hackathon Day 4: 9:00AM-5:30PM CST April 29, 2021 Zoom Link 



Please feel free to ask any 

Re: [petsc-dev] Commit squashing in MR

2021-03-03 Thread Zhang, Hong via petsc-dev
Patrick,
I need update petsc manual on DMNetwork, but do not know how to proceed. I 
tried your suggested steps:
1) go to the docs page you want to edit on 
docs.petsc.org
2) select the version you want (usually "main") in the black ReadTheDocs box in 
the lower right
3) click "edit" in "on GitLab" and make your MR (name the branch with "docs-" 
to maybe get it to auto-build on ReadTheDocs, label with docs and docs-only)

I do not understand 3). Can you give a tutorial demo in next petsc meeting?
Hong

From: petsc-dev  on behalf of Patrick Sanan 

Sent: Wednesday, March 3, 2021 12:23 AM
To: Jed Brown 
Cc: Satish Balay via petsc-dev 
Subject: Re: [petsc-dev] Commit squashing in MR

The whole section on git in the dev manual needs some attention. (It was moved 
there in the consolidation of docs we had scattered in various places, but 
hasn't been expertly updated yet). Ideal, I think, would be to find some good, 
external instructions and link to them, under the idea that we should only 
maintain things in our own docs that aren't adequately documented somewhere 
else. This might not be possible (since we had to create these instructions in 
the first place).

There is a section on squashing but it's currently a bit buried, and the advice 
in this thread is probably more useful/current
https://docs.petsc.org/en/main/developers/integration/#squashing-excessive-commits

If anyone wants to go in there and quickly update those docs, remember that you 
can do so all from web interfaces! This workflow still has some wrinkles, but 
for small changes I still think it's appealing:

- go to the docs page you want to edit on docs.petsc.org
- select the version you want (usually "main") in the black ReadTheDocs box in 
the lower right
- click "edit" in "on GitLab" and make your MR (name the branch with "docs-" to 
maybe get it to auto-build on ReadTheDocs, label with docs and docs-only)
- if you get feedback on your MR and need to update, or notice a typo, I 
*think* this will work:
   - click on the last commit of your new branch
   - find the offending file
   - click on "edit at @deadbeef123"
- change the branch *back* to your branch in the pulldown
- click "edit"
- back in your MR, edit to "squash commits"

You can get a partial preview with the usual "preview" button, though not 
everything is interpreted correctly (but for things like links, it works fine).

If you want a full preview, you can

1. Build the Sphinx docs locally from your branch, either with
- "make sphinx-docs-all LOC=$PETSC_DIR"  (you may need to add PYTHON=python3, 
since this relies on Python 3.3+ for venv)
- install the required Python packages yourself (e.g. pip install -r 
src/docs/sphinx_docs/requirements.txt), go to src/docs/sphinx_docs, run "make 
html", and look in _build/html

2. Build the Sphinx docs for your branch as a version on ReadTheDocs. There is 
currently an automation rule there that if your branch name has "docs-" in it, 
it should build (though I must admit I'm still not completely sure I understand 
exactly when RTD updates its information from GitLab). Or, if you have access, 
you can activate a new version yourself.



Am 03.03.2021 um 05:32 schrieb Jed Brown 
mailto:j...@jedbrown.org>>:

Satish Balay via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> writes:

On Wed, 3 Mar 2021, Blaise A Bourdin wrote:

Hi,

This is not technically a petsc question.
It would be great to have a short section in the PETSc integration workflow 
document explaining how to squash commits in a MR for git-impaired developers 
like me.

Anybody wants to pitch in, or explain me how to do this?

To squash commits - I use the 'squash' action in 'git rebase -i HASH' and 
figure out the HASH to use from 'gitk main..branch'

[as git rebase requires the commit prior to the first commit of interest]

git provides many ways of modifying the branch (and the rebase topic is very 
generic) so I think its best to rely on proper git docs/tutorials
[and its not really specific to petsc workflow]

You can do it in one line, without changing the base:

 git rebase -i $(git merge-base main HEAD)


An alternative is

 git rebase -i main

which gives you interactive rebase to replay on top of current 'main'. This 
does two things at once and changing the base for your branch is not always 
desirable.



Re: [petsc-dev] Infinite loop in A*B

2021-03-01 Thread Zhang, Hong via petsc-dev
Pierre,
I pushed a fix in branch hzhang/fix-matmatmult_aij_dense/release. 
https://gitlab.com/petsc/petsc/-/merge_requests/3667
[https://assets.gitlab-static.net/uploads/-/system/project/avatar/13882401/PETSc_RBG-logo.png]<https://gitlab.com/petsc/petsc/-/merge_requests/3667>
bugfix for MatMatMultSymbolic_MPIAIJ_MPIDense() when Bbn1 = 0. (!3667) · Merge 
Requests · PETSc / petsc<https://gitlab.com/petsc/petsc/-/merge_requests/3667>
Reported-by: Pierre Jolivet pie...@joliv.et Bb (column block size) cannot be 
zero; it leads to infinite loop in MatMatMultNumeric_MPIAIJ_MPIDense() with n=0
gitlab.com
Give it a try. Let me know if the bug is not fixed.
Your code is very helpful in debugging.
Hong


From: Pierre Jolivet 
Sent: Monday, March 1, 2021 12:51 AM
To: Zhang, Hong 
Cc: For users of the development version of PETSc 
Subject: Re: [petsc-dev] Infinite loop in A*B


On 1 Mar 2021, at 6:29 AM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Pierre,
This is a bug in MatMatMultSymbolic_MPIAIJ_MPIDense() during optimization of 
block column size of B. Run your code with
'-matmatmult_Bbn 1', the infinite loop should not occur.

Thanks Hong, I can confirm this option makes the more complex use case run 
smoothly as well.

I'll try to figure out a fix tomorrow.

Great.

Thanks,
Pierre

Hong
________
From: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Sent: Sunday, February 28, 2021 11:05 PM
To: Pierre Jolivet mailto:pie...@joliv.et>>; For users of the 
development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>; Zhang, Hong 
mailto:hzh...@mcs.anl.gov>>
Subject: Re: [petsc-dev] Infinite loop in A*B

The infinite loop in MatMatMultNumeric_MPIAIJ_MPIDense()
for (i=0; iworkB->cmap->n=0 (line 590 in mpimatmatmult.c)
Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>>
Sent: Sunday, February 28, 2021 10:33 PM
To: Pierre Jolivet mailto:pie...@joliv.et>>; For users of the 
development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] Infinite loop in A*B

I can reproduce the hang with
mpiexec -n 2 ./matmatmult

It seems in an infinite loop of calling MatDensePlaceArray() from

#0  MatDensePlaceArray (mat=0xda5c50, array=0xd15e60)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:2047
#1  0x7fa0d13bf4f7 in MatDenseGetSubMatrix_SeqDense (A=0xcfb2b0, cbegin=0,
cend=0, v=0xd90370)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:2997
#2  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xcfb2b0, cbegin=0, cend=0,
v=0xd90370) at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#3  0x7fa0d13db5ce in MatDenseGetSubMatrix_MPIDense (A=0xca5250, cbegin=0,
cend=0, v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:1835
#4  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xca5250, cbegin=0, cend=0,
v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#5  0x7fa0d179c2fa in MatMatMultNumeric_MPIAIJ_MPIDense (A=0xc55490,
B=0xca5250, C=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:593
#6  0x7fa0d1181331 in MatProductNumeric_AB (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:567
#7  0x7fa0d1182c14 in MatProductNumeric (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:679
#8  0x7fa0d115ef69 in MatProduct_Private (A=0xc55490, B=0xca5250,
scall=MAT_INITIAL_MATRIX, fill=-2, ptype=MATPRODUCT_AB, C=0x7ffe87d42018)
at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9405
---Type  to continue, or q  to quit---
#9  0x7fa0d115f274 in MatMatMult (A=0xc55490, B=0xca5250, 
scall=MAT_INITIAL_MATRIX, fill=-2,
C=0x7ffe87d42018) at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9445
#10 0x0040130a in main (argc=2, argv=0x7ffe87d42108) at ex1.c:20

I'll try to figure out what is going on. If anyone has a clue, please help. The 
above stack comes from 'release' branch.
Hong

From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet mailto:pie...@joliv.et>>
Sent: Sunday, February 28, 2021 4:17 PM
To: For users of the development version of PETSc 
mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] Infinite loop in A*B

Hello,
The following MWE loops indefinitely for MPI_Comm_size in {2; 3}.
Nothing fancy, just MatAIJ and MatDense.
The problem is either in MatMPIDenseScatter() or 
MatMatMultSymbolic_MPIAIJ_MPIDense(), I believe, so if someone familiar with 
those routines can figure out a hot fix, I’m all ears.
I could of course switch to a MatMult(), but the same infinite loop happens in 
another more complex code with
A = rows=8, cols=35212
B = rows=35212, cols=9
So I’ll need a fix eventually.

Thanks,
Pierre



Re: [petsc-dev] Infinite loop in A*B

2021-02-28 Thread Zhang, Hong via petsc-dev
Pierre,
This is a bug in MatMatMultSymbolic_MPIAIJ_MPIDense() during optimization of 
block column size of B. Run your code with
'-matmatmult_Bbn 1', the infinite loop should not occur. I'll try to figure out 
a fix tomorrow.
Hong
____
From: Zhang, Hong 
Sent: Sunday, February 28, 2021 11:05 PM
To: Pierre Jolivet ; For users of the development version of 
PETSc ; Zhang, Hong 
Subject: Re: [petsc-dev] Infinite loop in A*B

The infinite loop in MatMatMultNumeric_MPIAIJ_MPIDense()
for (i=0; iworkB->cmap->n=0 (line 590 in mpimatmatmult.c)
Hong

From: petsc-dev  on behalf of Zhang, Hong via 
petsc-dev 
Sent: Sunday, February 28, 2021 10:33 PM
To: Pierre Jolivet ; For users of the development version of 
PETSc 
Subject: Re: [petsc-dev] Infinite loop in A*B

I can reproduce the hang with
mpiexec -n 2 ./matmatmult

It seems in an infinite loop of calling MatDensePlaceArray() from

#0  MatDensePlaceArray (mat=0xda5c50, array=0xd15e60)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:2047
#1  0x7fa0d13bf4f7 in MatDenseGetSubMatrix_SeqDense (A=0xcfb2b0, cbegin=0,
cend=0, v=0xd90370)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:2997
#2  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xcfb2b0, cbegin=0, cend=0,
v=0xd90370) at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#3  0x7fa0d13db5ce in MatDenseGetSubMatrix_MPIDense (A=0xca5250, cbegin=0,
cend=0, v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:1835
#4  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xca5250, cbegin=0, cend=0,
v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#5  0x7fa0d179c2fa in MatMatMultNumeric_MPIAIJ_MPIDense (A=0xc55490,
B=0xca5250, C=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:593
#6  0x7fa0d1181331 in MatProductNumeric_AB (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:567
#7  0x7fa0d1182c14 in MatProductNumeric (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:679
#8  0x7fa0d115ef69 in MatProduct_Private (A=0xc55490, B=0xca5250,
scall=MAT_INITIAL_MATRIX, fill=-2, ptype=MATPRODUCT_AB, C=0x7ffe87d42018)
at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9405
---Type  to continue, or q  to quit---
#9  0x7fa0d115f274 in MatMatMult (A=0xc55490, B=0xca5250, 
scall=MAT_INITIAL_MATRIX, fill=-2,
C=0x7ffe87d42018) at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9445
#10 0x0040130a in main (argc=2, argv=0x7ffe87d42108) at ex1.c:20

I'll try to figure out what is going on. If anyone has a clue, please help. The 
above stack comes from 'release' branch.
Hong

From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Sunday, February 28, 2021 4:17 PM
To: For users of the development version of PETSc 
Subject: [petsc-dev] Infinite loop in A*B

Hello,
The following MWE loops indefinitely for MPI_Comm_size in {2; 3}.
Nothing fancy, just MatAIJ and MatDense.
The problem is either in MatMPIDenseScatter() or 
MatMatMultSymbolic_MPIAIJ_MPIDense(), I believe, so if someone familiar with 
those routines can figure out a hot fix, I’m all ears.
I could of course switch to a MatMult(), but the same infinite loop happens in 
another more complex code with
A = rows=8, cols=35212
B = rows=35212, cols=9
So I’ll need a fix eventually.

Thanks,
Pierre



Re: [petsc-dev] Infinite loop in A*B

2021-02-28 Thread Zhang, Hong via petsc-dev
The infinite loop in MatMatMultNumeric_MPIAIJ_MPIDense()
for (i=0; iworkB->cmap->n=0 (line 590 in mpimatmatmult.c)
Hong

From: petsc-dev  on behalf of Zhang, Hong via 
petsc-dev 
Sent: Sunday, February 28, 2021 10:33 PM
To: Pierre Jolivet ; For users of the development version of 
PETSc 
Subject: Re: [petsc-dev] Infinite loop in A*B

I can reproduce the hang with
mpiexec -n 2 ./matmatmult

It seems in an infinite loop of calling MatDensePlaceArray() from

#0  MatDensePlaceArray (mat=0xda5c50, array=0xd15e60)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:2047
#1  0x7fa0d13bf4f7 in MatDenseGetSubMatrix_SeqDense (A=0xcfb2b0, cbegin=0,
cend=0, v=0xd90370)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:2997
#2  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xcfb2b0, cbegin=0, cend=0,
v=0xd90370) at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#3  0x7fa0d13db5ce in MatDenseGetSubMatrix_MPIDense (A=0xca5250, cbegin=0,
cend=0, v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:1835
#4  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xca5250, cbegin=0, cend=0,
v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#5  0x7fa0d179c2fa in MatMatMultNumeric_MPIAIJ_MPIDense (A=0xc55490,
B=0xca5250, C=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:593
#6  0x7fa0d1181331 in MatProductNumeric_AB (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:567
#7  0x7fa0d1182c14 in MatProductNumeric (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:679
#8  0x7fa0d115ef69 in MatProduct_Private (A=0xc55490, B=0xca5250,
scall=MAT_INITIAL_MATRIX, fill=-2, ptype=MATPRODUCT_AB, C=0x7ffe87d42018)
at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9405
---Type  to continue, or q  to quit---
#9  0x7fa0d115f274 in MatMatMult (A=0xc55490, B=0xca5250, 
scall=MAT_INITIAL_MATRIX, fill=-2,
C=0x7ffe87d42018) at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9445
#10 0x0040130a in main (argc=2, argv=0x7ffe87d42108) at ex1.c:20

I'll try to figure out what is going on. If anyone has a clue, please help. The 
above stack comes from 'release' branch.
Hong

From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Sunday, February 28, 2021 4:17 PM
To: For users of the development version of PETSc 
Subject: [petsc-dev] Infinite loop in A*B

Hello,
The following MWE loops indefinitely for MPI_Comm_size in {2; 3}.
Nothing fancy, just MatAIJ and MatDense.
The problem is either in MatMPIDenseScatter() or 
MatMatMultSymbolic_MPIAIJ_MPIDense(), I believe, so if someone familiar with 
those routines can figure out a hot fix, I’m all ears.
I could of course switch to a MatMult(), but the same infinite loop happens in 
another more complex code with
A = rows=8, cols=35212
B = rows=35212, cols=9
So I’ll need a fix eventually.

Thanks,
Pierre



Re: [petsc-dev] Infinite loop in A*B

2021-02-28 Thread Zhang, Hong via petsc-dev
I can reproduce the hang with
mpiexec -n 2 ./matmatmult

It seems in an infinite loop of calling MatDensePlaceArray() from

#0  MatDensePlaceArray (mat=0xda5c50, array=0xd15e60)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:2047
#1  0x7fa0d13bf4f7 in MatDenseGetSubMatrix_SeqDense (A=0xcfb2b0, cbegin=0,
cend=0, v=0xd90370)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:2997
#2  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xcfb2b0, cbegin=0, cend=0,
v=0xd90370) at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#3  0x7fa0d13db5ce in MatDenseGetSubMatrix_MPIDense (A=0xca5250, cbegin=0,
cend=0, v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/mpi/mpidense.c:1835
#4  0x7fa0d13c574e in MatDenseGetSubMatrix (A=0xca5250, cbegin=0, cend=0,
v=0x7ffe87d41de0)
at /home/hongsu/soft/petsc/src/mat/impls/dense/seq/dense.c:3371
#5  0x7fa0d179c2fa in MatMatMultNumeric_MPIAIJ_MPIDense (A=0xc55490,
B=0xca5250, C=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:593
#6  0x7fa0d1181331 in MatProductNumeric_AB (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:567
#7  0x7fa0d1182c14 in MatProductNumeric (mat=0xd282b0)
at /home/hongsu/soft/petsc/src/mat/interface/matproduct.c:679
#8  0x7fa0d115ef69 in MatProduct_Private (A=0xc55490, B=0xca5250,
scall=MAT_INITIAL_MATRIX, fill=-2, ptype=MATPRODUCT_AB, C=0x7ffe87d42018)
at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9405
---Type  to continue, or q  to quit---
#9  0x7fa0d115f274 in MatMatMult (A=0xc55490, B=0xca5250, 
scall=MAT_INITIAL_MATRIX, fill=-2,
C=0x7ffe87d42018) at /home/hongsu/soft/petsc/src/mat/interface/matrix.c:9445
#10 0x0040130a in main (argc=2, argv=0x7ffe87d42108) at ex1.c:20

I'll try to figure out what is going on. If anyone has a clue, please help. The 
above stack comes from 'release' branch.
Hong

From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Sunday, February 28, 2021 4:17 PM
To: For users of the development version of PETSc 
Subject: [petsc-dev] Infinite loop in A*B

Hello,
The following MWE loops indefinitely for MPI_Comm_size in {2; 3}.
Nothing fancy, just MatAIJ and MatDense.
The problem is either in MatMPIDenseScatter() or 
MatMatMultSymbolic_MPIAIJ_MPIDense(), I believe, so if someone familiar with 
those routines can figure out a hot fix, I’m all ears.
I could of course switch to a MatMult(), but the same infinite loop happens in 
another more complex code with
A = rows=8, cols=35212
B = rows=35212, cols=9
So I’ll need a fix eventually.

Thanks,
Pierre



Re: [petsc-dev] error with flags PETSc uses for determining AVX

2021-02-14 Thread Zhang, Hong via petsc-dev
Oops, a typo in the command line. Should be AVX. SSE3 or above and AVX are not 
used for -O3.

hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep SSE
#define __SSE__ 1
#define __SSE_MATH__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep AVX
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$

> On Feb 14, 2021, at 1:25 PM, Zhang, Hong via petsc-dev 
>  wrote:
> 
> 
> 
>> On Feb 14, 2021, at 12:04 PM, Barry Smith  wrote:
>> 
>> 
>>  For our handcoded AVX functions this is fine, we can handle the dispatching 
>> ourselves. 
> 
> Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal 
> AVX code path at runtime. Theoretically we just need to query for the needed 
> features once and cache the results.
> 
>> 
>> But what about all the tons of regular code in PETSc, somehow we need to 
>> have the same function compiled twice and dispatched properly. Do we use 
>> what Hong suggested with fat binaries? So fat-binaries PLUS 
>> _may_i_use_cpu_feature together are the way to portable transportable 
>> libraries?
>> 
>> 
>> And we do this always --with-debugging=0 so everyone, packages and users get 
>> portable but also the best performance possible.
> 
> IMHO, only package managers should consider using -ax options. On our side, 
> if we want to satisfy the needs of different parties (developers, users, 
> package managers), better be conservative than aggressive. -march=native 
> brings huge performance improvement but it has never been the default for 
> many compilers with a good reason. Even -O3 does not enable the advanced 
> vector instructions. I just did a quick check on petsc-02: 
> 
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep SSE
> #define __SSE__ 1
> #define __SSE_MATH__ 1
> #define __SSE2__ 1
> #define __SSE2_MATH__ 1
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null 
> | grep avx
> hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ 
> 
> What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can 
> be suggested to anyone who does not need to care about portability. If you do 
> not want users to specify the magic options, perhaps we can provide a 
> configure option like --with-portability. If it is set to false, we add 
> aggressive flags automatically.
> 
> Hong
> 
>> 
>> Barry
>> 
>> 
>>> On Feb 14, 2021, at 11:50 AM, Jed Brown  wrote:
>>> 
>>>> 
>>> 
>>> immintrin.h provides
>>> 
>>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) {
>>> fancy_version_that_needs_fma_and_avx2();
>>> } else {
>>> fallback_version();
>>> }
>>> 
>>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677
>>> 
>>> I believe this function is slightly expensive because it probably calls the 
>>> CPUID instruction each time. BLIS has code to cache the result and query 
>>> features with simple bitwise math.
>>> 
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h
>>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c
>>> 
>>> Of course this bit of dispatch should typically be done at object creation 
>>> time, not every iteration.
>> 
> 



Re: [petsc-dev] error with flags PETSc uses for determining AVX

2021-02-14 Thread Zhang, Hong via petsc-dev


> On Feb 14, 2021, at 12:04 PM, Barry Smith  wrote:
> 
> 
>   For our handcoded AVX functions this is fine, we can handle the dispatching 
> ourselves. 

Cool. _may_i_use_cpu_feature() would be very useful to determine the optimal 
AVX code path at runtime. Theoretically we just need to query for the needed 
features once and cache the results.

> 
>  But what about all the tons of regular code in PETSc, somehow we need to 
> have the same function compiled twice and dispatched properly. Do we use what 
> Hong suggested with fat binaries? So fat-binaries PLUS _may_i_use_cpu_feature 
> together are the way to portable transportable libraries?
> 
> 
>  And we do this always --with-debugging=0 so everyone, packages and users get 
> portable but also the best performance possible.

IMHO, only package managers should consider using -ax options. On our side, if 
we want to satisfy the needs of different parties (developers, users, package 
managers), better be conservative than aggressive. -march=native brings huge 
performance improvement but it has never been the default for many compilers 
with a good reason. Even -O3 does not enable the advanced vector instructions. 
I just did a quick check on petsc-02: 

hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep SSE
#define __SSE__ 1
#define __SSE_MATH__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ icc -O3 -E -dM - < /dev/null | 
grep avx
hongzhang@petsc-02:/nfs/gce/projects/TSAdjoint$ 

What Jed usually does (--with-debugging=0 COPTFLAGS='-O2 -march=native’) can be 
suggested to anyone who does not need to care about portability. If you do not 
want users to specify the magic options, perhaps we can provide a configure 
option like --with-portability. If it is set to false, we add aggressive flags 
automatically.

Hong

> 
>  Barry
> 
> 
>> On Feb 14, 2021, at 11:50 AM, Jed Brown  wrote:
>> 
>>> 
>> 
>> immintrin.h provides
>> 
>> if (_may_i_use_cpu_feature(_FEATURE_FMA | _FEATURE_AVX2) {
>> fancy_version_that_needs_fma_and_avx2();
>> } else {
>> fallback_version();
>> }
>> 
>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_may_i_use&expand=3677,3677
>> 
>> I believe this function is slightly expensive because it probably calls the 
>> CPUID instruction each time. BLIS has code to cache the result and query 
>> features with simple bitwise math.
>> 
>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.h
>> https://github.com/flame/blis/blob/master/frame/base/bli_cpuid.c
>> 
>> Of course this bit of dispatch should typically be done at object creation 
>> time, not every iteration.
> 



Re: [petsc-dev] error with flags PETSc uses for determining AVX

2021-02-14 Thread Zhang, Hong via petsc-dev


On Feb 14, 2021, at 10:09 AM, Pierre Jolivet 
mailto:pie...@joliv.et>> wrote:



On 14 Feb 2021, at 4:52 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:



On Feb 14, 2021, at 5:05 AM, Patrick Sanan 
mailto:patrick.sa...@gmail.com>> wrote:



Am 14.02.2021 um 07:22 schrieb Barry Smith 
mailto:bsm...@petsc.dev>>:



On Feb 13, 2021, at 11:58 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

I usually configure --with-debugging=0 COPTFLAGS='-O2 -march=native' or 
similar. There's a tension here between optimizing aggressively for the current 
machine and making binaries that work on other machines. Most configure systems 
default to making somewhat portable binaries, so that's a principal of least 
surprise. (Though you're no novice and seem to have been surprised anyway.)

I'd kinda prefer if we recommended making portable binaries that run-time 
detected when to use newer instructions where it matters.

  How do we do this? What can we put in configure to do this.

  Yes, I never paid attention to the AVX nonsense over the years and never 
realized that Intel and Gnu (and hence PETSc)  both compile by default for 
machines I used in my twenties.

  Expecting PETSc users to automatically add -march= is not realistic.  I will 
try to rig something up in configure where if the user does not provide march 
something reasonable is selected.
A softer (yet trivial to implement) option might also be to just alert the user 
that these flags exist in the usual message about using default optimization 
flags. Something like this would encourage users to do what Jed is doing:

  * WARNING: Using default optimization C flags -g -O3
You might consider manually setting optimal optimization flags for your system 
with
COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for examples.
In particular, you may want to supply specific flags (e.g. -march=native)
to take advantage of higher-performance instructions.

I think this is a reasonable thing to do.

This is a reasonable message to print on the screen, but I don’t think this is 
a reasonable flag to impose by default.
You are basically asking all package managers to add a new flag 
(-march=generic) which was previously not needed.

I’m crossing my fingers Jed has a clever way of "making portable binaries that 
run-time detected when to use newer instructions where it matters”, because 
-march=native by default is just not practical when deploying software.

This is doable using Intel compilers with -ax options at the cost of generating 
fat binaries. For example,

-axCORE-AVX512,MIC-AVX512
-axAVX,CORE-AVX2-axarch

Hong


Thanks,
Pierre

We should also inform users that tuning -march options may enable vectorization 
instructions such as SSE(3 and above) and AVX but generate nonportable binaries.

If we add -march=native to the configure test, we will need to run executables 
to make sure the specified instruction sets are supported by the CPU where the 
code is running. For PETSc, the executables should cover all the intrinsics we 
use in the code ideally; otherwise, users will get run-time errors when there 
is a mismatch in vectorization instructions between compiler support and CPU 
support.

Hong


None of the examples in config/examples actually use -march=native, and this is 
a very common thing to do that, as you point out, isn't obvious until you know 
you have to do it, so it seems to be worth the screen space.






 Barry



Barry Smith mailto:bsm...@petsc.dev>> writes:

Shouldn't configure be setting something appropriate for this automatically? 
This is nuts, it means when users do a ./configure make unless they pass weird 
arguments they sure as heck don't know about to the compiler they won't get any 
of the glory that they expect and that has been in almost all Intel systems 
forever.

Barry

I run ./configure --with-debugging=0 and I get none of the stuff added by Intel 
for 15+ years?


On Feb 13, 2021, at 11:26 PM, Jed Brown  wrote:

Use -march=native or similar. The default target is basic x86_64, which has 
only SSE2.

Barry Smith  writes:

PETSc source has code like defined(__AVX2__) in the source but it does not seem 
to be able to find any of these macros (icc or gcc) on the petsc-02 system

Are these macros supposed to be defined? How does on get them to be defined? 
Why are they not define? What am I doing wrong?

Keep reading

$ lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping:7
CPU MHz: 1000.603
CPU max MHz: 2301

Re: [petsc-dev] error with flags PETSc uses for determining AVX

2021-02-14 Thread Zhang, Hong via petsc-dev


On Feb 14, 2021, at 5:05 AM, Patrick Sanan 
mailto:patrick.sa...@gmail.com>> wrote:



Am 14.02.2021 um 07:22 schrieb Barry Smith 
mailto:bsm...@petsc.dev>>:



On Feb 13, 2021, at 11:58 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

I usually configure --with-debugging=0 COPTFLAGS='-O2 -march=native' or 
similar. There's a tension here between optimizing aggressively for the current 
machine and making binaries that work on other machines. Most configure systems 
default to making somewhat portable binaries, so that's a principal of least 
surprise. (Though you're no novice and seem to have been surprised anyway.)

I'd kinda prefer if we recommended making portable binaries that run-time 
detected when to use newer instructions where it matters.

  How do we do this? What can we put in configure to do this.

  Yes, I never paid attention to the AVX nonsense over the years and never 
realized that Intel and Gnu (and hence PETSc)  both compile by default for 
machines I used in my twenties.

  Expecting PETSc users to automatically add -march= is not realistic.  I will 
try to rig something up in configure where if the user does not provide march 
something reasonable is selected.
A softer (yet trivial to implement) option might also be to just alert the user 
that these flags exist in the usual message about using default optimization 
flags. Something like this would encourage users to do what Jed is doing:

  * WARNING: Using default optimization C flags -g -O3
You might consider manually setting optimal optimization flags for your system 
with
COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for examples.
In particular, you may want to supply specific flags (e.g. -march=native)
to take advantage of higher-performance instructions.

I think this is a reasonable thing to do. We should also inform users that 
tuning -march options may enable vectorization instructions such as SSE(3 and 
above) and AVX but generate nonportable binaries.

If we add -march=native to the configure test, we will need to run executables 
to make sure the specified instruction sets are supported by the CPU where the 
code is running. For PETSc, the executables should cover all the intrinsics we 
use in the code ideally; otherwise, users will get run-time errors when there 
is a mismatch in vectorization instructions between compiler support and CPU 
support.

Hong


None of the examples in config/examples actually use -march=native, and this is 
a very common thing to do that, as you point out, isn't obvious until you know 
you have to do it, so it seems to be worth the screen space.






 Barry



Barry Smith mailto:bsm...@petsc.dev>> writes:

Shouldn't configure be setting something appropriate for this automatically? 
This is nuts, it means when users do a ./configure make unless they pass weird 
arguments they sure as heck don't know about to the compiler they won't get any 
of the glory that they expect and that has been in almost all Intel systems 
forever.

Barry

I run ./configure --with-debugging=0 and I get none of the stuff added by Intel 
for 15+ years?


On Feb 13, 2021, at 11:26 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

Use -march=native or similar. The default target is basic x86_64, which has 
only SSE2.

Barry Smith mailto:bsm...@petsc.dev>> writes:

PETSc source has code like defined(__AVX2__) in the source but it does not seem 
to be able to find any of these macros (icc or gcc) on the petsc-02 system

Are these macros supposed to be defined? How does on get them to be defined? 
Why are they not define? What am I doing wrong?

Keep reading

$ lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping:7
CPU MHz: 1000.603
CPU max MHz: 2301.
CPU min MHz: 1000.
BogoMIPS:4600.00
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:22528K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx 
smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb 
stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 av

Re: [petsc-dev] error with flags PETSc uses for determining AVX

2021-02-13 Thread Zhang, Hong via petsc-dev
The CPU supports avx2, but the compiler or the OS may not. You can print out 
the macros that the compiler defines and grep for avx2. The commands can be 
found at
 
https://stackoverflow.com/questions/9349754/generate-list-of-preprocessor-macros-defined-by-the-compiler

Hong

On Feb 13, 2021, at 8:48 PM, Barry Smith  wrote:


PETSc source has code like defined(__AVX2__) in the source but it does not seem 
to be able to find any of these macros (icc or gcc) on the petsc-02 system

Are these macros supposed to be defined? How does on get them to be defined? 
Why are they not define? What am I doing wrong?

Keep reading

$ lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping:7
CPU MHz: 1000.603
CPU max MHz: 2301.
CPU min MHz: 1000.
BogoMIPS:4600.00
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:22528K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx 
smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb 
stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap 
clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln 
pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Test program

#if defined(__FMA__)
#error FMA
#endif

#if defined(__AVX512F__)
#error AVX512F
#endif

#if defined(__AVX2__)
#error AVX2
#endif


icc mytest.c
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/Scrt1.o: In function 
`_start':
(.text+0x20): undefined reference to `main'





Re: [petsc-dev] "Search" does not work in the testing system?

2021-01-27 Thread Zhang, Hong via petsc-dev

make PETSC_DIR=/Users/kongf/projects/moose4/petsc 
PETSC_ARCH=arch-darwin-c-debug -f gmakefile test search='snes_tutorials-ex1_*'

or

make PETSC_DIR=/Users/kongf/projects/moose4/petsc 
PETSC_ARCH=arch-darwin-c-debug -f gmakefile test 
globsearch='snes_tutorials-ex1_*’

Hong (Mr.)

> On Jan 27, 2021, at 5:21 PM, Fande Kong  wrote:
> 
> Hi All,
> 
> I want to run one particular SNES test using the following command-line:
> 
> "make PETSC_DIR=/Users/kongf/projects/moose4/petsc 
> PETSC_ARCH=arch-darwin-c-debug -f gmakefile test search='snes_tutorials-ex1'"
> 
> I got the following output:
> 
> "Using MAKEFLAGS: search=snes_tutorials-ex1% PETSC_ARCH=arch-darwin-c-debug 
> PETSC_DIR=/Users/kongf/projects/moose4/petsc"
> 
> But I did not see any useful test information.
> 
> Could you kindly let me know what I did wrong?
> 
> Thanks,
> 
> Fande



Re: [petsc-dev] obscure changes in TSGetStages_Theta

2021-01-24 Thread Zhang, Hong via petsc-dev
Some TS methods such as TSRK do have an array of vectors like this to store the 
stage values. But not all TS methods have it. I am fine adding the scratch for 
TSTheta and any other method missing it. A little drawback is that it is used 
only by TSGetStages and the TSStep implementation does not necessarily need it. 
So I like the idea to return a temporary array with GetStages and free it with 
RestoreStages when the TS method does not have an array for stages internally.

Hong

On Jan 24, 2021, at 1:08 AM, Stefano Zampini 
mailto:stefano.zamp...@gmail.com>> wrote:

You actually do not need a RestoreStages if you use a scratch

Vec stages[2];

in TS_theta, and pass it back with TSGetStages_Theta. I understand PETSc 
philosophy is that every Get should have a Restore method, but is this really 
necessary for TSGetStages?

Il giorno sab 23 gen 2021 alle ore 21:05 Zhang, Hong 
mailto:hongzh...@anl.gov>> ha scritto:
Done. Please check https://gitlab.com/petsc/petsc/-/merge_requests/3583

Sorry for any disturbance it caused. It was for the convenience of the adjoint 
implementation. The stages returned by TSGetStages_Theta currently do not 
reflect the true stages associated with these methods. The endpoint variant 
actually has two stages. This will be changed in a separate forthcoming MR, 
where TSRestoreStages() will be added and TSGetStages will return an array of 
vectors for the endpoint variant.

Hong

On Jan 21, 2021, at 4:16 AM, Stefano Zampini 
mailto:stefano.zamp...@gmail.com>> wrote:

Hong,

I do not understand why you changed the behavior of TSGetStages_Theta 
https://gitlab.com/petsc/petsc/-/merge_requests/3500/diffs#a582bbaec75f4ae14bbf97d1d0404073ca89ff09_1194_1209
 with this MR https://gitlab.com/petsc/petsc/-/merge_requests/3500

Now, the non-endpoint variant does no longer use th->X as stage!

Please revert this change
Thanks
--
Stefano



--
Stefano



Re: [petsc-dev] obscure changes in TSGetStages_Theta

2021-01-23 Thread Zhang, Hong via petsc-dev
Done. Please check https://gitlab.com/petsc/petsc/-/merge_requests/3583

Sorry for any disturbance it caused. It was for the convenience of the adjoint 
implementation. The stages returned by TSGetStages_Theta currently do not 
reflect the true stages associated with these methods. The endpoint variant 
actually has two stages. This will be changed in a separate forthcoming MR, 
where TSRestoreStages() will be added and TSGetStages will return an array of 
vectors for the endpoint variant.

Hong

On Jan 21, 2021, at 4:16 AM, Stefano Zampini 
mailto:stefano.zamp...@gmail.com>> wrote:

Hong,

I do not understand why you changed the behavior of TSGetStages_Theta 
https://gitlab.com/petsc/petsc/-/merge_requests/3500/diffs#a582bbaec75f4ae14bbf97d1d0404073ca89ff09_1194_1209
 with this MR https://gitlab.com/petsc/petsc/-/merge_requests/3500

Now, the non-endpoint variant does no longer use th->X as stage!

Please revert this change
Thanks
--
Stefano



Re: [petsc-dev] About parallel of ILU

2021-01-15 Thread Zhang, Hong via petsc-dev
Just in case you want to try the exact algorithm you attached, it can be used 
in PETSc with -pc_type hypre -pc_hypre_type euclid 

Hong (Mr.)

> On Jan 12, 2021, at 8:42 AM, Chen Gang <569615...@qq.com> wrote:
> 
> 
> Dear Professor,
> 
> I'm writing about this mail about the ILU algorithm in PETSc. I have noticed 
> that the ILU in PETSs cannot be paralleled, and the Hypre providing the 
> parallel implement of ILU.
> 
> My questions is : why the PETSc does not provide the parallel implement of 
> ILU directly, instead of linking to the Hypre to obtain the Parallel ILU 
> algorithm?
> 
> PS: the attachment is the Parallel ILU algorithm in Hypre.
> 
> I'm looking forward to hear you!
> 
> Best,
> 
> Gang Chen
> 
> Sichuan Univertsity,
> School of Mathematics,
> Chengdu, China
> 


Re: [petsc-dev] problem with MatSeqAIJCUSPARSEILUAnalysisAndCopyToGPU

2020-12-22 Thread Zhang, Hong via petsc-dev


On Dec 22, 2020, at 3:38 PM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:

I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a Summit 
node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels. The cuSparse 
performance is terrible.

I solve the same TS problem in MPI serial on each global process. I run with 
NP=1 or (all) 7 cores/MPI per GPU:
MatLUFactorNum time, using all 6 GPUs:
NP/GPU cuSparse Kokkos kernels
1  0.12 0.075
7  0.55 0.072 // some noise here
So cuSparse is about 2x slower on one process and 8x slower when using all the 
cores, from memory contention I assume.

I found that the problem is in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix. 
Most of this excess time is in:

  cerr = cudaMallocHost((void**) &AALo, 
nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);

and

  cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);

nzLower is about 140K. Here is my timer data, in a stage after a "warm up 
stage":

   Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix  12 1.0 2.3514e-01 1.1 
0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  23  0  0  0  0 0   
0 12 1.34e+010 0.00e+00  0
   MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost  12 1.0 
1.5448e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  15  0  0  0 
 0 0   0  0 0.00e+000 0.00e+00  0
 MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost  12 1.0 
8.3908e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   8  0  0  0 
 0 0   0  0 0.00e+000 0.00e+00  0

Allocation/free of pinned memory is slow, usually on the order of several 
milliseconds. So these numbers look normal. Is there any opportunity to reuse 
the pinned memory in these functions?

Hong (Mr.)

This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty much 
matches the difference with Kokkos.

Any ideas?

Thanks,
Mark



Re: [petsc-dev] Can I call PetscSectionAddDof(s, p, ndof) at a shred 'p' by more than one processors?

2020-11-19 Thread Zhang, Hong via petsc-dev
Matt,
At a local section point (including ghost point), how to find all the owners? 
For example, proc[0] and proc[2] share p. How proc[0] find that proc[2] is an 
owner? Any routine would provide this info?
Hong


From: Matthew Knepley 
Sent: Wednesday, November 18, 2020 1:56 PM
To: Lawrence Mitchell 
Cc: Zhang, Hong ; petsc-dev 
Subject: Re: [petsc-dev] Can I call PetscSectionAddDof(s, p, ndof) at a shred 
'p' by more than one processors?

On Wed, Nov 18, 2020 at 2:19 PM Lawrence Mitchell 
mailto:we...@gmx.li>> wrote:
> On 18 Nov 2020, at 15:26, Zhang, Hong via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Matt or Jed,
> Can I call PetscSectionAddDof(s,p,ndof) at a shred 'p' by more than one 
> processors? For example,
> if (rank == 0) {
> PetscSectionAddDof(s,p,1) ;
> } else if (rank == 1) {
> PetscSectionAddDof(s,p,2) ;
> }
> Then, at shared 'p', section 's' has dof=3?

Sections are "local" objects that are tied together by an SF that describes 
point ownership.

So I think that the only thing you need is that if two processes set a dof on 
what is globally the same point, they should agree on how many dofs live there.

I wonder if PetscSectionAddDof (etc...) should be marked as logically 
collective.

Right now, what you want is handled by creating a global Section from a local 
Section:

  
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PetscSection/PetscSectionCreateGlobalSection.html

The global Section contains global offsets for all local points, and the ghost 
points have negative offsets.

There is no facility for combining dofs. The idea is that you know the number 
of dofs on each local point. If you change that
interpretation, you could easily use SF to add them up and broadcast the sum.

  Thanks,

 Matt

Lawrence
--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


[petsc-dev] Can I call PetscSectionAddDof(s, p, ndof) at a shred 'p' by more than one processors?

2020-11-18 Thread Zhang, Hong via petsc-dev
Matt or Jed,
Can I call PetscSectionAddDof(s,p,ndof) at a shred 'p' by more than one 
processors? For example,
if (rank == 0) {
PetscSectionAddDof(s,p,1) ;
} else if (rank == 1) {
PetscSectionAddDof(s,p,2) ;
}
Then, at shared 'p', section 's' has dof=3?

I did a test, and got an error
petsc/src/ksp/ksp/tutorials/network (hzhang/dmnetwork-netcoupling *=)
mpiexec -n 2 ./ex4

[0]PETSC ERROR: - Error Message 
--
[0]PETSC ERROR: Invalid argument
[0]PETSC ERROR: Global dof 2 for point 3 is not the unconstrained 1
...
[0]PETSC ERROR: #1 PetscSectionCreateGlobalSection() line 1297 in 
/Users/hongzhang-sun/soft/petsc/src/vec/is/section/interface/section.c
[0]PETSC ERROR: #2 DMGetGlobalSection() line 4367 in 
/Users/hongzhang-sun/soft/petsc/src/dm/interface/dm.c
[0]PETSC ERROR: #3 DMSetUp_Network() line 2187 in 
/Users/hongzhang-sun/soft/petsc/src/dm/impls/network/network.c
[0]PETSC ERROR: #4 DMSetUp() line 788 in 
/Users/hongzhang-sun/soft/petsc/src/dm/interface/dm.c
[0]PETSC ERROR: #5 main() line 132 in ex4.c

Another question: can I find the processor ownership of a section point?

Hong



Re: [petsc-dev] sm_70

2020-09-27 Thread Zhang, Hong via petsc-dev


On Sep 25, 2020, at 8:09 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


  Configure by default should find out the available GPU and build for that 
sm_*  it should not require the user to set this (how the heck is the user 
going to know what to set?)  If I remember correctly there is a utility 
available that gives this information.

  For generic builds like in package distributions I don't know how it should 
work, ideally all the possibilities would be available in the library and at 
run time the correct one will be utilized.


For package distribution we should add as many possibilities as possible. To 
have maximum compatibility on CUDA 11, we can add

-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_75,code=sm_75

The downside is longer compilation time and fatter binary.

Hong (Mr.)




  Barry


On Sep 25, 2020, at 5:49 PM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:

   '--CUDAFLAGS=-arch=sm_70',

seems to fix this.

On Fri, Sep 25, 2020 at 6:31 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
I see kokkos and hyper have a sm_70 flag, but I don't see one for PETSc.

It looks like you have to specify this to get modern atomics to work in Cuda. I 
get:

/ccs/home/adams/petsc/include/petscaijdevice.h(99): error: no instance of 
overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)

I tried using a Kokkos configuration, thinking I could get these sm_70 flags, 
but that did not work.

Any ideas?

Mark




Re: [petsc-dev] PDIPDM questions

2020-09-14 Thread Zhang, Hong via petsc-dev
Pierre,
ex1.c is a toy test inherited from previous experimental pdipm. We simply sent 
centralised data to all other processes to test pdipm. It is not intended for 
performance. We should add more tests.

Current pdipm is not fully developed yet, especially its linear solver may fail 
to handle indefinite KKT matrix. We are working on it. We'll let you know after 
we get it updated.

For your request about 'distributed Hessian with a Jacobian with a single 
row?', either someone else in petsc/tao team address this issue, or I'll check 
the details and get back to you later.

Hong


From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Monday, September 14, 2020 1:51 PM
To: PETSc 
Subject: [petsc-dev] PDIPDM questions

Hello,
In my quest to help users migrate from Ipopt to Tao, I’ve a new question.
When looking at src/tao/constrained/tutorials/ex1.c, it seems that almost 
everything is centralized on rank 0 (local sizes are 0 but on rank 0).
I’d like to have my Hessian distributed more naturally, as in (almost?) all 
other SNES/TS examples, but still keep the Jacobian of my equality constraint, 
which is of dimension 1 x N (N >> 1), centralized on rank 0.
Is this possible?
If not, is it possible to supply the transpose of the Jacobian, of dimension N 
x 1, which could then be distributed row-wise like the Hessian?
Or maybe use some trick to distribute a MatAIJ/MatDense of dimension 1 x N 
column-wise? Use a MatNest with as many blocks as processes?

So, just to sum up, how can I have a distributed Hessian with a Jacobian with a 
single row?

Thanks in advance for your help,
Pierre


Re: [petsc-dev] Statistics on the popularity of PETSc

2020-09-10 Thread Zhang, Hong via petsc-dev
Thanks for the info., interesting. I'll pass them to the requester.
Hong

From: petsc-dev  on behalf of Jacob 
Faibussowitsch 
Sent: Thursday, September 10, 2020 4:43 PM
To: Barry Smith 
Cc: petsc-dev 
Subject: Re: [petsc-dev] Statistics on the popularity of PETSc

Actually this reminds me speaking of package managers, do any of the HPC 
machines that have petsc modules installed collect any similar analytics? I 
can’t imagine they don’t internally keep track of this stuff (at least 
internally) to keep their list of available modules relevant.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
Cell: (312) 694-3391

On Sep 10, 2020, at 17:40, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

 So we could say roughly 3% of brew OpenMPI users use PETSc ?  Pretty low.

If you’re inferring petsc user base through brew packages, I would argue that 
these aren’t really all that representative save from direct dependency 
download. So mumps etc likely wouldn’t correlate well since its not listed as a 
direct dependency for the petsc package.

Theres also the question of whether the counters on the dependencies are even 
useful to begin with. I suspect it’s probably more common that users clone 
petsc and then use a combination of package-manager packages and —with-download 
configure options.

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
Cell: (312) 694-3391

On Sep 10, 2020, at 17:27, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:

  Jacob,

Cool feature,

$ brew info openmpi
open-mpi: stable 4.0.5 (bottled), HEAD
High performance message passing library
https://www.open-mpi.org/
Conflicts with:
  mpich (because both install MPI compiler wrappers)
Not installed
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/open-mpi.rb
License: BSD-3-Clause
==> Dependencies
Required: gcc ✔, hwloc ✘, libevent ✔
==> Options
--HEAD
Install HEAD version
==> Analytics
install: 24,409 (30 days), 72,362 (90 days), 205,281 (365 days)
install-on-request: 6,116 (30 days), 17,393 (90 days), 49,467 (365 days)
build-error: 0 (30 days)

  So we could say roughly 3% of brew OpenMPI users use PETSc ?  Pretty low. But 
I'm not surprised, my impression is that the huge bulk of MPI users don't 
really use MPI based libraries.

  For hypre it is 381/49467 < 1%

  For scalapack 1,455/49467 remarkably close to PETSc's number.

 I could not find superlu_dist, trilinos or MUMPs.

 We should push harder on making PETSc available through packaging systems, we 
can discuss this once we have our Community engagement group going.

  Barry




On Sep 10, 2020, at 4:10 PM, Jacob Faibussowitsch 
mailto:jacob@gmail.com>> wrote:

I don’t know if gitlab tracks repository clones, but the brew package manager 
on macOS keeps track of how many people install a package. But I don’t know 
that this is even remotely representative of the user-base even for macOS…

$ brew info petsc
…
install: 142 (30 days), 436 (90 days), 1,554 (365 days)
install-on-request: 140 (30 days), 412 (90 days), 1,450 (365 days)

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)
Cell: (312) 694-3391

On Sep 10, 2020, at 16:29, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Someone asks about the number of PETSc users. Do we have relevant info?
Hong






[petsc-dev] Statistics on the popularity of PETSc

2020-09-10 Thread Zhang, Hong via petsc-dev
Someone asks about the number of PETSc users. Do we have relevant info?
Hong


Re: [petsc-dev] TAOPDIPM

2020-08-21 Thread Zhang, Hong via petsc-dev
Pierre,
We have fixed this bug in petsc-release (maint branch). Thanks for you report.
Hong


From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Wednesday, August 5, 2020 2:10 AM
To: Abhyankar, Shrirang G 
Cc: PETSc 
Subject: Re: [petsc-dev] TAOPDIPM

Sorry for these questions, I’m trying to help the said user transitioning from 
IPOPT over to Tao, and since I’m no expert I run into a bunch of errors myself.
For example, if I run PDIPM without equality constraint, I now get:
[0]PETSC ERROR: VecSet() line 522 in src/vec/vec/interface/rvector.c Null 
Object: Parameter # 1
Indeed, I can see that lambdae is conditionally created line 815, but always 
used in VecSet() line 522.
I think the method is general enough to handle cases without such constraints, 
and should not thus fail here, right?
You should be able to reproduce this by commenting all 
DE/De/EqualityConstraints/JacobianEquality code from ex1.c.

Thanks,
Pierre

On 5 Aug 2020, at 5:27 AM, Abhyankar, Shrirang G 
mailto:shrirang.abhyan...@pnnl.gov>> wrote:

Pierre,
  Thanks for catching this issue. As Alp pointed out, the issue was because the 
tolerance was not set correctly.

@Alp: Let me know if I can help you with the patch.

Thanks,
Shri


From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Date: Tuesday, August 4, 2020 at 1:46 PM
To: "Dener, Alp" mailto:ade...@anl.gov>>
Cc: PETSc Development mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] TAOPDIPM

Thanks to you both, I’ll forward this to my user.
Pierre

On 4 Aug 2020, at 7:30 PM, Dener, Alp mailto:ade...@anl.gov>> 
wrote:

Hi Pierre,

This is indeed an issue with TAO default tolerances.

Specifically it has to do with constraint tolerances. The default value is an 
exact zero.

The problem should still work with line 89 commented out. Our default gradient 
tolerance is 1e-8. In your case, commenting out line 90 is causing the solver 
to try to converge constraints to exact zero and it cannot.

I will get a patch out for this within the week but in the meantime please 
ensure that the constraint tolerances are set for any constrained problem.

Thank you!
—
Alp

On Aug 4, 2020, at 12:24 PM, Munson, Todd via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


Hi Pierre,

I would say the answer to that question is "no, its not expected".

We will look into fixing it.  It seems like the default tolerances are being 
set to zero and the result is an inability to satisfy the constraints or 
gradient of the Lagrangian to that small of a tolerance.

Thanks for point this out to us!  We will get it resolved.

Thanks, Todd.

On 8/4/20, 12:12 PM, "petsc-dev on behalf of Pierre Jolivet" 
mailto:petsc-dev-boun...@mcs.anl.gov> on behalf 
of pierre.joli...@enseeiht.fr> wrote:

   Hello,
   If I comment line 89 and 90 of src/tao/constrained/tutorials/ex1.c from 
master, the example deadlocks with a single process.
   Is this expected?

   Thanks,
   Pierre

   $  ./ex1 -tao_monitor
    Constrained Problem -
   Solution should be f(1,1)=-2
 0 TAO,  Function value: 8.,  Residual: 8.48528
 1 TAO,  Function value: 1.28532,  Residual: 10.7411
 2 TAO,  Function value: -2.56703,  Residual: 1.87847
 3 TAO,  Function value: -2.03161,  Residual: 0.12881
 4 TAO,  Function value: -1.99961,  Residual: 0.0450855
 5 TAO,  Function value: -1.3,  Residual: 0.000916939
 6 TAO,  Function value: -1.9,  Residual: 6.69184e-05
 7 TAO,  Function value: -2.,  Residual: 7.15427e-06
 8 TAO,  Function value: -2.,  Residual: 7.15779e-07
 9 TAO,  Function value: -2.,  Residual: 7.15777e-08
10 TAO,  Function value: -2.,  Residual: 7.15777e-09
11 TAO,  Function value: -2.,  Residual: 7.15779e-10
12 TAO,  Function value: -2.,  Residual: 7.15775e-11
13 TAO,  Function value: -2.,  Residual: 7.1599e-12
14 TAO,  Function value: -2.,  Residual: 7.1599e-13
15 TAO,  Function value: -2.,  Residual: 7.22085e-14
16 TAO,  Function value: -2.,  Residual: 6.44626e-15
17 TAO,  Function value: -2.,  Residual: 1.00751e-15
18 TAO,  Function value: -2.,  Residual: 1.70295e-17
19 TAO,  Function value: -2.,  Residual: 1.70295e-18
20 TAO,  Function value: -2.,  Residual: 1.70295e-19
21 TAO,  Function value: -2.,  Residual: 1.70295e-20
22 TAO,  Function value: -2.,  Residual: 1.70295e-21
23 TAO,  Function value: -2.,  Residual: 1.70295e-22
24 TAO,  Function value: -2.,  Residual: 1.70295e-23
25 TAO,  Function value: -2.,  Residual: 1.70295e-24
26 TAO,  Function value: -2.,  Residual: 1.70295e-25
27 TAO,  Function value: -2.,  Residual: 1.70295e-26
28 TAO,  Function value: -2.,  Residual: 1.70295e-27
29 TAO,  Function value: -2.,  Residual: 1.70295e-28
30 TAO,  Function value: -2.,  Residual: 1.70295e-29
31 TAO,  Function value: -2.,  Residual: 1.70295e-30
32 TAO,  Function va

Re: [petsc-dev] MATOP_MAT_MULT

2020-05-06 Thread Zhang, Hong via petsc-dev
Stefano,
How about you work on this issue?
Hong


From: Stefano Zampini 
Sent: Wednesday, May 6, 2020 2:09 AM
To: Zhang, Hong 
Cc: Pierre Jolivet ; Jose E. Roman 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

Hong

If the product is not supported, the type of C will never be set anyway, so you 
cannot call MatHasOperation after MatProductSetFromOptions.
The purpose of MatProductSetFromOptions is to populate the function pointers 
for symbolic and numeric phases. If not found, they should be set to null 
instead of erroring as it is now.
What I propose is to have MatProductHasOperation (not MatHasOperation): this 
function will be identical to MatHasOperation, with the only difference that 
does not call PetscValidType on the input mat.

Meanwhile, I’m coding a basic MatMat (and MatTransposeMat) driver to loop over 
dense columns and apply MatMult. (Or MatMultTranspose) without memory movement.
This will be valid for all B matrices being of type dense (and its 
derivations), with C of type dense too. This in principle will fix Jose and 
Pierre’s issues (they can correct me if I’m wrong)

However, we should definitely have a way for the user to enquire if a given 
operation is supported or not.

Thanks
Stefano

On May 6, 2020, at 12:03 AM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Stefano:
Now, we need address this bug report: enable 
MatHasOperation(C,MATOP_MAT_MULT,&flg) for matrix products, e.g., C=A*B, which 
is related to your issue https://gitlab.com/petsc/petsc/-/issues/608.

In petsc-3.13:
1) MATOP_MAT_MULT, ..., MATOP_MATMAT_MULT are removed from the MATOP table 
(they are still listed in petscmat.h -- an overlook, I'll remove them).
MATOP_MAT_MULT_SYMBOLIC/NUMERIC ... are still in the table.
2) MatHasOperation(C,...) must be called for the matrix product C, not matrix A 
or B (slepc needs to fix this after this reported bug is fixed).

Like MatSetOption(), MatHasOperation() must be called AFTER MatSetType(). You 
moved MatSetType() from MatProductSetFromOptions() back to MatProductSymbolic() 
in your latest patch, thus user has to call MatHasOption() after 
MatProductSymbolic():

MatProductCreate(A,B,NULL,&C);
MatProductSetType(C,...);
...
MatProductSetFromOptions();   //if the product is not supported for the given 
mat types, currently petsc crashes here, which we can replace with an error 
output

MatProductSymbloc(); -> call MatSetType()
MatHasOperation(C,MATOP_MAT_MULT,&flg)

Question: how to call MatHasOperation(C,..) when MatProductSymbloc() is not 
supported?

My fix to this bug:
Resume MatSetType() in MatProductSetFromOptions(). Then user calls:

MatProductCreate(A,B,NULL,&C);
MatProductSetType(C,...);
...
MatProductSetFromOptions(C);  //if the product is not supported for the given 
mat types, C->ops->productsymbolic=NULL;
MatHasOperation(C,MATOP_PRODUCTSYMBOLIC,&flg);
if (flg) {
   MatProductSymbolic(C);
   ...
} else {
   MatDestroy(&C);
   ...
}

Either you take care of this bug report, or let me know your thoughts about how 
to fix this bug.
Hong
____
From: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Sent: Saturday, April 25, 2020 2:40 PM
To: Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Cc: Jose E. Roman mailto:jro...@dsic.upv.es>>; Stefano 
Zampini mailto:stefano.zamp...@gmail.com>>; 
petsc-dev mailto:petsc-dev@mcs.anl.gov>>; Smith, Barry 
F. mailto:bsm...@mcs.anl.gov>>
Subject: Re: [petsc-dev] MATOP_MAT_MULT

Pierre,
When we do
MatProductCreate: C = A*B; //C owns A and B, thus B->refct =2
MatProductCreateWithMats: B = A*C; //If I let B own A and C, then C->refct=2
Then
MatDestroy(&B) and MatDestroy(&C) only reduce their refct from 2 to 1, thus 
memory leak.
My solution is adding
{
   matreference;  /* do not add refct when using 
MatProductCreateWithMat() to void recursive references */
} Mat_Product
This flg prevents MatProductCreateWithMats() to increase reference counts, 
i.e., B does not own A and C to avoid reverse ownership. I am not sure this is 
a reasonable solution. Let me know if you have better solution.
See ex109.c and ex195.c for tests.
Hong

From: Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Sent: Saturday, April 25, 2020 11:45 AM
To: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Cc: Jose E. Roman mailto:jro...@dsic.upv.es>>; Stefano 
Zampini mailto:stefano.zamp...@gmail.com>>; 
petsc-dev mailto:petsc-dev@mcs.anl.gov>>; Smith, Barry 
F. mailto:bsm...@mcs.anl.gov>>
Subject: Re: [petsc-dev] MATOP_MAT_MULT

Hong,
José didn’t report this, though he may have run into the same issue, I did.
I’ll try the branch and get back at you on GitLab MR.

Thanks,
Pierre

On 25 Apr 2020, at 6:17 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Jose,

>> I also now just tested some previousl

Re: [petsc-dev] MATOP_MAT_MULT

2020-05-05 Thread Zhang, Hong via petsc-dev
Stefano:
Now, we need address this bug report: enable 
MatHasOperation(C,MATOP_MAT_MULT,&flg) for matrix products, e.g., C=A*B, which 
is related to your issue https://gitlab.com/petsc/petsc/-/issues/608.

In petsc-3.13:
1) MATOP_MAT_MULT, ..., MATOP_MATMAT_MULT are removed from the MATOP table 
(they are still listed in petscmat.h -- an overlook, I'll remove them).
MATOP_MAT_MULT_SYMBOLIC/NUMERIC ... are still in the table.
2) MatHasOperation(C,...) must be called for the matrix product C, not matrix A 
or B (slepc needs to fix this after this reported bug is fixed).

Like MatSetOption(), MatHasOperation() must be called AFTER MatSetType(). You 
moved MatSetType() from MatProductSetFromOptions() back to MatProductSymbolic() 
in your latest patch, thus user has to call MatHasOption() after 
MatProductSymbolic():

MatProductCreate(A,B,NULL,&C);
MatProductSetType(C,...);
...
MatProductSetFromOptions();   //if the product is not supported for the given 
mat types, currently petsc crashes here, which we can replace with an error 
output

MatProductSymbloc(); -> call MatSetType()
MatHasOperation(C,MATOP_MAT_MULT,&flg)

Question: how to call MatHasOperation(C,..) when MatProductSymbloc() is not 
supported?

My fix to this bug:
Resume MatSetType() in MatProductSetFromOptions(). Then user calls:

MatProductCreate(A,B,NULL,&C);
MatProductSetType(C,...);
...
MatProductSetFromOptions(C);  //if the product is not supported for the given 
mat types, C->ops->productsymbolic=NULL;
MatHasOperation(C,MATOP_PRODUCTSYMBOLIC,&flg);
if (flg) {
   MatProductSymbolic(C);
   ...
} else {
   MatDestroy(&C);
   ...
}

Either you take care of this bug report, or let me know your thoughts about how 
to fix this bug.
Hong
____
From: Zhang, Hong 
Sent: Saturday, April 25, 2020 2:40 PM
To: Pierre Jolivet 
Cc: Jose E. Roman ; Stefano Zampini 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

Pierre,
When we do
MatProductCreate: C = A*B; //C owns A and B, thus B->refct =2
MatProductCreateWithMats: B = A*C; //If I let B own A and C, then C->refct=2
Then
MatDestroy(&B) and MatDestroy(&C) only reduce their refct from 2 to 1, thus 
memory leak.
My solution is adding
{
   matreference;  /* do not add refct when using 
MatProductCreateWithMat() to void recursive references */
} Mat_Product
This flg prevents MatProductCreateWithMats() to increase reference counts, 
i.e., B does not own A and C to avoid reverse ownership. I am not sure this is 
a reasonable solution. Let me know if you have better solution.
See ex109.c and ex195.c for tests.
Hong

From: Pierre Jolivet 
Sent: Saturday, April 25, 2020 11:45 AM
To: Zhang, Hong 
Cc: Jose E. Roman ; Stefano Zampini 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

Hong,
José didn’t report this, though he may have run into the same issue, I did.
I’ll try the branch and get back at you on GitLab MR.

Thanks,
Pierre

On 25 Apr 2020, at 6:17 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Jose,

>> I also now just tested some previously PETSC_VERSION_LT(3,13,0) running code 
>> with C=A*B, Dense=Nest*Dense, all previously allocated prior to a call to 
>> MatMatMult and scall = MAT_REUSE_MATRIX.
>> Sadly, it’s now broken. It is my fault for not having a test for this in 
>> https://gitlab.com/petsc/petsc/-/merge_requests/2069, sorry about that.
>> [0]PETSC ERROR: Call MatProductSymbolic() first
>> [0]PETSC ERROR: #1 MatProductNumeric() line 730 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matproduct.c
>> [0]PETSC ERROR: #2 MatMatMult() line 9335 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matrix.c
>>
>> Here is a reproducer (that will work OK with 3.12.4).
>> diff --git a/src/mat/tests/ex195.c b/src/mat/tests/ex195.c
>> index c72662bc3c..811de669c5 100644
>> --- a/src/mat/tests/ex195.c
>> +++ b/src/mat/tests/ex195.c
>> @@ -73,2 +73,3 @@ int main(int argc,char **args)
>>ierr = MatMatMult(nest,B,MAT_REUSE_MATRIX,PETSC_DEFAULT,&C);CHKERRQ(ierr);
>> +  ierr = MatMatMult(nest,C,MAT_REUSE_MATRIX,PETSC_DEFAULT,&B);CHKERRQ(ierr);
>>ierr = MatMatMultEqual(nest,B,C,10,&equal);CHKERRQ(ierr);
>>
>> $ make -f gmakefile test searchin=mat_tests-ex195
>>
>> I believe this is very close to the topic at hand and issue #608, so maybe 
>> you could fix this as well in the same upcoming MR? Just let me know, I can 
>> have a crack it otherwise.

This is a bug. I fixed it in the branch hzhang/fix-matproduct-reuse/maint. Can 
you test it?
Hong



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-25 Thread Zhang, Hong via petsc-dev
Pierre,
When we do
MatProductCreate: C = A*B; //C owns A and B, thus B->refct =2
MatProductCreateWithMats: B = A*C; //If I let B own A and C, then C->refct=2
Then
MatDestroy(&B) and MatDestroy(&C) only reduce their refct from 2 to 1, thus 
memory leak.
My solution is adding
{
   matreference;  /* do not add refct when using 
MatProductCreateWithMat() to void recursive references */
} Mat_Product
This flg prevents MatProductCreateWithMats() to increase reference counts, 
i.e., B does not own A and C to avoid reverse ownership. I am not sure this is 
a reasonable solution. Let me know if you have better solution.
See ex109.c and ex195.c for tests.
Hong

From: Pierre Jolivet 
Sent: Saturday, April 25, 2020 11:45 AM
To: Zhang, Hong 
Cc: Jose E. Roman ; Stefano Zampini 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

Hong,
José didn’t report this, though he may have run into the same issue, I did.
I’ll try the branch and get back at you on GitLab MR.

Thanks,
Pierre

On 25 Apr 2020, at 6:17 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Jose,

>> I also now just tested some previously PETSC_VERSION_LT(3,13,0) running code 
>> with C=A*B, Dense=Nest*Dense, all previously allocated prior to a call to 
>> MatMatMult and scall = MAT_REUSE_MATRIX.
>> Sadly, it’s now broken. It is my fault for not having a test for this in 
>> https://gitlab.com/petsc/petsc/-/merge_requests/2069, sorry about that.
>> [0]PETSC ERROR: Call MatProductSymbolic() first
>> [0]PETSC ERROR: #1 MatProductNumeric() line 730 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matproduct.c
>> [0]PETSC ERROR: #2 MatMatMult() line 9335 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matrix.c
>>
>> Here is a reproducer (that will work OK with 3.12.4).
>> diff --git a/src/mat/tests/ex195.c b/src/mat/tests/ex195.c
>> index c72662bc3c..811de669c5 100644
>> --- a/src/mat/tests/ex195.c
>> +++ b/src/mat/tests/ex195.c
>> @@ -73,2 +73,3 @@ int main(int argc,char **args)
>>ierr = MatMatMult(nest,B,MAT_REUSE_MATRIX,PETSC_DEFAULT,&C);CHKERRQ(ierr);
>> +  ierr = MatMatMult(nest,C,MAT_REUSE_MATRIX,PETSC_DEFAULT,&B);CHKERRQ(ierr);
>>ierr = MatMatMultEqual(nest,B,C,10,&equal);CHKERRQ(ierr);
>>
>> $ make -f gmakefile test searchin=mat_tests-ex195
>>
>> I believe this is very close to the topic at hand and issue #608, so maybe 
>> you could fix this as well in the same upcoming MR? Just let me know, I can 
>> have a crack it otherwise.

This is a bug. I fixed it in the branch hzhang/fix-matproduct-reuse/maint. Can 
you test it?
Hong



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-25 Thread Zhang, Hong via petsc-dev
Jose,

>> I also now just tested some previously PETSC_VERSION_LT(3,13,0) running code 
>> with C=A*B, Dense=Nest*Dense, all previously allocated prior to a call to 
>> MatMatMult and scall = MAT_REUSE_MATRIX.
>> Sadly, it’s now broken. It is my fault for not having a test for this in 
>> https://gitlab.com/petsc/petsc/-/merge_requests/2069, sorry about that.
>> [0]PETSC ERROR: Call MatProductSymbolic() first
>> [0]PETSC ERROR: #1 MatProductNumeric() line 730 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matproduct.c
>> [0]PETSC ERROR: #2 MatMatMult() line 9335 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matrix.c
>>
>> Here is a reproducer (that will work OK with 3.12.4).
>> diff --git a/src/mat/tests/ex195.c b/src/mat/tests/ex195.c
>> index c72662bc3c..811de669c5 100644
>> --- a/src/mat/tests/ex195.c
>> +++ b/src/mat/tests/ex195.c
>> @@ -73,2 +73,3 @@ int main(int argc,char **args)
>>ierr = MatMatMult(nest,B,MAT_REUSE_MATRIX,PETSC_DEFAULT,&C);CHKERRQ(ierr);
>> +  ierr = MatMatMult(nest,C,MAT_REUSE_MATRIX,PETSC_DEFAULT,&B);CHKERRQ(ierr);
>>ierr = MatMatMultEqual(nest,B,C,10,&equal);CHKERRQ(ierr);
>>
>> $ make -f gmakefile test searchin=mat_tests-ex195
>>
>> I believe this is very close to the topic at hand and issue #608, so maybe 
>> you could fix this as well in the same upcoming MR? Just let me know, I can 
>> have a crack it otherwise.

This is a bug. I fixed it in the branch hzhang/fix-matproduct-reuse/maint. Can 
you test it?
Hong



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-23 Thread Zhang, Hong via petsc-dev
I'll try to do it in maint. Hong

From: Jose E. Roman 
Sent: Thursday, April 23, 2020 2:36 AM
To: Pierre Jolivet 
Cc: Zhang, Hong ; Stefano Zampini 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

I agree with Pierre. However, if the fix involves an API change then I could 
understand it goes to master.


> El 23 abr 2020, a las 7:43, Pierre Jolivet  
> escribió:
>
> I don’t know if you really meant to ask for José's opinion here, but I 
> personally think that releasing all 3.13.X version with a broken MatMatMult 
> and no deprecation warning concerning MATOP_MAT_MULT is not the best.
> Thanks,
> Pierre
>
>> On 23 Apr 2020, at 4:03 AM, Zhang, Hong  wrote:
>>
>> Jose,
>> I'll check and fix them. I have to do it in master, is ok?
>> Hong
>>
>> From: Pierre Jolivet 
>> Sent: Wednesday, April 22, 2020 3:08 PM
>> To: Zhang, Hong 
>> Cc: Jose E. Roman ; Stefano Zampini 
>> ; petsc-dev ; Smith, Barry 
>> F. 
>> Subject: Re: [petsc-dev] MATOP_MAT_MULT
>>
>> Hong,
>> I also now just tested some previously PETSC_VERSION_LT(3,13,0) running code 
>> with C=A*B, Dense=Nest*Dense, all previously allocated prior to a call to 
>> MatMatMult and scall = MAT_REUSE_MATRIX.
>> Sadly, it’s now broken. It is my fault for not having a test for this in 
>> https://gitlab.com/petsc/petsc/-/merge_requests/2069, sorry about that.
>> [0]PETSC ERROR: Call MatProductSymbolic() first
>> [0]PETSC ERROR: #1 MatProductNumeric() line 730 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matproduct.c
>> [0]PETSC ERROR: #2 MatMatMult() line 9335 in 
>> /ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matrix.c
>>
>> Here is a reproducer (that will work OK with 3.12.4).
>> diff --git a/src/mat/tests/ex195.c b/src/mat/tests/ex195.c
>> index c72662bc3c..811de669c5 100644
>> --- a/src/mat/tests/ex195.c
>> +++ b/src/mat/tests/ex195.c
>> @@ -73,2 +73,3 @@ int main(int argc,char **args)
>>ierr = MatMatMult(nest,B,MAT_REUSE_MATRIX,PETSC_DEFAULT,&C);CHKERRQ(ierr);
>> +  ierr = MatMatMult(nest,C,MAT_REUSE_MATRIX,PETSC_DEFAULT,&B);CHKERRQ(ierr);
>>ierr = MatMatMultEqual(nest,B,C,10,&equal);CHKERRQ(ierr);
>>
>> $ make -f gmakefile test searchin=mat_tests-ex195
>>
>> I believe this is very close to the topic at hand and issue #608, so maybe 
>> you could fix this as well in the same upcoming MR? Just let me know, I can 
>> have a crack it otherwise.
>> Thanks,
>> Pierre
>>
>>> On 22 Apr 2020, at 5:38 PM, Zhang, Hong  wrote:
>>>
>>> Jose, Pierre and Stefano,
>>> Now I understand the issue that Stefano raised. I plan to add
>>> MatProductIsSupported(Wmat,&supported,&matproductsetfromoptions)
>>> the flag 'supported' tells if the product is supported/implemented or not,
>>> and the function pointer 'matproductsetfromoptions' gives the name of 
>>> MatProductSetFromOptions_xxx, (including basic implementation) or NULL.
>>>
>>> Let me know your suggestions. I'll list all of you as reviewer.
>>> Hong
>>>
>>>
>>> From: Jose E. Roman 
>>> Sent: Wednesday, April 22, 2020 9:07 AM
>>> To: Stefano Zampini 
>>> Cc: Zhang, Hong ; Pierre Jolivet 
>>> ; petsc-dev 
>>> Subject: Re: [petsc-dev] MATOP_MAT_MULT
>>>
>>> I agree with Pierre and Stefano.
>>> Hong: your proposed solution would be fine, but MATOP_MATPRODUCT does not 
>>> exist yet, so I cannot try it.
>>> I would like a solution along the lines of what Stefano suggests. It is not 
>>> too much trouble if it goes to master instead of maint.
>>>
>>> Thanks.
>>> Jose
>>>
>>>
>>> > El 22 abr 2020, a las 15:26, Stefano Zampini  
>>> > escribió:
>>> >
>>> >
>>> >>
>>> >> MatProductCreateWithMat(A,Vmat,NULL,Wmat);
>>> >> MatProductSetType(Wmat,MATPRODUCT_AB);
>>> >> MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg); //new support, it calls 
>>> >> MatProductSetFromOptions(Wmat)
>>> >
>>> > Hong, this would go in the direction I was outlining here 
>>> > https://gitlab.com/petsc/petsc/-/issues/608
>>> > How about also adding something like
>>> >
>>> > MatProductIsImplemented(Wmat,&flg)
>>> >
>>> > That returns true if a specific implementation is available? This way
>>> >
>>> > This way, 

Re: [petsc-dev] MATOP_MAT_MULT

2020-04-22 Thread Zhang, Hong via petsc-dev
Jose,
I'll check and fix them. I have to do it in master, is ok?
Hong


From: Pierre Jolivet 
Sent: Wednesday, April 22, 2020 3:08 PM
To: Zhang, Hong 
Cc: Jose E. Roman ; Stefano Zampini 
; petsc-dev ; Smith, Barry F. 

Subject: Re: [petsc-dev] MATOP_MAT_MULT

Hong,
I also now just tested some previously PETSC_VERSION_LT(3,13,0) running code 
with C=A*B, Dense=Nest*Dense, all previously allocated prior to a call to 
MatMatMult and scall = MAT_REUSE_MATRIX.
Sadly, it’s now broken. It is my fault for not having a test for this in 
https://gitlab.com/petsc/petsc/-/merge_requests/2069, sorry about that.
[0]PETSC ERROR: Call MatProductSymbolic() first
[0]PETSC ERROR: #1 MatProductNumeric() line 730 in 
/ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matproduct.c
[0]PETSC ERROR: #2 MatMatMult() line 9335 in 
/ccc/work/cont003/rndm/rndm/petsc/src/mat/interface/matrix.c

Here is a reproducer (that will work OK with 3.12.4).
diff --git a/src/mat/tests/ex195.c b/src/mat/tests/ex195.c
index c72662bc3c..811de669c5 100644
--- a/src/mat/tests/ex195.c
+++ b/src/mat/tests/ex195.c
@@ -73,2 +73,3 @@ int main(int argc,char **args)
   ierr = MatMatMult(nest,B,MAT_REUSE_MATRIX,PETSC_DEFAULT,&C);CHKERRQ(ierr);
+  ierr = MatMatMult(nest,C,MAT_REUSE_MATRIX,PETSC_DEFAULT,&B);CHKERRQ(ierr);
   ierr = MatMatMultEqual(nest,B,C,10,&equal);CHKERRQ(ierr);

$ make -f gmakefile test searchin=mat_tests-ex195

I believe this is very close to the topic at hand and issue #608, so maybe you 
could fix this as well in the same upcoming MR? Just let me know, I can have a 
crack it otherwise.
Thanks,
Pierre

On 22 Apr 2020, at 5:38 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Jose, Pierre and Stefano,
Now I understand the issue that Stefano raised. I plan to add
MatProductIsSupported(Wmat,&supported,&matproductsetfromoptions)
the flag 'supported' tells if the product is supported/implemented or not,
and the function pointer 'matproductsetfromoptions' gives the name of 
MatProductSetFromOptions_xxx, (including basic implementation) or NULL.

Let me know your suggestions. I'll list all of you as reviewer.
Hong


From: Jose E. Roman mailto:jro...@dsic.upv.es>>
Sent: Wednesday, April 22, 2020 9:07 AM
To: Stefano Zampini 
mailto:stefano.zamp...@gmail.com>>
Cc: Zhang, Hong mailto:hzh...@mcs.anl.gov>>; Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>; petsc-dev 
mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MATOP_MAT_MULT

I agree with Pierre and Stefano.
Hong: your proposed solution would be fine, but MATOP_MATPRODUCT does not exist 
yet, so I cannot try it.
I would like a solution along the lines of what Stefano suggests. It is not too 
much trouble if it goes to master instead of maint.

Thanks.
Jose


> El 22 abr 2020, a las 15:26, Stefano Zampini 
> mailto:stefano.zamp...@gmail.com>> escribió:
>
>
>>
>> MatProductCreateWithMat(A,Vmat,NULL,Wmat);
>> MatProductSetType(Wmat,MATPRODUCT_AB);
>> MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg); //new support, it calls 
>> MatProductSetFromOptions(Wmat)
>
> Hong, this would go in the direction I was outlining here 
> https://gitlab.com/petsc/petsc/-/issues/608
> How about also adding something like
>
> MatProductIsImplemented(Wmat,&flg)
>
> That returns true if a specific implementation is available? This way
>
> This way, if we use both queries, we can assess the presence of the basic 
> fallbacks too, i.e.
>
> MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg1)
> MatProductIsImplemented(Wmat,&flg2)
>
> If flg1 is false, no support at all
> If flg1 is true and flg2 is false -> Basic implementation (i.e, MatShell with 
> products inside)
> If flg1 and flg2 are both true -> Specific implementation available.
>
>> if (V->vmm && flg) {
>>   MatProductSymbolic(Wmat);
>>   MatProductNumeric(Wmat);
>> } else {
>>   MatDestroy(Wmat);
>>   ...
>> }
>> Hong
>>
>>
>> From: Jose E. Roman mailto:jro...@dsic.upv.es>>
>> Sent: Tuesday, April 21, 2020 11:21 AM
>> To: Pierre Jolivet 
>> mailto:pierre.joli...@enseeiht.fr>>
>> Cc: Zhang, Hong mailto:hzh...@mcs.anl.gov>>; petsc-dev 
>> mailto:petsc-dev@mcs.anl.gov>>
>> Subject: Re: [petsc-dev] MATOP_MAT_MULT
>>
>>
>>
>> > El 21 abr 2020, a las 17:53, Pierre Jolivet 
>> > mailto:pierre.joli...@enseeiht.fr>> escribió:
>> >
>> >
>> >
>> >> On 21 Apr 2020, at 5:22 PM, Zhang, Hong 
>> >> mailto:hzh...@mcs.anl.gov>> wrote:
>> >>
>> >> Pierre,
>> >> MatMatMult_xxx() is removed from MatOps table.
>> >
>> > S

Re: [petsc-dev] MATOP_MAT_MULT

2020-04-22 Thread Zhang, Hong via petsc-dev
Jose, Pierre and Stefano,
Now I understand the issue that Stefano raised. I plan to add
MatProductIsSupported(Wmat,&supported,&matproductsetfromoptions)
the flag 'supported' tells if the product is supported/implemented or not,
and the function pointer 'matproductsetfromoptions' gives the name of 
MatProductSetFromOptions_xxx, (including basic implementation) or NULL.

Let me know your suggestions. I'll list all of you as reviewer.
Hong


From: Jose E. Roman 
Sent: Wednesday, April 22, 2020 9:07 AM
To: Stefano Zampini 
Cc: Zhang, Hong ; Pierre Jolivet 
; petsc-dev 
Subject: Re: [petsc-dev] MATOP_MAT_MULT

I agree with Pierre and Stefano.
Hong: your proposed solution would be fine, but MATOP_MATPRODUCT does not exist 
yet, so I cannot try it.
I would like a solution along the lines of what Stefano suggests. It is not too 
much trouble if it goes to master instead of maint.

Thanks.
Jose


> El 22 abr 2020, a las 15:26, Stefano Zampini  
> escribió:
>
>
>>
>> MatProductCreateWithMat(A,Vmat,NULL,Wmat);
>> MatProductSetType(Wmat,MATPRODUCT_AB);
>> MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg); //new support, it calls 
>> MatProductSetFromOptions(Wmat)
>
> Hong, this would go in the direction I was outlining here 
> https://gitlab.com/petsc/petsc/-/issues/608
> How about also adding something like
>
> MatProductIsImplemented(Wmat,&flg)
>
> That returns true if a specific implementation is available? This way
>
> This way, if we use both queries, we can assess the presence of the basic 
> fallbacks too, i.e.
>
> MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg1)
> MatProductIsImplemented(Wmat,&flg2)
>
> If flg1 is false, no support at all
> If flg1 is true and flg2 is false -> Basic implementation (i.e, MatShell with 
> products inside)
> If flg1 and flg2 are both true -> Specific implementation available.
>
>> if (V->vmm && flg) {
>>   MatProductSymbolic(Wmat);
>>   MatProductNumeric(Wmat);
>> } else {
>>   MatDestroy(Wmat);
>>   ...
>> }
>> Hong
>>
>>
>> From: Jose E. Roman 
>> Sent: Tuesday, April 21, 2020 11:21 AM
>> To: Pierre Jolivet 
>> Cc: Zhang, Hong ; petsc-dev 
>> Subject: Re: [petsc-dev] MATOP_MAT_MULT
>>
>>
>>
>> > El 21 abr 2020, a las 17:53, Pierre Jolivet  
>> > escribió:
>> >
>> >
>> >
>> >> On 21 Apr 2020, at 5:22 PM, Zhang, Hong  wrote:
>> >>
>> >> Pierre,
>> >> MatMatMult_xxx() is removed from MatOps table.
>> >
>> > Shouldn’t there be a deprecation notice somewhere?
>> > There is nothing about MATOP_MAT_MULT in the 3.13 changelog 
>> > https://www.mcs.anl.gov/petsc/documentation/changes/313.html
>> > For example, I see that in SLEPc, José is currently making these checks, 
>> > which are in practice useless as they always return 
>> > PETSC_FALSE?https://gitlab.com/slepc/slepc/-/blob/master/src/sys/classes/bv/impls/contiguous/contig.c#L191
>> > (Maybe José is aware of this and this is just for testing)
>>
>> No, I was not aware of this. Thanks for bringing this up. Now in 3.13 we are 
>> always doing the slow version (column by column), so yes I am interested in 
>> a solution for this.
>>
>> >
>> >> MatMatMult() is replaced by
>> >> MatProductCreate()
>> >> MatProductSetType(,MATPRODUCT_AB)
>> >> MatProductSetFromOptions()
>> >> MatProductSymbolic()
>> >> MatProductNumeric()
>> >>
>> >> Where/when do you need query a single matrix for its product operation?
>> >
>> > I didn’t want to bother at first with the new API, because I’m only 
>> > interested in C = A*B with C and B being dense.
>> > Of course, I can update my code, but if I understand Stefano’s issue 
>> > correctly, and let’s say my A is of type SBAIJ, for which there is no 
>> > MatMatMult, the code will now error out in the MatProduct?
>> > There is no fallback mechanism? Meaning I could in fact _not_ use the new 
>> > API and will just have to loop on all columns of B, even for AIJ matrices.
>> >
>> > Thanks,
>> > Pierre
>> >
>> >> Hong
>> >>
>> >> From: petsc-dev  on behalf of Pierre 
>> >> Jolivet 
>> >> Sent: Tuesday, April 21, 2020 7:50 AM
>> >> To: petsc-dev 
>> >> Subject: [petsc-dev] MATOP_MAT_MULT
>> >>
>> >> Hello,
>> >> Am I seeing this correctly?
>> >> #include 
>> >>
>

Re: [petsc-dev] MATOP_MAT_MULT

2020-04-22 Thread Zhang, Hong via petsc-dev
Pierre,


Well, that’s just not an option. I don’t want the code to error, I want a 
fallback mechanism so that I can do the MatMatMult myself, column by column (or 
implement this as part of issue #608 in the case of dense B and C so neither 
José nor me have to bother about this again).
That makes MatMatMult very hard to use now, because you can’t tell a priori if 
your code will run or not if the types of your A and B are variable (because of 
an unimplemented MatMatMult).

I'll add this support.

What you propose there looks good to me.
(Though I’d like to be able to skip MatProductSymbolic() in the case of a dense 
Wmat allocated in user code, but this is very minor)

Since you Wmat is allocated, you should skip MatProductSymbolic() by calling
MatProductCreateWithMat( , , ...,Wmat);
..
MatProductSetFromOptions()
MatProductNumeric() // skip MatProductSymbolic()

The new API is introduced to unify and cleanup previous random design of 
mat-mat operations, which becomes inextensible and difficult to manage. The new 
API is more flexible, allowing options for various algorithmic implementations, 
but would take a while to be bug-free and ensure needed features are 
maintained. Report to me -- I'll do my best to provided needed support.

Hong

From: Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Sent: Tuesday, April 21, 2020 10:53 AM
To: Zhang, Hong mailto:hzh...@mcs.anl.gov>>
Cc: petsc-dev mailto:petsc-dev@mcs.anl.gov>>
Subject: Re: [petsc-dev] MATOP_MAT_MULT



On 21 Apr 2020, at 5:22 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Pierre,
MatMatMult_xxx() is removed from MatOps table.

Shouldn’t there be a deprecation notice somewhere?
There is nothing about MATOP_MAT_MULT in the 3.13 changelog 
https://www.mcs.anl.gov/petsc/documentation/changes/313.html
For example, I see that in SLEPc, José is currently making these checks, which 
are in practice useless as they always return PETSC_FALSE? 
https://gitlab.com/slepc/slepc/-/blob/master/src/sys/classes/bv/impls/contiguous/contig.c#L191
(Maybe José is aware of this and this is just for testing)

MatMatMult() is replaced by
MatProductCreate()
MatProductSetType(,MATPRODUCT_AB)
MatProductSetFromOptions()
MatProductSymbolic()
MatProductNumeric()

Where/when do you need query a single matrix for its product operation?

I didn’t want to bother at first with the new API, because I’m only interested 
in C = A*B with C and B being dense.
Of course, I can update my code, but if I understand Stefano’s issue correctly, 
and let’s say my A is of type SBAIJ, for which there is no MatMatMult, the code 
will now error out in the MatProduct?
There is no fallback mechanism? Meaning I could in fact _not_ use the new API 
and will just have to loop on all columns of B, even for AIJ matrices.

Thanks,
Pierre

Hong



From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Sent: Tuesday, April 21, 2020 7:50 AM
To: petsc-dev mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] MATOP_MAT_MULT

Hello,
Am I seeing this correctly?
#include 

int main(int argc,char **args)
{
  Mat   A;
  PetscBool hasMatMult;
  PetscErrorCodeierr;

  ierr = PetscInitialize(&argc,&args,NULL,NULL);if (ierr) return ierr;
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
  ierr = MatHasOperation(A,MATOP_MAT_MULT,&hasMatMult);CHKERRQ(ierr);
  printf("%s\n", PetscBools[hasMatMult]);
  ierr = PetscFinalize();
  return ierr;
}

=> FALSE

I believe this is a regression (or at least an undocumented change) introduced 
here: https://gitlab.com/petsc/petsc/-/merge_requests/2524/
I also believe Stefano raised a similar point there: 
https://gitlab.com/petsc/petsc/-/issues/608
This is a performance killer in my case because I was previously using this 
check to know whether I could use MatMatMult or had to loop on all columns and 
call MatMult on all of them.
There is also a bunch of (previously functioning but now) broken code, e.g., 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/transpose/transm.c.html#line105
 or 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/nest/matnest.c.html#line2105
Is this being addressed/documented?

Thanks,
Pierre



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-21 Thread Zhang, Hong via petsc-dev
Jose,
We need both A and Vmat to determine if Wmat= A*Vmat is supported or not.

MatHasOperation(A,MATOP_MAT_MULT,&flg); //this call is not sufficient to ensure 
Wmat.

How about replacing
  if (V->vmm && flg) {
ierr = BVGetMat(V,&Vmat);CHKERRQ(ierr);
ierr = BVGetMat(W,&Wmat);CHKERRQ(ierr);
ierr = MatProductCreateWithMat(A,Vmat,NULL,Wmat);CHKERRQ(ierr);
ierr = MatProductSetType(Wmat,MATPRODUCT_AB);CHKERRQ(ierr);
ierr = MatProductSetFromOptions(Wmat);CHKERRQ(ierr);
ierr = MatProductSymbolic(Wmat);CHKERRQ(ierr);
ierr = MatProductNumeric(Wmat);CHKERRQ(ierr);
ierr = BVRestoreMat(V,&Vmat);CHKERRQ(ierr);
ierr = BVRestoreMat(W,&Wmat);CHKERRQ(ierr);
  } else {
...
  }


with

MatProductCreateWithMat(A,Vmat,NULL,Wmat);
MatProductSetType(Wmat,MATPRODUCT_AB);

MatHasOperation(Wmat,MATOP_MATPRODUCT,&flg); //new support, it calls 
MatProductSetFromOptions(Wmat)

if (V->vmm && flg) {

  MatProductSymbolic(Wmat);

  MatProductNumeric(Wmat);

} else {

  MatDestroy(Wmat);

...

}

Hong



From: Jose E. Roman 
Sent: Tuesday, April 21, 2020 11:21 AM
To: Pierre Jolivet 
Cc: Zhang, Hong ; petsc-dev 
Subject: Re: [petsc-dev] MATOP_MAT_MULT



> El 21 abr 2020, a las 17:53, Pierre Jolivet  
> escribió:
>
>
>
>> On 21 Apr 2020, at 5:22 PM, Zhang, Hong  wrote:
>>
>> Pierre,
>> MatMatMult_xxx() is removed from MatOps table.
>
> Shouldn’t there be a deprecation notice somewhere?
> There is nothing about MATOP_MAT_MULT in the 3.13 changelog 
> https://www.mcs.anl.gov/petsc/documentation/changes/313.html
> For example, I see that in SLEPc, José is currently making these checks, 
> which are in practice useless as they always return PETSC_FALSE? 
> https://gitlab.com/slepc/slepc/-/blob/master/src/sys/classes/bv/impls/contiguous/contig.c#L191
> (Maybe José is aware of this and this is just for testing)

No, I was not aware of this. Thanks for bringing this up. Now in 3.13 we are 
always doing the slow version (column by column), so yes I am interested in a 
solution for this.

>
>> MatMatMult() is replaced by
>> MatProductCreate()
>> MatProductSetType(,MATPRODUCT_AB)
>> MatProductSetFromOptions()
>> MatProductSymbolic()
>> MatProductNumeric()
>>
>> Where/when do you need query a single matrix for its product operation?
>
> I didn’t want to bother at first with the new API, because I’m only 
> interested in C = A*B with C and B being dense.
> Of course, I can update my code, but if I understand Stefano’s issue 
> correctly, and let’s say my A is of type SBAIJ, for which there is no 
> MatMatMult, the code will now error out in the MatProduct?
> There is no fallback mechanism? Meaning I could in fact _not_ use the new API 
> and will just have to loop on all columns of B, even for AIJ matrices.
>
> Thanks,
> Pierre
>
>> Hong
>>
>> From: petsc-dev  on behalf of Pierre Jolivet 
>> 
>> Sent: Tuesday, April 21, 2020 7:50 AM
>> To: petsc-dev 
>> Subject: [petsc-dev] MATOP_MAT_MULT
>>
>> Hello,
>> Am I seeing this correctly?
>> #include 
>>
>> int main(int argc,char **args)
>> {
>>   Mat   A;
>>   PetscBool hasMatMult;
>>   PetscErrorCodeierr;
>>
>>   ierr = PetscInitialize(&argc,&args,NULL,NULL);if (ierr) return ierr;
>>   ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
>>   ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
>>   ierr = MatHasOperation(A,MATOP_MAT_MULT,&hasMatMult);CHKERRQ(ierr);
>>   printf("%s\n", PetscBools[hasMatMult]);
>>   ierr = PetscFinalize();
>>   return ierr;
>> }
>>
>> => FALSE
>>
>> I believe this is a regression (or at least an undocumented change) 
>> introduced here: https://gitlab.com/petsc/petsc/-/merge_requests/2524/
>> I also believe Stefano raised a similar point there: 
>> https://gitlab.com/petsc/petsc/-/issues/608
>> This is a performance killer in my case because I was previously using this 
>> check to know whether I could use MatMatMult or had to loop on all columns 
>> and call MatMult on all of them.
>> There is also a bunch of (previously functioning but now) broken code, e.g., 
>> https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/transpose/transm.c.html#line105
>>  or 
>> https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/nest/matnest.c.html#line2105
>> Is this being addressed/documented?
>>
>> Thanks,
>> Pierre
>



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-21 Thread Zhang, Hong via petsc-dev
Pierre,
The old API, MatMatMult(), MatPtAP() ... are still available as wrappers to the 
new API:
MatProductCreate()
MatProductSetType(,MATPRODUCT_AB/PtAP)
MatProductSetFromOptions()
MatProductSymbolic()
MatProductNumeric()
You do not need to change your code. When you call MatMatMult() with seqsbaij 
and dense matrices, a detailed error message will be out from 
MatProductSetFromOptions() (or MatMatMult() if you use the wrapper).

I'll discuss Jose's usage in next email.
Hong

From: Pierre Jolivet 
Sent: Tuesday, April 21, 2020 10:53 AM
To: Zhang, Hong 
Cc: petsc-dev 
Subject: Re: [petsc-dev] MATOP_MAT_MULT



On 21 Apr 2020, at 5:22 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Pierre,
MatMatMult_xxx() is removed from MatOps table.

Shouldn’t there be a deprecation notice somewhere?
There is nothing about MATOP_MAT_MULT in the 3.13 changelog 
https://www.mcs.anl.gov/petsc/documentation/changes/313.html
For example, I see that in SLEPc, José is currently making these checks, which 
are in practice useless as they always return PETSC_FALSE? 
https://gitlab.com/slepc/slepc/-/blob/master/src/sys/classes/bv/impls/contiguous/contig.c#L191
(Maybe José is aware of this and this is just for testing)

MatMatMult() is replaced by
MatProductCreate()
MatProductSetType(,MATPRODUCT_AB)
MatProductSetFromOptions()
MatProductSymbolic()
MatProductNumeric()

Where/when do you need query a single matrix for its product operation?

I didn’t want to bother at first with the new API, because I’m only interested 
in C = A*B with C and B being dense.
Of course, I can update my code, but if I understand Stefano’s issue correctly, 
and let’s say my A is of type SBAIJ, for which there is no MatMatMult, the code 
will now error out in the MatProduct?
There is no fallback mechanism? Meaning I could in fact _not_ use the new API 
and will just have to loop on all columns of B, even for AIJ matrices.

Thanks,
Pierre

Hong


From: petsc-dev 
mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf 
of Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>>
Sent: Tuesday, April 21, 2020 7:50 AM
To: petsc-dev mailto:petsc-dev@mcs.anl.gov>>
Subject: [petsc-dev] MATOP_MAT_MULT

Hello,
Am I seeing this correctly?
#include 

int main(int argc,char **args)
{
  Mat   A;
  PetscBool hasMatMult;
  PetscErrorCodeierr;

  ierr = PetscInitialize(&argc,&args,NULL,NULL);if (ierr) return ierr;
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
  ierr = MatHasOperation(A,MATOP_MAT_MULT,&hasMatMult);CHKERRQ(ierr);
  printf("%s\n", PetscBools[hasMatMult]);
  ierr = PetscFinalize();
  return ierr;
}

=> FALSE

I believe this is a regression (or at least an undocumented change) introduced 
here: https://gitlab.com/petsc/petsc/-/merge_requests/2524/
I also believe Stefano raised a similar point there: 
https://gitlab.com/petsc/petsc/-/issues/608
This is a performance killer in my case because I was previously using this 
check to know whether I could use MatMatMult or had to loop on all columns and 
call MatMult on all of them.
There is also a bunch of (previously functioning but now) broken code, e.g., 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/transpose/transm.c.html#line105
 or 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/nest/matnest.c.html#line2105
Is this being addressed/documented?

Thanks,
Pierre



Re: [petsc-dev] MATOP_MAT_MULT

2020-04-21 Thread Zhang, Hong via petsc-dev
Pierre,
MatMatMult_xxx() is removed from MatOps table. MatMatMult() is replaced by
MatProductCreate()
MatProductSetType(,MATPRODUCT_AB)
MatProductSetFromOptions()
MatProductSymbolic()
MatProductNumeric()

Where/when do you need query a single matrix for its product operation?
Hong


From: petsc-dev  on behalf of Pierre Jolivet 

Sent: Tuesday, April 21, 2020 7:50 AM
To: petsc-dev 
Subject: [petsc-dev] MATOP_MAT_MULT

Hello,
Am I seeing this correctly?
#include 

int main(int argc,char **args)
{
  Mat   A;
  PetscBool hasMatMult;
  PetscErrorCodeierr;

  ierr = PetscInitialize(&argc,&args,NULL,NULL);if (ierr) return ierr;
  ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
  ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
  ierr = MatHasOperation(A,MATOP_MAT_MULT,&hasMatMult);CHKERRQ(ierr);
  printf("%s\n", PetscBools[hasMatMult]);
  ierr = PetscFinalize();
  return ierr;
}

=> FALSE

I believe this is a regression (or at least an undocumented change) introduced 
here: https://gitlab.com/petsc/petsc/-/merge_requests/2524/
I also believe Stefano raised a similar point there: 
https://gitlab.com/petsc/petsc/-/issues/608
This is a performance killer in my case because I was previously using this 
check to know whether I could use MatMatMult or had to loop on all columns and 
call MatMult on all of them.
There is also a bunch of (previously functioning but now) broken code, e.g., 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/transpose/transm.c.html#line105
 or 
https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/nest/matnest.c.html#line2105
Is this being addressed/documented?

Thanks,
Pierre


Re: [petsc-dev] Question about Binary-IO in READ mode with POSIX APIs

2020-03-16 Thread Zhang, Hong via petsc-dev


On Mar 16, 2020, at 12:12 PM, Lisandro Dalcin 
mailto:dalc...@gmail.com>> wrote:



On Mon, 16 Mar 2020 at 16:35, Jed Brown 
mailto:j...@jedbrown.org>> wrote:
Lisandro Dalcin mailto:dalc...@gmail.com>> writes:

> Currently, binary viewers using POSIX file descriptors with READ mode open
> the file in ALL processes in the communicator. For WRITE mode, only process
> zero opens the file.
>
> The current PetscViewerBynaryXXX APIs make it really unnecessary to open
> the file in all processes for READ. I would like to get rid of that and
> always open on rank 0 for both READ or WROTE.

I think we should use MPI-IO by default, and advise that people use it
whenever they can.

OK, let me look again to the details, I think a few minor things should be done 
before using MPI-IO as a default (like a proper subviewer implementation)

  I'm not sure of this suggested change, in that a
"bad for MPI-IO" workload (like each rank randomly seeking around a big
file) might not be better with rank 0 acting as a service rank.

Please note my main question is unrelated to MPI-IO. It is about the original 
POSIX-based implementation of binary viewers. For mode READ, all processes open 
the file (with the open() system call),

I am a bit confused. Isn’t this required when one uses MPI-IO? But of course, 
when not using MPI-IO, only process zero should open the file.

Hong (Mr.)

but in the current implementation, only process zero ever reads the file 
(unless the user gets the file descriptor and start issuing low-level 
PetscBinaryRead() calls). So I do not see the point of opening the file on all 
processes (and then stress metadata servers on parallel filesystem), if we are 
not going to ever read from rank != 0. Let's just fix things to open the file 
at rank==0 only. If users ever need to read in rank != 0, then can very well 
create the viewer on COMM_SELF, or whatever.


--
Lisandro Dalcin

Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/



Re: [petsc-dev] [petsc-users] Matrix-free method in PETSc

2020-02-18 Thread Zhang, Hong via petsc-dev
DMDA and MatShell are among the least documented in PETSc. But they are 
extremely useful at least to me. Hopefully I will try to get my 
TS+MatShell+DMDA example into master early next month.

Hong

On Feb 18, 2020, at 9:10 PM, Smith, Barry F. via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


  In the past you needed a brain to get a Stanford email account


Begin forwarded message:

From: Yuyun Yang mailto:yyan...@stanford.edu>>
Subject: Re: [petsc-users] Matrix-free method in PETSc
Date: February 18, 2020 at 8:26:11 AM CST
To: Matthew Knepley mailto:knep...@gmail.com>>
Cc: "Smith, Barry F." mailto:bsm...@mcs.anl.gov>>, 
"petsc-us...@mcs.anl.gov" 
mailto:petsc-us...@mcs.anl.gov>>

Thanks. Also, when using KSP, would the syntax be KSPSetOperators(ksp,A,A)? 
Since you mentioned preconditioners are not generally used for matrix-free 
operators, I wasn’t sure whether I should still put “A” in the Pmat field.

Is it still possible to use TS in conjunction with the matrix-free operator? 
I’d like to create a simple test case that solves the 1d heat equation 
implicitly with variable coefficients, but didn’t know how the time stepping 
can be set up.

Thanks,
Yuyun

From: Matthew Knepley mailto:knep...@gmail.com>>
Date: Tuesday, February 18, 2020 at 9:23 PM
To: Yuyun Yang mailto:yyan...@stanford.edu>>
Cc: "Smith, Barry F." mailto:bsm...@mcs.anl.gov>>, 
"petsc-us...@mcs.anl.gov" 
mailto:petsc-us...@mcs.anl.gov>>
Subject: Re: [petsc-users] Matrix-free method in PETSc

On Tue, Feb 18, 2020 at 8:20 AM Yuyun Yang 
mailto:yyan...@stanford.edu>> wrote:
Thanks for the clarification.

Got one more question: if I have variable coefficients, my stencil will be 
updated at every time step, so will the coefficients in myMatMult. In that 
case, is it necessary to destroy the shell matrix and create it all over again, 
or can I use it as it is, only calling the stencil update function, assuming 
the result will be passed into the matrix operation automatically?

You update the information in the context associated with the shell matrix. No 
need to destroy it.

  Thanks,

Matt

Thanks,
Yuyun

On 2/18/20, 7:34 AM, "Smith, Barry F." 
mailto:bsm...@mcs.anl.gov>> wrote:



> On Feb 17, 2020, at 7:56 AM, Yuyun Yang 
mailto:yyan...@stanford.edu>> wrote:
>
> Hello,
>
> I actually have a question about the usage of DMDA since I'm quite new to 
this. I wonder if the DMDA suite of functions can be directly called on vectors 
created from VecCreate?

   Yes, but you have to make sure the ones you create have the same sizes 
and parallel layouts. Generally best to get them from the DMDA or 
VecDuplicate() than the hassle of figuring out sizes.

> Or the vectors have to be formed by DMDACreateGlobalVector? I'm also not 
sure about what the dof and stencil width arguments do.
>
> I'm still unsure about the usage of MatCreateShell and 
MatShellSetOperation, since it seems that MyMatMult should still have 3 inputs 
just like MatMult (the matrix and two vectors). Since I'm not forming the 
matrix, does that mean the matrix input is meaningless but still needs to exist 
for the sake of this format?

Well the matrix input is your shell matrix so it likely has information 
you need to do your multiply routine. MatShellGetContext() (No you do not want 
to put your information about the matrix stencil inside global variables!)


>
> After I create such a shell matrix, can I use it like a regular matrix in 
KSP and utilize preconditioners?
>
> Thanks!
> Yuyun
> From: petsc-users 
mailto:petsc-users-boun...@mcs.anl.gov>> on 
behalf of Yuyun Yang mailto:yyan...@stanford.edu>>
> Sent: Sunday, February 16, 2020 3:12 AM
> To: Smith, Barry F. mailto:bsm...@mcs.anl.gov>>
> Cc: petsc-us...@mcs.anl.gov 
mailto:petsc-us...@mcs.anl.gov>>
> Subject: Re: [petsc-users] Matrix-free method in PETSc
>
> Thank you, that is very helpful information indeed! I will try it and 
send you my code when it works.
>
> Best regards,
> Yuyun
> From: Smith, Barry F. mailto:bsm...@mcs.anl.gov>>
> Sent: Saturday, February 15, 2020 10:02 PM
> To: Yuyun Yang mailto:yyan...@stanford.edu>>
> Cc: petsc-us...@mcs.anl.gov 
mailto:petsc-us...@mcs.anl.gov>>
> Subject: Re: [petsc-users] Matrix-free method in PETSc
>
>   Yuyun,
>
> If you are speaking about using a finite difference stencil on a 
structured grid where you provide the Jacobian vector products yourself by 
looping over the grid doing the stencil operation we unfortunately do not have 
exactly that kind of example.
>
> But it is actually not difficult. I suggest starting with 
src/ts/examples/tests/ex22.c It computes the sparse matrix explicitly with 
FormIJacobian()
>
> What you need to do is instead in main() use MatCreateShe

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-13 Thread Zhang, Hong via petsc-dev


> On Feb 13, 2020, at 7:39 AM, Smith, Barry F.  wrote:
> 
> 
> How are the two being compiled and linked? The same way, one with the PETSc 
> library in the path and the other without? Or does the PETSc one have lots of 
> flags and stuff while the non-PETSc one is just simple by hand?

PETSc was built into a static lib. Then both of the two example were built with 
the static lib.

Hong


> 
> Barry
> 
> 
>> On Feb 12, 2020, at 7:29 PM, Zhang, Hong  wrote:
>> 
>> 
>> 
>>> On Feb 12, 2020, at 5:11 PM, Smith, Barry F.  wrote:
>>> 
>>> 
>>> ldd -o on the petsc program (static) and the non petsc program (static), 
>>> what are the differences?
>> 
>> There is no difference in the outputs.
>> 
>>> 
>>> nm -o both executables | grep cudaFree()
>> 
>> Non petsc program:
>> 
>> [hongzh@login3.summit tests]$ nm ex_simple | grep cudaFree
>> 1ae0 t 0017.plt_call.cudaFree@@libcudart.so.10.1
>>  U cudaFree@@libcudart.so.10.1
>> 
>> Petsc program:
>> 
>> [hongzh@login3.summit tests]$ nm ex_simple_petsc | grep cudaFree
>> 10016550 t 0017.plt_call.cudaFree@@libcudart.so.10.1
>> 10017010 t 0017.plt_call.cudaFreeHost@@libcudart.so.10.1
>> 124c3f48 V 
>> _ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou
>> rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_
>> 8cuda_cub7pointerIvPT_vE8resource
>> 124c3f50 V 
>> _ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda
>> _memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEPT_vE8r
>> esource
>> 10726788 W 
>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvE11do_allocateEmm
>> 107267e8 W 
>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvE13do_deallocateENS_10device_ptrIvEEmm
>> 10726878 W 
>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvED0Ev
>> 10726848 W 
>> _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvED1Ev
>> 10729f78 W 
>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm
>> 1072a218 W 
>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm
>> 1072a388 W 
>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev
>> 1072a358 W 
>> _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev
>> 12122300 V 
>> _ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
>> 12122370 V 
>> _ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
>> 12122410 V 
>> _ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
>> 121225f0 V 
>> _ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
>> 12120630 V 
>> _ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
>> 121205b0 V 
>> _ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
>> 124c3f30 V 
>> _ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvPT_vE8resource
>> 124c3f20 V 
>> _ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEPT_vE8resource
>>  U cudaFree@@libcudart.so.10.1
>>  U cudaFreeHost@@libcudart.so.10.1
>> 
>> Hong
>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Feb 12, 2020, at 1:51 PM, M

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-12 Thread Zhang, Hong via petsc-dev


> On Feb 12, 2020, at 5:11 PM, Smith, Barry F.  wrote:
> 
> 
>  ldd -o on the petsc program (static) and the non petsc program (static), 
> what are the differences?

There is no difference in the outputs.

> 
>  nm -o both executables | grep cudaFree()

Non petsc program:

[hongzh@login3.summit tests]$ nm ex_simple | grep cudaFree
1ae0 t 0017.plt_call.cudaFree@@libcudart.so.10.1
 U cudaFree@@libcudart.so.10.1

Petsc program:

[hongzh@login3.summit tests]$ nm ex_simple_petsc | grep cudaFree
10016550 t 0017.plt_call.cudaFree@@libcudart.so.10.1
10017010 t 0017.plt_call.cudaFreeHost@@libcudart.so.10.1
124c3f48 V 
_ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou
rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_
8cuda_cub7pointerIvPT_vE8resource
124c3f50 V 
_ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda
_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEPT_vE8r
esource
10726788 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvE11do_allocateEmm
107267e8 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvE13do_deallocateENS_10device_ptrIvEEmm
10726878 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvED0Ev
10726848 W 
_ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvED1Ev
10729f78 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm
1072a218 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm
1072a388 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev
1072a358 W 
_ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev
12122300 V 
_ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
12122370 V 
_ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
12122410 V 
_ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
121225f0 V 
_ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
12120630 V 
_ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEE
121205b0 V 
_ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIv
124c3f30 V 
_ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvPT_vE8resource
124c3f20 V 
_ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEPT_vE8resource
 U cudaFree@@libcudart.so.10.1
 U cudaFreeHost@@libcudart.so.10.1

Hong

> 
> 
> 
> 
> 
>> On Feb 12, 2020, at 1:51 PM, Munson, Todd via petsc-dev 
>>  wrote:
>> 
>> 
>> There are some side effects when loading shared libraries, such as 
>> initializations of
>> static variables, etc.  Is something like that happening?
>> 
>> Another place is the initial runtime library that gets linked (libcrt0 
>> maybe?).  I 
>> think some MPI compilers insert their own version.
>> 
>> Todd.
>> 
>>> On Feb 12, 2020, at 11:38 AM, Zhang, Hong via petsc-dev 
>>>  wrote:
>>> 
>>> 
>>> 
>>>> On Feb 12, 2020, at 11:09 AM, Matthew Knepley  wrote:
>>>> 
>>>> On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev 
>>>>  wrote:
>>>> Sorry for the long post. Here are replies I have got from OLCF so far. We 
>>>> still don’t know how to solve the problem.
>>>> 
>>>> One interesting thing that Tom noticed is PetscInitialize() may have 
>>>> called cudaFree(0) 32 times as NVPROF shows, and they all run very fast. 
>>>> These calls may be trig

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-12 Thread Zhang, Hong via petsc-dev


On Feb 12, 2020, at 11:09 AM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Sorry for the long post. Here are replies I have got from OLCF so far. We still 
don’t know how to solve the problem.

One interesting thing that Tom noticed is PetscInitialize() may have called 
cudaFree(0) 32 times as NVPROF shows, and they all run very fast. These calls 
may be triggered by some other libraries like cublas. But if PETSc calls 
cudaFree() explicitly, it is always very slow.

It sounds really painful, but I would start removing lines from 
PetscInitialize() until it runs fast.

It may be more painful than it sounds. The problem is not really related to 
PetscInitialize(). In the following simple example, we do not call any PETsc 
function. But if we link it to the PETSc shared library, cudaFree(0) would be 
very slow. CUDA is a blackbox. There is not much we can debug with this simple 
example.

bash-4.2$ cat ex_simple.c
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double   *init,tmp[100] = {0};

  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);
  return 0;
}



  Thanks,

 Matt

Hong


On Wed Feb 12 09:51:33 2020, tpapathe wrote:

 Something else I noticed from the nvprof output (see my previous post) is
 that the runs with PETSc initialized have 33 calls to cudaFree, whereas the
 non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is
 happening in the PETSc initialize/finalize, but it appears to be doing a
 lot under the hood. You can also see there are many additional CUDA calls
 that are not shown in the profiler output from the non-PETSc runs (e.g.,
 additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).
 Which other systems have you tested this on? Which CUDA Toolkits and CUDA
 drivers were installed on those systems? Please let me know if there is any
 additional information you can share with me about this.

 -Tom
 On Wed Feb 12 09:25:23 2020, tpapathe wrote:

   Ok. Thanks for the additional info, Hong. I'll ask around to see if any
   local (PETSc or CUDA) experts have experienced this behavior. In the
   meantime, is this impacting your work or something you're just curious
   about? A 5-7 second initialization time is indeed unusual, but is it
   negligible relative to the overall walltime of your jobs, or is it
   somehow affecting your productivity?

   -Tom
   On Tue Feb 11 17:04:25 2020, hongzh...@anl.gov<mailto:hongzh...@anl.gov> 
wrote:

 We know it happens with PETSc. But note that the slow down occurs on the 
first CUDA function call. In the example I sent to you, if we simply link it to 
the PETSc shared library and don’t call any PETSc function, the slow down still 
happens on cudaFree(0). We have never seen this behavior on other GPU systems.

On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT 
mailto:h...@nccs.gov>> wrote:

Thanks for the update. I have now reproduced the behavior you described with
PETSc + CUDA using your example code:

[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

==16991== NVPROF is profiling process 16991, command:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

==16991== Profiling application:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc

free time =4.73 malloc time =0.00 copy time =0.00

==16991== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA memcpy
HtoD]

API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree

0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute

0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem

0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName

0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc

0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute

0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy

0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize

0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags

0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy

0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute

0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy

0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags

0.00% 11.955us 1 11.955us 11.955us 11

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-12 Thread Zhang, Hong via petsc-dev
00

==17248== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
HtoD]

API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree

0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute

0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem

0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc

0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName

0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy

0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId

0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount

0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet

0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid



I also get the expected behavior if I add an MPI_Init and MPI_Finalize to the
code instead of PETSc initialization:

[tpapathe@login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc
-L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi


[tpapathe@batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

==35166== NVPROF is profiling process 35166, command:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

==35166== Profiling application:
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi

free time =0.34 malloc time =0.00 copy time =0.00

==35166== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
HtoD]

API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree

0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute

0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem

0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc

0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName

0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy

0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId

0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount

0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet

0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid


So this appears to be something specific happening within PETSc itself - not
necessarily an OLCF issue. I would suggest asking this question within the
PETSc community to understand what's happening. Please let me know if you have
any additional questions.

-Tom

On Feb 10, 2020, at 11:14 AM, Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:


 gprof or some similar tool?


On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

-cuda_initialize 0 does not make any difference. Actually this issue has 
nothing to do with PetscInitialize(). I tried to call cudaFree(0) before 
PetscInitialize(), and it still took 7.5 seconds.

Hong

On Feb 10, 2020, at 10:44 AM, Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:

As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize 
contains
ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
Have you tried to comment out them and test again?
--Junchao Zhang


On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


On Feb 8, 2020, at 5:03 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I did some further investigation. The overhead persists for both the PETSc 
shared library and the static library. In the previous example, it does not 
call any PETSc function, the first CUDA function becomes very slow when it is 
linked to the petsc so. This indicates that the slowdown occurs if the symbol 
(cudafree)is searched through the petsc so, but does not occur if the symbol is 
found directly in the cuda runtime lib.

So the issue has nothing to do with the dynamic linker. The following example 
can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 
seconds).

1) This should go to OLCF admin as Jeff suggests

I had sent this to OLCF admin before the discussion was started here. Thomas 
Papatheodore has followed up. I am trying to help him reproduce the problem on 
summit.


2) Just to make sure I understand, a static executable with this code is still 
slow on the cudaFree(), since CUDA is a shared library by default.

I prepared the code as a minimal example to reproduce the problem. It would be 
fair to say any code using PETSc (with CUDA enabled, built statically or 
dynamically) on summit suffers a 7.5-second overhead on the first CUDA function 
call (either in the user code or inside PETSc).

Thanks,
Hong


I think we should try:

 a) Forcing a full static link, if possible

 b) Asking OLCF about link resolution order

It sounds like a similar thing I have seen in the past where l

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-10 Thread Zhang, Hong via petsc-dev
-cuda_initialize 0 does not make any difference. Actually this issue has 
nothing to do with PetscInitialize(). I tried to call cudaFree(0) before 
PetscInitialize(), and it still took 7.5 seconds.

Hong

On Feb 10, 2020, at 10:44 AM, Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:

As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize 
contains
ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
Have you tried to comment out them and test again?
--Junchao Zhang


On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


On Feb 8, 2020, at 5:03 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I did some further investigation. The overhead persists for both the PETSc 
shared library and the static library. In the previous example, it does not 
call any PETSc function, the first CUDA function becomes very slow when it is 
linked to the petsc so. This indicates that the slowdown occurs if the symbol 
(cudafree)is searched through the petsc so, but does not occur if the symbol is 
found directly in the cuda runtime lib.

So the issue has nothing to do with the dynamic linker. The following example 
can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 
seconds).

1) This should go to OLCF admin as Jeff suggests

I had sent this to OLCF admin before the discussion was started here. Thomas 
Papatheodore has followed up. I am trying to help him reproduce the problem on 
summit.


2) Just to make sure I understand, a static executable with this code is still 
slow on the cudaFree(), since CUDA is a shared library by default.

I prepared the code as a minimal example to reproduce the problem. It would be 
fair to say any code using PETSc (with CUDA enabled, built statically or 
dynamically) on summit suffers a 7.5-second overhead on the first CUDA function 
call (either in the user code or inside PETSc).

Thanks,
Hong


I think we should try:

  a) Forcing a full static link, if possible

  b) Asking OLCF about link resolution order

It sounds like a similar thing I have seen in the past where link resolution 
order can exponentially increase load time.

  Thanks,

 Matt

bash-4.2$ cat ex_simple_petsc.c
#include 
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double  *init,tmp[100] = {0};
  PetscErrorCode ierr=0;

  ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);
  ierr = PetscFinalize();
  return ierr;
}

Hong

On Feb 7, 2020, at 3:09 PM, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Note that the overhead was triggered by the first call to a CUDA function. So 
it seems that the first CUDA function triggered loading petsc so (if petsc so 
is linked), which is slow on the summit file system.

Hong

On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Linking any other shared library does not slow down the execution. The PETSc 
shared library is the only one causing trouble.

Here are the ldd output for two different versions. For the first version, I 
removed -lpetsc and it ran very fast. The second (slow) version was linked to 
petsc so.

bash-4.2$ ldd ex_simple
linux-vdso64.so.1 =>  (0x2005)
liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x2007)
libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x209b)
libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x20e8)
libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x20ed)
libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x20f5)
libhd

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-08 Thread Zhang, Hong via petsc-dev


On Feb 8, 2020, at 5:03 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I did some further investigation. The overhead persists for both the PETSc 
shared library and the static library. In the previous example, it does not 
call any PETSc function, the first CUDA function becomes very slow when it is 
linked to the petsc so. This indicates that the slowdown occurs if the symbol 
(cudafree)is searched through the petsc so, but does not occur if the symbol is 
found directly in the cuda runtime lib.

So the issue has nothing to do with the dynamic linker. The following example 
can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 
seconds).

1) This should go to OLCF admin as Jeff suggests

I had sent this to OLCF admin before the discussion was started here. Thomas 
Papatheodore has followed up. I am trying to help him reproduce the problem on 
summit.


2) Just to make sure I understand, a static executable with this code is still 
slow on the cudaFree(), since CUDA is a shared library by default.

I prepared the code as a minimal example to reproduce the problem. It would be 
fair to say any code using PETSc (with CUDA enabled, built statically or 
dynamically) on summit suffers a 7.5-second overhead on the first CUDA function 
call (either in the user code or inside PETSc).

Thanks,
Hong


I think we should try:

  a) Forcing a full static link, if possible

  b) Asking OLCF about link resolution order

It sounds like a similar thing I have seen in the past where link resolution 
order can exponentially increase load time.

  Thanks,

 Matt

bash-4.2$ cat ex_simple_petsc.c
#include 
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double  *init,tmp[100] = {0};
  PetscErrorCode ierr=0;

  ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);
  ierr = PetscFinalize();
  return ierr;
}

Hong

On Feb 7, 2020, at 3:09 PM, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Note that the overhead was triggered by the first call to a CUDA function. So 
it seems that the first CUDA function triggered loading petsc so (if petsc so 
is linked), which is slow on the summit file system.

Hong

On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Linking any other shared library does not slow down the execution. The PETSc 
shared library is the only one causing trouble.

Here are the ldd output for two different versions. For the first version, I 
removed -lpetsc and it ran very fast. The second (slow) version was linked to 
petsc so.

bash-4.2$ ldd ex_simple
linux-vdso64.so.1 =>  (0x2005)
liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x2007)
libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x209b)
libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x20e8)
libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x20ed)
libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x20f5)
libhdf5.so.103 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
 (0x20fb)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x215e)
libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
(0x2177)
libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
(0x29b0)
libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
(0x2d95)
libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
(0x2d9f0

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-08 Thread Zhang, Hong via petsc-dev
I did some further investigation. The overhead persists for both the PETSc 
shared library and the static library. In the previous example, it does not 
call any PETSc function, the first CUDA function becomes very slow when it is 
linked to the petsc so. This indicates that the slowdown occurs if the symbol 
(cudafree)is searched through the petsc so, but does not occur if the symbol is 
found directly in the cuda runtime lib.

So the issue has nothing to do with the dynamic linker. The following example 
can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 
seconds).

bash-4.2$ cat ex_simple_petsc.c
#include 
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double  *init,tmp[100] = {0};
  PetscErrorCode ierr=0;

  ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);
  ierr = PetscFinalize();
  return ierr;
}

Hong

On Feb 7, 2020, at 3:09 PM, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:

Note that the overhead was triggered by the first call to a CUDA function. So 
it seems that the first CUDA function triggered loading petsc so (if petsc so 
is linked), which is slow on the summit file system.

Hong

On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Linking any other shared library does not slow down the execution. The PETSc 
shared library is the only one causing trouble.

Here are the ldd output for two different versions. For the first version, I 
removed -lpetsc and it ran very fast. The second (slow) version was linked to 
petsc so.

bash-4.2$ ldd ex_simple
linux-vdso64.so.1 =>  (0x2005)
liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x2007)
libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x209b)
libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x20e8)
libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x20ed)
libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x20f5)
libhdf5.so.103 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
 (0x20fb)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x215e)
libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
(0x2177)
libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
(0x29b0)
libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
(0x2d95)
libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
(0x2d9f)
libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
(0x200012f5)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x20001dc4)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x20001ddd)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x20001de0)
libmpiprofilesupport.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
 (0x20001de4)
libmpi_ibm_usempi.so => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
 (0x20001de7)
libmpi_ibm_mpifh.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
 (0x20001dea)
libmpi_ibm.so.3 => 
/autofs/nccs-svm1_s

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-07 Thread Zhang, Hong via petsc-dev
I tried to install PETSc shared library in /gpfs/alpine/scratch, which should 
be faster than the home directory. But the same overhead still persists.

Hong

> On Feb 7, 2020, at 4:32 PM, Smith, Barry F.  wrote:
> 
> 
>   Perhaps the intent is that you build or install (--prefix) your libraries 
> in a different place than /autofs/nccs-svm1_home1 
> 
> 
> 
>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong  wrote:
>> 
>> Note that the overhead was triggered by the first call to a CUDA function. 
>> So it seems that the first CUDA function triggered loading petsc so (if 
>> petsc so is linked), which is slow on the summit file system.
>> 
>> Hong
>> 
>>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
>>>  wrote:
>>> 
>>> Linking any other shared library does not slow down the execution. The 
>>> PETSc shared library is the only one causing trouble.
>>> 
>>> Here are the ldd output for two different versions. For the first version, 
>>> I removed -lpetsc and it ran very fast. The second (slow) version was 
>>> linked to petsc so. 
>>> 
>>> bash-4.2$ ldd ex_simple
>>>linux-vdso64.so.1 =>  (0x2005)
>>>liblapack.so.0 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
>>>  (0x2007)
>>>libblas.so.0 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
>>>  (0x209b)
>>>libhdf5hl_fortran.so.100 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
>>>  (0x20e8)
>>>libhdf5_fortran.so.100 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
>>>  (0x20ed)
>>>libhdf5_hl.so.100 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
>>>  (0x20f5)
>>>libhdf5.so.103 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
>>>  (0x20fb)
>>>libX11.so.6 => /usr/lib64/libX11.so.6 (0x215e)
>>>libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
>>> (0x2177)
>>>libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
>>> (0x29b0)
>>>libcudart.so.10.1 => 
>>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x2d95)
>>>libcusparse.so.10 => 
>>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x2d9f)
>>>libcusolver.so.10 => 
>>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x200012f5)
>>>libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x20001dc4)
>>>libdl.so.2 => /usr/lib64/libdl.so.2 (0x20001ddd)
>>>libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x20001de0)
>>>libmpiprofilesupport.so.3 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
>>>  (0x20001de4)
>>>libmpi_ibm_usempi.so => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
>>>  (0x20001de7)
>>>libmpi_ibm_mpifh.so.3 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
>>>  (0x20001dea)
>>>libmpi_ibm.so.3 => 
>>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
>>>  (0x20001df4)
>>>libpgf90rtl.so => 
>>> /autofs/ncc

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-07 Thread Zhang, Hong via petsc-dev
Note that the overhead was triggered by the first call to a CUDA function. So 
it seems that the first CUDA function triggered loading petsc so (if petsc so 
is linked), which is slow on the summit file system.

Hong

On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Linking any other shared library does not slow down the execution. The PETSc 
shared library is the only one causing trouble.

Here are the ldd output for two different versions. For the first version, I 
removed -lpetsc and it ran very fast. The second (slow) version was linked to 
petsc so.

bash-4.2$ ldd ex_simple
linux-vdso64.so.1 =>  (0x2005)
liblapack.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
 (0x2007)
libblas.so.0 => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
 (0x209b)
libhdf5hl_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
 (0x20e8)
libhdf5_fortran.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
 (0x20ed)
libhdf5_hl.so.100 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
 (0x20f5)
libhdf5.so.103 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
 (0x20fb)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x215e)
libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 
(0x2177)
libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 
(0x29b0)
libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 
(0x2d95)
libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 
(0x2d9f)
libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 
(0x200012f5)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x20001dc4)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x20001ddd)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x20001de0)
libmpiprofilesupport.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
 (0x20001de4)
libmpi_ibm_usempi.so => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
 (0x20001de7)
libmpi_ibm_mpifh.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
 (0x20001dea)
libmpi_ibm.so.3 => 
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
 (0x20001df4)
libpgf90rtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
 (0x20001e0b)
libpgf90.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
 (0x20001e0f)
libpgf90_rpm1.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
 (0x20001e6a)
libpgf902.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
 (0x20001e6d)
libpgftnrtl.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
 (0x20001e70)
libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x20001e73)
libpgkomp.so => 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt

Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-07 Thread Zhang, Hong via petsc-dev
ci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
 (0x200023c2)
libXau.so.6 => /usr/lib64/libXau.so.6 (0x200023d1)


On Feb 7, 2020, at 2:31 PM, Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:


 ldd -o on the executable of both linkings of your code.

 My guess is that without PETSc it is linking the static version of the needed 
libraries and with PETSc the shared. And, in typical fashion, the shared 
libraries are off on some super slow file system so take a long time to be 
loaded and linked in on demand.

  Still a performance bug in Summit.

  Barry


On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Hi all,

Previously I have noticed that the first call to a CUDA function such as 
cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. 
Then I prepared a simple example as attached to help OCLF reproduce the 
problem. It turned out that the problem was  caused by PETSc. The 7.5-second 
overhead can be observed only when the PETSc lib is linked. If I do not link 
PETSc, it runs normally. Does anyone have any idea why this happens and how to 
fix it?

Hong (Mr.)

bash-4.2$ cat ex_simple.c
#include 
#include 
#include 

int main(int argc,char **args)
{
clock_t start,s1,s2,s3;
double  cputime;
double   *init,tmp[100] = {0};

start = clock();
cudaFree(0);
s1 = clock();
cudaMalloc((void **)&init,100*sizeof(double));
s2 = clock();
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
s3 = clock();
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);

return 0;
}






Re: [petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-07 Thread Zhang, Hong via petsc-dev
Statically linked excitable works fine. The dynamic linker is probably broken.

Hong

On Feb 7, 2020, at 12:53 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Fri, Feb 7, 2020 at 1:23 PM Zhang, Hong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hi all,

Previously I have noticed that the first call to a CUDA function such as 
cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. 
Then I prepared a simple example as attached to help OCLF reproduce the 
problem. It turned out that the problem was  caused by PETSc. The 7.5-second 
overhead can be observed only when the PETSc lib is linked. If I do not link 
PETSc, it runs normally. Does anyone have any idea why this happens and how to 
fix it?

Hong, this sounds like a screwed up dynamic linker. Can you try this with a 
statically linked executable?

  Thanks,

 Matt

Hong (Mr.)

bash-4.2$ cat ex_simple.c
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double   *init,tmp[100] = {0};

  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);

  return 0;
}




--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>



[petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

2020-02-07 Thread Zhang, Hong via petsc-dev
Hi all,

Previously I have noticed that the first call to a CUDA function such as 
cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. 
Then I prepared a simple example as attached to help OCLF reproduce the 
problem. It turned out that the problem was  caused by PETSc. The 7.5-second 
overhead can be observed only when the PETSc lib is linked. If I do not link 
PETSc, it runs normally. Does anyone have any idea why this happens and how to 
fix it?

Hong (Mr.)

bash-4.2$ cat ex_simple.c
#include 
#include 
#include 

int main(int argc,char **args)
{
  clock_t start,s1,s2,s3;
  double  cputime;
  double   *init,tmp[100] = {0};

  start = clock();
  cudaFree(0);
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - 
start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - 
s2)) / CLOCKS_PER_SEC);

  return 0;
}




Re: [petsc-dev] "participants" on gitlab

2019-10-30 Thread Zhang, Hong via petsc-dev
After some discussion with Satish, I realized that currently the approval rules 
default to Integration, Team and Code Owner and all these groups are carried 
over as participants for MRs. It is reasonable to keep integrators and 
codeowners in the list since they should definitely look at the MR. But I would 
suggest to remove Team from the default. Actually the codeowners/integrators 
can edit the MR page and add other people to the approval list if needed. If we 
need to encourage more people to review, adding them to codeowners would be 
much better than abusing Team. To solve the notification problem, I considered 
to be removed from Team, but that would cause me to lose the privileges. I 
think removing Team from default and relying more on codeowners would solve the 
notification problem perfectly.

Hong (Mr.)

> On Oct 21, 2019, at 10:34 AM, Jed Brown  wrote:
> 
> All "developers" are listed as able to grant (optional) approvals --
> approval from codeowners/integrators is still needed regardless of those
> optional approvals.  We should perhaps remove that because I don't know
> a way to have some able to approve without the notification problem you
> mention below.  Unfortunately, I think that reduces incentive to review,
> and we're always stressed for reviewing resources.
> 
> "Zhang, Hong via petsc-dev"  writes:
> 
>> How is the list of participants determined when a MR is created on gitlab? 
>> It seems to include everybody by default. Is there any way to shorten the 
>> list? Ideally only the participants involved in the particular MR should be 
>> picked. Note that currently there is a huge gap between the ''Participate'' 
>> and ''On mention'' levels in the notification settings. With the former, I 
>> get spammed with notifications whenever a new MR is created. With the later, 
>> I won’t receive any notification (even someone replied my comments) unless 
>> explicitly @ by someone.
>> 
>> Hong (Mr.)



Re: [petsc-dev] AVX kernels, old gcc, still broken

2019-10-24 Thread Zhang, Hong via petsc-dev
Hi Lisandro,

Can you please check if the following patch fixes the problem? I will create a 
MR.

diff --git a/src/mat/impls/aij/seq/aijperm/aijperm.c 
b/src/mat/impls/aij/seq/aijperm/aijperm.c
index 577dfc6713..568535117a 100644
--- a/src/mat/impls/aij/seq/aijperm/aijperm.c
+++ b/src/mat/impls/aij/seq/aijperm/aijperm.c
@@ -12,7 +12,7 @@

 #include <../src/mat/impls/aij/seq/aij.h>

-#if defined(PETSC_HAVE_IMMINTRIN_H) && defined(__AVX512F__) && 
defined(PETSC_USE_REAL_DOUBLE) && !defined(PETSC_USE_COMPLEX) && 
!defined(PETSC_USE_64BIT_INDICES)
+#if defined(PETSC_USE_AVX512_KERNELS) && defined(PETSC_HAVE_IMMINTRIN_H) && 
defined(__AVX512F__) && defined(PETSC_USE_REAL_DOUBLE) && 
!defined(PETSC_USE_COMPLEX) && !defined(PETSC_USE_64BIT_INDICES) && 
!defined(PETSC_SKIP_IMMINTRIN_H_CUDAWORKAROUND)
 #include 

 #if !defined(_MM_SCALE_8)
@@ -301,7 +301,7 @@ PetscErrorCode MatMult_SeqAIJPERM(Mat A,Vec xx,Vec yy)
 #if !(defined(PETSC_USE_FORTRAN_KERNEL_MULTAIJPERM) && defined(notworking))
   PetscInt  i,j;
 #endif
-#if defined(PETSC_HAVE_IMMINTRIN_H) && defined(__AVX512F__) && 
defined(PETSC_USE_REAL_DOUBLE) && !defined(PETSC_USE_COMPLEX) && 
!defined(PETSC_USE_64BIT_INDICES)
+#if defined(PETSC_USE_AVX512_KERNELS) && defined(PETSC_HAVE_IMMINTRIN_H) && 
defined(__AVX512F__) && defined(PETSC_USE_REAL_DOUBLE) && 
!defined(PETSC_USE_COMPLEX) && !defined(PETSC_USE_64BIT_INDICES) && 
!defined(PETSC_SKIP_IMMINTRIN_H_CUDAWORKAROUND)
   __m512d   vec_x,vec_y,vec_vals;
   __m256i   vec_idx,vec_ipos,vec_j;
   __mmask8   mask;
@@ -401,7 +401,7 @@ PetscErrorCode MatMult_SeqAIJPERM(Mat A,Vec xx,Vec yy)
 #pragma _CRI prefervector
 #endif

-#if defined(PETSC_HAVE_IMMINTRIN_H) && defined(__AVX512F__) && 
defined(PETSC_USE_REAL_DOUBLE) && !defined(PETSC_USE_COMPLEX) && 
!defined(PETSC_USE_64BIT_INDICES)
+#if defined(PETSC_USE_AVX512_KERNELS) && defined(PETSC_HAVE_IMMINTRIN_H) && 
defined(__AVX512F__) && defined(PETSC_USE_REAL_DOUBLE) && 
!defined(PETSC_USE_COMPLEX) && !defined(PETSC_USE_64BIT_INDICES) && 
!defined(PETSC_SKIP_IMMINTRIN_H_CUDAWORKAROUND)
 vec_y = _mm512_setzero_pd();
 ipos = ip[i];
 for (j=0; j<(nz>>3); j++) {
@@ -436,7 +436,7 @@ PetscErrorCode MatMult_SeqAIJPERM(Mat A,Vec xx,Vec yy)
* worthwhile to vectorize across the rows, that is, to do the
* matvec by operating with "columns" of the chunk. */
   for (j=0; j>3)<<3); i+=8) {
   vec_y= _mm512_loadu_pd(&yp[i]);


Thanks,
Hong

On Oct 24, 2019, at 2:47 PM, Lisandro Dalcin via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

This is with master, but I bet the issue is also in maint.

* Running on Ubuntu 16

$ uname -a
Linux flamingo 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux

* With system gcc 5.4

$ mpicc -show
/usr/bin/gcc-5 
-I/sw/workstations/apps/linux-ubuntu16.04-x86_64/mpich/3.3.1/gcc-5.4.0/nvejoe25snmak6a7fnjghabxjukjkuiu/include
 
-L/sw/workstations/apps/linux-ubuntu16.04-x86_64/mpich/3.3.1/gcc-5.4.0/nvejoe25snmak6a7fnjghabxjukjkuiu/lib
 -Wl,-rpath 
-Wl,/sw/workstations/apps/linux-ubuntu16.04-x86_64/mpich/3.3.1/gcc-5.4.0/nvejoe25snmak6a7fnjghabxjukjkuiu/lib
 -lmpi

$ mpicc --version
gcc-5 (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

* PETSc configured to NOT USE AVX512 kernels

$ grep avx arch-gnu-opt/lib/petsc/conf/reconfigure-arch-gnu-opt.py
'--with-avx512-kernels=0',

* Bang!

$ touch src/mat/impls/aij/seq/aijperm/aijperm.c
$ make -f gmakefile
Use "/usr/bin/make V=1" to see verbose compile lines, "/usr/bin/make V=0" to 
suppress.
  CC arch-gnu-opt/obj/mat/impls/aij/seq/aijperm/aijperm.o
/home/dalcin/Devel/petsc/src/mat/impls/aij/seq/aijperm/aijperm.c: In function 
‘MatMult_SeqAIJPERM’:
/home/dalcin/Devel/petsc/src/mat/impls/aij/seq/aijperm/aijperm.c:426:22: 
warning: implicit declaration of function ‘_mm512_reduce_add_pd’ 
[-Wimplicit-function-declaration]
 yp[i] += _mm512_reduce_add_pd(vec_y);


--
Lisandro Dalcin

Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/



[petsc-dev] "participants" on gitlab

2019-10-21 Thread Zhang, Hong via petsc-dev
How is the list of participants determined when a MR is created on gitlab? It 
seems to include everybody by default. Is there any way to shorten the list? 
Ideally only the participants involved in the particular MR should be picked. 
Note that currently there is a huge gap between the ''Participate'' and ''On 
mention'' levels in the notification settings. With the former, I get spammed 
with notifications whenever a new MR is created. With the later, I won’t 
receive any notification (even someone replied my comments) unless explicitly @ 
by someone.

Hong (Mr.)

Re: [petsc-dev] People spent tim doing this

2019-10-11 Thread Zhang, Hong via petsc-dev
It is hard to understand where the speedup comes. What is the difference 
between "manner 1" and "manner 2”?

Btw, we don’t provide “ELL” format in PETSc. We provide “SELL”, which should be 
more SIMD-friendly than the column-ELL proposed in the paper.

Hong

On Oct 10, 2019, at 8:16 PM, Matthew Knepley via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Thanks,

   Matt

--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/
<08560093.pdf>



Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Done. See 
https://gitlab.com/petsc/petsc/commit/85ec510f49531057ebfe1fb641fe93a36371878e
Hong

On Mon, Sep 23, 2019 at 11:32 AM Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>> wrote:
Hong,
You should probably cherry pick 
https://gitlab.com/petsc/petsc/commit/93d7d1d6d29b0d66b5629a261178b832a925de80?merge_request_iid=2069
 (and remove the MatNest part).
This fixes a similar issue in MatTransposeMatMult with nontrivial LDAs.
Since this commit is part of a feature MR that is unlikely to be ready for 
tomorrow, this fix (as of now) is also unlikely to be in master for the release.

Thanks,
Pierre

On 23 Sep 2019, at 6:02 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Barry:
As a hack for this release could you have the Numeric portion of the 
multiply routines check if the symbolic data is there and if not just call the 
symbolic an attach the needed data? You might need to have a utility function 
that does all the symbolic part except the allocation of the matrix and then 
call this from the numeric part as well as the real symbolic part.

I'm working on this now.  I was not aware of MatSeqDenseSetLDA() which changes 
pattern of data access in seqdense matrix.
Pierre's patch:
"change Bm here 
https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
 to the LDA of B"
fix this bug. I'll further test it and submit a pull request.
Then, I'll check slepc's bug report.
Hong



Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry:
As a hack for this release could you have the Numeric portion of the 
multiply routines check if the symbolic data is there and if not just call the 
symbolic an attach the needed data? You might need to have a utility function 
that does all the symbolic part except the allocation of the matrix and then 
call this from the numeric part as well as the real symbolic part.

I'm working on this now.  I was not aware of MatSeqDenseSetLDA() which changes 
pattern of data access in seqdense matrix.
Pierre's patch:
"change Bm here 
https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
 to the LDA of B"
fix this bug. I'll further test it and submit a pull request.
Then, I'll check slepc's bug report.
Hong


Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry :

   We would like avoid allocating a huge array for the matrix and then having 
the user place on top of it.

   In the new paradigm there could be options called on the resulting C of 
MatMatGetProduct() that would take effect before the C is fully formed to 
prevent the allocating and freeing for a huge array the same time as user array 
exists but with the current API we have for this release.
Allocation of C is done in the symbolic product, not GetProduct(). Petsc gets 
user's array before symbolic product, thus it will not allocate C array.
Hong




Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Yes, we should allow users to provide their own matrix array.

We use MatDensePlaceArray() to plug an array into matrix C before MatMatMult(). 
If we cannot do this, we will have to copy from the internal array of the 
result C to our array.

Would the following sequence work?
MatMatMultSymbolic()
MatDensePlaceArray()
MatMatMultNumeric()
This seems a reasonable API, but it is not obvious to users when and where 
MatDensePlaceArray() should be called.
Currently, most users call  MatMatMult(A,B, reuse,&C) instead of 
MatMatMultSymbolic/Numeric.

We plan to add
MatMatGetProduct(A,B,&C);
Then,
MatDensePlaceArray(C,array);

Hong


Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-22 Thread Zhang, Hong via petsc-dev
I'll check it tomorrow.
Hong

On Sun, Sep 22, 2019 at 1:04 AM Pierre Jolivet via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Jed,
I’m not sure how easy it is to put more than a few lines of code on GitLab, so 
I’ll just send the (tiny) source here, as a follow-up of our discussion 
https://gitlab.com/petsc/petsc/merge_requests/2069#note_220229648.
Please find attached a .cpp showing the brokenness of C=A*B with A of type 
MPIAIJ and B of type MPIDense when the LDA of B is not equal to its number of 
local rows.
It does [[1,1];[1,1]] * [[0,1,2,3];[0,1,2,3]]
C should be equal to 2*B, but it’s not, unless lda = m (= 1).
Mat Object: 2 MPI processes
  type: mpidense
0.e+00 1.e+00 2.e+00 
3.e+00
0.e+00 1.e+00 2.e+00 
3.e+00

If you change Bm here 
https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
 to the LDA of B, you’ll get the correct result.
Mat Object: 2 MPI processes
  type: mpidense
0.e+00 2.e+00 4.e+00 
6.e+00
0.e+00 2.e+00 4.e+00 
6.e+00

Unfortunately, w.r.t. MR 2069, I still don’t get the same results with a plain 
view LDA > m (KO) and a view + duplicate LDA = m (OK).
So there might be something else to fix (or this might not even be a correct 
fix), but the only reproducer I have right now is the full solver.

Thanks,
Pierre



Re: [petsc-dev] moving from BitBucket to GitLab

2019-06-16 Thread Zhang, Hong via petsc-dev
If it is mainly because of CI, why don't we host petsc on GitHub and use the 
GitLab CI?
https://about.gitlab.com/solutions/github/

GitHub has been the biggest social network for developers. Changing a utility 
is easy to me, but changing a social network isn't.

Thanks,
Hong (Mr.)

On Jun 15, 2019, at 5:42 PM, Smith, Barry F. via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


  Given the terrible performance of BitBucket recently and the far superior 
ability to do flexible CI on GitLab Satish and Jed are experimenting with using 
GitLap CI. In a couple of week if all goes well we are likely to move 
everything to GitLab.

   If you have major concerns about such a move please let us know.

  Thanks

   Barry




Re: [petsc-dev] Is bitbucket less responsive than it use to be?

2019-05-14 Thread Zhang, Hong via petsc-dev
Vote for GitHub +1.

We would have almost moved to GitHub early last year. But I was not sure what 
stopped the transition.

Hong

On May 14, 2019, at 10:51 AM, Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Any difficulty to switch over to GitHub?  I like GitHub better than bitbucket.

Fande

On Tue, May 14, 2019 at 9:41 AM Dave May via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


On Tue, 14 May 2019 at 17:34, Smith, Barry F. via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

  Could be they're digging their own grave.

  I seem to be spending more time waiting after each click when previously I 
recall it was virtually instantaneous?

It's definitely not your imagination - the web interface is much much slower 
than it used to be 5-8 years ago.









Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-12 Thread Zhang, Hong via petsc-dev
I would suggest Fande add this new implementation into petsc. What is the 
algorithm?
I'll try to see if I can further reduce memory consumption of the current 
symbolic PtAP when I get time.
Hong

On Fri, Apr 12, 2019 at 8:27 AM Mark Adams via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:


> On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Interesting, nice work.
>
> It would be interesting to get the flop counters working.
>
> This looks like GMG, I assume 3D.
>
> The degree of parallelism is not very realistic. You should probably run a 
> 10x smaller problem, at least, or use 10x more processes.

   Why do you say that? He's got his machine with a certain amount of physical 
memory per node, are you saying he should ignore/not use 90% of that physical 
memory for his simulation?

In my experience 1.5M equations/process about 50x more than applications run, 
but this is just anecdotal. Some apps are dominated by the linear solver in 
terms of memory but some apps use a lot of memory in the physics parts of the 
code.

The one app that I can think of where the memory usage is dominated by the 
solver does like 10 (pseudo) time steps with pretty hard nonlinear solves, so 
in the end they are not bound by turnaround time. But they are kind of a odd 
(academic) application and not very representative of what I see in the broader 
comp sci community. And these guys do have a scalable code so instead of 
waiting a week on the queue to run a 10 hour job that uses 10% of the machine, 
they wait a day to run a 2 hour job that takes 50% of the machine because 
centers scheduling policies work that way.

He should buy a machine 10x bigger just because it means having less degrees of 
freedom per node (whose footing the bill for this purchase?). At INL they run 
simulations for a purpose, not just for scalability studies and there are no 
dang GPUs or barely used over-sized monstrocities sitting around to brag about 
twice a year at SC.

I guess the are the nuke guys. I've never worked with them or seen this kind of 
complexity analysis in their talks, but OK if they fill up memory with the 
solver then this is representative of a significant (DOE)app.


   Barry



> I guess it does not matter. This basically like a one node run because the 
> subdomains are so large.
>
> And are you sure the numerics are the same with and without hypre? Hypre is 
> 15x slower. Any ideas what is going on?
>
> It might be interesting to scale this test down to a node to see if this is 
> from communication.
>
> Again, nice work,
> Mark
>
>
> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong 
> mailto:fdkong...@gmail.com>> wrote:
> Hi Developers,
>
> I just want to share a good news.  It is known PETSc-ptap-scalable is taking 
> too much memory for some applications because it needs to build intermediate 
> data structures.  According to Mark's suggestions, I implemented the  
> all-at-once algorithm that does not cache any intermediate data.
>
> I did some comparison,  the new implementation is actually scalable in terms 
> of the memory usage and the compute time even though it is still  slower than 
> "ptap-scalable".   There are some memory profiling results (see the 
> attachments). The new all-at-once implementation use the similar amount of 
> memory as hypre, but it way faster than hypre.
>
> For example, for a problem with 14,893,346,880 unknowns using 10,000 
> processor cores,  There are timing results:
>
> Hypre algorithm:
>
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>
> PETSc scalable PtAP:
>
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>
> New implementation of the all-at-once algorithm:
>
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>
>
> You can see here the all-at-once is a bit slower than ptap-scalable, but it 
> uses only much less memory.
>
>
> Fande
>



Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-27 Thread Zhang, Hong via petsc-dev
Myriam,
- PETSc 3.6.4 (reference)
- PETSc 3.10.4 without specific options
- PETSc 3.10.4 with the three scalability options you mentionned
What are the 'three scalability options' here?
What is "MaxMemRSS", the max memory used by a single core? How many cores do 
you start with?

Do you have 'execution time scalability' plot?
Hong


On Wed, Mar 27, 2019 at 8:47 AM Mark Adams via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
So is this the instructions that I should give him? This grad student is a 
quick study but he has not computing background. So we don't care what we use, 
we just want to work (easily).

Thanks

Do not use "--download-fblaslapack=1". Set it to 0. Same for 
"--download-mpich=1".

Now do:

> module load mkl

> export BLAS_LAPACK_LOAD=--with-blas-lapack-dir=${MKLROOT}

>  export PETSC_MPICH_HOME="${MPICH_HOME}"

And use

--with-cc=${MPICH_HOME}/mpicc --with-cxx=${MPICH_HOME}/mpicxx 
--with-fc=${MPICH_HOME}/mpif90

instead of clang++

On Wed, Mar 27, 2019 at 9:30 AM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Wed, Mar 27, 2019 at 8:55 AM Victor Eijkhout via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
On Mar 27, 2019, at 7:29 AM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:

How should he configure to this? remove "--download-fblaslapack=1" and add 

1. If using gcc

module load mkl

with either compiler:

export BLAS_LAPACK_LOAD=--with-blas-lapack-dir=${MKLROOT}

2.  We define MPICH_HOME for you.

With Intel MPI:

  export PETSC_MPICH_HOME="${MPICH_HOME}/intel64"
  export mpi="--with-mpi-compilers=1 --with-mpi-include=${TACC_IMPI_INC} 
--with-mpi-lib=${TACC_IMPI_LIB}/release_mt/libmpi.so”

with mvapich:

  export PETSC_MPICH_HOME="${MPICH_HOME}"
  export mpi="--with-mpi-compilers=1 --with-mpi-dir=${PETSC_MPICH_HOME}”

(looks like a little redundancy in my script)

I think Satish now prefers

  --with-cc=${MPICH_HOME}/mpicc --with-cxx=${MPICH_HOME}/mpicxx 
--with-fc=${MPICH_HOME}/mpif90

  Thanks,

Matt

Victor.



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/


Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-22 Thread Zhang, Hong via petsc-dev
Fande,
The images are very interesting and helpful. How did you get these images?

Petsc PtAP uses 753MB for PtAPSymbolic and only 116MB for PtAPNumeric,
while hypre uses 215MB -- it seems hypre does not implement symbolic PtAP.

When I implement PtAP, my focus was on numeric part because it was used 
repeatedly. Now we should look at symbolic part and optimize it. It should have 
room for improvement or new algorithmic approach.
We also need understand the algorithm and implementation used hypre.

Hong

On Thu, Mar 21, 2019 at 6:37 PM Mark Adams via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Could you explain this more by adding some small examples?


Since you are considering implementing all-at-once (four nested loops, right?) 
I'll give you my old code.

This code is hardwired for two AMG and for a geometric-AMG, where the blocks of 
the R (and hence P) matrices are scaled identities and I only store the scale. 
So you ignore those branches. This code also does equivalent real form complex, 
so more branches to ignore.


Re: [petsc-dev] How long?

2019-03-11 Thread Zhang, Hong via petsc-dev
Is linux kernel maintainable and extendable? Does anyone want to reimplement 
linux in Julia?

Hong (Mr.)

> On Mar 11, 2019, at 9:28 PM, Smith, Barry F. via petsc-dev 
>  wrote:
> 
> 
>   PETSc source code is becoming an unmaintainable, unextendable monstrosity. 
> How long until Julia is mature enough that we can (re)implement PETSc in it?
> 
>   Barry
> 



Re: [petsc-dev] Segmentation faults in MatMatMult & MatTransposeMatMult

2019-01-14 Thread Zhang, Hong via petsc-dev
Replace
ierr = MatSetType(A, MATMPIAIJ);CHKERRQ(ierr);
to
ierr = MatSetType(A, MATAIJ);CHKERRQ(ierr);

Replace
ierr = MatSetType(B, MATMPIDENSE)i;CHKERRQ(ierr);
to
ierr = MatSetType(B, MATDENSE)i;CHKERRQ(ierr);

Then add
MatSeqAIJSetPreallocation()
MatSeqDenseSetPreallocation()

Hong

On Mon, Jan 14, 2019 at 2:51 PM Pierre Jolivet via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hello,
Is there any chance to get MatMatMult_MPIAIJ_MPIDense  and 
MatTransposeMatMult_MPIAIJ_MPIDense fixed so that the attached program could 
run _with a single_ process? (I know, I could switch to SeqAIJ_SeqDense, but 
that is not an option I have right now)

Thanks in advance,
Pierre



Re: [petsc-dev] How to know MatSolve() was successful?

2018-10-12 Thread Zhang, Hong
Junchao :
I learned users should call KSPGetConvergedReason to check if a KSPSolve was 
successful. But if users directly call MatSolve() or MatMatSolve etc, how can 
they know it was successful?
I see in MatSolve_MUMPS, it first checks factor errors. If there is, it sets 
the solution to infinity;  But it does not do that in MatMatSolve.  Also, in 
MatSolve_MUMPS, if the factorization was fine but the solve failed, the code 
just aborts. Is it wrong in a KSP context?

If you suspect something is wrong, produce an example that validate your guess. 
We then work from there and improve PETSc.
Hong


Re: [petsc-dev] MUMPS silent errors

2018-10-11 Thread Zhang, Hong
Junchao:

Hong,

 The user's example code reads a matrix, calls KSPSolve, then over. From his  
log_view file, I saw long MatLUFactorNum time and short MatSolve time.  Now I 
know that is because MatSolve was skipped. Thanks.

This is intended.
Hong

From: Zhang, Hong
Sent: Thursday, October 11, 2018 10:07:10 AM
To: Zhang, Junchao
Cc: Smith, Barry F.; For users of the development version of PETSc
Subject: Re: [petsc-dev] MUMPS silent errors

Junchao :
When matrix factorization fails, we diver error message back to user and skip 
MatSolve. Can you reproduce this problem and I'll take a look at it?


What is embarrassing is the user sent me beautiful -log_view outputs and began 
doing performance comparison. The whole thing is meaningless only because he 
forgot to check the converged reason on a direct solver.


When linear solver fails, snes/ts also fails, which should display error output 
to user. User should check the accuracy of his final solution with 
'-snes_converged_reason' before looking at performance.


MUMPS manual has "A call to MUMPS with JOB=2 must be preceded by a call with 
JOB=1 on the same instance", and similar languages for other phases.  It 
implies we at least should not call MatSolve_MUMPS with failed factorization 
since it might crash the code.

Yes. I've never seen this happen before, thus want to check.
Hong

From: Smith, Barry F.
Sent: Wednesday, October 10, 2018 6:41:20 PM
To: Zhang, Junchao
Cc: petsc-dev
Subject: Re: [petsc-dev] MUMPS silent errors


  I looked at the code and it is handled in the PETSc way. The user should not 
expect KSP to error just because it was unable to solve a linear system; they 
should be calling KSPGetConvergedReason() after KSPSolve() to check that the 
solution was computed successfully.

   Barry


> On Oct 10, 2018, at 2:12 PM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I met a case where MUMPS numeric factorization returned an error code -9 in 
> mumps->id.INFOG(1) but A->erroriffailure was false in the following code in 
> mumps.c
> 1199: PetscErrorCode MatFactorNumeric_MUMPS(Mat F,Mat A,const MatFactorInfo 
> *info)
> 1200:
> {
> ...
>
> 1227:   PetscMUMPS_c(mumps);
> 1228:   if
>  (mumps->id.INFOG(1) < 0) {
>
> 1229: if
>  (A->erroriffailure) {
>
> 1230:   SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_LIB,"Error reported by MUMPS 
> in numerical factorization phase: INFOG(1)=%d, INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1231: } else
>  {
>
> 1232:   if (mumps->id.INFOG(1) == -10) { /* numerically singular matrix */
> 1233: PetscInfo2(F,"matrix is numerically singular, INFOG(1)=%d, 
> INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1234:
> F->factorerrortype = MAT_FACTOR_NUMERIC_ZEROPIVOT;
>
>
> The code continued to KSPSolve and finished successfully (with wrong answer). 
> The user did not call KSPGetConvergedReason() after KSPSolve. I found I had  
> to either add -ksp_error_if_not_converged or call 
> KSPSetErrorIfNotConverged(ksp,PETSC_TRUE) to make the code fail.
> Is it expected?  In my view, it is dangerous. If MUMPS fails in one stage, 
> PETSc should not proceed to the next stage because it may hang there.



Re: [petsc-dev] MUMPS silent errors

2018-10-11 Thread Zhang, Hong
Junchao :
When matrix factorization fails, we diver error message back to user and skip 
MatSolve. Can you reproduce this problem and I'll take a look at it?


What is embarrassing is the user sent me beautiful -log_view outputs and began 
doing performance comparison. The whole thing is meaningless only because he 
forgot to check the converged reason on a direct solver.


When linear solver fails, snes/ts also fails, which should display error output 
to user. User should check the accuracy of his final solution with 
'-snes_converged_reason' before looking at performance.


MUMPS manual has "A call to MUMPS with JOB=2 must be preceded by a call with 
JOB=1 on the same instance", and similar languages for other phases.  It 
implies we at least should not call MatSolve_MUMPS with failed factorization 
since it might crash the code.

Yes. I've never seen this happen before, thus want to check.
Hong

From: Smith, Barry F.
Sent: Wednesday, October 10, 2018 6:41:20 PM
To: Zhang, Junchao
Cc: petsc-dev
Subject: Re: [petsc-dev] MUMPS silent errors


  I looked at the code and it is handled in the PETSc way. The user should not 
expect KSP to error just because it was unable to solve a linear system; they 
should be calling KSPGetConvergedReason() after KSPSolve() to check that the 
solution was computed successfully.

   Barry


> On Oct 10, 2018, at 2:12 PM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I met a case where MUMPS numeric factorization returned an error code -9 in 
> mumps->id.INFOG(1) but A->erroriffailure was false in the following code in 
> mumps.c
> 1199: PetscErrorCode MatFactorNumeric_MUMPS(Mat F,Mat A,const MatFactorInfo 
> *info)
> 1200:
> {
> ...
>
> 1227:   PetscMUMPS_c(mumps);
> 1228:   if
>  (mumps->id.INFOG(1) < 0) {
>
> 1229: if
>  (A->erroriffailure) {
>
> 1230:   SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_LIB,"Error reported by MUMPS 
> in numerical factorization phase: INFOG(1)=%d, INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1231: } else
>  {
>
> 1232:   if (mumps->id.INFOG(1) == -10) { /* numerically singular matrix */
> 1233: PetscInfo2(F,"matrix is numerically singular, INFOG(1)=%d, 
> INFO(2)=%d\n"
> ,mumps->id.INFOG(1),mumps->id.INFO(2));
>
> 1234:
> F->factorerrortype = MAT_FACTOR_NUMERIC_ZEROPIVOT;
>
>
> The code continued to KSPSolve and finished successfully (with wrong answer). 
> The user did not call KSPGetConvergedReason() after KSPSolve. I found I had  
> to either add -ksp_error_if_not_converged or call 
> KSPSetErrorIfNotConverged(ksp,PETSC_TRUE) to make the code fail.
> Is it expected?  In my view, it is dangerous. If MUMPS fails in one stage, 
> PETSc should not proceed to the next stage because it may hang there.



Re: [petsc-dev] TSBASICSYMPLECTIC

2018-09-11 Thread Zhang, Hong
A few related discussions can be found at 
https://bitbucket.org/petsc/petsc/pull-requests/1108/rename-bsi-to-symplectic/diff

In addition, what we have in PETSc now is "Basic Symplectic Integrators" as 
introduced in Ernst Hairer's article 
https://www.unige.ch/~hairer/poly_geoint/week2.pdf .

Other types of symplectic methods such as symplectic Runge-Kutta use different 
tableaus and cannot be implemented in the same framework as the basic one. So 
when naming this particular type of symplectic methods, we think it is better 
to be specific than general.

Thanks,
Hong

On Sep 11, 2018, at 4:53 AM, Lisandro Dalcin 
mailto:dalc...@gmail.com>> wrote:

If the plan is to eventually have a family of sympletic integrators, then I 
think this is a really bad name.

We should follow the pattern elsewhere, and have a main TSSYMPLECTIC type, and 
subtypes TSSYMPLECTICBASIC etc, and in command line we ask for -ts_type 
sympletic -ts_sympletic_type basic.

Or, if there are no plans to have a family, then why to name it BASIC in the 
first place?

PS: Not an expert in the field, feel free to hammer me about my ignorance.

--
Lisandro Dalcin

Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 0109
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459



Re: [petsc-dev] any script to extract data from "-log_view"?

2018-08-09 Thread Zhang, Hong
Here is an updated (python3 and a few other new features) version of Jed's 
script.

https://github.com/caidao22/petscplot

Hong (Mr.)

On Aug 8, 2018, at 10:51 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

Fande Kong mailto:fdkong...@gmail.com>> writes:

Hi all,

If we are going to do a scaling study, can we automatically make a plot
based on a bunch of output files (from -log_view)? It is a big pain to
manually copy data if we have a lot of output files.

This may be a useful.

https://github.com/jedbrown/petscplot



Re: [petsc-dev] plans for preconditioners for SeqSELL

2018-06-25 Thread Zhang, Hong


On Jun 25, 2018, at 4:09 PM, Mills, Richard Tran 
mailto:rtmi...@anl.gov>> wrote:

Hi Hong,

Thanks for your reply. Yes, I was just looking at what goes on in 
MatConvert_Basic() and I see the problem. I note that my AIJSELL code seems to 
always work correctly when the matrix block size is one -- I guess that this is 
because MatSeqSELLSetPreallocation() is called in this case.

Are you saying that the

if (A->rmap->bs > 1) {
ierr = MatConvert_Basic(A,newtype,reuse,newmat);CHKERRQ(ierr);
PetscFunctionReturn(0);
}

lines in sell.c should be removed and then things should work?

If your conversion function for AIJSELL calls MatConvert_SeqAIJ_SeqSELL(), then 
yes. Otherwise you might want to copy some code from the link I attached 
(sell.c).

If so, I'll remove this, test on my machine, and then merge this change into 
'next'. Or is more code needed to make SELL work properly with a block size > 1?

I don't think we need more code.

Thanks,
Hong (Mr.)


--Richard

On Mon, Jun 25, 2018 at 2:04 PM, Zhang, Hong 
mailto:hongzh...@anl.gov>> wrote:
Hi Richard,

MatConvert_Basic() does not work for most cases when converting AIJ to SELL 
because SELL may require padding and being preallocated based on the nonzeros 
per row.
See the correct conversion in 
https://bitbucket.org/petsc/petsc/src/master/src/mat/impls/sell/seq/sell.c?fileviewer=file-view-default%20#sell.c-251

You were probably mislead by the code

if (A->rmap->bs > 1) {
ierr = MatConvert_Basic(A,newtype,reuse,newmat);CHKERRQ(ierr);
PetscFunctionReturn(0);
}
which should be removed.

Thanks,
Hong (Mr.)


On Jun 25, 2018, at 3:06 PM, Richard Tran Mills 
mailto:rtmi...@anl.gov>> wrote:

Hi Everyone (especially Herr Doktor Hong Zhang),

It sure took me a while to get around to it, but I coded up (in branch 
rmills/feature-add-mataijsell) the skeleton of what I'm currently calling the 
'AIJSELL' matrix type, which is a "meta" matrix type that inherits from AIJ but 
maintains a "shadow" MATSEQSELL copy of the matrix for use when that is 
appropriate. This works similarly to how AIJMKL works, insofar as the 
PetscObjectState is tracked and used to determine when the SELL "shadow" copy 
needs to be updated before an operation like MatMult() is applied. (Note that 
what I have so far only does MatMult, but it should be easy to add other 
operations once I've verified that my implementation is working.) I decided on 
the approach of using a MATSEQSELL instance inside an AIJ-derived matrix type 
because I still want to be able to support standalone SELL matrices when there 
really isn't a need for any of the AIJ-only operations.

My AIJSELL implementation seems to work fine in many cases, but I've spent 
several hours chasing some bugs that crop up in certain cases. I now think that 
the bugs are not a problem with my AIJSELL implementation (though I'd welcome 
another pair of eyes to help me verify this), but are lurking somewhere in the 
SELL code and haven't been hit before because AIJSELL makes it easier to 
exercise the SELL class on more problems.

I've been using the well-loved SNES ex19 example (that is used for 'make 
check') for some of my debugging. I've found that everything works great when I 
size the problem grid such that the resulting matrices have a row count that is 
evenly divisible by 8 (when using double precision scalars). So

./ex19 -ksp_type gmres -pc_type none -snes_monitor -ksp_monitor -da_grid_x 4 
-da_grid_y 5 -mat_seqaij_type seqaijsell -snes_max_it 1 -ksp_monitor

which results in an 80x80 Jacobian (since there are 4 degrees of freedom per 
grid point) runs fine, but

./ex19 -ksp_type gmres -pc_type none -snes_monitor -ksp_monitor -da_grid_x 3 
-da_grid_y 5 -mat_seqaij_type seqaijsell -snes_max_it 1 -ksp_monitor

results in either a segfault or the GMRES(30) solve terminating at 10,000 
iterations because the residual norm just keeps bouncing up and down.

Sticking some MatView() calls in, it appears that the MatMult() results inside 
the KSP solve are incorrect because the SELL copy of the matrix generated by my 
MatConvert() call is incorrect when the row count isn't evenly divisible by 
eight. In the case of SNES ex19, MatConvert_Basic() is currently used because 
the block size is greater than unity. So I think something is going wrong 
either in the MatSetValues() or (my guess) MatAssemblyBegin()/End().

Poking around with a debugger, I see that the column index array ("colidx") 
that is part of the SEQSELLHEADER ends up with nonsense values, and the 
segfaults I see tend to occur when indexing into the x vector with this.

I'm not familiar enough with the internals of SELL to look further into 
debugging this without a lot of effort. Hong, can you help me look at this?

Best regards,
Richard

On Tue, Mar 6, 2018 at 3:59 AM, Karl Rupp 
mailto:r...@iue.

Re: [petsc-dev] plans for preconditioners for SeqSELL

2018-06-25 Thread Zhang, Hong
Hi Richard,

MatConvert_Basic() does not work for most cases when converting AIJ to SELL 
because SELL may require padding and being preallocated based on the nonzeros 
per row.
See the correct conversion in 
https://bitbucket.org/petsc/petsc/src/master/src/mat/impls/sell/seq/sell.c?fileviewer=file-view-default%20#sell.c-251

You were probably mislead by the code

if (A->rmap->bs > 1) {
ierr = MatConvert_Basic(A,newtype,reuse,newmat);CHKERRQ(ierr);
PetscFunctionReturn(0);
}
which should be removed.

Thanks,
Hong (Mr.)


On Jun 25, 2018, at 3:06 PM, Richard Tran Mills 
mailto:rtmi...@anl.gov>> wrote:

Hi Everyone (especially Herr Doktor Hong Zhang),

It sure took me a while to get around to it, but I coded up (in branch 
rmills/feature-add-mataijsell) the skeleton of what I'm currently calling the 
'AIJSELL' matrix type, which is a "meta" matrix type that inherits from AIJ but 
maintains a "shadow" MATSEQSELL copy of the matrix for use when that is 
appropriate. This works similarly to how AIJMKL works, insofar as the 
PetscObjectState is tracked and used to determine when the SELL "shadow" copy 
needs to be updated before an operation like MatMult() is applied. (Note that 
what I have so far only does MatMult, but it should be easy to add other 
operations once I've verified that my implementation is working.) I decided on 
the approach of using a MATSEQSELL instance inside an AIJ-derived matrix type 
because I still want to be able to support standalone SELL matrices when there 
really isn't a need for any of the AIJ-only operations.

My AIJSELL implementation seems to work fine in many cases, but I've spent 
several hours chasing some bugs that crop up in certain cases. I now think that 
the bugs are not a problem with my AIJSELL implementation (though I'd welcome 
another pair of eyes to help me verify this), but are lurking somewhere in the 
SELL code and haven't been hit before because AIJSELL makes it easier to 
exercise the SELL class on more problems.

I've been using the well-loved SNES ex19 example (that is used for 'make 
check') for some of my debugging. I've found that everything works great when I 
size the problem grid such that the resulting matrices have a row count that is 
evenly divisible by 8 (when using double precision scalars). So

./ex19 -ksp_type gmres -pc_type none -snes_monitor -ksp_monitor -da_grid_x 4 
-da_grid_y 5 -mat_seqaij_type seqaijsell -snes_max_it 1 -ksp_monitor

which results in an 80x80 Jacobian (since there are 4 degrees of freedom per 
grid point) runs fine, but

./ex19 -ksp_type gmres -pc_type none -snes_monitor -ksp_monitor -da_grid_x 3 
-da_grid_y 5 -mat_seqaij_type seqaijsell -snes_max_it 1 -ksp_monitor

results in either a segfault or the GMRES(30) solve terminating at 10,000 
iterations because the residual norm just keeps bouncing up and down.

Sticking some MatView() calls in, it appears that the MatMult() results inside 
the KSP solve are incorrect because the SELL copy of the matrix generated by my 
MatConvert() call is incorrect when the row count isn't evenly divisible by 
eight. In the case of SNES ex19, MatConvert_Basic() is currently used because 
the block size is greater than unity. So I think something is going wrong 
either in the MatSetValues() or (my guess) MatAssemblyBegin()/End().

Poking around with a debugger, I see that the column index array ("colidx") 
that is part of the SEQSELLHEADER ends up with nonsense values, and the 
segfaults I see tend to occur when indexing into the x vector with this.

I'm not familiar enough with the internals of SELL to look further into 
debugging this without a lot of effort. Hong, can you help me look at this?

Best regards,
Richard

On Tue, Mar 6, 2018 at 3:59 AM, Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:

> Karl, are you thinking of a matrix subclass that has everything that an
> AIJ matrix does, but also keeps a SELL copy around for operations like
> MatMult()? Would it make sense to just keep another Mat inside (like how
> MPIAIJ keeps multiple Mat instances) that *is* of type MATSELL, that
> gets built/updated as needed? Would this involve carrying too much
> baggage around, associated with a complete Mat instance?

What I have in mind is to put the SELL datastructures into a A->spptr,
very much like you did for AIJMKL.


> I like the idea
> of having a MATSELL type available that is lean (insofar as not having
> storing an AIJ matrix) for those cases when a user really knows that the
> AIJ stuff will not be needed. But maybe it makes sense to be able to use
> that inside another matrix class. Perhaps we could have something,
> called, say, MATAIJMUTABLE that uses AIJ but might also create copies in
> SELL (or other formats, potentially) when appropriate -- perhaps based
> on a some performance model indicating which format is fastest for
> MatMult() or whatever.

The actual overhead of also storing a SELL datastructure in terms of
memory footprint is at most 2x. When you keep in mind that ext

Re: [petsc-dev] running test harness under batch system

2018-06-25 Thread Zhang, Hong
I recall that we had a discussion on this before. It would be great if we could 
separate the 'compile' stage and the 'run' stage so that we can compile the 
test on login node and then run it on compute node. Is this feature already in 
the test harness system or what we are going to have?

Hong (Mr.)

> On Jun 25, 2018, at 12:43 PM, Balay, Satish  wrote:
> 
> yes - thats how I did the spack/xsdk build on theta.
> 
> However - it took a very long time - and this interfers with the
> queues and their time limits.
> 
> I was told [later] that mostlikely 'the 'make -j200' job was being
> scheduled on a single core of the theta node - and I needed to change
> some setting to change this behavior. [something I need to figureout]
> 
> Satish
> 
> On Mon, 25 Jun 2018, Smith, Barry F. wrote:
> 
>> 
>>This assumes that the compilers are all available and working on the 
>> compute nodes, correct?
>> 
>> Thanks
>> 
>>      Barry
>> 
>>   Do the compilers work on the compute nodes of theta?
>> 
>> 
>> 
>>> On Jun 25, 2018, at 12:03 PM, Zhang, Hong  wrote:
>>> 
>>> Yes, it is possible. I have run the test harness on cori submitting the 
>>> following script
>>> 
>>> #!/bin/bash -l
>>> 
>>> #SBATCH -N 1  #Use 1 nodes
>>> #SBATCH -t 02:00:00   #Set time limit
>>> #SBATCH -p regular  #Submit to the regular 'partition'
>>> #SBATCH -C knl,quad,cache  #Use KNL nodes
>>> 
>>> make PETSC_ARCH=arch-cori-avx512-opt MPIEXEC='srun -c 4 --cpu_bind=cores' 
>>> -f gmakefile test
>>> 
>>> The most important thing is probably setting the MPIEXEC according to the 
>>> system. But note that there are often limitations on number of nodes on 
>>> large machines. For example, Theta requires a minimum of 128 nodes.
>>> 
>>> Hong (Mr.) 
>>> 
>>> 
>>>> On Jun 22, 2018, at 11:41 AM, Smith, Barry F.  wrote:
>>>> 
>>>> 
>>>> Has anyone run the entire test harness under a batch system? Is this 
>>>> possible, does it require specific commands that should be documented in 
>>>> the users manual?
>>>> 
>>>>  Thanks
>>>> 
>>>> Barry
>>>> 
>>> 
>> 
> 



Re: [petsc-dev] running test harness under batch system

2018-06-25 Thread Zhang, Hong
Yes, it is possible. I have run the test harness on cori submitting the 
following script

#!/bin/bash -l

#SBATCH -N 1  #Use 1 nodes
#SBATCH -t 02:00:00   #Set time limit
#SBATCH -p regular  #Submit to the regular 'partition'
#SBATCH -C knl,quad,cache  #Use KNL nodes

make PETSC_ARCH=arch-cori-avx512-opt MPIEXEC='srun -c 4 --cpu_bind=cores' -f 
gmakefile test

The most important thing is probably setting the MPIEXEC according to the 
system. But note that there are often limitations on number of nodes on large 
machines. For example, Theta requires a minimum of 128 nodes.

Hong (Mr.) 


> On Jun 22, 2018, at 11:41 AM, Smith, Barry F.  wrote:
> 
> 
>   Has anyone run the entire test harness under a batch system? Is this 
> possible, does it require specific commands that should be documented in the 
> users manual?
> 
>Thanks
> 
>   Barry
> 



Re: [petsc-dev] Proposed changes to TS API

2018-05-11 Thread Zhang, Hong


On May 11, 2018, at 1:01 PM, Lisandro Dalcin 
mailto:dalc...@gmail.com>> wrote:

On Fri, 11 May 2018 at 19:34, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

"Smith, Barry F." mailto:bsm...@mcs.anl.gov>> writes:


I assemble the combined system directly.

 How, two sets of calls to MatSetValues(), One for the scaled mass
matrix and one for the Jacobian entries? For a constant mass matrix does
this mean you are recomputing the mass matrix entries each call? Or are you
storing the mass matrix entries somewhere?  Or is your mass matrix diagonal
only?

 Or do you build element by element the M*shift + J element stiffness
and then insert it with a single MatSetValues() call?

It isn't even built separately at the element scale, just summed
contributions at quadrature points.

That's exactly the way I do it in PetIGA, and the way I would do it in any
other general-purpose FEM-like code. In high-order FEM, Jacobian assembly
may very well account from 50% to 80% of runtime (at least that's my
experience with PetIGA). IMHO, forcing users to have to do TWO global
matrix assemblies per time step is simply unacceptable, both in terms of
memory and runtime.

We are not forcing users to do two matrix assemblies per time step. For most 
cases, there is even no need to update dF/dUdot at all. For extreme cases that 
the application requires frequent update on dF/dUdot and assembly is expensive, 
one can always assemble the single-matrix Jacobian and throw it to SNES 
directly.


TS used to be an unusable pile of crap until Jed proposed the marvelous
IJacobian interface. Suddenly COMPLEX fully-implicit DAE problems become
SIMPLE to express, and a single IFunction/IJacobian pair reusable for
different timestepper implementations a reality. And we got an added
bounus: this was efficient, it involved a SINGLE global matrix assembly. If
the issue is in supporting simpler problems, then we should go for an
alternative interface with broken Jacobians, just as Stefano propossed in a
previous email. I'm totally in favor of an additional broken Jacobians API,
and totally againt the removal of the single-matrix IJacobian interface.

The current IJacobian is essentially SNESJacobian. And the single-matrix 
SNESJacobian interface is always there. Power users could set up the 
SNESJacobian directly if we pass a read-only shift parameter to them. So we are 
by no means prohibiting power users from doing what they want.

IJacobian with shift mixes TS Jacobian and SNES Jacobian. This is the issue we 
need to fix.

Thanks,
Hong

I don't buy the argument that IJacobian with shift is ugly, and that such
API drives users away from TS. At best, this is a documentation problem.
Come on, this is just chain rule, should be kindergarden-level stuff for
PETSc users. If we simplify things for the sake of making things simple for
newcomers and beginners and make them annoyingly slow for power users
solving complex problems, we will do a very bad business. This is not
politically correct, but I'm much worried about loosing power users, you
know, those that can eventually make a meaningful contributions to science
and software projects. Beginners and newcomers eventually learn and benefit
for common-sense software design, and will eventually appreciate it. I
really hope populism to not win this battle :-)
--
Lisandro Dalcin

Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 0109
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459



Re: [petsc-dev] Proposed changes to TS API

2018-05-11 Thread Zhang, Hong
Before we go down the rabbit hole, let me reiterate the primary point: an 
unfriendly API breaks the deal in the first place. Perhaps we should reflect on 
why many other software use PETSc just as a nonlinear/linear solver and 
implement their own time stepper instead of using TS. FWIW I think the weird 
IJacobian with shift is not user-friendly.

On May 11, 2018, at 7:20 AM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

"Smith, Barry F." mailto:bsm...@mcs.anl.gov>> writes:

On May 10, 2018, at 4:12 PM, Jed Brown 
mailto:j...@jedbrown.org>> wrote:

"Zhang, Hong" mailto:hongzh...@anl.gov>> writes:

Dear PETSc folks,

Current TS APIs (IFunction/IJacobian+RHSFunction/RHSJacobian) were designed for 
the fully implicit formulation F(t,U,Udot) = G(t,U).
Shampine's paper 
(https://www.sciencedirect.com/science/article/pii/S0377042706004110?via%3Dihub<https://www.sciencedirect.com/science/article/pii/S0377042706004110?via=ihub>)
 explains some reasoning behind it.

Our formulation is general enough to cover all the following common cases

*   Udot = G(t,U) (classic ODE)
*   M Udot = G(t,U)  (ODEs/DAEs for mechanical and electronic systems)
*   M(t,U) Udot = G(t,U) (PDEs)

Yet the TS APIs provide the capability to solve both simple problems and 
complicated problems. However, we are not doing well to make TS easy to use and 
efficient especially for simple problems. Over the years, we have clearly seen 
the main drawbacks including:
1. The shift parameter exposed in IJacobian is terribly confusing, especially 
to new users. Also it is not conceptually straightforward when using AD or 
finite differences on IFunction to approximate IJacobian.

What isn't straightforward about AD or FD on the IFunction?  That one
bit of chain rule?

2. It is difficult to switch from fully implicit to fully explicit. Users 
cannot use explicit methods when they provide IFunction/IJacobian.

This is a real issue, but it's extremely common for PDE to have boundary
conditions enforced as algebraic constraints, thus yielding a DAE.

3. The structure of mass matrix is completely invisible to TS. This means 
giving up all the opportunities to improve efficiency. For example, when M is 
constant or weekly dependent on U, we might not want to evaluate/update it 
every time IJacobian is called. If M is diagonal, the Jacobian can be shifted 
more efficiently than just using MatAXPY().

I don't understand

4. Reshifting the Jacobian is unnecessarily expensive and sometimes buggy.

Why is "reshifting" needed?  After a step is rejected and when the step
size changes for a linear constant operator?

Consider the scenario below.
shift = a;
TSComputeIJacobian()
shift = b;
TSComputeIJacobian() // with the same U and Udot as last call
Changing the shift parameter requires the Jacobian function to be evaluated 
again. If users provide only RHSJacobian, the Jacobian will not be 
updated/reshifted in the second call because TSComputeRHSJacobian() finds out 
that U has not been changed. This issue is fixable by adding more logic into 
the already convoluted implementation of TSComputeIJacobian(), but the 
intention here is to illustrate the cumbersomeness of current IJacobian and the 
growing complications in TSComputeIJacobian() that IJacobian causes.

So I propose that we have two separate matrices dF/dUdot and dF/dU, and remove 
the shift parameter from IJacobian. dF/dU will be calculated by IJacobian; 
dF/dUdot will be calculated by a new callback function and default to an 
identity matrix if it is not provided by users. Then the users do not need to 
assemble the shifted Jacobian since TS will handle the shifting whenever 
needed. And knowing the structure of dF/dUdot (the mass matrix), TS will become 
more flexible. In particular, we will have

*   easy access to the unshifted Jacobian dF/dU (this simplifies the adjoint 
implementation a lot),

How does this simplify the adjoint?

*   plenty of opportunities to optimize TS when the mass matrix is diagonal or 
constant or weekly dependent on U (which accounts for almost all use cases in 
practice),

But longer critical path,

  What do you mean by longer critical path?

Create Mass (dF/dUdot) matrix, call MatAssembly, create dF/dU, call
MatAssembly, call MatAXPY (involves another MatAssembly unless
SAME_NONZERO_PATTERN).  That's a long sequence for what could be one
MatAssembly.  Also note that geometric setup for elements is usually
done in each element loop.  For simple physics, this is way more
expensive than the physics (certainly the case for LibMesh and Deal.II).

So the benefit of asking users to provide the shifted Jacobian is that they 
could use less MatAssembly inside IJacobian. But what I proposed aims for the 
benefit of reducing the number of Jacobian or Mass matrix evaluations.
Consider how we handle the following simple cases in the new approach:
1. Udot = G(t,U) -- users do not need to provide dF/d

  1   2   >