from:"Zhang, Junchao via petsc\-dev"

[petsc-dev] How to add root values to leaves in PetscSF?

2019-01-22 Thread Zhang, Junchao via petsc-dev

I want to add root values to leaves, and keep root unchanged. PetscSFBcast came 
to my mind, but unfortunately it only broadcasts roots and does not have an 
MPI_Op argument like PetscSFReduce for me to choose from INSERT_VALUES, 
ADD_VALUES, etc.
Any tips? Thanks.

PS: I met this problem when I tried to implement VecScatter in SF.  In 
VecScatter, multiple entries of vec x can be scattered to the same entry of vec 
y, and one entry of x can also be scattered to multiple entries of y.  I want 
to use one SF for all combinations of SCATTER_FORWARD/BACKWARD, 
INSERT/ADD_VALUES. It seems impossible with current SF interface.
--Junchao Zhang

Re: [petsc-dev] How to add root values to leaves in PetscSF?

2019-01-22 Thread Zhang, Junchao via petsc-dev



On Tue, Jan 22, 2019 at 1:35 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Tue, Jan 22, 2019 at 2:23 PM Zhang, Junchao via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I want to add root values to leaves, and keep root unchanged. PetscSFBcast came 
to my mind, but unfortunately it only broadcasts roots and does not have an 
MPI_Op argument like PetscSFReduce for me to choose from INSERT_VALUES, 
ADD_VALUES, etc.
Any tips? Thanks.

PS: I met this problem when I tried to implement VecScatter in SF.  In 
VecScatter, multiple entries of vec x can be scattered to the same entry of vec 
y, and one entry of x can also be scattered to multiple entries of y.  I want 
to use one SF for all combinations of SCATTER_FORWARD/BACKWARD, 
INSERT/ADD_VALUES. It seems impossible with current SF interface.

I think you might want this: 
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PetscSF/PetscSFFetchAndOpBegin.html

I read it as: Leaves are accumulated to root (root is changed), and leaves get 
a snapshot of the root before each atomic update. It is not what I want.

  Thanks,

Matt

--Junchao Zhang


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] How to add root values to leaves in PetscSF?

2019-01-22 Thread Zhang, Junchao via petsc-dev



On Tue, Jan 22, 2019 at 4:08 PM Jed Brown 
mailto:j...@jedbrown.org>> wrote:
It is not supported at this time.  What does your use case look like?
Do roots have degree greater than 1?
Yes. Imagine a vecscatter x[0]->y[0], x[1]->y[0], x[1]->y[1],x[2]->y[1]. I 
build an SF for SCATTER_FORWARD. Now I wan to do SCATTER_REVERSE with 
ADD_VALUES.
To solve this problem without creating another SF and without breaking current 
SF API, I propose to add PetscSFBcastAndOp(sf, unit, rootdata, leafdata, op)


"Zhang, Junchao via petsc-dev" 
mailto:petsc-dev@mcs.anl.gov>> writes:

> On Tue, Jan 22, 2019 at 1:35 PM Matthew Knepley 
> mailto:knep...@gmail.com><mailto:knep...@gmail.com<mailto:knep...@gmail.com>>>
>  wrote:
> On Tue, Jan 22, 2019 at 2:23 PM Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
>  wrote:
> I want to add root values to leaves, and keep root unchanged. PetscSFBcast 
> came to my mind, but unfortunately it only broadcasts roots and does not have 
> an MPI_Op argument like PetscSFReduce for me to choose from INSERT_VALUES, 
> ADD_VALUES, etc.
> Any tips? Thanks.
>
> PS: I met this problem when I tried to implement VecScatter in SF.  In 
> VecScatter, multiple entries of vec x can be scattered to the same entry of 
> vec y, and one entry of x can also be scattered to multiple entries of y.  I 
> want to use one SF for all combinations of SCATTER_FORWARD/BACKWARD, 
> INSERT/ADD_VALUES. It seems impossible with current SF interface.
>
> I think you might want this: 
> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PetscSF/PetscSFFetchAndOpBegin.html
>
> I read it as: Leaves are accumulated to root (root is changed), and leaves 
> get a snapshot of the root before each atomic update. It is not what I want.
>
>   Thanks,
>
> Matt
>
> --Junchao Zhang
>
>
> --
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] How to add root values to leaves in PetscSF?

2019-01-22 Thread Zhang, Junchao via petsc-dev

On Tue, Jan 22, 2019 at 5:07 PM Jed Brown 
mailto:j...@jedbrown.org>> wrote:
"Zhang, Junchao" mailto:jczh...@mcs.anl.gov>> writes:

> On Tue, Jan 22, 2019 at 4:08 PM Jed Brown 
> mailto:j...@jedbrown.org>>>
>  wrote:
>> It is not supported at this time.  What does your use case look like?
>> Do roots have degree greater than 1?
>
> Yes. Imagine a vecscatter x[0]->y[0], x[1]->y[0], x[1]->y[1],x[2]->y[1]. I 
> build an SF for SCATTER_FORWARD. Now I wan to do SCATTER_REVERSE with 
> ADD_VALUES.
> To solve this problem without creating another SF and without breaking 
> current SF API, I propose to add PetscSFBcastAndOp(sf, unit, rootdata, 
> leafdata, op)

Is this pattern needed for some algorithm or application or is this
about implementing the full VecScatter interface (modulo ill-defined
operations) in terms of SF?  If the former, I'd like to understand it.
If the latter, I'm okay with you extending the interface as proposed.
It is the latter.

[petsc-dev] No VecRestoreArrayRead_Nest()?

2019-02-11 Thread Zhang, Junchao via petsc-dev

I did not see VecRestoreArrayRead_Nest. On VecNest, VecRestoreArrayRead 
defaults to VecRestoreArray(x,a) , which copies a[] to x and is expensive.  Is 
it an oversight?

--Junchao Zhang

[petsc-dev] Fwd: Arm DDT feature questions

2019-03-05 Thread Zhang, Junchao via petsc-dev

In an Arm training class, I requested a DDT feature from John Linford from Arm 
Inc, which now owns Allinea DDT.  Basically, I want DDT to correctly display 
petsc variable-lengthed arrays and void* pointers.  From John's feedback, it 
looks DDT could support it.
Does anyone already have custom pretty-printers in gdb for petsc?  Otherwise, I 
think we should make one and perhaps let Arm ship with its product.

--Junchao Zhang

-- Forwarded message -
From: John Linford mailto:john.linf...@arm.com>>
Date: Tue, Mar 5, 2019 at 1:16 PM
Subject: Re: Arm DDT feature questions
To: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>



Hi Junchao,


The Forge developers came back with some answers.  Please see below.  Thanks,


Sorry for the delay. Please find answer to your questions:



Does Arm DDT support user-defined array length and user-defined pointer casting 
from a custom file, which sets the rules and is provided by users?



Yes, DDT does support this through the user of GDB pretty printers. The best 
example is the support of STL vectors in DDT. There is also a simple example in 
/path/to/forge/example/fruit* and explanations in the UG.

I have been trying to fiddle with pretty printers and PETSC (this could make a 
nice blog article) but I only managed to make it work in GDB, not in DDT for 
some reason. I’ll let you know if I manage to get anything interesting.

We could possibly make a feature request to support PETSC datatypes. If PETSC 
developers maintain GDB pretty printers between versions, I think this is 
something we could look at.







John C. Linford | Principal Applications Engineer | Development Solutions
john.linf...@arm.com | 
LinkedIn

Direct/Mobile: +1-737-218-3529
Arm HPC Ecosystem | 
www.arm.com







From: John Linford
Sent: Tuesday, February 19, 2019 12:31:56 PM
To: Zhang, Junchao
Subject: Re: Arm DDT feature questions


Hi Junchao,


Thanks for the follow-up and glad you could attend the workshop.  I've sent 
this over to the tools group and should have an answer for you soon.  Thanks,




John C. Linford | Principal Applications Engineer | Development Solutions
john.linf...@arm.com | 
LinkedIn

Direct/Mobile: +1-737-218-3529
Arm HPC Ecosystem | 
www.arm.com







From: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>
Sent: Friday, February 15, 2019 5:28:59 PM
To: John Linford
Subject: Arm DDT feature questions

Hi, John,
  Does Arm DDT support user-defined array length and user-defined pointer 
casting from a custom file, which sets the rules and is provided by users?  
Suppose I have a struct
typedef struct {
  int type;
  int len;
  double *p;
  void *data;
} Vec;

Currently, to display a Vec variable, I have to manually tell DDT length of 
array p[] is given by len, and type of void *data is Vec_MPI* if type==1, is 
Vec_Seq* if type==2, so on so forth. I have to do the manual work every time 
when I display variables of type Vec. It is not convenient.
Structures with void pointers is heavily used by PETSc, which is a numerical 
library widely used in HPC. It would be nice if DDT supports PETSc objects 
display natively.
Thank you
--Junchao Zhang
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

[petsc-dev] Unwanted Fortran stub

2019-03-12 Thread Zhang, Junchao via petsc-dev

I declared PETSC_INTERN PetscErrorCode VecLockWriteSet_Private(Vec,PetscBool) 
in vecimp.h and defined it in src/vec/vec/interface/rvector.c.  I used 
PETSC_INTERN and _Private since currently the function is only used in the Vec 
packaged and is not public.  I met this compilation warning
src/vec/vec/interface/ftn-auto/rvectorf.c:333:1: warning: implicit declaration 
of function ‘VecLockWriteSet_Private’ [-Wimplicit-function-declaration]
 *__ierr = VecLockWriteSet_Private(
 ^
How to fix that? I do not think I need a Fortran stub for that. I found under 
ftn-auto, not all .c functions have a Fotran counterpart. What is the rule?
Thanks.
--Junchao Zhang

Re: [petsc-dev] Unwanted Fortran stub

2019-03-12 Thread Zhang, Junchao via petsc-dev

Got it.  Thanks.
--Junchao Zhang


On Tue, Mar 12, 2019 at 2:01 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  It is because you have  manual page documentation for it. The comment above 
it starts with /*@   if you don't want a manual page then remove the @ if you 
want a manual page but no stub then use /*@C

  Barry


> On Mar 12, 2019, at 1:46 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> I declared PETSC_INTERN PetscErrorCode VecLockWriteSet_Private(Vec,PetscBool) 
> in vecimp.h and defined it in src/vec/vec/interface/rvector.c.  I used 
> PETSC_INTERN and _Private since currently the function is only used in the 
> Vec packaged and is not public.  I met this compilation warning
> src/vec/vec/interface/ftn-auto/rvectorf.c:333:1: warning: implicit 
> declaration of function ‘VecLockWriteSet_Private’ 
> [-Wimplicit-function-declaration]
>  *__ierr = VecLockWriteSet_Private(
>  ^
> How to fix that? I do not think I need a Fortran stub for that. I found under 
> ftn-auto, not all .c functions have a Fotran counterpart. What is the rule?
> Thanks.
> --Junchao Zhang

[petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev

I met some errors with cuda + mumps. It was tested with
make -f gmakefile test search='snes_tutorials-ex69_q2p1fetidp_deluxe 
snes_tutorials-ex62_fetidp_2d_quad 
snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive ksp_ksp_tutorials-ex52f_mumps'
I can reproduce it with petsc master.  The first line of petsc nightly 
(http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
shows another error. But I guess they have the same root: PETSc gives random 
wrong results in some cases. For example, I ran ksp_ksp_tutorials-ex52f_mumps 
twice and saw

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.5554E-06 iterations 1

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.6356E-06 iterations 1

The correct output has "Norm of error < 1.e-12,iterations 1".  Currently, I 
do know the reason.

--Junchao Zhang

Re: [petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev

I do not know the reason.
--Junchao Zhang


On Wed, Mar 13, 2019 at 11:15 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
I met some errors with cuda + mumps. It was tested with
make -f gmakefile test search='snes_tutorials-ex69_q2p1fetidp_deluxe 
snes_tutorials-ex62_fetidp_2d_quad 
snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive ksp_ksp_tutorials-ex52f_mumps'
I can reproduce it with petsc master.  The first line of petsc nightly 
(http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
shows another error. But I guess they have the same root: PETSc gives random 
wrong results in some cases. For example, I ran ksp_ksp_tutorials-ex52f_mumps 
twice and saw

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.5554E-06 iterations 1

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.6356E-06 iterations 1

The correct output has "Norm of error < 1.e-12,iterations 1".  Currently, I 
do know the reason.

--Junchao Zhang

Re: [petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev



On Wed, Mar 13, 2019 at 11:28 AM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Wed, Mar 13, 2019 at 12:16 PM Zhang, Junchao via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I met some errors with cuda + mumps.

It does not look like CUDA is being used in these runs. Is it? If not, do you 
mean MUMPS goes crazy if we even compiler something with CUDA?
No, the tests do not use GPU.  I need further investigation.


  Thanks,

 Matt

It was tested with
make -f gmakefile test search='snes_tutorials-ex69_q2p1fetidp_deluxe 
snes_tutorials-ex62_fetidp_2d_quad 
snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive ksp_ksp_tutorials-ex52f_mumps'
I can reproduce it with petsc master.  The first line of petsc nightly 
(http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
shows another error. But I guess they have the same root: PETSc gives random 
wrong results in some cases. For example, I ran ksp_ksp_tutorials-ex52f_mumps 
twice and saw

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.5554E-06 iterations 1

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.6356E-06 iterations 1

The correct output has "Norm of error < 1.e-12,iterations 1".  Currently, I 
do know the reason.

--Junchao Zhang


--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] MPI_UB is deprecated in MPI-2.0

2019-03-13 Thread Zhang, Junchao via petsc-dev

On Wed, Mar 13, 2019 at 12:48 PM Isaac, Tobin G via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Pushed a fix that just uses MPI_Type_contiguous(MPI_BYTE, sizeof(),
...), which is not great but I'm only creating the type to work with
PetscSF, so it does the job.  Satish, do you want this as a pull
request, or can you just merge it into next
(`tisaac/feature-remove-mpi-ub`)?

If we support heterogeneous environments (e.g., big endian to little endian 
transfer), then we should use MPI_Type_create_resized(). MPI_BYTE is untyped. 
MPI needs type info to do conversion.

Thanks,
  Toby

On Tue, Mar 12, 2019 at 10:21:42PM -0600, Jed Brown wrote:
> MPI_Type_create_resized (if needed).
>
> "Balay, Satish via petsc-dev" 
> mailto:petsc-dev@mcs.anl.gov>> writes:
>
> > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/01/make_master_arch-linux-pkgs-64idx_thrash.log
> > has the following [but for some reason - its filtered out from the warning 
> > count]
> >
> 
> > In file included from 
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/dmp4est.c:13:0:
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c: In 
> > function â€˜DMPforestGetTransferSF_Pointâ€™:
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c:2518:7: 
> > warning: â€˜ompi_mpi_ubâ€™ is deprecated (declared at 
> > /sandbox/petsc/petsc.master-3/arch-linux-pkgs-64idx/include/mpi.h:928): 
> > MPI_UB is deprecated in MPI-2.0 [-Wdeprecated-declarations]
> >MPI_Datatype blockTypes[5] = 
> > {MPI_INT32_T,MPI_INT8_T,MPI_INT16_T,MPI_INT32_T,MPI_UB};
> > <<
> >
> > Any idea how to fix this?
> >
> > Thanks,
> > Satish

[petsc-dev] Fwd: MPI_UB is deprecated in MPI-2.0

2019-03-13 Thread Zhang, Junchao via petsc-dev

-- Forwarded message -
From: Gropp, William D mailto:wgr...@illinois.edu>>
Date: Wed, Mar 13, 2019 at 1:38 PM
Subject: Re: [petsc-dev] MPI_UB is deprecated in MPI-2.0
To: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>

Type_create_resized is the preferred method of replacing use of MPI_UB .
Bill

On Mar 13, 2019 1:34 PM, "Zhang, Junchao via petsc-dev" 
mailto:petsc-dev@mcs.anl.gov>> wrote:

On Wed, Mar 13, 2019 at 12:48 PM Isaac, Tobin G via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Pushed a fix that just uses MPI_Type_contiguous(MPI_BYTE, sizeof(),
...), which is not great but I'm only creating the type to work with
PetscSF, so it does the job.  Satish, do you want this as a pull
request, or can you just merge it into next
(`tisaac/feature-remove-mpi-ub`)?

If we support heterogeneous environments (e.g., big endian to little endian 
transfer), then we should use MPI_Type_create_resized(). MPI_BYTE is untyped. 
MPI needs type info to do conversion.

Thanks,
  Toby

On Tue, Mar 12, 2019 at 10:21:42PM -0600, Jed Brown wrote:
> MPI_Type_create_resized (if needed).
>
> "Balay, Satish via petsc-dev" 
> mailto:petsc-dev@mcs.anl.gov>> writes:
>
> > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/01/make_master_arch-linux-pkgs-64idx_thrash.log
> > has the following [but for some reason - its filtered out from the warning 
> > count]
> >
> >>>>>>>>
> > In file included from 
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/dmp4est.c:13:0:
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c: In 
> > function â€˜DMPforestGetTransferSF_Pointâ€™:
> > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c:2518:7: 
> > warning: â€˜ompi_mpi_ubâ€™ is deprecated (declared at 
> > /sandbox/petsc/petsc.master-3/arch-linux-pkgs-64idx/include/mpi.h:928): 
> > MPI_UB is deprecated in MPI-2.0 [-Wdeprecated-declarations]
> >MPI_Datatype blockTypes[5] = 
> > {MPI_INT32_T,MPI_INT8_T,MPI_INT16_T,MPI_INT32_T,MPI_UB};
> > <<<<<<
> >
> > Any idea how to fix this?
> >
> > Thanks,
> > Satish

Re: [petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev

Satish,
  I found something strange. I configured with --with-cuda 
--with-precision=single, then with -log_view, I saw
Compiled with single precision PetscScalar and PetscReal
Compiled with full precision matrices (default)
  I found that confused mumps. When I added --with-precision=double, I got 
consistent MatScalar and PetscScalar precision, and the errors I reported 
disappeared. Do you know why?

--Junchao Zhang


On Wed, Mar 13, 2019 at 11:15 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
I met some errors with cuda + mumps. It was tested with
make -f gmakefile test search='snes_tutorials-ex69_q2p1fetidp_deluxe 
snes_tutorials-ex62_fetidp_2d_quad 
snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive ksp_ksp_tutorials-ex52f_mumps'
I can reproduce it with petsc master.  The first line of petsc nightly 
(http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
shows another error. But I guess they have the same root: PETSc gives random 
wrong results in some cases. For example, I ran ksp_ksp_tutorials-ex52f_mumps 
twice and saw

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.5554E-06 iterations 1

$ mpirun -n 3 ./ex52f
Mumps row pivot threshhold =1.00E-06
Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
Norm of error  1.6356E-06 iterations 1

The correct output has "Norm of error < 1.e-12,iterations 1".  Currently, I 
do know the reason.

--Junchao Zhang

Re: [petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev




On Wed, Mar 13, 2019 at 3:49 PM Balay, Satish 
mailto:ba...@mcs.anl.gov>> wrote:
mailers that only format html and not the text form are annoying :(


On Wed, 13 Mar 2019, Zhang, Junchao wrote:

> Satish,   I found something strange. I configured with --with-cuda 
> --with-precision=single, then with -log_view, I saw
>   Compiled with single precision PetscScalar and PetscReal
> Compiled with full precision matrices (default)

This message is misleading [and should be fixed]. This feature was removed.

>
>   I found that confused mumps. When I added --with-precision=double, I got 
> consistent MatScalar and PetscScalar precision, and the errors I reported
> disappeared. Do you know why?

how is mumps confused? Perhaps this example doesn't work with single precision? 
[or mumps single precision doesn't work for this problem?]

 OK, I thought mumps was inputed with a double precision matrix and single 
precision vectors.  I was wrong.  I tested on my laptop --with-cuda 
--with-precision=single, these errors did show up.  In petsc dashboard, we do 
not have "single + mumps" combination.  I will create a PR to mark these tests 
require double.

Lots of examples fail with single precision..

Satish

>
> --Junchao Zhang
>
>
> On Wed, Mar 13, 2019 at 11:15 AM Junchao Zhang 
> mailto:jczh...@mcs.anl.gov>> wrote:
>   I met some errors with cuda + mumps. It was tested with
>   make -f gmakefile test search='snes_tutorials-ex69_q2p1fetidp_deluxe 
> snes_tutorials-ex62_fetidp_2d_quad
>   snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive 
> ksp_ksp_tutorials-ex52f_mumps'
>
> I can reproduce it with petsc master.  The first line of petsc nightly
> (http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
> shows another error. But I guess they have the same root: PETSc
> gives random wrong results in some cases. For example, I ran 
> ksp_ksp_tutorials-ex52f_mumps twice and saw
>
>   $ mpirun -n 3 ./ex52f
> Mumps row pivot threshhold =1.00E-06
> Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
> Norm of error  1.5554E-06 iterations 1
>
> $ mpirun -n 3 ./ex52f
> Mumps row pivot threshhold =1.00E-06
> Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
> Norm of error  1.6356E-06 iterations 1
>
>
> The correct output has "Norm of error < 1.e-12,iterations 1".  Currently, 
> I do know the reason.
>
> --Junchao Zhang
>
>
>

Re: [petsc-dev] errors with cuda + mumps

2019-03-13 Thread Zhang, Junchao via petsc-dev

search='snes_tutorials-ex69_q2p1fetidp_deluxe 
snes_tutorials-ex62_fetidp_2d_quad 
snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive ksp_ksp_tutorials-ex52f_mumps'

Already made a PR.

--Junchao Zhang


On Wed, Mar 13, 2019 at 4:55 PM Balay, Satish 
mailto:ba...@mcs.anl.gov>> wrote:
On Wed, 13 Mar 2019, Zhang, Junchao wrote:

>
>
>
> On Wed, Mar 13, 2019 at 3:49 PM Balay, Satish 
> mailto:ba...@mcs.anl.gov>> wrote:
>   mailers that only format html and not the text form are annoying :(
>
>
>   On Wed, 13 Mar 2019, Zhang, Junchao wrote:
>
>   > Satish,   I found something strange. I configured with --with-cuda 
> --with-precision=single, then
>   with -log_view, I saw
>   >   Compiled with single precision PetscScalar and PetscReal
>   > Compiled with full precision matrices (default)
>
>   This message is misleading [and should be fixed]. This feature was 
> removed.
>
>   >
>   >   I found that confused mumps. When I added --with-precision=double, 
> I got consistent MatScalar
>   and PetscScalar precision, and the errors I reported
>   > disappeared. Do you know why?
>
>   how is mumps confused? Perhaps this example doesn't work with single 
> precision? [or mumps single
>   precision doesn't work for this problem?]
>
>
>  OK, I thought mumps was inputed with a double precision matrix and single 
> precision vectors.  I was wrong.  I
> tested on my laptop --with-cuda --with-precision=single, these errors did 
> show up.  In petsc dashboard, we do
> not have "single + mumps" combination.  I will create a PR to mark these 
> tests require double.

Which examples are you referring to? Do they fail in master - or only in your 
feature branch?

Satish

>
>   Lots of examples fail with single precision..
>
>   Satish
>
>   >
>   > --Junchao Zhang
>   >
>   >
>   > On Wed, Mar 13, 2019 at 11:15 AM Junchao Zhang 
> mailto:jczh...@mcs.anl.gov>> wrote:
>   >   I met some errors with cuda + mumps. It was tested with
>   >   make -f gmakefile test 
> search='snes_tutorials-ex69_q2p1fetidp_deluxe
>   snes_tutorials-ex62_fetidp_2d_quad
>   >   snes_tutorials-ex69_q2p1fetidp_deluxe_adaptive 
> ksp_ksp_tutorials-ex52f_mumps'
>   >
>   > I can reproduce it with petsc master.  The first line of petsc nightly
>   > 
> (http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/12/master.html) 
> shows another
>   error. But I guess they have the same root: PETSc
>   > gives random wrong results in some cases. For example, I ran 
> ksp_ksp_tutorials-ex52f_mumps twice
>   and saw
>   >
>   >   $ mpirun -n 3 ./ex52f
>   > Mumps row pivot threshhold =1.00E-06
>   > Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
>   > Norm of error  1.5554E-06 iterations 1
>   >
>   > $ mpirun -n 3 ./ex52f
>   > Mumps row pivot threshhold =1.00E-06
>   > Mumps determinant=(   9.01E-01   0.00E+00)*2^ 99
>   > Norm of error  1.6356E-06 iterations 1
>   >
>   >
>   > The correct output has "Norm of error < 1.e-12,iterations 1".  
> Currently, I do know the
>   reason.
>   >
>   > --Junchao Zhang
>   >
>   >
>   >
>
>
>

Re: [petsc-dev] https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html

2019-03-18 Thread Zhang, Junchao via petsc-dev

Let's see how the author thought about PETSc. The author likes Chapel -- a PGAS 
language. In https://www.dursi.ca/post/julia-vs-chapel.html he said his 
concerns about Chapel
 "the beginnings of a Chapel-native set of solvers from Scalapack or PETSc 
(both of which are notoriously hard to get started with, and in PETSc’s case, 
even install)"

His slides have more,
"
PETSc is a widely used library for large sparse iterative solves.
Excellent and comprehensive library of solvers
It is the basis of a significant number of home-made simulation codes
It is notoriously hard to start getting running with; nontrivial even 
for experts to install.
  Significant fraction of PETSc functionality is tied up in large CSR matrices 
of reasonable structure partitioned by row, vectors, and solvers built on top.
  What would a Chapel API to PETSc look like?
  What would a Chapel implementation of some core PETSc solvers look like?
"
In my view, the good and the evil of MPI grow from one root:  MPI has a local 
name space. MPI does not try to define a global data structure. The evil is 
users have to do their global naming, which can be very easy (e.g., in stencil) 
or very hard (e.g., refine an unstructured mesh). The good is user has the 
freedom to design their own data structure (array, CSR, tree, hash table, mesh, 
...).
PGAS languages tried to provide a set of global data structures, but which is 
very limited and did not meet requirements of many HPC codes. MPI challengers 
should start with AMR, but not VecAXPY.

--Junchao Zhang


On Sun, Mar 17, 2019 at 3:12 PM Smith, Barry F. via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

  I stubbled on this today; I should have seen it years ago.

  Barry

Re: [petsc-dev] MPI_UB is deprecated in MPI-2.0

2019-03-21 Thread Zhang, Junchao via petsc-dev

I pushed an update to this branch, which adopts MPI_Type_create_resized.
--Junchao Zhang


On Tue, Mar 19, 2019 at 11:56 AM Balay, Satish via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
For now - I'm merging this branch to next. If better fix comes up later we can 
merge it then.

thanks,
Satish

On Wed, 13 Mar 2019, Isaac, Tobin G wrote:

>
> Pushed a fix that just uses MPI_Type_contiguous(MPI_BYTE, sizeof(),
> ...), which is not great but I'm only creating the type to work with
> PetscSF, so it does the job.  Satish, do you want this as a pull
> request, or can you just merge it into next
> (`tisaac/feature-remove-mpi-ub`)?
>
> Thanks,
>   Toby
>
> On Tue, Mar 12, 2019 at 10:21:42PM -0600, Jed Brown wrote:
> > MPI_Type_create_resized (if needed).
> >
> > "Balay, Satish via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> > > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2019/03/01/make_master_arch-linux-pkgs-64idx_thrash.log
> > > has the following [but for some reason - its filtered out from the 
> > > warning count]
> > >
> > 
> > > In file included from 
> > > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/dmp4est.c:13:0:
> > > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c: In 
> > > function â€˜DMPforestGetTransferSF_Pointâ€™:
> > > /sandbox/petsc/petsc.master-3/src/dm/impls/forest/p4est/pforest.c:2518:7: 
> > > warning: â€˜ompi_mpi_ubâ€™ is deprecated (declared at 
> > > /sandbox/petsc/petsc.master-3/arch-linux-pkgs-64idx/include/mpi.h:928): 
> > > MPI_UB is deprecated in MPI-2.0 [-Wdeprecated-declarations]
> > >MPI_Datatype blockTypes[5] = 
> > > {MPI_INT32_T,MPI_INT8_T,MPI_INT16_T,MPI_INT32_T,MPI_UB};
> > > <<
> > >
> > > Any idea how to fix this?
> > >
> > > Thanks,
> > > Satish
>

Re: [petsc-dev] Deprecation strategy for Enums

2019-04-09 Thread Zhang, Junchao via petsc-dev

We should have a mechanism to auto-detect API-breaking commits and then we can 
fix them before release.

--Junchao Zhang

On Tue, Apr 9, 2019 at 12:39 PM Matthew Knepley via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
This change:

  
https://bitbucket.org/petsc/petsc/commits/c0decd05c6848b80907752eef350b55c8c90e696#Linclude/petscksp.hF448

breaks the current LibMesh, and I am assuming other things. Do we have a 
deprecation strategy for this? How about we #define that old name to the same 
value for one release? We could do this
in 3.11.1.

  Thanks,

Matt

--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/

Re: [petsc-dev] Deprecation strategy for Enums

2019-04-09 Thread Zhang, Junchao via petsc-dev

I will try that to see if it works.
--Junchao Zhang

On Tue, Apr 9, 2019 at 5:14 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   Junchao,

 Do you want to give this a go? Maybe it is a script in lib/petsc/bin/maint 
or a makefile rule that takes two git branch names and reports any changes 
between them (presumably by running ./configure && make twice to generate the 
libraries and making a copy of one branches includes so they can be compared 
with the other)

   Barry

> On Apr 9, 2019, at 4:42 PM, Jed Brown via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> "Zhang, Junchao via petsc-dev" 
> mailto:petsc-dev@mcs.anl.gov>> writes:
>
>> We should have a mechanism to auto-detect API-breaking commits and then we 
>> can fix them before release.
>
> We should have our CI system flag PRs that fail this checker to confirm
> that it's documented in changes/dev.html and that the change is really
> necessary.
>
> https://lvc.github.io/abi-compliance-checker/

Re: [petsc-dev] Bad use of defined(MPI_XXX)

2019-05-24 Thread Zhang, Junchao via petsc-dev

How about stuff in MPI-2.2 (approved in 2009), the last of MPI-2.x, e.g., 
PETSC_HAVE_MPI_REDUCE_LOCAL?


On Fri, May 24, 2019 at 2:51 PM Jed Brown via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Lisandro Dalcin via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> writes:

> These two are definitely wrong, we need PETSC_HAVE_MPI_XXX instead.

Thanks, we can delete both of these cpp guards.

> include/petscsf.h:#if defined(MPI_REPLACE)

MPI-2.0

> src/sys/objects/init.c:#if defined(PETSC_USE_64BIT_INDICES) ||
> !defined(MPI_2INT)

MPI-1.0

Re: [petsc-dev] Bad use of defined(MPI_XXX)

2019-05-24 Thread Zhang, Junchao via petsc-dev

PetscSF has many PETSC_HAVE_MPI_REDUCE_LOCAL. It is disturbing. But consider 
the time gap between MPI-2.0 (1998) and MPI-2.2 (2009), it is better to keep it.

On Fri, May 24, 2019 at 3:53 PM Jed Brown 
mailto:j...@jedbrown.org>> wrote:
"Zhang, Junchao" mailto:jczh...@mcs.anl.gov>> writes:

> How about stuff in MPI-2.2 (approved in 2009), the last of MPI-2.x, e.g., 
> PETSC_HAVE_MPI_REDUCE_LOCAL?

Currently we only require MPI-2.0, but I would not object to increasing
to MPI-2.1 or 2.2 if such systems are sufficiently rare (almost
nonexistent) in the wild.  I'm not sure how great the benefits are.

> On Fri, May 24, 2019 at 2:51 PM Jed Brown via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>>>
>  wrote:
> Lisandro Dalcin via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>>>
>  writes:
>
>> These two are definitely wrong, we need PETSC_HAVE_MPI_XXX instead.
>
> Thanks, we can delete both of these cpp guards.
>
>> include/petscsf.h:#if defined(MPI_REPLACE)
>
> MPI-2.0
>
>> src/sys/objects/init.c:#if defined(PETSC_USE_64BIT_INDICES) ||
>> !defined(MPI_2INT)
>
> MPI-1.0

[petsc-dev] Is DMGetDMKSPWrite() really not collective?

2019-06-14 Thread Zhang, Junchao via petsc-dev

Hello,
   I am investigating petsc issue 
306.
 One can produce the problem with src/snes/examples/tutorials/ex9.c and mpirun 
-n 3 ./ex9 -snes_grid_sequence 3 -snes_converged_reason -pc_type mg
  The program can run to finish,  or crash, or hang.  The error appears either 
in PetscGatherMessageLengths or PetscCommBuildTwoSided().  From from my 
debugging, the following routine is suspicious.  It claims not collective, but 
it might call DMKSPCreate, which can indirectly call a collective 
MPI_Comm_dup().
  Can someone familiar with the code explain it?  Thanks.

  /*@C
   DMGetDMKSPWrite - get write access to private DMKSP context from a DM

   Not Collective

   Input Argument:
.  dm - DM to be used with KSP

   Output Argument:
.  kspdm - private DMKSP context

   Level: developer

.seealso: DMGetDMKSP()
@*/
PetscErrorCode DMGetDMKSPWrite(DM dm,DMKSP *kspdm)
{
  PetscErrorCode ierr;
  DMKSP  kdm;

  PetscFunctionBegin;
  PetscValidHeaderSpecific(dm,DM_CLASSID,1);
  ierr = DMGetDMKSP(dm,&kdm);CHKERRQ(ierr);
  if (!kdm->originaldm) kdm->originaldm = dm;
  if (kdm->originaldm != dm) {  /* Copy on write */
DMKSP oldkdm = kdm;
ierr  = PetscInfo(dm,"Copying DMKSP due to write\n");CHKERRQ(ierr);
ierr  = 
DMKSPCreate(PetscObjectComm((PetscObject)dm),&kdm);CHKERRQ(ierr);
ierr  = DMKSPCopy(oldkdm,kdm);CHKERRQ(ierr);
ierr  = DMKSPDestroy((DMKSP*)&dm->dmksp);CHKERRQ(ierr);
dm->dmksp = (PetscObject)kdm;
  }
  *kspdm = kdm;
  PetscFunctionReturn(0);
}

The calling stack is
...
#20 0x7fcf3e3d23e2 in PetscCommDuplicate 
(comm_in=comm_in@entry=-2080374782, comm_out=comm_out@entry=0x557a18304db0, 
first_tag=first_tag@entry=0x557a18304de4) at 
/home/jczhang/petsc/src/sys/objects/tagm.c:162
#21 0x7fcf3e3d7730 in PetscHeaderCreate_Private (h=0x557a18304d70, 
classid=, class_name=class_name@entry=0x7fcf3f7f762a "DMKSP", 
descr=descr@entry=0x7fcf3f7f762a "DMKSP",  mansec=mansec@entry=0x7fcf3f7f762a 
"DMKSP", comm=comm@entry=-2080374782, destroy=0x7fcf3f350570 , 
view=0x0)at /home/jczhang/petsc/src/sys/objects/inherit.c:64
#22 0x7fcf3f3504c9 in DMKSPCreate (comm=-2080374782, 
kdm=kdm@entry=0x7ffc1d4d00f8) at 
/home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:24
#23 0x7fcf3f35150f in DMGetDMKSPWrite (dm=0x557a18541a10, 
kspdm=kspdm@entry=0x7ffc1d4d01a8) at 
/home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:155
#24 0x7fcf3f1bb20e in PCSetUp_MG (pc=) at 
/home/jczhang/petsc/src/ksp/pc/impls/mg/mg.c:682
#25 0x7fcf3f204bea in PCSetUp (pc=0x557a17dc1860) at 
/home/jczhang/petsc/src/ksp/pc/interface/precon.c:894
#26 0x7fcf3f32ba4b in KSPSetUp (ksp=0x557a17d73500) at 
/home/jczhang/petsc/src/ksp/ksp/interface/itfunc.c:377
#27 0x7fcf3f41e43e in SNESSolve_VINEWTONRSLS (snes=0x557a17bff210) at 
/home/jczhang/petsc/src/snes/impls/vi/rs/virs.c:502
#28 0x7fcf3f3fa191 in SNESSolve (snes=0x557a17bff210, b=0x0, x=) at /home/jczhang/petsc/src/snes/interface/snes.c:4433
#29 0x557a16432095 in main (argc=, argv=) 
at /home/jczhang/petsc/src/snes/examples/tutorials/ex9.c:105

--Junchao Zhang

Re: [petsc-dev] Is DMGetDMKSPWrite() really not collective?

2019-06-14 Thread Zhang, Junchao via petsc-dev

In  PCSetUp_MG, processors can diverge in the red line, DMGetDMKSPWrite
 for (i=n-2; i>-1; i--) {
  DMKSP kdm;
  PetscBool dmhasrestrict, dmhasinject;
  ierr = KSPSetDM(mglevels[i]->smoothd,dms[i]);CHKERRQ(ierr);
  if (!needRestricts) {ierr = 
KSPSetDMActive(mglevels[i]->smoothd,PETSC_FALSE);CHKERRQ(ierr);}
  if (mglevels[i]->smoothd != mglevels[i]->smoothu) {
ierr = KSPSetDM(mglevels[i]->smoothu,dms[i]);CHKERRQ(ierr);
if (!needRestricts) {ierr = 
KSPSetDMActive(mglevels[i]->smoothu,PETSC_FALSE);CHKERRQ(ierr);}
  }
  ierr = DMGetDMKSPWrite(dms[i],&kdm);CHKERRQ(ierr);
--Junchao Zhang


On Fri, Jun 14, 2019 at 12:43 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
Hello,
   I am investigating petsc issue 
306.
 One can produce the problem with src/snes/examples/tutorials/ex9.c and mpirun 
-n 3 ./ex9 -snes_grid_sequence 3 -snes_converged_reason -pc_type mg
  The program can run to finish,  or crash, or hang.  The error appears either 
in PetscGatherMessageLengths or PetscCommBuildTwoSided().  From from my 
debugging, the following routine is suspicious.  It claims not collective, but 
it might call DMKSPCreate, which can indirectly call a collective 
MPI_Comm_dup().
  Can someone familiar with the code explain it?  Thanks.

  /*@C
   DMGetDMKSPWrite - get write access to private DMKSP context from a DM

   Not Collective

   Input Argument:
.  dm - DM to be used with KSP

   Output Argument:
.  kspdm - private DMKSP context

   Level: developer

.seealso: DMGetDMKSP()
@*/
PetscErrorCode DMGetDMKSPWrite(DM dm,DMKSP *kspdm)
{
  PetscErrorCode ierr;
  DMKSP  kdm;

  PetscFunctionBegin;
  PetscValidHeaderSpecific(dm,DM_CLASSID,1);
  ierr = DMGetDMKSP(dm,&kdm);CHKERRQ(ierr);
  if (!kdm->originaldm) kdm->originaldm = dm;
  if (kdm->originaldm != dm) {  /* Copy on write */
DMKSP oldkdm = kdm;
ierr  = PetscInfo(dm,"Copying DMKSP due to write\n");CHKERRQ(ierr);
ierr  = 
DMKSPCreate(PetscObjectComm((PetscObject)dm),&kdm);CHKERRQ(ierr);
ierr  = DMKSPCopy(oldkdm,kdm);CHKERRQ(ierr);
ierr  = DMKSPDestroy((DMKSP*)&dm->dmksp);CHKERRQ(ierr);
dm->dmksp = (PetscObject)kdm;
  }
  *kspdm = kdm;
  PetscFunctionReturn(0);
}

The calling stack is
...
#20 0x7fcf3e3d23e2 in PetscCommDuplicate 
(comm_in=comm_in@entry=-2080374782, comm_out=comm_out@entry=0x557a18304db0, 
first_tag=first_tag@entry=0x557a18304de4) at 
/home/jczhang/petsc/src/sys/objects/tagm.c:162
#21 0x7fcf3e3d7730 in PetscHeaderCreate_Private (h=0x557a18304d70, 
classid=, class_name=class_name@entry=0x7fcf3f7f762a "DMKSP", 
descr=descr@entry=0x7fcf3f7f762a "DMKSP",  mansec=mansec@entry=0x7fcf3f7f762a 
"DMKSP", comm=comm@entry=-2080374782, destroy=0x7fcf3f350570 , 
view=0x0)at /home/jczhang/petsc/src/sys/objects/inherit.c:64
#22 0x7fcf3f3504c9 in DMKSPCreate (comm=-2080374782, 
kdm=kdm@entry=0x7ffc1d4d00f8) at 
/home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:24
#23 0x7fcf3f35150f in DMGetDMKSPWrite (dm=0x557a18541a10, 
kspdm=kspdm@entry=0x7ffc1d4d01a8) at 
/home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:155
#24 0x7fcf3f1bb20e in PCSetUp_MG (pc=) at 
/home/jczhang/petsc/src/ksp/pc/impls/mg/mg.c:682
#25 0x7fcf3f204bea in PCSetUp (pc=0x557a17dc1860) at 
/home/jczhang/petsc/src/ksp/pc/interface/precon.c:894
#26 0x7fcf3f32ba4b in KSPSetUp (ksp=0x557a17d73500) at 
/home/jczhang/petsc/src/ksp/ksp/interface/itfunc.c:377
#27 0x7fcf3f41e43e in SNESSolve_VINEWTONRSLS (snes=0x557a17bff210) at 
/home/jczhang/petsc/src/snes/impls/vi/rs/virs.c:502
#28 0x7fcf3f3fa191 in SNESSolve (snes=0x557a17bff210, b=0x0, x=) at /home/jczhang/petsc/src/snes/interface/snes.c:4433
#29 0x557a16432095 in main (argc=, argv=) 
at /home/jczhang/petsc/src/snes/examples/tutorials/ex9.c:105

--Junchao Zhang

Re: [petsc-dev] Is DMGetDMKSPWrite() really not collective?

2019-06-14 Thread Zhang, Junchao via petsc-dev


On Fri, Jun 14, 2019 at 1:01 PM Lawrence Mitchell 
mailto:we...@gmx.li>> wrote:


> On 14 Jun 2019, at 18:44, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Hello,
>I am investigating petsc issue 306. One can produce the problem with 
> src/snes/examples/tutorials/ex9.c and mpirun -n 3 ./ex9 -snes_grid_sequence 3 
> -snes_converged_reason -pc_type mg
>   The program can run to finish,  or crash, or hang.  The error appears 
> either in PetscGatherMessageLengths or PetscCommBuildTwoSided().  From from 
> my debugging, the following routine is suspicious.  It claims not collective, 
> but it might call DMKSPCreate, which can indirectly call a collective 
> MPI_Comm_dup().

I would have thought that DMKSPCreate could only call MPI_Comm_dup (via 
PetscCommDuplicate) if the incoming dm has a communicator which is /not/ a 
PETSc communicator. Given that all PETSc objects must (?) return a PETSc 
communicator when calling PetscObjectComm, this function is presumably 
incidentally not collective (although it is logically collective I would have 
thought), oh, and with PETSC_USE_DEBUG, there's a barrier even in the "return 
immediately with a PETSc comm" case.

Yes, PetscCommDuplicate not always calls MPI_Comm_dup, but it decreases MPI 
tag, resulting in MPI tag mismatch and crashing the code.  If I turn on 
PETSC_USE_DEBUG, the code sometime (but not all times) hang in 
PetscCommDuplicate.


FWIW, the call site looks to be collective (PCSetUp_MG).

Lawrence

>   Can someone familiar with the code explain it?  Thanks.
>
>   /*@C
>DMGetDMKSPWrite - get write access to private DMKSP context from a DM
>
>Not Collective
>
>Input Argument:
> .  dm - DM to be used with KSP
>
>Output Argument:
> .  kspdm - private DMKSP context
>
>Level: developer
>
> .seealso: DMGetDMKSP()
> @*/
> PetscErrorCode DMGetDMKSPWrite(DM dm,DMKSP *kspdm)
> {
>   PetscErrorCode ierr;
>   DMKSP  kdm;
>
>   PetscFunctionBegin;
>   PetscValidHeaderSpecific(dm,DM_CLASSID,1);
>   ierr = DMGetDMKSP(dm,&kdm);CHKERRQ(ierr);
>   if (!kdm->originaldm) kdm->originaldm = dm;
>   if (kdm->originaldm != dm) {  /* Copy on write */
> DMKSP oldkdm = kdm;
> ierr  = PetscInfo(dm,"Copying DMKSP due to write\n");CHKERRQ(ierr);
> ierr  = 
> DMKSPCreate(PetscObjectComm((PetscObject)dm),&kdm);CHKERRQ(ierr);
> ierr  = DMKSPCopy(oldkdm,kdm);CHKERRQ(ierr);
> ierr  = DMKSPDestroy((DMKSP*)&dm->dmksp);CHKERRQ(ierr);
> dm->dmksp = (PetscObject)kdm;
>   }
>   *kspdm = kdm;
>   PetscFunctionReturn(0);
> }
>
> The calling stack is
> ...
> #20 0x7fcf3e3d23e2 in PetscCommDuplicate 
> (comm_in=comm_in@entry=-2080374782, comm_out=comm_out@entry=0x557a18304db0, 
> first_tag=first_tag@entry=0x557a18304de4) at 
> /home/jczhang/petsc/src/sys/objects/tagm.c:162
> #21 0x7fcf3e3d7730 in PetscHeaderCreate_Private (h=0x557a18304d70, 
> classid=, class_name=class_name@entry=0x7fcf3f7f762a "DMKSP", 
> descr=descr@entry=0x7fcf3f7f762a "DMKSP",  mansec=mansec@entry=0x7fcf3f7f762a 
> "DMKSP", comm=comm@entry=-2080374782, destroy=0x7fcf3f350570 , 
> view=0x0)at /home/jczhang/petsc/src/sys/objects/inherit.c:64
> #22 0x7fcf3f3504c9 in DMKSPCreate (comm=-2080374782, 
> kdm=kdm@entry=0x7ffc1d4d00f8) at 
> /home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:24
> #23 0x7fcf3f35150f in DMGetDMKSPWrite (dm=0x557a18541a10, 
> kspdm=kspdm@entry=0x7ffc1d4d01a8) at 
> /home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:155
> #24 0x7fcf3f1bb20e in PCSetUp_MG (pc=) at 
> /home/jczhang/petsc/src/ksp/pc/impls/mg/mg.c:682
> #25 0x7fcf3f204bea in PCSetUp (pc=0x557a17dc1860) at 
> /home/jczhang/petsc/src/ksp/pc/interface/precon.c:894
> #26 0x7fcf3f32ba4b in KSPSetUp (ksp=0x557a17d73500) at 
> /home/jczhang/petsc/src/ksp/ksp/interface/itfunc.c:377
> #27 0x7fcf3f41e43e in SNESSolve_VINEWTONRSLS (snes=0x557a17bff210) at 
> /home/jczhang/petsc/src/snes/impls/vi/rs/virs.c:502
> #28 0x7fcf3f3fa191 in SNESSolve (snes=0x557a17bff210, b=0x0, x= out>) at /home/jczhang/petsc/src/snes/interface/snes.c:4433
> #29 0x557a16432095 in main (argc=, argv=)   
>   at /home/jczhang/petsc/src/snes/examples/tutorials/ex9.c:105
>
> --Junchao Zhang

Re: [petsc-dev] Is DMGetDMKSPWrite() really not collective?

2019-06-14 Thread Zhang, Junchao via petsc-dev

Discussed with Mr. Hong. The following two new lines fix the problem. Either 
single line also works, but I think we should have both. Similar things happen 
to dmsnes and dmts. But I need DM experts' feedback before changing them. 
Thanks.

diff --git a/src/ksp/ksp/interface/dmksp.c b/src/ksp/ksp/interface/dmksp.c
index 9ce75090..0ab69574 100644
--- a/src/ksp/ksp/interface/dmksp.c
+++ b/src/ksp/ksp/interface/dmksp.c
@@ -80,6 +80,7 @@ PetscErrorCode DMKSPCopy(DMKSP kdm,DMKSP nkdm)
   nkdm->rhsctx  = kdm->rhsctx;
   nkdm->initialguessctx = kdm->initialguessctx;
   nkdm->data= kdm->data;
+  nkdm->originaldm  = kdm->originaldm;

   nkdm->fortran_func_pointers[0] = kdm->fortran_func_pointers[0];
   nkdm->fortran_func_pointers[1] = kdm->fortran_func_pointers[1];
@@ -156,6 +157,7 @@ PetscErrorCode DMGetDMKSPWrite(DM dm,DMKSP *kspdm)
 ierr  = DMKSPCopy(oldkdm,kdm);CHKERRQ(ierr);
 ierr  = DMKSPDestroy((DMKSP*)&dm->dmksp);CHKERRQ(ierr);
 dm->dmksp = (PetscObject)kdm;
+kdm->originaldm = dm;
   }
   *kspdm = kdm;
   PetscFunctionReturn(0);

--Junchao Zhang


On Fri, Jun 14, 2019 at 1:07 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:

On Fri, Jun 14, 2019 at 1:01 PM Lawrence Mitchell 
mailto:we...@gmx.li>> wrote:


> On 14 Jun 2019, at 18:44, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Hello,
>I am investigating petsc issue 306. One can produce the problem with 
> src/snes/examples/tutorials/ex9.c and mpirun -n 3 ./ex9 -snes_grid_sequence 3 
> -snes_converged_reason -pc_type mg
>   The program can run to finish,  or crash, or hang.  The error appears 
> either in PetscGatherMessageLengths or PetscCommBuildTwoSided().  From from 
> my debugging, the following routine is suspicious.  It claims not collective, 
> but it might call DMKSPCreate, which can indirectly call a collective 
> MPI_Comm_dup().

I would have thought that DMKSPCreate could only call MPI_Comm_dup (via 
PetscCommDuplicate) if the incoming dm has a communicator which is /not/ a 
PETSc communicator. Given that all PETSc objects must (?) return a PETSc 
communicator when calling PetscObjectComm, this function is presumably 
incidentally not collective (although it is logically collective I would have 
thought), oh, and with PETSC_USE_DEBUG, there's a barrier even in the "return 
immediately with a PETSc comm" case.

Yes, PetscCommDuplicate not always calls MPI_Comm_dup, but it decreases MPI 
tag, resulting in MPI tag mismatch and crashing the code.  If I turn on 
PETSC_USE_DEBUG, the code sometime (but not all times) hang in 
PetscCommDuplicate.


FWIW, the call site looks to be collective (PCSetUp_MG).

Lawrence

>   Can someone familiar with the code explain it?  Thanks.
>
>   /*@C
>DMGetDMKSPWrite - get write access to private DMKSP context from a DM
>
>Not Collective
>
>Input Argument:
> .  dm - DM to be used with KSP
>
>Output Argument:
> .  kspdm - private DMKSP context
>
>Level: developer
>
> .seealso: DMGetDMKSP()
> @*/
> PetscErrorCode DMGetDMKSPWrite(DM dm,DMKSP *kspdm)
> {
>   PetscErrorCode ierr;
>   DMKSP  kdm;
>
>   PetscFunctionBegin;
>   PetscValidHeaderSpecific(dm,DM_CLASSID,1);
>   ierr = DMGetDMKSP(dm,&kdm);CHKERRQ(ierr);
>   if (!kdm->originaldm) kdm->originaldm = dm;
>   if (kdm->originaldm != dm) {  /* Copy on write */
> DMKSP oldkdm = kdm;
> ierr  = PetscInfo(dm,"Copying DMKSP due to write\n");CHKERRQ(ierr);
> ierr  = 
> DMKSPCreate(PetscObjectComm((PetscObject)dm),&kdm);CHKERRQ(ierr);
> ierr  = DMKSPCopy(oldkdm,kdm);CHKERRQ(ierr);
> ierr  = DMKSPDestroy((DMKSP*)&dm->dmksp);CHKERRQ(ierr);
> dm->dmksp = (PetscObject)kdm;
>   }
>   *kspdm = kdm;
>   PetscFunctionReturn(0);
> }
>
> The calling stack is
> ...
> #20 0x7fcf3e3d23e2 in PetscCommDuplicate 
> (comm_in=comm_in@entry=-2080374782, comm_out=comm_out@entry=0x557a18304db0, 
> first_tag=first_tag@entry=0x557a18304de4) at 
> /home/jczhang/petsc/src/sys/objects/tagm.c:162
> #21 0x7fcf3e3d7730 in PetscHeaderCreate_Private (h=0x557a18304d70, 
> classid=, class_name=class_name@entry=0x7fcf3f7f762a "DMKSP", 
> descr=descr@entry=0x7fcf3f7f762a "DMKSP",  mansec=mansec@entry=0x7fcf3f7f762a 
> "DMKSP", comm=comm@entry=-2080374782, destroy=0x7fcf3f350570 , 
> view=0x0)at /home/jczhang/petsc/src/sys/objects/inherit.c:64
> #22 0x7fcf3f3504c9 in DMKSPCreate (comm=-2080374782, 
> kdm=kdm@entry=0x7ffc1d4d00f8) at 
> /home/jczhang/petsc/src/ksp/ksp/interface/dmksp.c:24
> #23 0x7fcf3f35150f in DMGetDMKSPWrite (dm=0x557a18541a10, 
> kspdm=kspdm@entry=0x7ffc1d4d01a8)

Re: [petsc-dev] PETSc blame digest (next-tmp) 2019-06-27

2019-06-27 Thread Zhang, Junchao via petsc-dev

Fixed.
--Junchao Zhang


On Thu, Jun 27, 2019 at 3:28 PM PETSc checkBuilds 
mailto:petsc-checkbui...@mcs.anl.gov>> wrote:


Dear PETSc developer,

This email contains listings of contributions attributed to you by
`git blame` that caused compiler errors or warnings in PETSc automated
testing.  Follow the links to see the full log files. Please attempt to fix
the issues promptly or let us know at 
petsc-dev@mcs.anl.gov if you are unable
to resolve the issues.

Thanks,
  The PETSc development team



warnings attributed to commit https://bitbucket.org/petsc/petsc/commits/63b8946
Add patterned SF graphs and use x as roots and y as leaves in x to y vecscatter

  src/vec/is/sf/interface/sf.c:433

[http://ftp.mcs.anl.gov/pub/petsc/nightlylogs//archive/2019/06/27/build_next-tmp_arch-linux-cxx-cmplx-pkgs-64idx_churn.log]
  /sandbox/petsc/petsc.next-tmp-2/src/vec/is/sf/interface/sf.c:433:91: 
warning: variable 'comm' is uninitialized when used here [-Wuninitialized]

[http://ftp.mcs.anl.gov/pub/petsc/nightlylogs//archive/2019/06/27/build_next-tmp_arch-osx-10.6-cxx-pkgs-opt_ipro.log]
  /Users/petsc/petsc.next-tmp-2/src/vec/is/sf/interface/sf.c:433:91: 
warning: variable 'comm' is uninitialized when used here [-Wuninitialized]

[http://ftp.mcs.anl.gov/pub/petsc/nightlylogs//archive/2019/06/27/build_next-tmp_arch-osx-10.6-cxx-cmplx-pkgs-dbg_ipro.log]
  /Users/petsc/petsc.next-tmp-3/src/vec/is/sf/interface/sf.c:433:91: 
warning: variable 'comm' is uninitialized when used here [-Wuninitialized]


To opt-out from receiving these messages - send a request to 
petsc-dev@mcs.anl.gov.

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-02 Thread Zhang, Junchao via petsc-dev

Is it because the array is already sorted?
--Junchao Zhang


On Tue, Jul 2, 2019 at 12:13 PM Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-02 Thread Zhang, Junchao via petsc-dev

Try this to see if it helps:

diff --git a/src/sys/utils/sorti.c b/src/sys/utils/sorti.c
index 1b07205a..90779891 100644
--- a/src/sys/utils/sorti.c
+++ b/src/sys/utils/sorti.c
@@ -294,7 +294,8 @@ static PetscErrorCode 
PetscSortIntWithArrayPair_Private(PetscInt *L,PetscInt *J,
 }
 PetscFunctionReturn(0);
   }
-  SWAP3(L[0],L[right/2],J[0],J[right/2],K[0],K[right/2],tmp);
+  i = MEDIAN(L,right);
+  SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   vl   = L[0];
   last = 0;
   for (i=1; i<=right; i++) {


On Tue, Jul 2, 2019 at 12:14 PM Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
BTW,

PetscSortIntWithArrayPair is used in MatStashSortCompress_Private.

Any way to avoid to use PetscSortIntWithArrayPair in 
MatStashSortCompress_Private?

Fande,

On Tue, Jul 2, 2019 at 11:09 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-03 Thread Zhang, Junchao via petsc-dev

Fande and John,
  Could you try jczhang/feature-better-quicksort-pivot? It passed Jenkins tests 
and I could not imagine why it failed on yours.
  Hash table has its own cost. We'd better get quicksort right and see how it 
performs before rewriting code.
--Junchao Zhang


On Tue, Jul 2, 2019 at 2:37 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 
(signal 11)

Segmentation fault :-)


As Jed said, it might be a good idea to rewrite the code using the hashing 
table.


Fande,


On Tue, Jul 2, 2019 at 1:27 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Try this to see if it helps:

diff --git a/src/sys/utils/sorti.c b/src/sys/utils/sorti.c
index 1b07205a..90779891 100644
--- a/src/sys/utils/sorti.c
+++ b/src/sys/utils/sorti.c
@@ -294,7 +294,8 @@ static PetscErrorCode 
PetscSortIntWithArrayPair_Private(PetscInt *L,PetscInt *J,
 }
 PetscFunctionReturn(0);
   }
-  SWAP3(L[0],L[right/2],J[0],J[right/2],K[0],K[right/2],tmp);
+  i = MEDIAN(L,right);
+  SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   vl   = L[0];
   last = 0;
   for (i=1; i<=right; i++) {


On Tue, Jul 2, 2019 at 12:14 PM Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
BTW,

PetscSortIntWithArrayPair is used in MatStashSortCompress_Private.

Any way to avoid to use PetscSortIntWithArrayPair in 
MatStashSortCompress_Private?

Fande,

On Tue, Jul 2, 2019 at 11:09 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-03 Thread Zhang, Junchao via petsc-dev

Could you debug it or paste the stack trace? Since it is a segfault, it should 
be easy.
--Junchao Zhang


On Wed, Jul 3, 2019 at 5:16 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Thanks Junchao,

But there is still segment fault. I guess you could write some continuous 
integers to test your changes.


Fande

On Wed, Jul 3, 2019 at 12:57 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande and John,
  Could you try jczhang/feature-better-quicksort-pivot? It passed Jenkins tests 
and I could not imagine why it failed on yours.
  Hash table has its own cost. We'd better get quicksort right and see how it 
performs before rewriting code.
--Junchao Zhang


On Tue, Jul 2, 2019 at 2:37 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 
(signal 11)

Segmentation fault :-)


As Jed said, it might be a good idea to rewrite the code using the hashing 
table.


Fande,


On Tue, Jul 2, 2019 at 1:27 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Try this to see if it helps:

diff --git a/src/sys/utils/sorti.c b/src/sys/utils/sorti.c
index 1b07205a..90779891 100644
--- a/src/sys/utils/sorti.c
+++ b/src/sys/utils/sorti.c
@@ -294,7 +294,8 @@ static PetscErrorCode 
PetscSortIntWithArrayPair_Private(PetscInt *L,PetscInt *J,
 }
 PetscFunctionReturn(0);
   }
-  SWAP3(L[0],L[right/2],J[0],J[right/2],K[0],K[right/2],tmp);
+  i = MEDIAN(L,right);
+  SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   vl   = L[0];
   last = 0;
   for (i=1; i<=right; i++) {


On Tue, Jul 2, 2019 at 12:14 PM Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
BTW,

PetscSortIntWithArrayPair is used in MatStashSortCompress_Private.

Any way to avoid to use PetscSortIntWithArrayPair in 
MatStashSortCompress_Private?

Fande,

On Tue, Jul 2, 2019 at 11:09 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-08 Thread Zhang, Junchao via petsc-dev

Is the code public for me to test?
--Junchao Zhang


On Mon, Jul 8, 2019 at 3:06 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Thanks Junchao,

Tried your code. I did not hit seg fault this time, but the assembly was still 
slow


time mpirun -n 2 ./matrix_sparsity-opt   -matstash_legacy
Close matrix for np = 2 ...
Matrix successfully closed

real 0m2.009s
user 0m3.324s
sys 0m0.575s




 time mpirun -n 2 ./matrix_sparsity-opt
Close matrix for np = 2 ...
Matrix successfully closed

real 3m39.235s
user 6m42.184s
sys 0m35.084s




Fande,




On Mon, Jul 8, 2019 at 8:47 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Will let you know soon.

Thanks,

Fande,

On Mon, Jul 8, 2019 at 8:41 AM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande or John,
  Could any of you have a try? Thanks
--Junchao Zhang


-- Forwarded message -
From: Junchao Zhang mailto:jczh...@mcs.anl.gov>>
Date: Thu, Jul 4, 2019 at 8:21 AM
Subject: Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly
To: Fande Kong mailto:fdkong...@gmail.com>>


Fande,
  I wrote tests but could not reproduce the error. I pushed a commit that 
changed the MEDIAN macro to a function to make it easier to debug.  Could you 
run and debug it again? It should be easy to see what is wrong in gdb.
  Thanks.
--Junchao Zhang


On Wed, Jul 3, 2019 at 6:48 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Process 3915 resuming
Process 3915 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=2, address=0x7ffee9b91fc8)
frame #0: 0x00010cbaa031 
libpetsc.3.011.dylib`PetscSortIntWithArrayPair_Private(L=0x000119fc5480, 
J=0x00011bfaa480, K=0x00011ff74480, right=13291) at sorti.c:298
   295  }
   296  PetscFunctionReturn(0);
   297}
-> 298i= MEDIAN(L,right);
   299SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   300vl   = L[0];
   301last = 0;
(lldb)


On Wed, Jul 3, 2019 at 4:32 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Could you debug it or paste the stack trace? Since it is a segfault, it should 
be easy.
--Junchao Zhang


On Wed, Jul 3, 2019 at 5:16 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Thanks Junchao,

But there is still segment fault. I guess you could write some continuous 
integers to test your changes.


Fande

On Wed, Jul 3, 2019 at 12:57 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande and John,
  Could you try jczhang/feature-better-quicksort-pivot? It passed Jenkins tests 
and I could not imagine why it failed on yours.
  Hash table has its own cost. We'd better get quicksort right and see how it 
performs before rewriting code.
--Junchao Zhang


On Tue, Jul 2, 2019 at 2:37 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 
(signal 11)

Segmentation fault :-)


As Jed said, it might be a good idea to rewrite the code using the hashing 
table.


Fande,


On Tue, Jul 2, 2019 at 1:27 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Try this to see if it helps:

diff --git a/src/sys/utils/sorti.c b/src/sys/utils/sorti.c
index 1b07205a..90779891 100644
--- a/src/sys/utils/sorti.c
+++ b/src/sys/utils/sorti.c
@@ -294,7 +294,8 @@ static PetscErrorCode 
PetscSortIntWithArrayPair_Private(PetscInt *L,PetscInt *J,
 }
 PetscFunctionReturn(0);
   }
-  SWAP3(L[0],L[right/2],J[0],J[right/2],K[0],K[right/2],tmp);
+  i = MEDIAN(L,right);
+  SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   vl   = L[0];
   last = 0;
   for (i=1; i<=right; i++) {


On Tue, Jul 2, 2019 at 12:14 PM Fande Kong via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
BTW,

PetscSortIntWithArrayPair is used in MatStashSortCompress_Private.

Any way to avoid to use PetscSortIntWithArrayPair in 
MatStashSortCompress_Private?

Fande,

On Tue, Jul 2, 2019 at 11:09 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Hi Developers,

John just noticed that the matrix assembly was slow when having sufficient 
amount of off-diagonal entries. It was not a MPI issue since I was  able to 
reproduce the issue using two cores on my desktop, that is, "mpirun -n 2".

I turned  on a profiling, and 99.99% of the time was spent on 
PetscSortIntWithArrayPair (recursively calling).   It took THREE MINUTES  to 
get the assembly done. And then changed to use the option "-matstash_legacy" to 
restore
the code to the old assembly routine, and the same code took ONE SECOND to get 
the matrix assembly done.

Should write any better sorting algorithms?


Fande,

Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly

2019-07-10 Thread Zhang, Junchao via petsc-dev

Fande,
  I ran your code with two processes and found the poor performance of 
PetscSortIntWithArrayPair() was due to duplicates.  In particular,  rank 0 has 
array length = 0 and rank 1 has array length = 4,180,070. On rank 1, each 
unique array value has ~95 duplicates; The duplicates are already clustered 
together before sorting.
  The old quick sort algorithm has an O(n^2) complexity on these duplicates. I 
added a three-way-partition algorithm to split the array into <, =, >pivot 
parts and get these numbers:

Master:
$ time mpirun -n 2 ./ex_petsc_only
real 0m55.359s
user 1m7.807s
sys 0m42.651s

Master:
$ time mpirun -n 2 ./ex_petsc_only -matstash_legacy
real 0m0.987s
user 0m1.565s
sys 0m0.285s

Three way partition
$ time mpirun -n 2 ./ex_petsc_only
real 0m1.015s
user 0m1.535s
sys 0m0.392s

We can see the new sort algorithm gives a 55x speedup and can almost catch up 
with -matstash_legacy. So I think it is good to have no matter whether we will 
do hashing or not in mat assembly.
Please try branch jczhang/feature-better-quicksort-pivot on your side to see if 
it helps and if the segfault persist.
Thanks.

--Junchao Zhang


On Tue, Jul 9, 2019 at 10:18 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Hi Junchao,

If you are struggling on building the right package environment, here is a 
native PETSc example.


Fande,

On Mon, Jul 8, 2019 at 3:00 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
I guess John has a pure PETSc example,

John, could you share the pure PETSc example with Junchao?


Fande,

On Mon, Jul 8, 2019 at 2:58 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Yes, here it is https://github.com/fdkong/matrixsparsity


You need to follow instructions here to install MOOSE 
https://www.mooseframework.org/getting_started/installation/mac_os.html


Thanks for your help.


Fande



On Mon, Jul 8, 2019 at 2:28 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Is the code public for me to test?
--Junchao Zhang


On Mon, Jul 8, 2019 at 3:06 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Thanks Junchao,

Tried your code. I did not hit seg fault this time, but the assembly was still 
slow


time mpirun -n 2 ./matrix_sparsity-opt   -matstash_legacy
Close matrix for np = 2 ...
Matrix successfully closed

real 0m2.009s
user 0m3.324s
sys 0m0.575s




 time mpirun -n 2 ./matrix_sparsity-opt
Close matrix for np = 2 ...
Matrix successfully closed

real 3m39.235s
user 6m42.184s
sys 0m35.084s




Fande,




On Mon, Jul 8, 2019 at 8:47 AM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Will let you know soon.

Thanks,

Fande,

On Mon, Jul 8, 2019 at 8:41 AM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande or John,
  Could any of you have a try? Thanks
--Junchao Zhang


-- Forwarded message -
From: Junchao Zhang mailto:jczh...@mcs.anl.gov>>
Date: Thu, Jul 4, 2019 at 8:21 AM
Subject: Re: [petsc-dev] Slowness of PetscSortIntWithArrayPair in MatAssembly
To: Fande Kong mailto:fdkong...@gmail.com>>


Fande,
  I wrote tests but could not reproduce the error. I pushed a commit that 
changed the MEDIAN macro to a function to make it easier to debug.  Could you 
run and debug it again? It should be easy to see what is wrong in gdb.
  Thanks.
--Junchao Zhang


On Wed, Jul 3, 2019 at 6:48 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Process 3915 resuming
Process 3915 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=2, address=0x7ffee9b91fc8)
frame #0: 0x00010cbaa031 
libpetsc.3.011.dylib`PetscSortIntWithArrayPair_Private(L=0x000119fc5480, 
J=0x00011bfaa480, K=0x00011ff74480, right=13291) at sorti.c:298
   295  }
   296  PetscFunctionReturn(0);
   297}
-> 298i= MEDIAN(L,right);
   299SWAP3(L[0],L[i],J[0],J[i],K[0],K[i],tmp);
   300vl   = L[0];
   301last = 0;
(lldb)


On Wed, Jul 3, 2019 at 4:32 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Could you debug it or paste the stack trace? Since it is a segfault, it should 
be easy.
--Junchao Zhang


On Wed, Jul 3, 2019 at 5:16 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
Thanks Junchao,

But there is still segment fault. I guess you could write some continuous 
integers to test your changes.


Fande

On Wed, Jul 3, 2019 at 12:57 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande and John,
  Could you try jczhang/feature-better-quicksort-pivot? It passed Jenkins tests 
and I could not imagine why it failed on yours.
  Hash table has its own cost. We'd better get quicksort right and see how it 
performs before rewriting code.
--Junchao Zhang


On Tue, Jul 2, 2019 at 2:37 PM Fande Kong 
mailto:fdkong...@gmail.com>> wrote:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 
(signal 11)

Segmentation fault :-)


As Jed said, it might be a good idea to rewrite the code using the hashing 
table.


Fande,


On Tue, Jul 2, 2019 at 1:27 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Try th

Re: [petsc-dev] [Radev, Martin] Re: Adding a new encoding for FP data

2019-07-11 Thread Zhang, Junchao via petsc-dev

A side question: Do lossy compressors have value for PETSc?

--Junchao Zhang

On Thu, Jul 11, 2019 at 9:06 AM Jed Brown via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Zstd is a remarkably good compressor.  I've experimented with it for
compressing column indices for sparse matrices on structured grids and
(after a simple transform: subtracting the row number) gotten
decompression speed in the neighborhood of 10 GB/s (i.e., faster per
core than DRAM).  I've been meaning to follow up.  The transformation
described below (splitting the bytes) is yielding decompression speed
around 1GB/s (in this link below), which isn't competitive for things
like MatMult, but could be useful for things like trajectory
checkpointing.

https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

-- Forwarded message --
From: "Radev, Martin" mailto:martin.ra...@tum.de>>
To: "d...@arrow.apache.org" 
mailto:d...@arrow.apache.org>>
Cc: "Raoofy, Amir" mailto:amir.rao...@tum.de>>, 
"Karlstetter, Roman" mailto:roman.karlstet...@tum.de>>
Bcc:
Date: Thu, 11 Jul 2019 09:55:03 +
Subject: Re: Adding a new encoding for FP data
Hello Liya Fan,

this explains the technique but for a more complex case:

https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/

For FP data, the approach which seemed to be the best is the following.

Say we have a buffer of two 32-bit floating point values:

buf = [af, bf]

We interpret each FP value as a 32-bit uint and look at each individual byte. 
We have 8 bytes in total for this small input.

buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]

Then we apply stream splitting and the new buffer becomes:

newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]

We compress newbuf.

Due to similarities the sign bits, mantissa bits and MSB exponent bits, we 
might have a lot more repetitions in data. For scientific data, the 2nd and 3rd 
byte for 32-bit data is probably largely noise. Thus in the original 
representation we would always have a few bytes of data which could appear 
somewhere else in the buffer and then a couple bytes of possible noise. In the 
new representation we have a long stream of data which could compress well and 
then a sequence of noise towards the end.

This transformation improved compression ratio as can be seen in the report.

It also improved speed for ZSTD. This could be because ZSTD makes a decision of 
how to compress the data - RLE, new huffman tree, huffman tree of the previous 
frame, raw representation. Each can potentially achieve a different compression 
ratio and compression/decompression speed. It turned out that when the 
transformation is applied, zstd would attempt to compress fewer frames and copy 
the other. This could lead to less attempts to build a huffman tree. It's hard 
to pin-point the exact reason.

I did not try other lossless text compressors but I expect similar results.

For code, I can polish my patches, create a Jira task and submit the patches 
for review.

Regards,

Martin

From: Fan Liya mailto:liya.fa...@gmail.com>>
Sent: Thursday, July 11, 2019 11:32:53 AM
To: d...@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Radev,

Thanks for the information. It seems interesting.
IMO, Arrow has much to do for data compression. However, it seems there are
some differences for memory data compression and external storage data
compression.

Could you please provide some reference for stream splitting?

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin 
mailto:martin.ra...@tum.de>> wrote:

> Hello people,
>
>
> there has been discussion in the Apache Parquet mailing list on adding a
> new encoder for FP data.
> The reason for this is that the supported compressors by Apache Parquet
> (zstd, gzip, etc) do not compress well raw FP data.
>
>
> In my investigation it turns out that a very simple simple technique,
> named stream splitting, can improve the compression ratio and even speed
> for some of the compressors.
>
> You can read about the results here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>
>
> I went through the developer guide for Apache Arrow and wrote a patch to
> add the new encoding and test coverage for it.
>
> I will polish my patch and work in parallel to extend the Apache Parquet
> format for the new encoding.
>
>
> If you have any concerns, please let me know.
>
>
> Regards,
>
> Martin
>
>

Re: [petsc-dev] Fwd: PETSc blame digest (next-tmp) 2019-07-17

2019-07-17 Thread Zhang, Junchao via petsc-dev

Fande did a fantastic job using C++11 in the test code. I modified it so it 
does not use C++11 anymore. But I think Satish's code is still useful and we 
should have it in a new PR.
Another serious issue using Sun C++ with this C++ test code:

"/export/home/jczhang/petsc/include/petscmat.h", line 153: Warning 
(Anachronism): Formal argument 4 of type extern "C" 
int(*)(_p_Mat*,MatFactorType,_p_Mat**) in call to MatSolverTypeRegister(const 
char*, const char*, MatFactorType, extern "C" 
int(*)(_p_Mat*,MatFactorType,_p_Mat**)) is being passed 
int(*)(_p_Mat*,MatFactorType,_p_Mat**).
"/export/home/jczhang/petsc/include/petscmat.h", line 155: Error: Formal 
argument 6 of type extern "C" int(*)(_p_Mat*,MatFactorType,_p_Mat**)* in call 
to MatSolverTypeGet(const char*, const char*, MatFactorType, PetscBool*, 
PetscBool*, extern "C" int(*)(_p_Mat*,MatFactorType,_p_Mat**)*) is being passed 
int(*)(_p_Mat*,MatFactorType,_p_Mat**)*.
1 Error(s) and 1 Warning(s) detected.

Googled and found it only happens to Sun compilers. The logic is
we have
PETSC_EXTERN PetscErrorCode 
MatSolverTypeRegister(MatSolverType,MatType,MatFactorType,PetscErrorCode(*)(Mat,MatFactorType,Mat*));
Sun C++ thinks the function pointer argument is in C linkage.

We also have
PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageRegister(MatSolverType 
stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
{ return MatSolverTypeRegister(stype,mtype,ftype,f); }

Without PETSC_EXTERN, Sun C++ thinks the function pointer argument is in C++ 
linkage. This mismatch caused the warning and errors. I have to add 
PETSC_EXTERN in petscmat.h to get rid of them.  Other header files may also 
have this problem, but thanks to Fande for only including petscmat.h :)

diff --git a/include/petscmat.h b/include/petscmat.h
index 95c2b825..08300bc2 100644
--- a/include/petscmat.h
+++ b/include/petscmat.h
@@ -149,9 +149,9 @@ PETSC_EXTERN PetscErrorCode 
MatSetFactorType(Mat,MatFactorType);
 PETSC_EXTERN PetscErrorCode 
MatSolverTypeRegister(MatSolverType,MatType,MatFactorType,PetscErrorCode(*)(Mat,MatFactorType,Mat*));
 PETSC_EXTERN PetscErrorCode 
MatSolverTypeGet(MatSolverType,MatType,MatFactorType,PetscBool*,PetscBool*,PetscErrorCode
 (**)(Mat,MatFactorType,Mat*));
 typedef MatSolverType MatSolverPackage PETSC_DEPRECATED_TYPEDEF("Use 
MatSolverType (since version 3.9)");
-PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageRegister(MatSolverType 
stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
+PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_EXTERN PETSC_STATIC_INLINE PetscErrorCode 
MatSolverPackageRegister(MatSolverType stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
 { return MatSolverTypeRegister(stype,mtype,ftype,f); }
-PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeGet() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageGet(MatSolverType 
stype,MatType mtype,MatFactorType ftype,PetscBool *foundmtype,PetscBool 
*foundstype,PetscErrorCode(**f)(Mat,MatFactorType,Mat*))
+PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeGet() (since version 3.9)") 
PETSC_EXTERN PETSC_STATIC_INLINE PetscErrorCode 
MatSolverPackageGet(MatSolverType stype,MatType mtype,MatFactorType 
ftype,PetscBool *foundmtype,PetscBool 
*foundstype,PetscErrorCode(**f)(Mat,MatFactorType,Mat*))
 { return MatSolverTypeGet(stype,mtype,ftype,foundmtype,foundstype,f); }

--Junchao Zhang

On Wed, Jul 17, 2019 at 5:06 PM Balay, Satish 
mailto:ba...@mcs.anl.gov>> wrote:
Perhaps you can use the following in your PR? But then - you have to check for 
both PETSC_CXX_DIALECT_CXX14 and PETSC_CXX_DIALECT_CXX11 in your code.

#define PETSC_CXX_DIALECT_CXX14 1

>>
diff --git a/config/BuildSystem/config/compilers.py 
b/config/BuildSystem/config/compilers.py
index c29cc2a67a..4775318a7d 100644
--- a/config/BuildSystem/config/compilers.py
+++ b/config/BuildSystem/config/compilers.py
@@ -525,6 +525,8 @@ class Configure(config.base.Configure):

 self.setCompilers.popLanguage()
 self.logWrite(self.setCompilers.restoreLog())
+if self.cxxdialect:
+  self.addDefine('CXX_DIALECT_'+self.cxxdialect.upper().replace('+','X'),1)
 return

   def checkCxxLibraries(self):
<<

Or does the following make more sense? [but I suspect this is useful
for CPP which supports '>' etc.. comparison operations - but perhaps
not the test suite]

#define PETSC_CXX_DIALECT 14

>>>
diff --git a/config/BuildSystem/config/compilers.py 
b/config/BuildSystem/config/compilers.py
index c29cc2a67a..b391e4a3b0 100644
--- a/config/BuildSystem/config/compilers.py
+++ b/config/BuildSystem/config/compilers.py
@@ -525,6 +525,8 @@ class Configure(config.base.Configure):

 self.setCompilers.popLanguage()
 self.logWrite

Re: [petsc-dev] Fwd: PETSc blame digest (next-tmp) 2019-07-17

2019-07-18 Thread Zhang, Junchao via petsc-dev

I changed the include file from petscmat.h to petsc.h, one more similar warning 
showed up in petsctao.h. So we can conclude Sun C++ works for headers except 
petscmat.h and petsctao.h
My previous fix does not work on Linux because PETSC_EXTERN conflicts with 
PETSC_STATIC_INLINE.
I have to give up using Sun C++ on this test. I added PETSC_HAVE_SUN_CXX for 
that. This PR should be clear now.
--Junchao Zhang


On Wed, Jul 17, 2019 at 5:34 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
Fande did a fantastic job using C++11 in the test code. I modified it so it 
does not use C++11 anymore. But I think Satish's code is still useful and we 
should have it in a new PR.
Another serious issue using Sun C++ with this C++ test code:

"/export/home/jczhang/petsc/include/petscmat.h", line 153: Warning 
(Anachronism): Formal argument 4 of type extern "C" 
int(*)(_p_Mat*,MatFactorType,_p_Mat**) in call to MatSolverTypeRegister(const 
char*, const char*, MatFactorType, extern "C" 
int(*)(_p_Mat*,MatFactorType,_p_Mat**)) is being passed 
int(*)(_p_Mat*,MatFactorType,_p_Mat**).
"/export/home/jczhang/petsc/include/petscmat.h", line 155: Error: Formal 
argument 6 of type extern "C" int(*)(_p_Mat*,MatFactorType,_p_Mat**)* in call 
to MatSolverTypeGet(const char*, const char*, MatFactorType, PetscBool*, 
PetscBool*, extern "C" int(*)(_p_Mat*,MatFactorType,_p_Mat**)*) is being passed 
int(*)(_p_Mat*,MatFactorType,_p_Mat**)*.
1 Error(s) and 1 Warning(s) detected.

Googled and found it only happens to Sun compilers. The logic is
we have
PETSC_EXTERN PetscErrorCode 
MatSolverTypeRegister(MatSolverType,MatType,MatFactorType,PetscErrorCode(*)(Mat,MatFactorType,Mat*));
Sun C++ thinks the function pointer argument is in C linkage.

We also have
PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageRegister(MatSolverType 
stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
{ return MatSolverTypeRegister(stype,mtype,ftype,f); }

Without PETSC_EXTERN, Sun C++ thinks the function pointer argument is in C++ 
linkage. This mismatch caused the warning and errors. I have to add 
PETSC_EXTERN in petscmat.h to get rid of them.  Other header files may also 
have this problem, but thanks to Fande for only including petscmat.h :)

diff --git a/include/petscmat.h b/include/petscmat.h
index 95c2b825..08300bc2 100644
--- a/include/petscmat.h
+++ b/include/petscmat.h
@@ -149,9 +149,9 @@ PETSC_EXTERN PetscErrorCode 
MatSetFactorType(Mat,MatFactorType);
 PETSC_EXTERN PetscErrorCode 
MatSolverTypeRegister(MatSolverType,MatType,MatFactorType,PetscErrorCode(*)(Mat,MatFactorType,Mat*));
 PETSC_EXTERN PetscErrorCode 
MatSolverTypeGet(MatSolverType,MatType,MatFactorType,PetscBool*,PetscBool*,PetscErrorCode
 (**)(Mat,MatFactorType,Mat*));
 typedef MatSolverType MatSolverPackage PETSC_DEPRECATED_TYPEDEF("Use 
MatSolverType (since version 3.9)");
-PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageRegister(MatSolverType 
stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
+PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeRegister() (since version 3.9)") 
PETSC_EXTERN PETSC_STATIC_INLINE PetscErrorCode 
MatSolverPackageRegister(MatSolverType stype,MatType mtype,MatFactorType 
ftype,PetscErrorCode(*f)(Mat,MatFactorType,Mat*))
 { return MatSolverTypeRegister(stype,mtype,ftype,f); }
-PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeGet() (since version 3.9)") 
PETSC_STATIC_INLINE PetscErrorCode MatSolverPackageGet(MatSolverType 
stype,MatType mtype,MatFactorType ftype,PetscBool *foundmtype,PetscBool 
*foundstype,PetscErrorCode(**f)(Mat,MatFactorType,Mat*))
+PETSC_DEPRECATED_FUNCTION("Use MatSolverTypeGet() (since version 3.9)") 
PETSC_EXTERN PETSC_STATIC_INLINE PetscErrorCode 
MatSolverPackageGet(MatSolverType stype,MatType mtype,MatFactorType 
ftype,PetscBool *foundmtype,PetscBool 
*foundstype,PetscErrorCode(**f)(Mat,MatFactorType,Mat*))
 { return MatSolverTypeGet(stype,mtype,ftype,foundmtype,foundstype,f); }

--Junchao Zhang


On Wed, Jul 17, 2019 at 5:06 PM Balay, Satish 
mailto:ba...@mcs.anl.gov>> wrote:
Perhaps you can use the following in your PR? But then - you have to check for 
both PETSC_CXX_DIALECT_CXX14 and PETSC_CXX_DIALECT_CXX11 in your code.

#define PETSC_CXX_DIALECT_CXX14 1

>>
diff --git a/config/BuildSystem/config/compilers.py 
b/config/BuildSystem/config/compilers.py
index c29cc2a67a..4775318a7d 100644
--- a/config/BuildSystem/config/compilers.py
+++ b/config/BuildSystem/config/compilers.py
@@ -525,6 +525,8 @@ class Configure(config.base.Configure):

 self.setCompilers.popLanguage()
 self.logWrite(self.setCompilers.restoreLog())
+if self.cxxdialect:
+  self.addDefine('CXX_DIALECT_'+self.cxxdialect.upper().replace('+','X'),1)
 return

   def checkCxxLibraries(self):
<<

Or does the following make more se

Re: [petsc-dev] (no subject)

2019-07-22 Thread Zhang, Junchao via petsc-dev

We should be able to overlap PetscSFReduce. I will have a look. Thanks.
--Junchao Zhang


On Mon, Jul 22, 2019 at 5:23 AM Stefano Zampini 
mailto:stefano.zamp...@gmail.com>> wrote:
Junchao,

I found an issue with PetscSFReduceBegin/End. It seems that we can no longer do 
(always worked before)

ierr = PetscSFReduceBegin(sf,MPIU_INT,leaf,root,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceBegin(sf,MPIU_INT,leaf2,root2,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceEnd(sf,MPIU_INT,leaf,root,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceEnd(sf,MPIU_INT,leaf2,root2,MPI_MAX);CHKERRQ(ierr);

You can reproduce with the currrent master
$ cd $PETSC_DIR/src/ksp/ksp/examples/tutorials
$ make ex71
$ mpiexec -n 4  ./ex71 -pde_type Elasticity -cells 7,9 -dim 2 -pc_bddc_levels 1 
-pc_bddc_coarsening_ratio 2 -ksp_error_if_not_converged -pc_bddc_monolithic 
-pc_bddc_use_faces -pc_bddc_coarse_pc_bddc_corner_selection 
-pc_bddc_coarse_l1_pc_bddc_corner_selection -mat_partitioning_type average  
-pc_bddc_coarse_pc_bddc_use_deluxe_scaling 
-pc_bddc_coarse_sub_schurs_mat_solver_type petsc

The attached patch shows that if we instead do

ierr = PetscSFReduceBegin(sf,MPIU_INT,leaf,root,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceEnd(sf,MPIU_INT,leaf,root,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceBegin(sf,MPIU_INT,leaf2,root2,MPI_MAX);CHKERRQ(ierr);
ierr = PetscSFReduceEnd(sf,MPIU_INT,leaf2,root2,MPI_MAX);CHKERRQ(ierr);

everything works properly.
I think the issue is with the SF GetPack mechanism, but I couldn't figure out 
where.

--
Stefano

Re: [petsc-dev] Issues with Fortran Interfaces for PetscSort routines

2019-07-29 Thread Zhang, Junchao via petsc-dev

I will have a look. I simply changed PetscSortInt(PetscInt n,PetscInt i[]) to 
PetscSortInt(PetscInt n,PetscInt *X). Don't know why it caused that.
--Junchao Zhang


On Mon, Jul 29, 2019 at 10:14 AM Fabian.Jakub via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Dear Petsc,

Commit 051fd8986cf23c0556f4229193defe128fafa1f7 changed the C signature
of the sorting routines and as a result I cannot compile against them
anymore from Fortran.
I tried to rebuild Petsc from scratch and did a make allfortranstubs but
still to no avail.

I attach a simple fortran program that calls PetscSortInt and gives the
following error at compile time.




petsc_fortran_sort.F90:15:27:

   call PetscSortInt(N, x, ierr)
   1
Error: Rank mismatch in argument ‘b’ at (1) (scalar and rank-1)

Same applies for other routines such as PetscSortIntWithArrayPair...

I am not sure where to find the FortranInterfaces and currently had no
time to dig deeper.

Please let me know if I have missed something stupid.

Many thanks,

Fabian


P.S. Petsc was compiled with
--with-fortran
--with-fortran-interfaces
--with-shared-libraries=1

Re: [petsc-dev] Issues with Fortran Interfaces for PetscSort routines

2019-07-29 Thread Zhang, Junchao via petsc-dev

Fixed in jczhang/fix-sort-fortran-binding and will be in master later. Thanks.
--Junchao Zhang


On Mon, Jul 29, 2019 at 10:14 AM Fabian.Jakub via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Dear Petsc,

Commit 051fd8986cf23c0556f4229193defe128fafa1f7 changed the C signature
of the sorting routines and as a result I cannot compile against them
anymore from Fortran.
I tried to rebuild Petsc from scratch and did a make allfortranstubs but
still to no avail.

I attach a simple fortran program that calls PetscSortInt and gives the
following error at compile time.




petsc_fortran_sort.F90:15:27:

   call PetscSortInt(N, x, ierr)
   1
Error: Rank mismatch in argument ‘b’ at (1) (scalar and rank-1)

Same applies for other routines such as PetscSortIntWithArrayPair...

I am not sure where to find the FortranInterfaces and currently had no
time to dig deeper.

Please let me know if I have missed something stupid.

Many thanks,

Fabian


P.S. Petsc was compiled with
--with-fortran
--with-fortran-interfaces
--with-shared-libraries=1

Re: [petsc-dev] DMDAGlobalToNatural errors with Ubuntu:latest; gcc 7 & Open MPI 2.1.1

2019-07-30 Thread Zhang, Junchao via petsc-dev

Fabian,
  I happen have a Ubuntu virtual machine and I could reproduce the error with 
your mini-test, even with two processes. It is horrible to see wrong results in 
such a simple test.
  We'd better figure out whether it is a PETSc bug or an OpenMPI bug. If it is 
latter, which MPI call is at fault.

--Junchao Zhang


On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Dear Petsc Team,
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
errors in DMDAGlobalToNatural.

This is evident in a minimal fortran example such as the attached
example petsc_ex.F90

with the following error:

==22616== Conditional jump or move depends on uninitialised value(s)
==22616==at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)
==22616==by 0x4FA4DAC: PetscMallocA (mal.c:413)
==22616==by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)
==22616==by 0x50A1104: VecScatterSetUp (vscatfce.c:209)
==22616==by 0x509EE3B: VecScatterCreate (vscreate.c:280)
==22616==by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)
==22616==by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)
==22616==by 0x5798446: VecView_MPI_DA (gr2.c:720)
==22616==by 0x51BC7D8: VecView (vector.c:574)
==22616==by 0x4F4ECA1: PetscObjectView (destroy.c:90)
==22616==by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)

and consequently wrong results in the natural vec


I was looking at the fortran example if I did forget something but I can
also see the same error, i.e. not being valgrind clean, in pure C - PETSc:

cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun
--allow-run-as-root -np 2 valgrind ./ex14

I then tried various docker/podman linux distributions to make sure that
my setup is clean and to me it seems that this error is confined to the
particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.

I tried other images from dockerhub including

gcc:7.4.0 :: where I could neither install openmpi nor mpich through
apt, however works with --download-openmpi and --download-mpich

ubuntu:rolling(19.04) <-- work

debian:latest & :stable <-- works

ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich
or with petsc-configure --download-openmpi or --download-mpich


Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I
guess I'll go with a custom mpi install but given that ubuntu:latest is
widely spread, do you think there is an easy solution to the error?

I guess you are not eager to delve into this issue with old mpi versions
but in case you find some spare time, maybe you find the root cause
and/or a workaround.

Many thanks,
Fabian

Re: [petsc-dev] DMDAGlobalToNatural errors with Ubuntu:latest; gcc 7 & Open MPI 2.1.1

2019-07-31 Thread Zhang, Junchao via petsc-dev

Hi, Fabian,
I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using 
MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv 
buffer.
I have a workaround 
jczhang/fix-ubuntu-openmpi-anysource.
 I tested with your petsc_ex.F90 and $PETSC_DIR/src/dm/examples/tests/ex14.  
The majority of valgrind errors disappeared. A few left are in ompi_mpi_init 
and we can ignore them.
I filed a bug report to OpenMPI 
https://www.mail-archive.com/users@lists.open-mpi.org//msg33383.html and hope 
they can fix it in Ubuntu.
Thanks.

--Junchao Zhang


On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Dear Petsc Team,
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
errors in DMDAGlobalToNatural.

This is evident in a minimal fortran example such as the attached
example petsc_ex.F90

with the following error:

==22616== Conditional jump or move depends on uninitialised value(s)
==22616==at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)
==22616==by 0x4FA4DAC: PetscMallocA (mal.c:413)
==22616==by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)
==22616==by 0x50A1104: VecScatterSetUp (vscatfce.c:209)
==22616==by 0x509EE3B: VecScatterCreate (vscreate.c:280)
==22616==by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)
==22616==by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)
==22616==by 0x5798446: VecView_MPI_DA (gr2.c:720)
==22616==by 0x51BC7D8: VecView (vector.c:574)
==22616==by 0x4F4ECA1: PetscObjectView (destroy.c:90)
==22616==by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)

and consequently wrong results in the natural vec


I was looking at the fortran example if I did forget something but I can
also see the same error, i.e. not being valgrind clean, in pure C - PETSc:

cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun
--allow-run-as-root -np 2 valgrind ./ex14

I then tried various docker/podman linux distributions to make sure that
my setup is clean and to me it seems that this error is confined to the
particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.

I tried other images from dockerhub including

gcc:7.4.0 :: where I could neither install openmpi nor mpich through
apt, however works with --download-openmpi and --download-mpich

ubuntu:rolling(19.04) <-- work

debian:latest & :stable <-- works

ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich
or with petsc-configure --download-openmpi or --download-mpich


Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I
guess I'll go with a custom mpi install but given that ubuntu:latest is
widely spread, do you think there is an easy solution to the error?

I guess you are not eager to delve into this issue with old mpi versions
but in case you find some spare time, maybe you find the root cause
and/or a workaround.

Many thanks,
Fabian

Re: [petsc-dev] DMDAGlobalToNatural errors with Ubuntu:latest; gcc 7 & Open MPI 2.1.1

2019-08-02 Thread Zhang, Junchao via petsc-dev

Some updates for this OpenMPI bug:
 1) It appears to OpenMPI 2.1.x when configured with --enable-heterogeneous, 
which is not a default option and is not commonly used. But Ubuntu somehow used 
that.
 2) OpenMPI fixed it in 3.x
 3) It was reported to Ubuntu two years ago but is still unassigned. 
https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1731938. A user's 
comment from last year, "We have just spent today hunting down a user bug 
report for Xyce (which uses Trilinos, and its Zoltan library) that turn out to 
be exactly this issue "

--Junchao Zhang


On Wed, Jul 31, 2019 at 2:17 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
Hi, Fabian,
I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using 
MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv 
buffer.
I have a workaround 
jczhang/fix-ubuntu-openmpi-anysource.
 I tested with your petsc_ex.F90 and $PETSC_DIR/src/dm/examples/tests/ex14.  
The majority of valgrind errors disappeared. A few left are in ompi_mpi_init 
and we can ignore them.
I filed a bug report to OpenMPI 
https://www.mail-archive.com/users@lists.open-mpi.org//msg33383.html and hope 
they can fix it in Ubuntu.
Thanks.

--Junchao Zhang


On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Dear Petsc Team,
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
errors in DMDAGlobalToNatural.

This is evident in a minimal fortran example such as the attached
example petsc_ex.F90

with the following error:

==22616== Conditional jump or move depends on uninitialised value(s)
==22616==at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)
==22616==by 0x4FA4DAC: PetscMallocA (mal.c:413)
==22616==by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)
==22616==by 0x50A1104: VecScatterSetUp (vscatfce.c:209)
==22616==by 0x509EE3B: VecScatterCreate (vscreate.c:280)
==22616==by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)
==22616==by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)
==22616==by 0x5798446: VecView_MPI_DA (gr2.c:720)
==22616==by 0x51BC7D8: VecView (vector.c:574)
==22616==by 0x4F4ECA1: PetscObjectView (destroy.c:90)
==22616==by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)

and consequently wrong results in the natural vec


I was looking at the fortran example if I did forget something but I can
also see the same error, i.e. not being valgrind clean, in pure C - PETSc:

cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun
--allow-run-as-root -np 2 valgrind ./ex14

I then tried various docker/podman linux distributions to make sure that
my setup is clean and to me it seems that this error is confined to the
particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.

I tried other images from dockerhub including

gcc:7.4.0 :: where I could neither install openmpi nor mpich through
apt, however works with --download-openmpi and --download-mpich

ubuntu:rolling(19.04) <-- work

debian:latest & :stable <-- works

ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich
or with petsc-configure --download-openmpi or --download-mpich


Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I
guess I'll go with a custom mpi install but given that ubuntu:latest is
widely spread, do you think there is an easy solution to the error?

I guess you are not eager to delve into this issue with old mpi versions
but in case you find some spare time, maybe you find the root cause
and/or a workaround.

Many thanks,
Fabian

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Zhang, Junchao via petsc-dev




On Sat, Aug 31, 2019 at 8:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:


On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Any explanation for why the scaling is much better for CPUs and than GPUs? Is 
it the "extra" time needed for communication from the GPUs?

The GPU work is well load balanced so it weak scales perfectly. When you put 
that work in the CPU you get more perfectly scalable work added so it looks 
better. For instance, the 98K dof/proc data goes up by about 1/2 sec. from the 
1 node to 512 node case for both GPU and CPU, because this non-scaling is from 
communication that is the same for both cases


  Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA 
branch (in the gitlab merge requests)  that can speed up the communication from 
GPUs?

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?

Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then add 
-use_gpu_aware_mpi in option to let PETSc use that feature.



   Barry


> On Aug 30, 2019, at 11:56 AM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
>
> Here is some more weak scaling data with a fixed number of iterations (I have 
> given a test with the numerical problems to ORNL and they said they would 
> give it to Nvidia).
>
> I implemented an option to "spread" the reduced coarse grids across the whole 
> machine as opposed to a "compact" layout where active processes are laid out 
> in simple lexicographical order. This spread approach looks a little better.
>
> Mark
>
> On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Ahh, PGI compiler, that explains it :-)
>
>   Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. 
> The code is just
>
>   *a = (PetscReal)strtod(name,endptr);
>
>   could be a compiler bug.
>
>
>
>
> > On Aug 14, 2019, at 9:23 PM, Mark Adams 
> > mailto:mfad...@lbl.gov>> wrote:
> >
> > I am getting this error with single:
> >
> > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 
> > ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type 
> > aijcusparse -fp_trap
> > [0] 81 global equations, 27 vertices
> > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > [0]PETSC ERROR: The specific exception can be determined by running in a 
> > debugger.  When the
> > [0]PETSC ERROR: debugger traps the signal, the exception can be found with 
> > fetestexcept(0x3e00)
> > [0]PETSC ERROR: where the result is a bitwise OR of the following flags:
> > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400 
> > FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > [0]PETSC ERROR: Try option -start_in_debugger
> > [0]PETSC ERROR: likely location of problem given in stack below
> > [0]PETSC ERROR: -  Stack Frames 
> > 
> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [0]PETSC ERROR:   INSTEAD the line number of the start of the function
> > [0]PETSC ERROR:   is given.
> > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > [0]PETSC ERROR: [0] PetscStrtod line 1964 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015 
> > /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > [0]PETSC ERROR: [0] KSPSetFromOptions line 329 
> > /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > [0]PETSC ERROR: [0] SNESSetFromOptions line 869 
> > /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > [0]PETSC ERROR: - Error Message 
> > --
> > [0]PETSC ERROR: Floating point exception
> > [0]PETSC ERROR: trapped floating point error
> > [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> > for trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT 
> > Date: 2019-08-13 06:33:29 -0400
> > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named 
> > h36n11 by adams Wed Aug 14 22:21:56 2019
> > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC 
> > --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon" 
> > FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0 
> > --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc 
> > CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis 
> > --download-fblaslapack --with-x=0 --with-64-bit-indices=0 
> > --with-debugging=1 PETSC_ARCH=arch-summit-dbg-

[petsc-dev] PetscCUDAInitialize

2019-09-18 Thread Zhang, Junchao via petsc-dev

Barry,

I saw you added these in init.c
+  -cuda_initialize - do the initialization in PetscInitialize()

Notes:
   Initializing cuBLAS takes about 1/2 second there it is done by default in 
PetscInitialize() before logging begins

But I did not get otherwise with -cuda_initialize 0, when will cuda be 
initialized?
--Junchao Zhang

Re: [petsc-dev] PetscCUDAInitialize

2019-09-19 Thread Zhang, Junchao via petsc-dev

I saw your update. In PetscCUDAInitialize we have

  /* First get the device count */
  err   = cudaGetDeviceCount(&devCount);

  /* next determine the rank and then set the device via a mod */
  ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
  device = rank % devCount;
}
err = cudaSetDevice(device);

If we rely on the first CUDA call to do initialization, how could CUDA know 
these MPI stuff.
--Junchao Zhang


On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Fixed the docs. Thanks for pointing out the lack of clarity


> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Barry,
>
> I saw you added these in init.c
>
>
> +  -cuda_initialize - do the initialization in PetscInitialize()
>
>
>
>
>
>
>
>
> Notes:
>
>Initializing cuBLAS takes about 1/2 second there it is done by default in 
> PetscInitialize() before logging begins
>
>
>
> But I did not get otherwise with -cuda_initialize 0, when will cuda be 
> initialized?
> --Junchao Zhang

Re: [petsc-dev] PetscCUDAInitialize

2019-09-19 Thread Zhang, Junchao via petsc-dev

On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:


> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I saw your update. In PetscCUDAInitialize we have
>
>
>
>
>
>   /* First get the device count */
>
>   err   = cudaGetDeviceCount(&devCount);
>
>
>
>
>   /* next determine the rank and then set the device via a mod */
>
>   ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
>
>   device = rank % devCount;
>
> }
>
> err = cudaSetDevice(device);
>
>
>
>
>
> If we rely on the first CUDA call to do initialization, how could CUDA know 
> these MPI stuff.

  It doesn't, so it does whatever it does (which may be dumb).

  Are you proposing something?

No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
reproduce it. I'm doing investigation.

  Barry

>
> --Junchao Zhang
>
>
>
> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Fixed the docs. Thanks for pointing out the lack of clarity
>
>
> > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> > Barry,
> >
> > I saw you added these in init.c
> >
> >
> > +  -cuda_initialize - do the initialization in PetscInitialize()
> >
> >
> >
> >
> >
> >
> >
> >
> > Notes:
> >
> >Initializing cuBLAS takes about 1/2 second there it is done by default 
> > in PetscInitialize() before logging begins
> >
> >
> >
> > But I did not get otherwise with -cuda_initialize 0, when will cuda be 
> > initialized?
> > --Junchao Zhang
>

Re: [petsc-dev] Master broken after changes to PetscSection headers

2019-09-19 Thread Zhang, Junchao via petsc-dev

Could this problem be fixed right now? It shows up in stage-1 tests and block 
further tests.

--Junchao Zhang


On Thu, Sep 19, 2019 at 4:15 AM Lisandro Dalcin via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
The warnings below are from a C build. A C++ build fails right away.

We need to `#include ` somewhere in the public headers, at 
least such that  `#include ` in user code works.

$ git grep petscsection include/
include/petsc/private/sectionimpl.h:#include 
include/petscis.h:#include 
include/petscsection.h:#include 


/home/devel/petsc/dev/src/snes/utils/convest.c: In function 
‘PetscConvEstGetConvRate’:
/home/devel/petsc/dev/src/snes/utils/convest.c:293:14: warning: implicit 
declaration of function ‘PetscSectionGetField’; did you mean 
‘PetscSectionVecView’? [-Wimplicit-function-declaration]
  293 |   ierr = PetscSectionGetField(s, f, &fs);CHKERRQ(ierr);
  |  ^~~~
  |  PetscSectionVecView
/home/devel/petsc/dev/src/snes/utils/convest.c:294:14: warning: implicit 
declaration of function ‘PetscSectionGetConstrainedStorageSize’ 
[-Wimplicit-function-declaration]
  294 |   ierr = PetscSectionGetConstrainedStorageSize(fs, 
&lsize);CHKERRQ(ierr);
  |  ^

--
Lisandro Dalcin

Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

Re: [petsc-dev] PetscCUDAInitialize

2019-09-19 Thread Zhang, Junchao via petsc-dev

All failed tests just said "application called MPI_Abort" and had no stack 
trace. They are not cuda tests. I updated SF to avoid CUDA  related 
initialization if not needed. Let's see the new test result.

not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
#   application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

--Junchao Zhang


On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

 Failed?  Means nothing, send link or cut and paste error

 It could be that since we have multiple separate tests running at the same 
time they overload the GPU or cause some inconsistent behavior that doesn't 
appear every time the tests are run.

   Barry

Maybe we need to sequentialize all the tests that use the GPUs, we just trust 
gnumake for the parallelism maybe you could some how add dependencies to get 
gnu make to achieve this?




> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>
> > On Sep 19, 2019, at 2:50 PM, Zhang, Junchao 
> > mailto:jczh...@mcs.anl.gov>> wrote:
> >
> > I saw your update. In PetscCUDAInitialize we have
> >
> >
> >
> >
> >
> >   /* First get the device count */
> >
> >   err   = cudaGetDeviceCount(&devCount);
> >
> >
> >
> >
> >   /* next determine the rank and then set the device via a mod */
> >
> >   ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
> >
> >   device = rank % devCount;
> >
> > }
> >
> > err = cudaSetDevice(device);
> >
> >
> >
> >
> >
> > If we rely on the first CUDA call to do initialization, how could CUDA know 
> > these MPI stuff.
>
>   It doesn't, so it does whatever it does (which may be dumb).
>
>   Are you proposing something?
>
> No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
> reproduce it. I'm doing investigation.
>
>   Barry
>
> >
> > --Junchao Zhang
> >
> >
> >
> > On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. 
> > mailto:bsm...@mcs.anl.gov>> wrote:
> >
> >   Fixed the docs. Thanks for pointing out the lack of clarity
> >
> >
> > > On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
> > > mailto:petsc-dev@mcs.anl.gov>> wrote:
> > >
> > > Barry,
> > >
> > > I saw you added these in init.c
> > >
> > >
> > > +  -cuda_initialize - do the initialization in PetscInitialize()
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Notes:
> > >
> > >Initializing cuBLAS takes about 1/2 second there it is done by default 
> > > in PetscInitialize() before logging begins
> > >
> > >
> > >
> > > But I did not get otherwise with -cuda_initialize 0, when will cuda be 
> > > initialized?
> > > --Junchao Zhang
> >

[petsc-dev] Configure hangs on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

My configure hangs on Summit at
  TESTING: configureMPIEXEC from 
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:170)

On the machine one has to use script to submit jobs. So why do we need 
configureMPIEXEC? Do I need to use --with-batch? I remember we removed that.

--Junchao Zhang

Re: [petsc-dev] Configure hangs on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

Satish's trick --with-mpiexec=/bin/true solved the problem.  Thanks.
--Junchao Zhang


On Fri, Sep 20, 2019 at 3:50 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
My configure hangs on Summit at
  TESTING: configureMPIEXEC from 
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:170)

On the machine one has to use script to submit jobs. So why do we need 
configureMPIEXEC? Do I need to use --with-batch? I remember we removed that.

--Junchao Zhang

Re: [petsc-dev] Configure hangs on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

Richard,
  I almost copied arch-olcf-summit-opt.py. The hanging is random. I met it few 
weeks ago. I retried and it passed. It happened today when I did a fresh 
configure.
  On Summit login nodes, mpiexec is actually in everyone's PATH. I did "ps ux" 
and found the script was executing "mpiexec ... "
--Junchao Zhang


On Fri, Sep 20, 2019 at 8:59 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hi Junchao,

Glad you've found a workaround, but I don't know why you are hitting this 
problem. The last time I built PETSc on Summit (just a couple days ago), I 
didn't have this problem. I'm working from the example template that's in the 
PETSc repo at config/examples/arch-olcf-summit-opt.py.

Can you point me to your configure script on Summit so I can try to reproduce 
your problem?

--Richard

On 9/20/19 4:25 PM, Zhang, Junchao via petsc-dev wrote:
Satish's trick --with-mpiexec=/bin/true solved the problem.  Thanks.
--Junchao Zhang


On Fri, Sep 20, 2019 at 3:50 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
My configure hangs on Summit at
  TESTING: configureMPIEXEC from 
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:170)

On the machine one has to use script to submit jobs. So why do we need 
configureMPIEXEC? Do I need to use --with-batch? I remember we removed that.

--Junchao Zhang

[petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

I downloaded a sparse matrix (HV15R) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 387864   9733910 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  1  0 97 25  0  35  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  48  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0


--Junchao Zhang

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

Click the links to visualize it.

6 ranks
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution 
packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse 
-vec_type cuda -n 100 -log_view

24 ranks
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution 
packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse 
-vec_type cuda -n 100 -log_view

--Junchao Zhang


On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Junchao,

Can you share your 'jsrun' command so that we can see how you are mapping 
things to resource sets?

--Richard

On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
I downloaded a sparse matrix (HV15R<https://sparse.tamu.edu/Fluorem/HV15R>) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 1001

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Zhang, Junchao via petsc-dev

Here are CPU version results on one node with 24 cores, 42 cores. Click the 
links for core layout.

24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

--Junchao Zhang


On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

   Very interesting. For completeness please run also 24 and 42 CPUs without 
the GPUs. Note that the default layout for CPU cores is not good. You will want 
3 cores on each socket then 12 on each.

  Thanks

   Barry

  Since Tim is one of our reviewers next week this is a very good test matrix 
:-)


> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Click the links to visualize it.
>
> 6 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>
> 24 ranks
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
> Junchao,
>
> Can you share your 'jsrun' command so that we can see how you are mapping 
> things to resource sets?
>
> --Richard
>
> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
>> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
>> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
>> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
>> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
>> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
>> aware SF hurt performance. I don't know why and have to profile it. I will 
>> also collect  data with multiple nodes. Are the matrix and tests proper?
>>
>> 
>> EventCount  Time (sec) Flop  
>> --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu 
>> - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---
>> 6 MPI ranks (CPU version)
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
>> 0.00e+00  0
>>
>> 6 MPI ranks + 6 GPUs + regular SF
>> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
>> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
>> 2.69e+02 100
>>

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev

We log gpu time before/after cusparse calls. 
https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
But according to 
https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
cusparse is asynchronous. Does that mean the gpu time is meaningless?
--Junchao Zhang


On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   Hannah, Junchao and Richard,

The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally funky. 
951558 and 973391 they are so much lower than unvirtualized 3084009
  and 3133521 and yet the total time to solution is similar for the runs.

Is it possible these are being counted or calculated wrong? If not what 
does this mean? Please check the code that computes them (I can't imagine it is 
wrong but ...)

It means the GPUs are taking 3.x times more to do the multiplies in the MPS 
case but where is that time coming from in the other numbers? Communication 
time doesn't drop that much?

I can't present these numbers with this huge inconsistency

Thanks,

   Barry




> On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node 
> of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was 
> almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 
> GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> aware SF hurt performance. I don't know why and have to profile it. I will 
> also collect  data with multiple nodes. Are the matrix and tests proper?
>
> 
> EventCount  Time (sec) Flop   
>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - 
> GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---
> 6 MPI ranks (CPU version)
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> 2.69e+02 100
> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
> 0.00e+00  0
> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
> 2.69e+02  0
>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
> 0.00e+00 100
> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> 6.72e+01 100
> VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev

I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  14  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + regular SF  + No log_view
MatMult: 100 1.0 1.4180e-01 
399268

6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult  100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 512224   6420750 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   6  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  16  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0 9.8344e-02 
575717

24 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  0 99 97 25  0 100100100100  0 489223   708601  100 4.61e+01  100 
6.7

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Zhang, Junchao via petsc-dev

42 cores have better performance.

36 MPI ranks
MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

--Junchao Zhang


On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

Mark has a good point; could you also try for completeness the CPU with 36 
cores and see if it is any better than the 42 core case?

  Barry

  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the 
GPUs for the multiply for this problem size.

> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
>
> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> saturated at that point.
>
> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
> Here are CPU version results on one node with 24 cores, 42 cores. Click the 
> links for core layout.
>
> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> 0.00e+00  0
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao,
>
>Very interesting. For completeness please run also 24 and 42 CPUs without 
> the GPUs. Note that the default layout for CPU cores is not good. You will 
> want 3 cores on each socket then 12 on each.
>
>   Thanks
>
>Barry
>
>   Since Tim is one of our reviewers next week this is a very good test matrix 
> :-)
>
>
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> > Click the links to visualize it.
> >
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > --Junchao Zhang
> >
> >
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev 
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> > Junchao,
> >
> > Can you share your 'jsrun' command so that we can see how you are mapping 
> > things to resource sets?
> >
> > --Richard
> >
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix 
> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 
> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I 
> >> found MatMult was almost dominated by VecScatter in this simple test. 
> >> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
> >> performance. But if I enabled Multi-Process Service on Summit and used 24 
> >> ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why 
>

Re: [petsc-dev] MatMult on Summit

2019-09-22 Thread Zhang, Junchao via petsc-dev




On Sat, Sep 21, 2019 at 11:08 PM Karl Rupp via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hi Junchao,

thanks, these numbers are interesting.

Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs.
a non-CUDA-aware MPI that still keeps the benefits of your
packing/unpacking routines?

I'd like to get a feeling of where the performance gains come from. Is
it due to the reduced PCI-Express transfer for the scatters (i.e.
packing/unpacking and transferring only the relevant entries) on each
rank, or is it some low-level optimization that makes the MPI-part of
the communication faster? Your current MR includes both; it would be
helpful to know whether we can extract similar benefits for other GPU
backends without having to require "CUDA-awareness" of MPI. If the
benefits are mostly due to the packing/unpacking, we could carry over
the benefits to other GPU backends (e.g. upcoming Intel GPUs) without
having to wait for an "Intel-GPU-aware MPI".

Your argument is fair. I will add this support later. Besides performance 
benefit, GPU-aware can simplify user's code. That is why I think all vendors 
will converge on that.
This post https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ has detailed 
explanation of CUDA-aware MPI. In short, it avoids CPU involvement and 
redundant memory copies.

Best regards,
Karli


On 9/21/19 6:22 AM, Zhang, Junchao via petsc-dev wrote:
> I downloaded a sparse matrix (HV15R
> <https://sparse.tamu.edu/Fluorem/HV15R>) from Florida Sparse Matrix
> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
> found MatMult was almost dominated by VecScatter in this simple test.
> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve
> performance. But if I enabled Multi-Process Service on Summit and used
> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know
> why and have to profile it. I will also collect  data with multiple
> nodes. Are the matrix and tests proper?
>
> 
> EventCount  Time (sec) Flop
>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -
> - GpuToCpu - GPU
> Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>   Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
> Count   Size  %F
> ---
> 6 MPI ranks (CPU version)
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
>
> 6 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02
>   100 2.69e+02 100
> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00
>   100 2.69e+02  0
> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+02
>   0 0.00e+00  0
> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00
>   100 2.69e+02  0
>
> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00
>   0 0.00e+00 100
> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05
> 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+00
>   0 0.00e+00  0
> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+00
>   0 0.00e+00  0
>
> 24 MPI ranks + 6 GPUs + regular SF
> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev

I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junch

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev

The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang

On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0

--Junchao Zhang

On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry

> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev

It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu<http://aijcusparse.cu>, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev

No objection. Thanks.
--Junchao Zhang


On Mon, Sep 23, 2019 at 10:09 PM Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters
create their own streams. This will almost inevitably create races
(there is no synchronization mechanism implemented), unless one calls
WaitForGPU() after each operation. Some of the non-deterministic tests
can likely be explained by this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU()
> to ensure that the calls actually complete. I will make a pass through
> the aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>> It looks cusparsestruct->stream is always created (not NULL).  I don't
>> know logic of the "if (!cusparsestruct->stream)".
>> --Junchao Zhang
>>
>>
>> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev
>> mailto:petsc-dev@mcs.anl.gov> 
>> <mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>> wrote:
>>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>> the end of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult()
>> with no add case and should just do this all the time, for the
>> purposes of timing if no other reason. Is there some reason to NOT
>> do this because of worries the about effects that these
>> WaitForGPU() invocations might have on performance?
>>
>> I notice other problems in aijcusparse.cu<http://aijcusparse.cu> 
>> <http://aijcusparse.cu>,
>> now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>> see that we have GPU timing calls around the cusparse_csr_spmv()
>> (but no WaitForGPU() inside the timed region). I believe this is
>> another area in which we get a meaningless timing. It looks like
>> we need a WaitForGPU() there, and then maybe inside the timed
>> region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two
>> WaitForGPU() calls in one function, just to help with getting
>> timings? I don't have a good idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>> I made the following changes:
>>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   PetscFunctionReturn(0);
>>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>>> The old code swapped the first two lines. Since with
>>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
>>> order to have better overlap.
>>>   ierr =
>>> 
>>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> 
>>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>> 3) Log time directly in the test code so we can also know
>>> execution time without -log_view (hence cuda synchronization). I
>>> manually calculated the Total Mflop/s for these cases for easy
>>> comparison.
>>>
>>> <>
>>>
>>> 
>>> 
>>> EventCount

[petsc-dev] Should v->valid_GPU_array be a bitmask?

2019-10-01 Thread Zhang, Junchao via petsc-dev

Stafano recently modified the following code,

PetscErrorCode VecCreate_SeqCUDA(Vec V)
{
  PetscErrorCode ierr;

  PetscFunctionBegin;
  ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr);
  ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr);
  ierr = 
VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr);
  ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr);
  ierr = VecSet(V,0.0);CHKERRQ(ierr);
  ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr);
  V->valid_GPU_array = PETSC_OFFLOAD_BOTH;
  PetscFunctionReturn(0);
}

That means if one creates an SEQCUDA vector V and then immediately tests if 
(V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That is 
counterintuitive.  I think we should have
enum 
{PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3}
and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think?

--Junchao Zhang

Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?

2019-10-02 Thread Zhang, Junchao via petsc-dev

Yes, the name valid_GPU_array is very confusing. I read it as valid_places.
--Junchao Zhang


On Wed, Oct 2, 2019 at 1:12 AM Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:
Hi Junchao,

I recall that Jed already suggested to make this a bitmask ~7 years ago ;-)

On the other hand: If we touch valid_GPU_array, then we should also use
a better name or refactor completely. Code like

  (V->valid_GPU_array & PETSC_OFFLOAD_GPU)

simply isn't intuitive (nor does it make sense) when read aloud.

Best regards,
Karli


On 10/2/19 5:24 AM, Zhang, Junchao via petsc-dev wrote:
> Stafano recently modified the following code,
>
> PetscErrorCode VecCreate_SeqCUDA(Vec V)
> {
>PetscErrorCode ierr;
>
>PetscFunctionBegin;
>ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr);
>ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr);
>ierr =
> VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr);
>ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr);
>ierr = VecSet(V,0.0);CHKERRQ(ierr);
>ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr);
> V->valid_GPU_array = PETSC_OFFLOAD_BOTH;
> PetscFunctionReturn(0);
> }
>
> That means if one creates an SEQCUDA vector V and then immediately tests
> if (V->valid_GPU_array == PETSC_OFFLOAD_GPU), the test will fail. That
> is counterintuitive.  I think we should have
>
> enum 
> {PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3}
>
>
> and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think?
>
> --Junchao Zhang

Re: [petsc-dev] Feed back on report on performance of vector operations on Summit requested

2019-10-10 Thread Zhang, Junchao via petsc-dev

*Better to have an abstract for readers to know your intention/conclusion

*p.5 "We also launch all jobs using the --launch_distribution cyclic option so
that MPI ranks are assigned to resource sets in a circular fashion, which we
deem appropriate for most high performance computing (HPC) algorithms."
Cyclic distribution is fine for these simple Vec ops since there is almost no
communication, but can not be deemed appropriate for most HPC algorithms. I
assume packed distribution is better for locality.

*Fig. 1 Left. I would use the diagram at p.11 of
https://press3.mcs.anl.gov/atpesc/files/2018/08/ATPESC_2018_Track-1_6_7-30_130pm_Hill-Summit_at_ORNL.pdf,
which is more informative and contains a lot of numbers we can compare with
your results. E.g., peak bandwidth, you mentioned but did not list.

*2.1 cudaMemcopy ?
For the two bullets VecAXPY, VecDot, you'd better clearly list how you counted
their FLOPS & memory, which you used to calculate bandwidth and performance in
the report.

*p.12 VecACPY ?
*p.12 I don't the difference of the two GPU launch time.

*When appropriate, can you draw a line for hardware peak bandwidth or FLOPS/s
in the figures.

*p.13, some bullets are not important and you can mention them earlier in your
experimental setup.
bullet 4: I think the reason is: to get peak CPU->GPU bandwidth, the cpu buffer
has to be pinned (i.e. non-pageable).

--Junchao Zhang

On Wed, Oct 9, 2019 at 5:34 PM Smith, Barry F. via petsc-dev
mailto:petsc-dev@mcs.anl.gov>> wrote:

We've prepared a short report on the performance of vector operations on
Summit and would appreciate any feed back including: inconsistencies, lack of
clarity, incorrect notation or terminology, etc.

Thanks

Barry, Hannah, and Richard

Re: [petsc-dev] Should v->valid_GPU_array be a bitmask?

2019-10-13 Thread Zhang, Junchao via petsc-dev

I had an MR (already merged to master) that changed the name to v->offloadmask.
But the behavior is not changed. VecCreate_SeqCUDA still allocates on both CPU 
and GPU. I believe we should allocate on CPU on-demand for VecCUDA.

--Junchao Zhang

On Sun, Oct 13, 2019 at 12:27 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Yikes, forget about bit flags and names.

  Does this behavior make sense? EVERY CUDA vector allocates memory on both GPU 
and CPU ? Or do I misunderstand the code?

   This seems fundamentally wrong and is different than before. What about the 
dozens of work vectors on the GPU (for example for Krylov methods)? There is no 
reason for them to have memory allocated on the CPU.  In the long run pretty 
much all the matrices and vectors will only reside on the GPU so this seems 
like a step backwards. Does libaxb do this?

   Barry

> On Oct 1, 2019, at 10:24 PM, Zhang, Junchao via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Stafano recently modified the following code,
>
>
> PetscErrorCode VecCreate_SeqCUDA(Vec V)
>
> {
>
>   PetscErrorCode ierr;
>
>
>
>   PetscFunctionBegin;
>
>   ierr = PetscLayoutSetUp(V->map);CHKERRQ(ierr);
>
>   ierr = VecCUDAAllocateCheck(V);CHKERRQ(ierr);
>
>   ierr = 
> VecCreate_SeqCUDA_Private(V,((Vec_CUDA*)V->spptr)->GPUarray_allocated);CHKERRQ(ierr);
>
>   ierr = VecCUDAAllocateCheckHost(V);CHKERRQ(ierr);
>
>   ierr = VecSet(V,0.0);CHKERRQ(ierr);
>
>   ierr = VecSet_Seq(V,0.0);CHKERRQ(ierr);
>
>   V->valid_GPU_array = PETSC_OFFLOAD_BOTH;
>
>   PetscFunctionReturn(0);
>
> }
>
>
>
>
> That means if one creates an SEQCUDA vector V and then immediately tests if 
> (V->valid_GPU_array
>  == PETSC_OFFLOAD_GPU), the test will fail. That is
>
> counterintuitive.  I think we should have
>
>
>
>
> enum 
> {PETSC_OFFLOAD_UNALLOCATED=0x0,PETSC_OFFLOAD_GPU=0x1,PETSC_OFFLOAD_CPU=0x2,PETSC_OFFLOAD_BOTH=0x3}
>
>
>
>
>
> and then use if (V->valid_GPU_array & PETSC_OFFLOAD_GPU). What do you think?
>
>
>
> --Junchao Zhang

Re: [petsc-dev] PetscLayoutFindOwner and PetscLayoutFindOwnerIndex

2019-10-16 Thread Zhang, Junchao via petsc-dev

The value of "owner" should fit in PetscMPIInt. But if you change prototype of 
the two functions, you have to change all their uses.
In petsc, values representing MPI ranks are not always of type PetscMPIInt. 
Only those closely tied to MPI routines are in PetscMPIInt.

--Junchao Zhang


On Wed, Oct 16, 2019 at 5:19 AM Pierre Jolivet via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Hello,
These two functions use a parameter “owner” of type PetscInt*.
Shouldn’t this be PetscMPIInt*?
This implies changes left and right, so I want to check I’m not pushing an 
incorrect MR.

Thanks,
Pierre

Re: [petsc-dev] GPU counters

2019-11-06 Thread Zhang, Junchao via petsc-dev

No. For each vector/matrix operation, PETSc can get its flop count based on 
number of nonzeros, for example.

--Junchao Zhang


On Wed, Nov 6, 2019 at 8:44 AM Mark Adams via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
I am puzzled.

I am running AMGx now, and I am getting flop counts/rates. How does that 
happen? Does PETSc use hardware counters to get flops?

[petsc-dev] Fw: Foundations Forum

2024-08-19 Thread Zhang, Junchao via petsc-dev

From: Katz, Daniel S. 
Sent: Monday, August 19, 2024 2:46 PM
Cc: Katz, Daniel S. 
Subject: Foundations Forum

Hi. A member of CORSA has probably previously reached out to you to ask about 
your project's interest in software foundations and sustainability metrics. If 
you are not interested in open source software foundations, you can stop 
reading now,
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi.

A member of CORSA has probably previously reached out to you to ask about your 
project's interest in software foundations and sustainability metrics.

If you are not interested in open source software foundations, you can stop 
reading now, and I apologize for bothering you.

At this time, I want to let you know that we are starting a Foundations Forum 
as part of 
CASS.

We'll have the first of our foundations forum meetings at noon CT on Wednesday 
Aug 28.  Connect via
https://urldefense.us/v3/__https://illinois.zoom.us/j/84019968610?pwd=yB9nwobAPnv7eAolvIrn2E1BZoMrIH.1__;!!G_uCfscf7eWS!fSHjn0gFNP9hGFf6apc0yv3xnDKiFcVKaUhw3m0bPNVIxzzzmfvP87AXMK-9HMK2l85fZCLPqY5J35zX3wvK9ofI$

Meeting ID: 840 1996 8610, Password: 076304
One tap mobile: +13126266799,,84019968610# US (Chicago)

As mentioned previously, we'll generally try to have in these meetings: a) a 
short talk (e.g., update from a foundation, experience from an SSO with a 
foundation), b) discussion on a predetermined topic, and c) general 
announcements and discussion (related to open source foundations)

Anyone can suggest agenda items, or topics, etc. - so please let me know if you 
have ideas. This first meeting could include brainstorming about future 
organization, or I could talk about Parsl's desire to and process of joining 
NumFOCUS.

Meetings will be open to anyone who wants to join - so please feel free to 
share this

Meetings will be recorded and posted on youtube, with links to them from a part 
of the CORSA website.

Meetings will be announced via the CORSA email list and the CASS #general and 
#wg-foundations slack channels

If you are interested in knowing about future meetings, please join the CORSA 
mailing list (see the bottom of 
https://urldefense.us/v3/__https://corsa.center/__;!!G_uCfscf7eWS!fSHjn0gFNP9hGFf6apc0yv3xnDKiFcVKaUhw3m0bPNVIxzzzmfvP87AXMK-9HMK2l85fZCLPqY5J35zX35FCaoFv$

)
 and/or join the #wg-foundations channel in the CASS Slack (I can invite you 
from inside slack if needed)

And finally, if you would like a meeting between CORSA and your project about 
foundations, we can also schedule one - just let me know.

Thanks,
Dan

--
Daniel S. Katz
Chief Scientist, NCSA
Research Associate Professor, Siebel School of Computing and Data Science
Research Associate Professor, School of Information Sciences (iSchool)
University of Illinois Urbana-Champaign
(217) 244-8000
d.k...@ieee.org or 
dsk...@illinois.edu
https://urldefense.us/v3/__https://danielskatz.org__;!!G_uCfscf7eWS!fSHjn0gFNP9hGFf6apc0yv3xnDKiFcVKaUhw3m0bPNVIxzzzmfvP87AXMK-9HMK2l85fZCLPqY5J35zX3wGjOdd0$

69 matches

Mail list logo