Re: [petsc-users] Communication during MatAssemblyEnd

2019-07-03 Thread Ale Foggia via petsc-users
Thank you Richard for your explanation.

I first changed the way I was running the code. In the machine there's
SLURM and I was using "srun -n  ./my_program.x". I've
seen more than 60% improvement in execution time by just running with "srun
-n  --ntasks-per-core=1 --cpu-bind=cores>
./my_program.x". All the results I'm sharing now were obtained by running
with the second command.

1) I've tested the configuration you suggested me and I've found mixed
results. For one system (matrix) size and distribution of memory per node,
there's a small difference in execution time (say, it goes from 208 seconds
to 200 seconds) between the two cases: I_MPI_PIN_DOMAIN=socket and
I_MPI_PIN_DOMAIN=auto:compact. In another case, I obtained that using
I_MPI_PIN_DOMAIN=auto:socket gives actually worse results: I go from an
execution time of 5 second to 11 seconds, when using "socket" and
"auto:compact" configurations respectively. In this case, I see that the
log events that differ in execution time are MatMult, VecNorm and
VecScatterEnd. I also see that the imbalance for those log events is
smaller when using "socket" (consequently the time difference).

2) I also tested using I_MPI_PIN_PROCESSOR_LIST=0-63 and I get the same
numbers that I get when using I_MPI_PIN_DOMAIN=socket (I've seen the BIOS
numbering and those correspond to the first of the CPUs in each core).

El mar., 25 jun. 2019 a las 2:32, Mills, Richard Tran ()
escribió:

> Hi Ale,
>
> I don't know if this has anything to do with the strange performance you
> are seeing, but I notice that some of your Intel MPI settings are
> inconsistent and I'm not sure what you are intending. You have specified a
> value for I_MPI_PIN_DOMAIN and also a value for I_MPI_PIN_PROCESSOR_LIST.
> You can specify the domains that each MPI rank is pinned to by specifying
> one of these, but not both. According to what I found by a Google search
> for the Intel MPI documentation, if you specify both, it is
> I_MPI_PIN_DOMAIN that gets used.
>
> A KNL node has only one socket per node (though it can be made to appear
> as if it has four virtual sockets if booted into SNC-4 mode, which doesn't
> seem to be a popular usage model). If you run with
> "I_MPI_PIN_DOMAIN=socket", then each of your MPI ranks on a node will be
> "pinned" to a domain that consists of all of the CPU cores available in a
> socket, so the MPI processes aren't really pinned at all, and can be
> migrated around by the system. You may see better performance by having
> your MPI implementation pin processes to more restrictive domains. I
> suggest first trying
>
>   I_MPI_PIN_DOMAIN=auto:compact
>
> which I believe is the default Intel MPI behavior. This will create
> domains by dividing the number of logical CPUs (# of hardware threads) by
> the number of MPI processes being started on the node, and they will be
> "compact" in the sense that domain members will be as "close" to each other
> as possible in terms of sharing cores, caches, etc.
>
> I think that Intel MPI is ignoring your I_MPI_PIN_PROCESSOR_LIST setting,
> but I'll go ahead and point out that, if you are running with 64-core KNL
> nodes, there are 4*64 = 256 logical processors available, because each core
> supports four hardware threads. Asking for logical processors 0-63 might
> not actually be using all of the cores, as, depending on the BIOS numbering
> (which can be arbitrary), some of these logical processors may actually be
> hardware threads that share the same core.
>
> Best regards,
> Richard
>
> On 6/24/19 4:19 AM, Ale Foggia via petsc-users wrote:
>
> Yes, I used KNL nodes. I you can perform the test would be great. Could it
> be that I'm not using the correct configuration of the KNL nodes? These are
> the environment variables I set:
> MKL_NUM_THREADS=1
> OMP_NUM_THREADS=1
> KMP_HW_SUBSET=1t
> KMP_AFFINITY=compact
> I_MPI_PIN_DOMAIN=socket
> I_MPI_PIN_PROCESSOR_LIST=0-63
> MKL_DYNAMIC=0
>
> The code is in https://github.com/amfoggia/LSQuantumED and it has a
> readme to compile it and run it. When I run the test I used only 32
> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
> Thank you
>
> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao ()
> escribió:
>
>> Ale,
>>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL
>> nodes  one year ago. He used 32768 processors and called MatAssemblyEnd 118
>> times and it used only 1.5 seconds in total.  So I guess something was
>> wrong with your test. If you can share your code, I can have a test on our
>> machine to see how it goes.
>>  Thanks.
>> --Junchao Zhang
>>
>>
>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
>> wrote:
>>
>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>>> time.  Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>>> which does global synchronization. The high max/min ratio means load
>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. It
>>> means processors are 

Re: [petsc-users] Communication during MatAssemblyEnd

2019-07-01 Thread Zhang, Junchao via petsc-users
Jose & Ale,
   -ds_method 2 fixed the problem.   I used PETSc master (f1480a5c) and slepc 
master(675b89d7) through --download-slepc. I used MKL 
/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/
   I got the following results with 2048 processors.  MatAssemblyEnd looks 
expensive to me. I am looking into it.

--- Event Stage 5: Offdiag

BuildTwoSidedF 1 1.0 1.5201e+007345.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  13  0  0  0  0 0
MatAssemblyBegin   1 1.0 1.5201e+005371.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  13  0  0  0  0 0
MatAssemblyEnd 1 1.0 1.7720e+00 1.0 0.00e+00 0.0 5.7e+04 1.3e+05 
8.0e+00  2  0  0  0  1  39  0100100100 0
VecSet 1 1.0 2.4695e-02 7.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

--Junchao Zhang


On Mon, Jul 1, 2019 at 4:28 AM Jose E. Roman 
mailto:jro...@dsic.upv.es>> wrote:
You can try the following:
- Try with a different DS method: -ds_method 1  or  -ds_method 2  (see 
DSSetMethod)
- Run with -ds_parallel synchronized (see DSSetParallel)
If it does not help, send a reproducible code to slepc-maint

Jose


> El 1 jul 2019, a las 11:10, Ale Foggia via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> escribió:
>
> Oh, I also got the same error when I switched to the newest version of SLEPc 
> (using OpenBlas), and I don't know where it is coming from.
> Can you tell me which version of SLEPc and PETSc are you using? And, are you 
> using MKL?
> Thanks for trying :)
>
> El vie., 28 jun. 2019 a las 16:57, Zhang, Junchao 
> (mailto:jczh...@mcs.anl.gov>>) escribió:
> Ran with 64 nodes and 32 ranks/node, met  slepc errors and did not know how 
> to proceed :(
>
> [363]PETSC ERROR: - Error Message 
> --
> [363]PETSC ERROR: Error in external library
> [363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
> [363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> for trouble shooting.
> [363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c  GIT 
> Date: 2019-06-22 21:39:54 +
> [363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387 by 
> jczhang Fri Jun 28 07:26:59 2019
> [1225]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [1225]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> [240]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [240]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
>
> --Junchao Zhang
>
>
> On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia 
> mailto:amfog...@gmail.com>> wrote:
> Junchao,
> I'm sorry for the late response.
>
> El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao 
> (mailto:jczh...@mcs.anl.gov>>) escribió:
> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your 
> processes may have been killed by the cgroup out-of-memory handler."
>
> I mentioned that I used 1024 nodes and 32 processes on each node because the 
> application needs a lot of memory. I think that for a system of size 38, one 
> needs above 256 nodes for sure (assuming only 32 procs per node). I would try 
> with 512 if it's possible.
>
> I also tried with 128 core with ./main.x 2 ... and got a weird error message  
> "The size of the basis has to be at least equal to the number 
>  of MPI processes used."
>
> The error comes from the fact that you put a system size of only 2 which is 
> too small.
> I can also see the problem in the assembly with system sizes smaller than 38, 
> so you can try with like 30 (for which I also have a log). In that case I run 
> with 64 nodes and 32 processes per node. I think the problem may also fit in 
> 32 nodes.
>
> --Junchao Zhang
>
>
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang 
> mailto:jczh...@mcs.anl.gov>> wrote:
> Ale,
>   I successfully built your code and submitted a job to the NERSC Cori 
> machine requiring 32768 KNL cores and one and a half hours. It is estimated 
> to run in 3 days. If you also observed the same problem with less cores, what 
> is your input arguments?  Currently, I use what in your log file, ./main.x 38 
> -nn -j1 1.0 -d1 1.0 -eps_type krylovschur

Re: [petsc-users] Communication during MatAssemblyEnd

2019-07-01 Thread Jose E. Roman via petsc-users
You can try the following:
- Try with a different DS method: -ds_method 1  or  -ds_method 2  (see 
DSSetMethod)
- Run with -ds_parallel synchronized (see DSSetParallel)
If it does not help, send a reproducible code to slepc-maint

Jose


> El 1 jul 2019, a las 11:10, Ale Foggia via petsc-users 
>  escribió:
> 
> Oh, I also got the same error when I switched to the newest version of SLEPc 
> (using OpenBlas), and I don't know where it is coming from.
> Can you tell me which version of SLEPc and PETSc are you using? And, are you 
> using MKL?
> Thanks for trying :)
> 
> El vie., 28 jun. 2019 a las 16:57, Zhang, Junchao () 
> escribió:
> Ran with 64 nodes and 32 ranks/node, met  slepc errors and did not know how 
> to proceed :(
> 
> [363]PETSC ERROR: - Error Message 
> --
> [363]PETSC ERROR: Error in external library
> [363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
> [363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html 
> for trouble shooting.
> [363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c  GIT 
> Date: 2019-06-22 21:39:54 +
> [363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387 by 
> jczhang Fri Jun 28 07:26:59 2019
> [1225]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [1225]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> [240]PETSC ERROR: #2 DSSolve() line 586 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
> [240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
> [240]PETSC ERROR: #4 EPSSolve() line 149 in 
> /global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
> 
> --Junchao Zhang
> 
> 
> On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia  wrote:
> Junchao,
> I'm sorry for the late response.
> 
> El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao () 
> escribió:
> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your 
> processes may have been killed by the cgroup out-of-memory handler."
> 
> I mentioned that I used 1024 nodes and 32 processes on each node because the 
> application needs a lot of memory. I think that for a system of size 38, one 
> needs above 256 nodes for sure (assuming only 32 procs per node). I would try 
> with 512 if it's possible.
> 
> I also tried with 128 core with ./main.x 2 ... and got a weird error message  
> "The size of the basis has to be at least equal to the number 
>  of MPI processes used."
> 
> The error comes from the fact that you put a system size of only 2 which is 
> too small.
> I can also see the problem in the assembly with system sizes smaller than 38, 
> so you can try with like 30 (for which I also have a log). In that case I run 
> with 64 nodes and 32 processes per node. I think the problem may also fit in 
> 32 nodes.
> 
> --Junchao Zhang
> 
> 
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang  wrote:
> Ale,
>   I successfully built your code and submitted a job to the NERSC Cori 
> machine requiring 32768 KNL cores and one and a half hours. It is estimated 
> to run in 3 days. If you also observed the same problem with less cores, what 
> is your input arguments?  Currently, I use what in your log file, ./main.x 38 
> -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
>   The smaller the better. Thanks.
> --Junchao Zhang
> 
> 
> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia  wrote:
> Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
> that I'm not using the correct configuration of the KNL nodes? These are the 
> environment variables I set:
> MKL_NUM_THREADS=1
> OMP_NUM_THREADS=1
> KMP_HW_SUBSET=1t
> KMP_AFFINITY=compact
> I_MPI_PIN_DOMAIN=socket
> I_MPI_PIN_PROCESSOR_LIST=0-63
> MKL_DYNAMIC=0
> 
> The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
> compile it and run it. When I run the test I used only 32 processors per 
> node, and I used 1024 nodes in total, and it's for nspins=38.
> Thank you
> 
> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao () 
> escribió:
> Ale,
>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes 
>  one year ago. He used 32768 processors and called MatAssemblyEnd 118 times 
> and it used only 1.5 seconds in total.  So I guess something was wrong with 
> your test. If you can 

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-28 Thread Zhang, Junchao via petsc-users
Ran with 64 nodes and 32 ranks/node, met  slepc errors and did not know how to 
proceed :(

[363]PETSC ERROR: - Error Message 
--
[363]PETSC ERROR: Error in external library
[363]PETSC ERROR: Error in LAPACK subroutine steqr: info=0
[363]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for 
trouble shooting.
[363]PETSC ERROR: Petsc Development GIT revision: v3.11.2-1052-gf1480a5c  GIT 
Date: 2019-06-22 21:39:54 +
[363]PETSC ERROR: /tmp/main.x on a arch-cray-xc40-knl-opt named nid03387 by 
jczhang Fri Jun 28 07:26:59 2019
[1225]PETSC ERROR: #2 DSSolve() line 586 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
[1225]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
[1225]PETSC ERROR: #4 EPSSolve() line 149 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c
[240]PETSC ERROR: #2 DSSolve() line 586 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/sys/classes/ds/interface/dsops.c
[240]PETSC ERROR: #3 EPSSolve_KrylovSchur_Symm() line 55 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/impls/krylov/krylovschur/ks-symm.c
[240]PETSC ERROR: #4 EPSSolve() line 149 in 
/global/u1/j/jczhang/petsc/arch-cray-xc40-knl-opt/externalpackages/git.slepc/src/eps/interface/epssolve.c

--Junchao Zhang


On Fri, Jun 28, 2019 at 4:02 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Junchao,
I'm sorry for the late response.

El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao 
(mailto:jczh...@mcs.anl.gov>>) escribió:
Ale,
The job got a chance to run but failed with out-of-memory, "Some of your 
processes may have been killed by the cgroup out-of-memory handler."

I mentioned that I used 1024 nodes and 32 processes on each node because the 
application needs a lot of memory. I think that for a system of size 38, one 
needs above 256 nodes for sure (assuming only 32 procs per node). I would try 
with 512 if it's possible.

I also tried with 128 core with ./main.x 2 ... and got a weird error message  
"The size of the basis has to be at least equal to the number   
   of MPI processes used."

The error comes from the fact that you put a system size of only 2 which is too 
small.
I can also see the problem in the assembly with system sizes smaller than 38, 
so you can try with like 30 (for which I also have a log). In that case I run 
with 64 nodes and 32 processes per node. I think the problem may also fit in 32 
nodes.

--Junchao Zhang


On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
Ale,
  I successfully built your code and submitted a job to the NERSC Cori machine 
requiring 32768 KNL cores and one and a half hours. It is estimated to run in 3 
days. If you also observed the same problem with less cores, what is your input 
arguments?  Currently, I use what in your log file, ./main.x 38 -nn -j1 1.0 -d1 
1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
  The smaller the better. Thanks.
--Junchao Zhang


On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
that I'm not using the correct configuration of the KNL nodes? These are the 
environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0

The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
compile it and run it. When I run the test I used only 32 processors per node, 
and I used 1024 nodes in total, and it's for nspins=38.
Thank you

El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao 
(mailto:jczh...@mcs.anl.gov>>) escribió:
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Anot

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-28 Thread Ale Foggia via petsc-users
Junchao,
I'm sorry for the late response.

El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao ()
escribió:

> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your
> processes may have been killed by the cgroup out-of-memory handler."
>

I mentioned that I used 1024 nodes and 32 processes on each node because
the application needs a lot of memory. I think that for a system of size
38, one needs above 256 nodes for sure (assuming only 32 procs per node). I
would try with 512 if it's possible.

I also tried with 128 core with ./main.x 2 ... and got a weird error
> message  "The size of the basis has to be at least equal to the number
>  of MPI processes used."
>

The error comes from the fact that you put a system size of only 2 which is
too small.
I can also see the problem in the assembly with system sizes smaller than
38, so you can try with like 30 (for which I also have a log). In that case
I run with 64 nodes and 32 processes per node. I think the problem may also
fit in 32 nodes.

--Junchao Zhang
>
>
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang 
> wrote:
>
>> Ale,
>>   I successfully built your code and submitted a job to the NERSC Cori
>> machine requiring 32768 KNL cores and one and a half hours. It is estimated
>> to run in 3 days. If you also observed the same problem with less cores,
>> what is your input arguments?  Currently, I use what in your log file,
>> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9
>> -log_view
>>   The smaller the better. Thanks.
>> --Junchao Zhang
>>
>>
>> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia  wrote:
>>
>>> Yes, I used KNL nodes. I you can perform the test would be great. Could
>>> it be that I'm not using the correct configuration of the KNL nodes? These
>>> are the environment variables I set:
>>> MKL_NUM_THREADS=1
>>> OMP_NUM_THREADS=1
>>> KMP_HW_SUBSET=1t
>>> KMP_AFFINITY=compact
>>> I_MPI_PIN_DOMAIN=socket
>>> I_MPI_PIN_PROCESSOR_LIST=0-63
>>> MKL_DYNAMIC=0
>>>
>>> The code is in https://github.com/amfoggia/LSQuantumED and it has a
>>> readme to compile it and run it. When I run the test I used only 32
>>> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
>>> Thank you
>>>
>>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao ()
>>> escribió:
>>>
 Ale,
   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL
 nodes  one year ago. He used 32768 processors and called MatAssemblyEnd 118
 times and it used only 1.5 seconds in total.  So I guess something was
 wrong with your test. If you can share your code, I can have a test on our
 machine to see how it goes.
  Thanks.
 --Junchao Zhang


 On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
 wrote:

> MatAssembly was called once (in stage 5) and cost 2.5% of the total
> time.  Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
> which does global synchronization. The high max/min ratio means load
> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. 
> It
> means processors are already synchronized. With 32768 processors, there 
> are
> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor 
> sends
> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance 
> is
> the reduction at  MatAssemblyEnd. I don't know why it needs 8 reductions.
> In my mind, one is enough. I need to look at the code.
>
> Summary of Stages:   - Time --  - Flop --  ---
> Messages ---  -- Message Lengths --  -- Reductions --
> Avg %Total Avg %TotalCount
> %Total Avg %TotalCount   %Total
>  0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07
>  13.1%  7.768e+06   13.1%  2.530e+02  13.0%
>  1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00
> 0.0%  0.000e+000.0%  0.000e+00   0.0%
>  2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00
> 0.0%  0.000e+000.0%  0.000e+00   0.0%
>  3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00
> 0.0%  0.000e+000.0%  2.000e+00   0.1%
>  5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06
> 0.2%  1.942e+060.0%  8.000e+00   0.4%
>  6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08
>  86.7%  7.768e+06   86.8%  1.674e+03  86.1%
>
> --- Event Stage 5: Offdiag
> BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0 0
> MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  1  0  0  0  0  28  0  0  0  0 0
> MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06
> 1.9e+06 8.0e+00  1  0  0  0  0  33  0100100100  

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-26 Thread Zhang, Junchao via petsc-users
Ale,
The job got a chance to run but failed with out-of-memory, "Some of your 
processes may have been killed by the cgroup out-of-memory handler."
I also tried with 128 core with ./main.x 2 ... and got a weird error message  
"The size of the basis has to be at least equal to the number   
   of MPI processes used."
--Junchao Zhang


On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
Ale,
  I successfully built your code and submitted a job to the NERSC Cori machine 
requiring 32768 KNL cores and one and a half hours. It is estimated to run in 3 
days. If you also observed the same problem with less cores, what is your input 
arguments?  Currently, I use what in your log file, ./main.x 38 -nn -j1 1.0 -d1 
1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
  The smaller the better. Thanks.
--Junchao Zhang


On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
that I'm not using the correct configuration of the KNL nodes? These are the 
environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0

The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
compile it and run it. When I run the test I used only 32 processors per node, 
and I used 1024 nodes in total, and it's for nspins=38.
Thank you

El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao 
(mailto:jczh...@mcs.anl.gov>>) escribió:
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+0

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-25 Thread Zhang, Junchao via petsc-users
Ale,
  I successfully built your code and submitted a job to the NERSC Cori machine 
requiring 32768 KNL cores and one and a half hours. It is estimated to run in 3 
days. If you also observed the same problem with less cores, what is your input 
arguments?  Currently, I use what in your log file, ./main.x 38 -nn -j1 1.0 -d1 
1.0 -eps_type krylovschur -eps_tol 1e-9 -log_view
  The smaller the better. Thanks.
--Junchao Zhang


On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
that I'm not using the correct configuration of the KNL nodes? These are the 
environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0

The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
compile it and run it. When I run the test I used only 32 processors per node, 
and I used 1024 nodes in total, and it's for nspins=38.
Thank you

El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao 
(mailto:jczh...@mcs.anl.gov>>) escribió:
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-24 Thread Mills, Richard Tran via petsc-users
Hi Ale,

I don't know if this has anything to do with the strange performance you are 
seeing, but I notice that some of your Intel MPI settings are inconsistent and 
I'm not sure what you are intending. You have specified a value for 
I_MPI_PIN_DOMAIN and also a value for I_MPI_PIN_PROCESSOR_LIST. You can specify 
the domains that each MPI rank is pinned to by specifying one of these, but not 
both. According to what I found by a Google search for the Intel MPI 
documentation, if you specify both, it is I_MPI_PIN_DOMAIN that gets used.

A KNL node has only one socket per node (though it can be made to appear as if 
it has four virtual sockets if booted into SNC-4 mode, which doesn't seem to be 
a popular usage model). If you run with "I_MPI_PIN_DOMAIN=socket", then each of 
your MPI ranks on a node will be "pinned" to a domain that consists of all of 
the CPU cores available in a socket, so the MPI processes aren't really pinned 
at all, and can be migrated around by the system. You may see better 
performance by having your MPI implementation pin processes to more restrictive 
domains. I suggest first trying

  I_MPI_PIN_DOMAIN=auto:compact

which I believe is the default Intel MPI behavior. This will create domains by 
dividing the number of logical CPUs (# of hardware threads) by the number of 
MPI processes being started on the node, and they will be "compact" in the 
sense that domain members will be as "close" to each other as possible in terms 
of sharing cores, caches, etc.

I think that Intel MPI is ignoring your I_MPI_PIN_PROCESSOR_LIST setting, but 
I'll go ahead and point out that, if you are running with 64-core KNL nodes, 
there are 4*64 = 256 logical processors available, because each core supports 
four hardware threads. Asking for logical processors 0-63 might not actually be 
using all of the cores, as, depending on the BIOS numbering (which can be 
arbitrary), some of these logical processors may actually be hardware threads 
that share the same core.

Best regards,
Richard

On 6/24/19 4:19 AM, Ale Foggia via petsc-users wrote:
Yes, I used KNL nodes. I you can perform the test would be great. Could it be 
that I'm not using the correct configuration of the KNL nodes? These are the 
environment variables I set:
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
KMP_HW_SUBSET=1t
KMP_AFFINITY=compact
I_MPI_PIN_DOMAIN=socket
I_MPI_PIN_PROCESSOR_LIST=0-63
MKL_DYNAMIC=0

The code is in https://github.com/amfoggia/LSQuantumED and it has a readme to 
compile it and run it. When I run the test I used only 32 processors per node, 
and I used 1024 nodes in total, and it's for nspins=38.
Thank you

El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao 
(mailto:jczh...@mcs.anl.gov>>) escribió:
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes.

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia 
> mailto:amfog...@gmail.com>> wrote:
>
> I'm send

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Smith, Barry F. via petsc-users

   You could access the VecScatter inside the matrix-multiply and call 
VecScatterView() with an ASCII viewer with the format PETSC_VIEWER_ASCII_INFO 
(make sure you use this format) and it provides information about how much 
communication is being done and how many neighbors are being communicated with

  Barry


> On Jun 21, 2019, at 10:56 AM, Jed Brown  wrote:
> 
> What is the partition like?  Suppose you randomly assigned nodes to
> processes; then in the typical case, all neighbors would be on different
> processors.  Then the "diagonal block" would be nearly diagonal and the
> off-diagonal block would be huge, requiring communication with many
> other processes.
> 
> "Smith, Barry F. via petsc-users"  writes:
> 
>>   The load balance is definitely out of whack. 
>> 
>> 
>> 
>> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
>> 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
>> MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
>> 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
>> VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
>> VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
>> 0.0e+00  1  0 13 13  0   4  0100100  0 0
>> VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   4  0  0  0  0 0
>> VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> 
>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have 
>> a balance ratio above 2 which means some processes have more than twice the 
>> work of others. Meanwhile the ratio for anything with communication is 
>> extremely in balanced, some processes get to the synchronization point well 
>> before other processes. 
>> 
>> The first thing I would do is worry about the load imbalance, what is its 
>> cause? is it one process with much less work than others (not great but not 
>> terrible) or is it one process with much more work then the others 
>> (terrible) or something in between. I think once you get a handle on the 
>> load balance the rest may fall into place, otherwise we still have some 
>> exploring to do. This is not expected behavior for a good machine with a 
>> good network and a well balanced job. After you understand the load 
>> balancing you may need to use one of the parallel performance visualization 
>> tools to see why the synchronization is out of whack.
>> 
>>   Good luck
>> 
>>  Barry
>> 
>> 
>>> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
>>> 
>>> I'm sending one with a bit less time.
>>> I'm timing the functions also with std::chronos and for the case of 180 
>>> seconds the program runs out of memory (and crushes) before the PETSc log 
>>> gets to be printed, so I know the time only from my function. Anyway, in 
>>> every case, the times between std::chronos and the PETSc log match.
>>> 
>>> (The large times are in part "4b- Building offdiagonal part" or "Event 
>>> Stage 5: Offdiag").
>>> 
>>> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
>>> escribió:
>>> 
>>> 
>>> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
>>> Thanks both of you for your answers,
>>> 
>>> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
>>> escribió:
>>> 
>>>  Note that this is a one time cost if the nonzero structure of the matrix 
>>> stays the same. It will not happen in future MatAssemblies.
>>> 
 On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
  wrote:
 
 Those messages were used to build MatMult communication pattern for the 
 matrix. They were not part of the matrix entries-passing you imagined, but 
 indeed happened in MatAssemblyEnd. If you want to make sure processors do 
 not set remote entries, you can use 
 MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
 error when an off-proc entry is set.
>>> 
>>> I started being concerned about this when I saw that the assembly was 
>>> taking a few hundreds of seconds in my code, like 180 seconds, which for me 
>>> it's a considerable time. Do you think (or maybe you need more information 
>>> to answer this) that this time is "reasonable" for communicating the 
>>> pattern for the matrix? I already checked that I'm not setting any remote 
>>> entries. 
>>> It is not reasonable. Could you send log view of that test with 18

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes.

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia 
> mailto:amfog...@gmail.com>> wrote:
>
> I'm sending one with a bit less time.
> I'm timing the functions also with std::chronos and for the case of 180 
> seconds the program runs out of memory (and crushes) before the PETSc log 
> gets to be printed, so I know the time only from my function. Anyway, in 
> every case, the times between std::chronos and the PETSc log match.
>
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
> 5: Offdiag").
>
> El vie., 21 

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Jed Brown via petsc-users
What is the partition like?  Suppose you randomly assigned nodes to
processes; then in the typical case, all neighbors would be on different
processors.  Then the "diagonal block" would be nearly diagonal and the
off-diagonal block would be huge, requiring communication with many
other processes.

"Smith, Barry F. via petsc-users"  writes:

>The load balance is definitely out of whack. 
>
>
>
> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
> 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
> MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
> 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
> VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
> VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
> 0.0e+00  1  0 13 13  0   4  0100100  0 0
> VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   4  0  0  0  0 0
> VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>
> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
> balance ratio above 2 which means some processes have more than twice the 
> work of others. Meanwhile the ratio for anything with communication is 
> extremely in balanced, some processes get to the synchronization point well 
> before other processes. 
>
> The first thing I would do is worry about the load imbalance, what is its 
> cause? is it one process with much less work than others (not great but not 
> terrible) or is it one process with much more work then the others (terrible) 
> or something in between. I think once you get a handle on the load balance 
> the rest may fall into place, otherwise we still have some exploring to do. 
> This is not expected behavior for a good machine with a good network and a 
> well balanced job. After you understand the load balancing you may need to 
> use one of the parallel performance visualization tools to see why the 
> synchronization is out of whack.
>
>Good luck
>
>   Barry
>
>
>> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
>> 
>> I'm sending one with a bit less time.
>> I'm timing the functions also with std::chronos and for the case of 180 
>> seconds the program runs out of memory (and crushes) before the PETSc log 
>> gets to be printed, so I know the time only from my function. Anyway, in 
>> every case, the times between std::chronos and the PETSc log match.
>> 
>> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
>> 5: Offdiag").
>> 
>> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
>> escribió:
>> 
>> 
>> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
>> Thanks both of you for your answers,
>> 
>> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
>> escribió:
>> 
>>   Note that this is a one time cost if the nonzero structure of the matrix 
>> stays the same. It will not happen in future MatAssemblies.
>> 
>> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
>> >  wrote:
>> > 
>> > Those messages were used to build MatMult communication pattern for the 
>> > matrix. They were not part of the matrix entries-passing you imagined, but 
>> > indeed happened in MatAssemblyEnd. If you want to make sure processors do 
>> > not set remote entries, you can use 
>> > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
>> > error when an off-proc entry is set.
>> 
>> I started being concerned about this when I saw that the assembly was taking 
>> a few hundreds of seconds in my code, like 180 seconds, which for me it's a 
>> considerable time. Do you think (or maybe you need more information to 
>> answer this) that this time is "reasonable" for communicating the pattern 
>> for the matrix? I already checked that I'm not setting any remote entries. 
>> It is not reasonable. Could you send log view of that test with 180 seconds 
>> MatAssembly?
>>  
>> Also I see (in my code) that even if there are no messages being passed 
>> during MatAssemblyBegin, it is taking time and the "ratio" is very big.
>> 
>> > 
>> > 
>> > --Junchao Zhang
>> > 
>> > 
>> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
>> >  wrote:
>> > Hello all!
>> > 
>> > During the conference I showed you a problem happening during 
>> > MatAssemblyEnd in a particular code that I have. Now, I tried the sam

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Smith, Barry F. via petsc-users

   The load balance is definitely out of whack. 



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes. 

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
> 
> I'm sending one with a bit less time.
> I'm timing the functions also with std::chronos and for the case of 180 
> seconds the program runs out of memory (and crushes) before the PETSc log 
> gets to be printed, so I know the time only from my function. Anyway, in 
> every case, the times between std::chronos and the PETSc log match.
> 
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
> 5: Offdiag").
> 
> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
> escribió:
> 
> 
> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
> Thanks both of you for your answers,
> 
> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
> escribió:
> 
>   Note that this is a one time cost if the nonzero structure of the matrix 
> stays the same. It will not happen in future MatAssemblies.
> 
> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
> >  wrote:
> > 
> > Those messages were used to build MatMult communication pattern for the 
> > matrix. They were not part of the matrix entries-passing you imagined, but 
> > indeed happened in MatAssemblyEnd. If you want to make sure processors do 
> > not set remote entries, you can use 
> > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
> > error when an off-proc entry is set.
> 
> I started being concerned about this when I saw that the assembly was taking 
> a few hundreds of seconds in my code, like 180 seconds, which for me it's a 
> considerable time. Do you think (or maybe you need more information to answer 
> this) that this time is "reasonable" for communicating the pattern for the 
> matrix? I already checked that I'm not setting any remote entries. 
> It is not reasonable. Could you send log view of that test with 180 seconds 
> MatAssembly?
>  
> Also I see (in my code) that even if there are no messages being passed 
> during MatAssemblyBegin, it is taking time and the "ratio" is very big.
> 
> > 
> > 
> > --Junchao Zhang
> > 
> > 
> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
> >  wrote:
> > Hello all!
> > 
> > During the conference I showed you a problem happening during 
> > MatAssemblyEnd in a particular code that I have. Now, I tried the same with 
> > a simple code (a symmetric problem corresponding to the Laplacian operator 
> > in 1D, from the SLEPc Hands-On exercises). As I understand (and please, 
> > correct me if I'm wrong), in this case the elements of the matrix are 
> > computed locally by each process so there should not be any communication 
> > during the assembly. However, in the log I get that there are messages 
> > being passed. Also, the number of messages changes with the number of 
> > process

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users


On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Thanks both of you for your answers,

El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. 
(mailto:bsm...@mcs.anl.gov>>) escribió:

  Note that this is a one time cost if the nonzero structure of the matrix 
stays the same. It will not happen in future MatAssemblies.

> On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>
> Those messages were used to build MatMult communication pattern for the 
> matrix. They were not part of the matrix entries-passing you imagined, but 
> indeed happened in MatAssemblyEnd. If you want to make sure processors do not 
> set remote entries, you can use 
> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
> error when an off-proc entry is set.

I started being concerned about this when I saw that the assembly was taking a 
few hundreds of seconds in my code, like 180 seconds, which for me it's a 
considerable time. Do you think (or maybe you need more information to answer 
this) that this time is "reasonable" for communicating the pattern for the 
matrix? I already checked that I'm not setting any remote entries.
It is not reasonable. Could you send log view of that test with 180 seconds 
MatAssembly?

Also I see (in my code) that even if there are no messages being passed during 
MatAssemblyBegin, it is taking time and the "ratio" is very big.

>
>
> --Junchao Zhang
>
>
> On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
> Hello all!
>
> During the conference I showed you a problem happening during MatAssemblyEnd 
> in a particular code that I have. Now, I tried the same with a simple code (a 
> symmetric problem corresponding to the Laplacian operator in 1D, from the 
> SLEPc Hands-On exercises). As I understand (and please, correct me if I'm 
> wrong), in this case the elements of the matrix are computed locally by each 
> process so there should not be any communication during the assembly. 
> However, in the log I get that there are messages being passed. Also, the 
> number of messages changes with the number of processes used and the size of 
> the matrix. Could you please help me understand this?
>
> I attach the code I used and the log I get for a small problem.
>
> Cheers,
> Ale
>



Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-20 Thread Smith, Barry F. via petsc-users


  Note that this is a one time cost if the nonzero structure of the matrix 
stays the same. It will not happen in future MatAssemblies.

> On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
>  wrote:
> 
> Those messages were used to build MatMult communication pattern for the 
> matrix. They were not part of the matrix entries-passing you imagined, but 
> indeed happened in MatAssemblyEnd. If you want to make sure processors do not 
> set remote entries, you can use 
> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
> error when an off-proc entry is set.
> 
> 
> --Junchao Zhang
> 
> 
> On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
>  wrote:
> Hello all!
> 
> During the conference I showed you a problem happening during MatAssemblyEnd 
> in a particular code that I have. Now, I tried the same with a simple code (a 
> symmetric problem corresponding to the Laplacian operator in 1D, from the 
> SLEPc Hands-On exercises). As I understand (and please, correct me if I'm 
> wrong), in this case the elements of the matrix are computed locally by each 
> process so there should not be any communication during the assembly. 
> However, in the log I get that there are messages being passed. Also, the 
> number of messages changes with the number of processes used and the size of 
> the matrix. Could you please help me understand this?
> 
> I attach the code I used and the log I get for a small problem.
> 
> Cheers,
> Ale
> 



Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-20 Thread Zhang, Junchao via petsc-users
Those messages were used to build MatMult communication pattern for the matrix. 
They were not part of the matrix entries-passing you imagined, but indeed 
happened in MatAssemblyEnd. If you want to make sure processors do not set 
remote entries, you can use MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), 
which will generate an error when an off-proc entry is set.


--Junchao Zhang


On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Hello all!

During the conference I showed you a problem happening during MatAssemblyEnd in 
a particular code that I have. Now, I tried the same with a simple code (a 
symmetric problem corresponding to the Laplacian operator in 1D, from the SLEPc 
Hands-On exercises). As I understand (and please, correct me if I'm wrong), in 
this case the elements of the matrix are computed locally by each process so 
there should not be any communication during the assembly. However, in the log 
I get that there are messages being passed. Also, the number of messages 
changes with the number of processes used and the size of the matrix. Could you 
please help me understand this?

I attach the code I used and the log I get for a small problem.

Cheers,
Ale