[petsc-users] createASCII cleans up the content of a gmsh file.

2019-06-21 Thread Dongyu Liu - CITG via petsc-users
Hi,


we are using the Viewer class in pets4py to read a gmsh file, but after we use 
the function createASCII with the mode "READ", the gmsh file is emptied. Do you 
have any clue why this happens.


Best,

Dongyu


Re: [petsc-users] createASCII cleans up the content of a gmsh file.

2019-06-21 Thread Matthew Knepley via petsc-users
On Fri, Jun 21, 2019 at 4:56 AM Dongyu Liu - CITG via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Hi,
>
> we are using the Viewer class in pets4py to read a gmsh file, but after we
> use the function createASCII with the mode "READ", the gmsh file is
> emptied. Do you have any clue why this happens.
>
You need to set the mode to FILE_MODE_READ before opening the file.

  Thanks,

Matt

> Best,
>
> Dongyu
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users


On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia 
mailto:amfog...@gmail.com>> wrote:
Thanks both of you for your answers,

El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. 
(mailto:bsm...@mcs.anl.gov>>) escribió:

  Note that this is a one time cost if the nonzero structure of the matrix 
stays the same. It will not happen in future MatAssemblies.

> On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>
> Those messages were used to build MatMult communication pattern for the 
> matrix. They were not part of the matrix entries-passing you imagined, but 
> indeed happened in MatAssemblyEnd. If you want to make sure processors do not 
> set remote entries, you can use 
> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
> error when an off-proc entry is set.

I started being concerned about this when I saw that the assembly was taking a 
few hundreds of seconds in my code, like 180 seconds, which for me it's a 
considerable time. Do you think (or maybe you need more information to answer 
this) that this time is "reasonable" for communicating the pattern for the 
matrix? I already checked that I'm not setting any remote entries.
It is not reasonable. Could you send log view of that test with 180 seconds 
MatAssembly?

Also I see (in my code) that even if there are no messages being passed during 
MatAssemblyBegin, it is taking time and the "ratio" is very big.

>
>
> --Junchao Zhang
>
>
> On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
> Hello all!
>
> During the conference I showed you a problem happening during MatAssemblyEnd 
> in a particular code that I have. Now, I tried the same with a simple code (a 
> symmetric problem corresponding to the Laplacian operator in 1D, from the 
> SLEPc Hands-On exercises). As I understand (and please, correct me if I'm 
> wrong), in this case the elements of the matrix are computed locally by each 
> process so there should not be any communication during the assembly. 
> However, in the log I get that there are messages being passed. Also, the 
> number of messages changes with the number of processes used and the size of 
> the matrix. Could you please help me understand this?
>
> I attach the code I used and the log I get for a small problem.
>
> Cheers,
> Ale
>



Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Smith, Barry F. via petsc-users

   The load balance is definitely out of whack. 



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes. 

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
> 
> I'm sending one with a bit less time.
> I'm timing the functions also with std::chronos and for the case of 180 
> seconds the program runs out of memory (and crushes) before the PETSc log 
> gets to be printed, so I know the time only from my function. Anyway, in 
> every case, the times between std::chronos and the PETSc log match.
> 
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
> 5: Offdiag").
> 
> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
> escribió:
> 
> 
> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
> Thanks both of you for your answers,
> 
> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
> escribió:
> 
>   Note that this is a one time cost if the nonzero structure of the matrix 
> stays the same. It will not happen in future MatAssemblies.
> 
> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
> >  wrote:
> > 
> > Those messages were used to build MatMult communication pattern for the 
> > matrix. They were not part of the matrix entries-passing you imagined, but 
> > indeed happened in MatAssemblyEnd. If you want to make sure processors do 
> > not set remote entries, you can use 
> > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
> > error when an off-proc entry is set.
> 
> I started being concerned about this when I saw that the assembly was taking 
> a few hundreds of seconds in my code, like 180 seconds, which for me it's a 
> considerable time. Do you think (or maybe you need more information to answer 
> this) that this time is "reasonable" for communicating the pattern for the 
> matrix? I already checked that I'm not setting any remote entries. 
> It is not reasonable. Could you send log view of that test with 180 seconds 
> MatAssembly?
>  
> Also I see (in my code) that even if there are no messages being passed 
> during MatAssemblyBegin, it is taking time and the "ratio" is very big.
> 
> > 
> > 
> > --Junchao Zhang
> > 
> > 
> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
> >  wrote:
> > Hello all!
> > 
> > During the conference I showed you a problem happening during 
> > MatAssemblyEnd in a particular code that I have. Now, I tried the same with 
> > a simple code (a symmetric problem corresponding to the Laplacian operator 
> > in 1D, from the SLEPc Hands-On exercises). As I understand (and please, 
> > correct me if I'm wrong), in this case the elements of the matrix are 
> > computed locally by each process so there should not be any communication 
> > during the assembly. However, in the log I get that there are messages 
> > being passed. Also, the number of messages changes with the number of 
> > process

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Jed Brown via petsc-users
What is the partition like?  Suppose you randomly assigned nodes to
processes; then in the typical case, all neighbors would be on different
processors.  Then the "diagonal block" would be nearly diagonal and the
off-diagonal block would be huge, requiring communication with many
other processes.

"Smith, Barry F. via petsc-users"  writes:

>The load balance is definitely out of whack. 
>
>
>
> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
> 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
> MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
> 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
> VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
> VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
> 0.0e+00  1  0 13 13  0   4  0100100  0 0
> VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   4  0  0  0  0 0
> VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>
> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
> balance ratio above 2 which means some processes have more than twice the 
> work of others. Meanwhile the ratio for anything with communication is 
> extremely in balanced, some processes get to the synchronization point well 
> before other processes. 
>
> The first thing I would do is worry about the load imbalance, what is its 
> cause? is it one process with much less work than others (not great but not 
> terrible) or is it one process with much more work then the others (terrible) 
> or something in between. I think once you get a handle on the load balance 
> the rest may fall into place, otherwise we still have some exploring to do. 
> This is not expected behavior for a good machine with a good network and a 
> well balanced job. After you understand the load balancing you may need to 
> use one of the parallel performance visualization tools to see why the 
> synchronization is out of whack.
>
>Good luck
>
>   Barry
>
>
>> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
>> 
>> I'm sending one with a bit less time.
>> I'm timing the functions also with std::chronos and for the case of 180 
>> seconds the program runs out of memory (and crushes) before the PETSc log 
>> gets to be printed, so I know the time only from my function. Anyway, in 
>> every case, the times between std::chronos and the PETSc log match.
>> 
>> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
>> 5: Offdiag").
>> 
>> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
>> escribió:
>> 
>> 
>> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
>> Thanks both of you for your answers,
>> 
>> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
>> escribió:
>> 
>>   Note that this is a one time cost if the nonzero structure of the matrix 
>> stays the same. It will not happen in future MatAssemblies.
>> 
>> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
>> >  wrote:
>> > 
>> > Those messages were used to build MatMult communication pattern for the 
>> > matrix. They were not part of the matrix entries-passing you imagined, but 
>> > indeed happened in MatAssemblyEnd. If you want to make sure processors do 
>> > not set remote entries, you can use 
>> > MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
>> > error when an off-proc entry is set.
>> 
>> I started being concerned about this when I saw that the assembly was taking 
>> a few hundreds of seconds in my code, like 180 seconds, which for me it's a 
>> considerable time. Do you think (or maybe you need more information to 
>> answer this) that this time is "reasonable" for communicating the pattern 
>> for the matrix? I already checked that I'm not setting any remote entries. 
>> It is not reasonable. Could you send log view of that test with 180 seconds 
>> MatAssembly?
>>  
>> Also I see (in my code) that even if there are no messages being passed 
>> during MatAssemblyBegin, it is taking time and the "ratio" is very big.
>> 
>> > 
>> > 
>> > --Junchao Zhang
>> > 
>> > 
>> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users 
>> >  wrote:
>> > Hello all!
>> > 
>> > During the conference I showed you a problem happening during 
>> > MatAssemblyEnd in a particular code that I have. Now, I tried the sam

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes.

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia 
> mailto:amfog...@gmail.com>> wrote:
>
> I'm sending one with a bit less time.
> I'm timing the functions also with std::chronos and for the case of 180 
> seconds the program runs out of memory (and crushes) before the PETSc log 
> gets to be printed, so I know the time only from my function. Anyway, in 
> every case, the times between std::chronos and the PETSc log match.
>
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 
> 5: Offdiag").
>
> El vie., 21 

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Smith, Barry F. via petsc-users

   You could access the VecScatter inside the matrix-multiply and call 
VecScatterView() with an ASCII viewer with the format PETSC_VIEWER_ASCII_INFO 
(make sure you use this format) and it provides information about how much 
communication is being done and how many neighbors are being communicated with

  Barry


> On Jun 21, 2019, at 10:56 AM, Jed Brown  wrote:
> 
> What is the partition like?  Suppose you randomly assigned nodes to
> processes; then in the typical case, all neighbors would be on different
> processors.  Then the "diagonal block" would be nearly diagonal and the
> off-diagonal block would be huge, requiring communication with many
> other processes.
> 
> "Smith, Barry F. via petsc-users"  writes:
> 
>>   The load balance is definitely out of whack. 
>> 
>> 
>> 
>> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
>> 0.0e+00  2  4 13 13  0  15 25100100  0 2935476
>> MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
>> 2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
>> VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
>> VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
>> 0.0e+00  1  0 13 13  0   4  0100100  0 0
>> VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   4  0  0  0  0 0
>> VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> 
>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have 
>> a balance ratio above 2 which means some processes have more than twice the 
>> work of others. Meanwhile the ratio for anything with communication is 
>> extremely in balanced, some processes get to the synchronization point well 
>> before other processes. 
>> 
>> The first thing I would do is worry about the load imbalance, what is its 
>> cause? is it one process with much less work than others (not great but not 
>> terrible) or is it one process with much more work then the others 
>> (terrible) or something in between. I think once you get a handle on the 
>> load balance the rest may fall into place, otherwise we still have some 
>> exploring to do. This is not expected behavior for a good machine with a 
>> good network and a well balanced job. After you understand the load 
>> balancing you may need to use one of the parallel performance visualization 
>> tools to see why the synchronization is out of whack.
>> 
>>   Good luck
>> 
>>  Barry
>> 
>> 
>>> On Jun 21, 2019, at 9:27 AM, Ale Foggia  wrote:
>>> 
>>> I'm sending one with a bit less time.
>>> I'm timing the functions also with std::chronos and for the case of 180 
>>> seconds the program runs out of memory (and crushes) before the PETSc log 
>>> gets to be printed, so I know the time only from my function. Anyway, in 
>>> every case, the times between std::chronos and the PETSc log match.
>>> 
>>> (The large times are in part "4b- Building offdiagonal part" or "Event 
>>> Stage 5: Offdiag").
>>> 
>>> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao () 
>>> escribió:
>>> 
>>> 
>>> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia  wrote:
>>> Thanks both of you for your answers,
>>> 
>>> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. () 
>>> escribió:
>>> 
>>>  Note that this is a one time cost if the nonzero structure of the matrix 
>>> stays the same. It will not happen in future MatAssemblies.
>>> 
 On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users 
  wrote:
 
 Those messages were used to build MatMult communication pattern for the 
 matrix. They were not part of the matrix entries-passing you imagined, but 
 indeed happened in MatAssemblyEnd. If you want to make sure processors do 
 not set remote entries, you can use 
 MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an 
 error when an off-proc entry is set.
>>> 
>>> I started being concerned about this when I saw that the assembly was 
>>> taking a few hundreds of seconds in my code, like 180 seconds, which for me 
>>> it's a considerable time. Do you think (or maybe you need more information 
>>> to answer this) that this time is "reasonable" for communicating the 
>>> pattern for the matrix? I already checked that I'm not setting any remote 
>>> entries. 
>>> It is not reasonable. Could you send log view of that test with 18

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-21 Thread Zhang, Junchao via petsc-users
Ale,
  Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL nodes  
one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and 
it used only 1.5 seconds in total.  So I guess something was wrong with your 
test. If you can share your code, I can have a test on our machine to see how 
it goes.
 Thanks.
--Junchao Zhang


On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
MatAssembly was called once (in stage 5) and cost 2.5% of the total time.  Look 
at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global 
synchronization. The high max/min ratio means load imbalance. What I do not 
understand is MatAssemblyEnd. The ratio is 1.0. It means processors are already 
synchronized. With 32768 processors, there are 1.2e+06 messages with average 
length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages 
and it takes 54 seconds. Another chance is the reduction at  MatAssemblyEnd. I 
don't know why it needs 8 reductions. In my mind, one is enough. I need to look 
at the code.

Summary of Stages:   - Time --  - Flop --  --- Messages ---  -- 
Message Lengths --  -- Reductions --
Avg %Total Avg %TotalCount   %Total 
Avg %TotalCount   %Total
 0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07  13.1%  
7.768e+06   13.1%  2.530e+02  13.0%
 1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  0.000e+00   0.0%
 3:   Create Hamilt: 1.0694e+02   1.6%  0.e+00   0.0%  0.000e+00   0.0%  
0.000e+000.0%  2.000e+00   0.1%
 5: Offdiag: 1.6525e+02   2.5%  0.e+00   0.0%  1.188e+06   0.2%  
1.942e+060.0%  8.000e+00   0.4%
 6: Phys quantities: 5.4045e+03  82.8%  1.8866e+16  86.0%  5.417e+08  86.7%  
7.768e+06   86.8%  1.674e+03  86.1%

--- Event Stage 5: Offdiag
BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyBegin   1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  28  0  0  0  0 0
MatAssemblyEnd 1 1.0 5.3762e+01 1.0  0.00e+00 0.0 1.2e+06 1.9e+06 
8.0e+00  1  0  0  0  0  33  0100100100 0
VecSet 1 1.0 7.5533e-02 9.0  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0


--Junchao Zhang


On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   The load balance is definitely out of whack.



BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult  138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 
0.0e+00  2  4 13 13  0  15 25100100  0 2935476
MatAssemblyBegin   1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecNorm2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 
2.0e+00  1  0  0  0  0   5  0  0  0  1 12780
VecCopy6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 41000838
VecScatterBegin  138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 
0.0e+00  1  0 13 13  0   4  0100100  0 0
VecScatterEnd138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   4  0  0  0  0 0
VecSetRandom   1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0

Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a 
balance ratio above 2 which means some processes have more than twice the work 
of others. Meanwhile the ratio for anything with communication is extremely in 
balanced, some processes get to the synchronization point well before other 
processes.

The first thing I would do is worry about the load imbalance, what is its 
cause? is it one process with much less work than others (not great but not 
terrible) or is it one process with much more work then the others (terrible) 
or something in between. I think once you get a handle on the load balance the 
rest may fall into place, otherwise we still have some exploring to do. This is 
not expected behavior for a good machine with a good network and a well 
balanced job. After you understand the load balancing you may need to use one 
of the parallel performance visualization tools to see why the synchronization 
is out of whack.

   Good luck

  Barry


> On Jun 21, 2019, at 9:27 AM, Ale Foggia 
> mailto:amfog...@gmail.com>> wrote:
>
> I'm send