Junchao, I'm sorry for the late response. El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao (<jczh...@mcs.anl.gov>) escribió:
> Ale, > The job got a chance to run but failed with out-of-memory, "Some of your > processes may have been killed by the cgroup out-of-memory handler." > I mentioned that I used 1024 nodes and 32 processes on each node because the application needs a lot of memory. I think that for a system of size 38, one needs above 256 nodes for sure (assuming only 32 procs per node). I would try with 512 if it's possible. I also tried with 128 core with ./main.x 2 ... and got a weird error > message "The size of the basis has to be at least equal to the number > of MPI processes used." > The error comes from the fact that you put a system size of only 2 which is too small. I can also see the problem in the assembly with system sizes smaller than 38, so you can try with like 30 (for which I also have a log). In that case I run with 64 nodes and 32 processes per node. I think the problem may also fit in 32 nodes. --Junchao Zhang > > > On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang <jczh...@mcs.anl.gov> > wrote: > >> Ale, >> I successfully built your code and submitted a job to the NERSC Cori >> machine requiring 32768 KNL cores and one and a half hours. It is estimated >> to run in 3 days. If you also observed the same problem with less cores, >> what is your input arguments? Currently, I use what in your log file, >> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9 >> -log_view >> The smaller the better. Thanks. >> --Junchao Zhang >> >> >> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia <amfog...@gmail.com> wrote: >> >>> Yes, I used KNL nodes. I you can perform the test would be great. Could >>> it be that I'm not using the correct configuration of the KNL nodes? These >>> are the environment variables I set: >>> MKL_NUM_THREADS=1 >>> OMP_NUM_THREADS=1 >>> KMP_HW_SUBSET=1t >>> KMP_AFFINITY=compact >>> I_MPI_PIN_DOMAIN=socket >>> I_MPI_PIN_PROCESSOR_LIST=0-63 >>> MKL_DYNAMIC=0 >>> >>> The code is in https://github.com/amfoggia/LSQuantumED and it has a >>> readme to compile it and run it. When I run the test I used only 32 >>> processors per node, and I used 1024 nodes in total, and it's for nspins=38. >>> Thank you >>> >>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<jczh...@mcs.anl.gov>) >>> escribió: >>> >>>> Ale, >>>> Did you use Intel KNL nodes? Mr. Hong (cc'ed) did experiments on KNL >>>> nodes one year ago. He used 32768 processors and called MatAssemblyEnd 118 >>>> times and it used only 1.5 seconds in total. So I guess something was >>>> wrong with your test. If you can share your code, I can have a test on our >>>> machine to see how it goes. >>>> Thanks. >>>> --Junchao Zhang >>>> >>>> >>>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <jczh...@mcs.anl.gov> >>>> wrote: >>>> >>>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total >>>>> time. Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, >>>>> which does global synchronization. The high max/min ratio means load >>>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. >>>>> It >>>>> means processors are already synchronized. With 32768 processors, there >>>>> are >>>>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor >>>>> sends >>>>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance >>>>> is >>>>> the reduction at MatAssemblyEnd. I don't know why it needs 8 reductions. >>>>> In my mind, one is enough. I need to look at the code. >>>>> >>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- >>>>> Messages --- -- Message Lengths -- -- Reductions -- >>>>> Avg %Total Avg %Total Count >>>>> %Total Avg %Total Count %Total >>>>> 0: Main Stage: 8.5045e+02 13.0% 3.0633e+15 14.0% 8.196e+07 >>>>> 13.1% 7.768e+06 13.1% 2.530e+02 13.0% >>>>> 1: Create Basis: 7.9234e-02 0.0% 0.0000e+00 0.0% 0.000e+00 >>>>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0% >>>>> 2: Create Lattice: 8.3944e-05 0.0% 0.0000e+00 0.0% 0.000e+00 >>>>> 0.0% 0.000e+00 0.0% 0.000e+00 0.0% >>>>> 3: Create Hamilt: 1.0694e+02 1.6% 0.0000e+00 0.0% 0.000e+00 >>>>> 0.0% 0.000e+00 0.0% 2.000e+00 0.1% >>>>> 5: Offdiag: 1.6525e+02 2.5% 0.0000e+00 0.0% 1.188e+06 >>>>> 0.2% 1.942e+06 0.0% 8.000e+00 0.4% >>>>> 6: Phys quantities: 5.4045e+03 82.8% 1.8866e+16 86.0% 5.417e+08 >>>>> 86.7% 7.768e+06 86.8% 1.674e+03 86.1% >>>>> >>>>> --- Event Stage 5: Offdiag >>>>> BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0 >>>>> MatAssemblyBegin 1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0 >>>>> MatAssemblyEnd 1 1.0 5.3762e+01 1.0 0.00e+00 0.0 1.2e+06 >>>>> 1.9e+06 8.0e+00 1 0 0 0 0 33 0100100100 0 >>>>> VecSet 1 1.0 7.5533e-02 9.0 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <bsm...@mcs.anl.gov> >>>>> wrote: >>>>> >>>>>> >>>>>> The load balance is definitely out of whack. >>>>>> >>>>>> >>>>>> >>>>>> BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> MatMult 138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 >>>>>> 7.8e+06 0.0e+00 2 4 13 13 0 15 25100100 0 2935476 >>>>>> MatAssemblyBegin 1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> VecNorm 2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 >>>>>> 0.0e+00 2.0e+00 1 0 0 0 0 5 0 0 0 1 12780 >>>>>> VecCopy 6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> VecAXPY 2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 41000838 >>>>>> VecScatterBegin 138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 >>>>>> 7.8e+06 0.0e+00 1 0 13 13 0 4 0100100 0 0 >>>>>> VecScatterEnd 138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 1 0 0 0 0 4 0 0 0 0 0 >>>>>> VecSetRandom 1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>>> >>>>>> Note that VecCopy/AXPY/SetRandom which are all embarrassingly >>>>>> parallel have a balance ratio above 2 which means some processes have >>>>>> more >>>>>> than twice the work of others. Meanwhile the ratio for anything with >>>>>> communication is extremely in balanced, some processes get to the >>>>>> synchronization point well before other processes. >>>>>> >>>>>> The first thing I would do is worry about the load imbalance, what is >>>>>> its cause? is it one process with much less work than others (not great >>>>>> but >>>>>> not terrible) or is it one process with much more work then the others >>>>>> (terrible) or something in between. I think once you get a handle on the >>>>>> load balance the rest may fall into place, otherwise we still have some >>>>>> exploring to do. This is not expected behavior for a good machine with a >>>>>> good network and a well balanced job. After you understand the load >>>>>> balancing you may need to use one of the parallel performance >>>>>> visualization >>>>>> tools to see why the synchronization is out of whack. >>>>>> >>>>>> Good luck >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> > On Jun 21, 2019, at 9:27 AM, Ale Foggia <amfog...@gmail.com> wrote: >>>>>> > >>>>>> > I'm sending one with a bit less time. >>>>>> > I'm timing the functions also with std::chronos and for the case of >>>>>> 180 seconds the program runs out of memory (and crushes) before the PETSc >>>>>> log gets to be printed, so I know the time only from my function. Anyway, >>>>>> in every case, the times between std::chronos and the PETSc log match. >>>>>> > >>>>>> > (The large times are in part "4b- Building offdiagonal part" or >>>>>> "Event Stage 5: Offdiag"). >>>>>> > >>>>>> > El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (< >>>>>> jczh...@mcs.anl.gov>) escribió: >>>>>> > >>>>>> > >>>>>> > On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <amfog...@gmail.com> >>>>>> wrote: >>>>>> > Thanks both of you for your answers, >>>>>> > >>>>>> > El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (< >>>>>> bsm...@mcs.anl.gov>) escribió: >>>>>> > >>>>>> > Note that this is a one time cost if the nonzero structure of the >>>>>> matrix stays the same. It will not happen in future MatAssemblies. >>>>>> > >>>>>> > > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users < >>>>>> petsc-users@mcs.anl.gov> wrote: >>>>>> > > >>>>>> > > Those messages were used to build MatMult communication pattern >>>>>> for the matrix. They were not part of the matrix entries-passing you >>>>>> imagined, but indeed happened in MatAssemblyEnd. If you want to make sure >>>>>> processors do not set remote entries, you can use >>>>>> MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate >>>>>> an >>>>>> error when an off-proc entry is set. >>>>>> > >>>>>> > I started being concerned about this when I saw that the assembly >>>>>> was taking a few hundreds of seconds in my code, like 180 seconds, which >>>>>> for me it's a considerable time. Do you think (or maybe you need more >>>>>> information to answer this) that this time is "reasonable" for >>>>>> communicating the pattern for the matrix? I already checked that I'm not >>>>>> setting any remote entries. >>>>>> > It is not reasonable. Could you send log view of that test with 180 >>>>>> seconds MatAssembly? >>>>>> > >>>>>> > Also I see (in my code) that even if there are no messages being >>>>>> passed during MatAssemblyBegin, it is taking time and the "ratio" is very >>>>>> big. >>>>>> > >>>>>> > > >>>>>> > > >>>>>> > > --Junchao Zhang >>>>>> > > >>>>>> > > >>>>>> > > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users < >>>>>> petsc-users@mcs.anl.gov> wrote: >>>>>> > > Hello all! >>>>>> > > >>>>>> > > During the conference I showed you a problem happening during >>>>>> MatAssemblyEnd in a particular code that I have. Now, I tried the same >>>>>> with >>>>>> a simple code (a symmetric problem corresponding to the Laplacian >>>>>> operator >>>>>> in 1D, from the SLEPc Hands-On exercises). As I understand (and please, >>>>>> correct me if I'm wrong), in this case the elements of the matrix are >>>>>> computed locally by each process so there should not be any communication >>>>>> during the assembly. However, in the log I get that there are messages >>>>>> being passed. Also, the number of messages changes with the number of >>>>>> processes used and the size of the matrix. Could you please help me >>>>>> understand this? >>>>>> > > >>>>>> > > I attach the code I used and the log I get for a small problem. >>>>>> > > >>>>>> > > Cheers, >>>>>> > > Ale >>>>>> > > >>>>>> > >>>>>> > <log.txt> >>>>>> >>>>>>