[petsc-users] PETSc/SLEPc: Memory consumption, particularly during solver initialization/solve

2018-10-04 Thread Ale Foggia
Hello all,

I'm using SLEPc 3.9.2 (and PETSc 3.9.3) to get the EPS_SMALLEST_REAL of a
matrix with the following characteristics:

* type: real, Hermitian, sparse
* linear size: 2333606220
* distributed in 2048 processes (64 nodes, 32 procs per node)

My code first preallocates the necessary memory with
*MatMPIAIJSetPreallocation*, then fills it with the values and finally it
calls the following functions to create the solver and diagonalize the
matrix:

EPSCreate(PETSC_COMM_WORLD, &solver);
EPSSetOperators(solver,matrix,NULL);
EPSSetProblemType(solver, EPS_HEP);
EPSSetType(solver, EPSLANCZOS);
EPSSetWhichEigenpairs(solver, EPS_SMALLEST_REAL);
EPSSetFromOptions(solver);
EPSSolve(solver);

I want to make an estimation for larger size problems of the memory used by
the program (at every step) because I would like to keep it under 16 GB per
node. I've used the "memory usage" functions provided by PETSc, but
something happens during the solver stage that I can't explain. This brings
up two questions.

1) In each step I put a call to four memory functions and between them I
print the value of mem:

mem = 0;
PetscMallocGetCurrentUsage(&mem);
PetscMallocGetMaximumUsage(&mem);
PetscMemoryGetCurrentUsage(&mem);
PetscMemoryGetMaximumUsage(&mem);

I've read some other question in the mailing list regarding the same issue
but I can't fully understand this. What is the difference between all of
them? What information are they actually giving me? (I know this is only a
"per process" output). I copy the output of two steps of the program as an
example:

 step N 
MallocGetCurrent: 314513664.0 B
MallocGetMaximum: 332723328.0 B
MemoryGetCurrent: 539996160.0 B
MemoryGetMaximum: 0.0 B
 step N+1 
MallocGetCurrent: 395902912.0 B
MallocGetMaximum: 415178624.0 B
MemoryGetCurrent: 623783936.0 B
MemoryGetMaximum: 623775744.0 B

2) I was using this information to make the calculation of the memory
required per node to run my problem. Also, I'm able to login to the
computing node while running and I can check the memory consumption (with
*top*). The memory used that I see with top is more or less the same as the
one reported by PETSc functions at the beginning. But during the
inialization of the solver and during the solving, *top* reports a
consumption two times bigger than the one the functions report. Is it
possible to know from where this extra memory consumption comes from? What
things does SLEPc allocate that need that much memory? I've been trying to
do the math but I think there are things I'm missing. I thought that part
of it comes from the "BV" that the option -eps_view reports:

BV Object: 2048 MPI processes
  type: svec
  17 columns of global length 2333606220
  vector orthogonalization method: modified Gram-Schmidt
  orthogonalization refinement: if needed (eta: 0.7071)
  block orthogonalization method: GS
  doing matmult as a single matrix-matrix product

But "17 * 2333606220 * 8 Bytes / #nodes" only explains on third or less of
the "extra" memory.

Ale


Re: [petsc-users] PETSc/SLEPc: Memory consumption, particularly during solver initialization/solve

2018-10-04 Thread Ale Foggia
Thank you both for your answers :)

Matt:
-Yes, sorry I forgot to tell you that, but I've also called
PetscMemorySetGetMaximumUsage() right after initializing SLEPc. Also I've
seen a strange behaviour: if I ran the same code in my computer and in the
cluster *without* the command line option -malloc_dump, in the cluster the
output of PetscMallocGetCurrentUsage and PetscMallocGetMaximumUsage is
always zero, but that doesn't happen in my computer.

- This is the output of the code for the solving part (after EPSCreate and
after EPSSolve), and I've compared it with the output of *top* during those
moments of peak memory consumption. *top* provides in one of the columns
the resident set size (RES) and the numbers are around 1 GB per process,
while, considering the numbers reported by the PETSc functions, the one
that is more similar to that is given by MemoryGetCurrentUsage and is only
800 MB in the solving stage. Maybe, we can consider that those numbers are
the same plus/minus something? Is it safe to say that MemoryGetCurrentUsage
is measuring the "ru_maxss" member of "rusage" (or something similar)? If
that's the case, what do the other functions report?

 SOLVER INIT 
MallocGetCurrent (init): 396096192.0 B
MallocGetMaximum (init): 415178624.0 B
MemoryGetCurrent (init): 624050176.0 B
MemoryGetMaximum (init): 623775744.0 B
 SOLVER 
MallocGetCurrent (solver): 560320256.0 B
MallocGetMaximum (solver): 560333440.0 B
MemoryGetCurrent (solver): 820961280.0 B
MemoryGetMaximum (solver): 623775744.0 B

Jose:
- By each step I mean each of the step of the the program in order to
diagonalize the matrix. For me, those are: creation of basis, preallocation
of matrix, setting values of matrix, initializing solver,
solving/diagonalizing and cleaning. I'm only diagonalizing once.

- Regarding the information provided by -log_view, it's confusing for me:
for example, it reports the creation of Vecs scattered across the various
stages that I've set up (with PetscLogStageRegister and
PetscLogStagePush/Pop), but almost all the deletions are presented in the
"Main Stage". What does that "Main Stage" consider? Why are more deletions
in there that creations? It's nor completely for me clear how things are
presented there.

- Thanks for the suggestion about the solver. Does "faster convergence" for
Krylov-Schur mean less memory and less computation, or just less
computation?

Ale


El jue., 4 oct. 2018 a las 13:12, Jose E. Roman ()
escribió:

> Regarding the SLEPc part:
> - What do you mean by "each step"? Are you calling EPSSolve() several
> times?
> - Yes, the BV object is generally what takes most of the memory. It is
> allocated at the beginning of EPSSolve(). Depending on the solver/options,
> other memory may be allocated as well.
> - You can also see the memory reported at the end of -log_view
> - I would suggest using the default solver Krylov-Schur - it will do
> Lanczos with implicit restart, which will give faster convergence than the
> EPSLANCZOS solver.
>
> Jose
>
>
> > El 4 oct 2018, a las 12:49, Matthew Knepley 
> escribió:
> >
> > On Thu, Oct 4, 2018 at 4:43 AM Ale Foggia  wrote:
> > Hello all,
> >
> > I'm using SLEPc 3.9.2 (and PETSc 3.9.3) to get the EPS_SMALLEST_REAL of
> a matrix with the following characteristics:
> >
> > * type: real, Hermitian, sparse
> > * linear size: 2333606220
> > * distributed in 2048 processes (64 nodes, 32 procs per node)
> >
> > My code first preallocates the necessary memory with
> *MatMPIAIJSetPreallocation*, then fills it with the values and finally it
> calls the following functions to create the solver and diagonalize the
> matrix:
> >
> > EPSCreate(PETSC_COMM_WORLD, &solver);
> > EPSSetOperators(solver,matrix,NULL);
> > EPSSetProblemType(solver, EPS_HEP);
> > EPSSetType(solver, EPSLANCZOS);
> > EPSSetWhichEigenpairs(solver, EPS_SMALLEST_REAL);
> > EPSSetFromOptions(solver);
> > EPSSolve(solver);
> >
> > I want to make an estimation for larger size problems of the memory used
> by the program (at every step) because I would like to keep it under 16 GB
> per node. I've used the "memory usage" functions provided by PETSc, but
> something happens during the solver stage that I can't explain. This brings
> up two questions.
> >
> > 1) In each step I put a call to four memory functions and between them I
> print the value of mem:
> >
> > Did you call PetscMemorySetGetMaximumUsage() first?
> >
> > We are computing https://en.wikipedia.org/wiki/Resident_set_size
> however we can. Usually with getrusage().
> > From this (https://www

Re: [petsc-users] PETSc/SLEPc: Memory consumption, particularly during solver initialization/solve

2018-10-10 Thread Ale Foggia
Jed, Jose and Matthew,
I've finally managed to make massif (it gives pretty detailed information,
I like it) work in the correct way in the cluster and I'm able to track
down the memory consumption, and what's more important (for me), I think
now I'm able to make a more accurate prediction of the memory I need for a
particular size of the problem. Thank you very much for all your answers
and suggestions.

El vie., 5 oct. 2018 a las 9:38, Jose E. Roman ()
escribió:

>
>
> > El 4 oct 2018, a las 19:54, Ale Foggia  escribió:
> >
> > Jose:
> > - By each step I mean each of the step of the the program in order to
> diagonalize the matrix. For me, those are: creation of basis, preallocation
> of matrix, setting values of matrix, initializing solver,
> solving/diagonalizing and cleaning. I'm only diagonalizing once.
> >
> > - Regarding the information provided by -log_view, it's confusing for
> me: for example, it reports the creation of Vecs scattered across the
> various stages that I've set up (with PetscLogStageRegister and
> PetscLogStagePush/Pop), but almost all the deletions are presented in the
> "Main Stage". What does that "Main Stage" consider? Why are more deletions
> in there that creations? It's nor completely for me clear how things are
> presented there.
>
> I guess deletions should match creations. Seems to be related to using
> stages. Maybe someone from PETSc can give an explanation, but looking at a
> PETSc example that uses stages (e.g. dm/impls/plex/examples/tests/ex1.c) it
> seems that some destructions are counted in the main stage while the
> creation is counted in another stage - I guess it depends on the points
> where the stages are defined. The sum of creations matches the sum of
> destroys.
>
> >
> > - Thanks for the suggestion about the solver. Does "faster convergence"
> for Krylov-Schur mean less memory and less computation, or just less
> computation?
> >
>
> It should be about the same memory with less iterations.
>
> Jose
>
>


[petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-23 Thread Ale Foggia
Hello,

I'm currently using Lanczos solver (EPSLANCZOS) to get the smallest real
eigenvalue (EPS_SMALLEST_REAL) of a Hermitian problem (EPS_HEP). Those are
the only options I set for the solver. My aim is to be able to
predict/estimate the time-to-solution. To do so, I was doing a scaling of
the code for different sizes of matrices and for different number of MPI
processes. As I was not observing a good scaling I checked the number of
iterations of the solver (given by EPSGetIterationNumber). I've encounter
that for the **same size** of matrix (that meaning, the same problem), when
I change the number of MPI processes, the amount of iterations changes, and
the behaviour is not monotonic. This are the numbers I've got:

# procs   # iters
960  157
992  189
1024338
1056190
1120174
2048136

I've checked the mailing list for a similar situation and I've found
another person with the same problem but in another solver ("[SLEPc] GD is
not deterministic when using different number of cores", Nov 19 2015), but
I think the solution this person finds does not apply to my problem
(removing "-eps_harmonic" option).

Can you give me any hint on what is the reason for this behaviour? Is there
a way to prevent this? It's not possible to estimate/predict any time
consumption for bigger problems if the number of iterations varies this
much.

Ale


Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-23 Thread Ale Foggia
Hello Jose, thanks for your answer.

El mar., 23 oct. 2018 a las 12:59, Jose E. Roman ()
escribió:

> There is an undocumented option:
>
>   -bv_reproducible_random
>
> It will force the initial vector of the Krylov subspace to be the same
> irrespective of the number of MPI processes. This should be used for
> scaling analyses as the one you are trying to do.
>

What about when I'm not doing the scaling? Now I would like to ask for
computing time for bigger size problems, should I also use this option in
that case? Because, what happens if I have a "bad" configuration? Meaning,
I ask for some time, enough if I take into account the "correct" scaling,
but when I run it takes double the time/iterations, like it happened before
when changing from 960 to 1024 processes?

>
> An additional comment is that we strongly recommend to use the default
> solver (Krylov-Schur), which will do Lanczos with implicit restart. It is
> generally faster and more stable.
>

I will be doing Dynamical Lanczos, that means that I'll need the "matrix
whose rows are the eigenvectors of the tridiagonal matrix" (so, according
to the Lanczos Technical Report notation, I need the "matrix whose rows are
the eigenvectors of T_m", which should be the same as the vectors y_i). I
checked the Technical Report for Krylov-Schur also and I think I can get
the same information also from that solver, but I'm not sure. Can you
confirm this please?
Also, as the vectors I want are given by V_m^(-1)*x_i=y_i (following the
notation on the Report), my idea to get them was to retrieve the invariant
subspace V_m (with EPSGetInvariantSubspace), invert it, and then multiply
it with the eigenvectors that I get with EPSGetEigenvector. Is there
another easier (or with less computations) way to get this?


> Jose
>
>
>
> > El 23 oct 2018, a las 12:13, Ale Foggia  escribió:
> >
> > Hello,
> >
> > I'm currently using Lanczos solver (EPSLANCZOS) to get the smallest real
> eigenvalue (EPS_SMALLEST_REAL) of a Hermitian problem (EPS_HEP). Those are
> the only options I set for the solver. My aim is to be able to
> predict/estimate the time-to-solution. To do so, I was doing a scaling of
> the code for different sizes of matrices and for different number of MPI
> processes. As I was not observing a good scaling I checked the number of
> iterations of the solver (given by EPSGetIterationNumber). I've encounter
> that for the **same size** of matrix (that meaning, the same problem), when
> I change the number of MPI processes, the amount of iterations changes, and
> the behaviour is not monotonic. This are the numbers I've got:
> >
> > # procs   # iters
> > 960  157
> > 992  189
> > 1024338
> > 1056190
> > 1120174
> > 2048136
> >
> > I've checked the mailing list for a similar situation and I've found
> another person with the same problem but in another solver ("[SLEPc] GD is
> not deterministic when using different number of cores", Nov 19 2015), but
> I think the solution this person finds does not apply to my problem
> (removing "-eps_harmonic" option).
> >
> > Can you give me any hint on what is the reason for this behaviour? Is
> there a way to prevent this? It's not possible to estimate/predict any time
> consumption for bigger problems if the number of iterations varies this
> much.
> >
> > Ale
>
>


Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-23 Thread Ale Foggia
El mar., 23 oct. 2018 a las 15:33, Jose E. Roman ()
escribió:

>
>
> > El 23 oct 2018, a las 15:17, Ale Foggia  escribió:
> >
> > Hello Jose, thanks for your answer.
> >
> > El mar., 23 oct. 2018 a las 12:59, Jose E. Roman ()
> escribió:
> > There is an undocumented option:
> >
> >   -bv_reproducible_random
> >
> > It will force the initial vector of the Krylov subspace to be the same
> irrespective of the number of MPI processes. This should be used for
> scaling analyses as the one you are trying to do.
> >
> > What about when I'm not doing the scaling? Now I would like to ask for
> computing time for bigger size problems, should I also use this option in
> that case? Because, what happens if I have a "bad" configuration? Meaning,
> I ask for some time, enough if I take into account the "correct" scaling,
> but when I run it takes double the time/iterations, like it happened before
> when changing from 960 to 1024 processes?
>
> When you increase the matrix size the spectrum of the matrix changes and
> probably also the convergence, so the computation time is not easy to
> predict in advance.
>

Okey, I'll keep that in mine. I thought that, even if the spectrum changes,
if I had a behaviour/tendency for 6 or 7 smaller cases I could predict more
or less the time. It was working this way until I found this "iterations
problem" which doubled the time of execution for the same size problem. To
be completely sure, do you suggest me or not to use this run-time option
when going in production? Can you elaborate a bit in the effect this
option? Is the (huge) difference I got in the number of iterations
something expected?


> >
> > An additional comment is that we strongly recommend to use the default
> solver (Krylov-Schur), which will do Lanczos with implicit restart. It is
> generally faster and more stable.
> >
> > I will be doing Dynamical Lanczos, that means that I'll need the "matrix
> whose rows are the eigenvectors of the tridiagonal matrix" (so, according
> to the Lanczos Technical Report notation, I need the "matrix whose rows are
> the eigenvectors of T_m", which should be the same as the vectors y_i). I
> checked the Technical Report for Krylov-Schur also and I think I can get
> the same information also from that solver, but I'm not sure. Can you
> confirm this please?
> > Also, as the vectors I want are given by V_m^(-1)*x_i=y_i (following the
> notation on the Report), my idea to get them was to retrieve the invariant
> subspace V_m (with EPSGetInvariantSubspace), invert it, and then multiply
> it with the eigenvectors that I get with EPSGetEigenvector. Is there
> another easier (or with less computations) way to get this?
>
> In Krylov-Schur the tridiagonal matrix T_m becomes
> arrowhead-plus-tridiagonal. Apart from this, it should be equivalent. The
> relevant information can be obtained with EPSGetBV() and EPSGetDS(). But
> this is a "developer level" interface. We could help you get this running.
> Send a small problem matrix to slepc-maint together with a more detailed
> description of what you need to compute.
>

Thanks! When I get to that part I'll write to slepc-maint for help.


> Jose
>
> >
> >
> > Jose
> >
> >
> >
> > > El 23 oct 2018, a las 12:13, Ale Foggia  escribió:
> > >
> > > Hello,
> > >
> > > I'm currently using Lanczos solver (EPSLANCZOS) to get the smallest
> real eigenvalue (EPS_SMALLEST_REAL) of a Hermitian problem (EPS_HEP). Those
> are the only options I set for the solver. My aim is to be able to
> predict/estimate the time-to-solution. To do so, I was doing a scaling of
> the code for different sizes of matrices and for different number of MPI
> processes. As I was not observing a good scaling I checked the number of
> iterations of the solver (given by EPSGetIterationNumber). I've encounter
> that for the **same size** of matrix (that meaning, the same problem), when
> I change the number of MPI processes, the amount of iterations changes, and
> the behaviour is not monotonic. This are the numbers I've got:
> > >
> > > # procs   # iters
> > > 960  157
> > > 992  189
> > > 1024338
> > > 1056190
> > > 1120174
> > > 2048136
> > >
> > > I've checked the mailing list for a similar situation and I've found
> another person with the same problem but in another solver ("[SLEPc] GD is
> not deterministic when using different number of cores", Nov 19 2015), but
> I think the solution this person finds does not apply to my problem
> (removing "-eps_harmonic" option).
> > >
> > > Can you give me any hint on what is the reason for this behaviour? Is
> there a way to prevent this? It's not possible to estimate/predict any time
> consumption for bigger problems if the number of iterations varies this
> much.
> > >
> > > Ale
> >
>
>


Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-24 Thread Ale Foggia
I've tried the option that you gave me but I still get different number of
iterations when changing the number of MPI processes: I did 960 procs and
1024 procs and I got 435 and 176 iterations, respectively.

El mar., 23 oct. 2018 a las 16:48, Jose E. Roman ()
escribió:

>
>
> > El 23 oct 2018, a las 15:46, Ale Foggia  escribió:
> >
> >
> >
> > El mar., 23 oct. 2018 a las 15:33, Jose E. Roman ()
> escribió:
> >
> >
> > > El 23 oct 2018, a las 15:17, Ale Foggia  escribió:
> > >
> > > Hello Jose, thanks for your answer.
> > >
> > > El mar., 23 oct. 2018 a las 12:59, Jose E. Roman ()
> escribió:
> > > There is an undocumented option:
> > >
> > >   -bv_reproducible_random
> > >
> > > It will force the initial vector of the Krylov subspace to be the same
> irrespective of the number of MPI processes. This should be used for
> scaling analyses as the one you are trying to do.
> > >
> > > What about when I'm not doing the scaling? Now I would like to ask for
> computing time for bigger size problems, should I also use this option in
> that case? Because, what happens if I have a "bad" configuration? Meaning,
> I ask for some time, enough if I take into account the "correct" scaling,
> but when I run it takes double the time/iterations, like it happened before
> when changing from 960 to 1024 processes?
> >
> > When you increase the matrix size the spectrum of the matrix changes and
> probably also the convergence, so the computation time is not easy to
> predict in advance.
> >
> > Okey, I'll keep that in mine. I thought that, even if the spectrum
> changes, if I had a behaviour/tendency for 6 or 7 smaller cases I could
> predict more or less the time. It was working this way until I found this
> "iterations problem" which doubled the time of execution for the same size
> problem. To be completely sure, do you suggest me or not to use this
> run-time option when going in production? Can you elaborate a bit in the
> effect this option? Is the (huge) difference I got in the number of
> iterations something expected?
>
> Ideally if you have a rough approximation of the eigenvector, you set it
> as the initial vector with EPSSetInitialSpace(). Otherwise, SLEPc generates
> a random initial vector, that is start the search blindly. The difference
> between using one random vector or another may be large, depending on the
> problem. Krylov-Schur is usually less sensitive to the initial vector.
>
> Jose
>
> >
> >
> > >
> > > An additional comment is that we strongly recommend to use the default
> solver (Krylov-Schur), which will do Lanczos with implicit restart. It is
> generally faster and more stable.
> > >
> > > I will be doing Dynamical Lanczos, that means that I'll need the
> "matrix whose rows are the eigenvectors of the tridiagonal matrix" (so,
> according to the Lanczos Technical Report notation, I need the "matrix
> whose rows are the eigenvectors of T_m", which should be the same as the
> vectors y_i). I checked the Technical Report for Krylov-Schur also and I
> think I can get the same information also from that solver, but I'm not
> sure. Can you confirm this please?
> > > Also, as the vectors I want are given by V_m^(-1)*x_i=y_i (following
> the notation on the Report), my idea to get them was to retrieve the
> invariant subspace V_m (with EPSGetInvariantSubspace), invert it, and then
> multiply it with the eigenvectors that I get with EPSGetEigenvector. Is
> there another easier (or with less computations) way to get this?
> >
> > In Krylov-Schur the tridiagonal matrix T_m becomes
> arrowhead-plus-tridiagonal. Apart from this, it should be equivalent. The
> relevant information can be obtained with EPSGetBV() and EPSGetDS(). But
> this is a "developer level" interface. We could help you get this running.
> Send a small problem matrix to slepc-maint together with a more detailed
> description of what you need to compute.
> >
> > Thanks! When I get to that part I'll write to slepc-maint for help.
> >
> >
> > Jose
> >
> > >
> > >
> > > Jose
> > >
> > >
> > >
> > > > El 23 oct 2018, a las 12:13, Ale Foggia 
> escribió:
> > > >
> > > > Hello,
> > > >
> > > > I'm currently using Lanczos solver (EPSLANCZOS) to get the smallest
> real eigenvalue (EPS_SMALLEST_REAL) of a Hermitian problem (EPS_HEP). Those
> are the only options I set for the solver. My aim is to be able to
> pre

Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-24 Thread Ale Foggia
The functions called to set the solver are (in this order): EPSCreate();
EPSSetOperators(); EPSSetProblemType(EPS_HEP); EPSSetType(EPSLANCZOS);
EPSSetWhichEigenpairs(EPS_SMALLEST_REAL); EPSSetFromOptions();

The output of -eps_view for each run is:
=
EPS Object: 960 MPI processes
  type: lanczos
LOCAL reorthogonalization
  problem type: symmetric eigenvalue problem
  selected portion of the spectrum: smallest real parts
  number of eigenvalues (nev): 1
  number of column vectors (ncv): 16
  maximum dimension of projected problem (mpd): 16
  maximum number of iterations: 291700777
  tolerance: 1e-08
  convergence test: relative to the eigenvalue
BV Object: 960 MPI processes
  type: svec
  17 columns of global length 2333606220
  vector orthogonalization method: modified Gram-Schmidt
  orthogonalization refinement: if needed (eta: 0.7071)
  block orthogonalization method: GS
  doing matmult as a single matrix-matrix product
  generating random vectors independent of the number of processes
DS Object: 960 MPI processes
  type: hep
  parallel operation mode: REDUNDANT
  solving the problem with: Implicit QR method (_steqr)
ST Object: 960 MPI processes
  type: shift
  shift: 0.
  number of matrices: 1
=
EPS Object: 1024 MPI processes
  type: lanczos
LOCAL reorthogonalization
  problem type: symmetric eigenvalue problem
  selected portion of the spectrum: smallest real parts
  number of eigenvalues (nev): 1
  number of column vectors (ncv): 16
  maximum dimension of projected problem (mpd): 16
  maximum number of iterations: 291700777
  tolerance: 1e-08
  convergence test: relative to the eigenvalue
BV Object: 1024 MPI processes
  type: svec
  17 columns of global length 2333606220
  vector orthogonalization method: modified Gram-Schmidt
  orthogonalization refinement: if needed (eta: 0.7071)
  block orthogonalization method: GS
  doing matmult as a single matrix-matrix product
  generating random vectors independent of the number of processes
DS Object: 1024 MPI processes
  type: hep
  parallel operation mode: REDUNDANT
  solving the problem with: Implicit QR method (_steqr)
ST Object: 1024 MPI processes
  type: shift
  shift: 0.
  number of matrices: 1
=

I run again the same configurations and I got the same result in term of
the number of iterations.

I also tried the full reorthogonalization (always with the
-bv_reproducible_random option) but I still get different number of
iterations: for 960 procs I get 172 iters, and for 1024 I get 362 iters.
The -esp_view output for this case (only for 960 procs, the other one has
the same information -except the number of processes-) is:
=
EPS Object: 960 MPI processes
  type: lanczos
FULL reorthogonalization
  problem type: symmetric eigenvalue problem
  selected portion of the spectrum: smallest real parts
  number of eigenvalues (nev): 1
  number of column vectors (ncv): 16
  maximum dimension of projected problem (mpd): 16
  maximum number of iterations: 291700777
  tolerance: 1e-08
  convergence test: relative to the eigenvalue
BV Object: 960 MPI processes
  type: svec
  17 columns of global length 2333606220
  vector orthogonalization method: classical Gram-Schmidt
  orthogonalization refinement: if needed (eta: 0.7071)
  block orthogonalization method: GS
  doing matmult as a single matrix-matrix product
  generating random vectors independent of the number of processes
DS Object: 960 MPI processes
  type: hep
  parallel operation mode: REDUNDANT
  solving the problem with: Implicit QR method (_steqr)
ST Object: 960 MPI processes
  type: shift
  shift: 0.
  number of matrices: 1
=

El mié., 24 oct. 2018 a las 10:52, Jose E. Roman ()
escribió:

> This is very strange. Make sure you call EPSSetFromOptions in the code. Do
> iteration counts change also for two different runs with the same number of
> processes?
> Maybe Lanczos with default options is too sensitive (by default it does
> not reorthogonalize). Suggest using Krylov-Schur or Lanczos with full
> reorthogonalization (EPSLanczosSetReorthog).
> Also, send the output of -eps_view to see if there is anything abnormal.
>
> Jose
>
>
> > El 24 oct 2018, a las 9:09, Ale Foggia  escribió:
> >
> > I've tried the option that you gave me but I still get different number
> of iterations when changing the number of MPI processes: I did 960 procs
> and 1024 procs and I got 435 and 176 iterations, respectively.
> >
> > El mar., 23 oct. 2018 a las 16:48, Jose E. Roman ()
> escribió:
> >
> >
> > > El 23 oct 2018, a las 15:46, Ale Foggia  escribió:
> > >
> > >
> > >
>

Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-25 Thread Ale Foggia
El mar., 23 oct. 2018 a las 13:53, Matthew Knepley ()
escribió:

> On Tue, Oct 23, 2018 at 6:24 AM Ale Foggia  wrote:
>
>> Hello,
>>
>> I'm currently using Lanczos solver (EPSLANCZOS) to get the smallest real
>> eigenvalue (EPS_SMALLEST_REAL) of a Hermitian problem (EPS_HEP). Those are
>> the only options I set for the solver. My aim is to be able to
>> predict/estimate the time-to-solution. To do so, I was doing a scaling of
>> the code for different sizes of matrices and for different number of MPI
>> processes. As I was not observing a good scaling I checked the number of
>> iterations of the solver (given by EPSGetIterationNumber). I've encounter
>> that for the **same size** of matrix (that meaning, the same problem), when
>> I change the number of MPI processes, the amount of iterations changes, and
>> the behaviour is not monotonic. This are the numbers I've got:
>>
>
> I am sure you know this, but this test is strong scaling and will top out
> when the individual problem sizes become too small (we see this at several
> thousand unknowns).
>

Thanks for pointing this out, we are aware of that and I've been "playing"
around to try to see by myself this behaviour. Now, I think I'll go with
the Krylov-Schur method because is the only solution to the problem of the
number of iterations. With this I think I'll be able to see the individual
problem size effect in the scaling.


>   Thanks,
>
> Matt
>
>
>>
>> # procs   # iters
>> 960  157
>> 992  189
>> 1024338
>> 1056190
>> 1120174
>> 2048136
>>
>> I've checked the mailing list for a similar situation and I've found
>> another person with the same problem but in another solver ("[SLEPc] GD is
>> not deterministic when using different number of cores", Nov 19 2015), but
>> I think the solution this person finds does not apply to my problem
>> (removing "-eps_harmonic" option).
>>
>> Can you give me any hint on what is the reason for this behaviour? Is
>> there a way to prevent this? It's not possible to estimate/predict any time
>> consumption for bigger problems if the number of iterations varies this
>> much.
>>
>> Ale
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>


Re: [petsc-users] [SLEPc] Number of iterations changes with MPI processes in Lanczos

2018-10-25 Thread Ale Foggia
No, the eigenvalue is around -15. I've tried KS and the number of
iterations differs in one when I change the number of MPI processes, which
seems fine for me. So, I'll see if this method is fine for my specific goal
or not, and I'll try to use. Thanks for the help.

El mié., 24 oct. 2018 a las 15:48, Jose E. Roman ()
escribió:

> Everything seems correct. I don't know, maybe your problem is very
> sensitive? Is the eigenvalue tiny?
> I would still try with Krylov-Schur.
> Jose
>
>
> > El 24 oct 2018, a las 14:59, Ale Foggia  escribió:
> >
> > The functions called to set the solver are (in this order): EPSCreate();
> EPSSetOperators(); EPSSetProblemType(EPS_HEP); EPSSetType(EPSLANCZOS);
> EPSSetWhichEigenpairs(EPS_SMALLEST_REAL); EPSSetFromOptions();
> >
> > The output of -eps_view for each run is:
> > =
> > EPS Object: 960 MPI processes
> >   type: lanczos
> > LOCAL reorthogonalization
> >   problem type: symmetric eigenvalue problem
> >   selected portion of the spectrum: smallest real parts
> >   number of eigenvalues (nev): 1
> >   number of column vectors (ncv): 16
> >   maximum dimension of projected problem (mpd): 16
> >   maximum number of iterations: 291700777
> >   tolerance: 1e-08
> >   convergence test: relative to the eigenvalue
> > BV Object: 960 MPI processes
> >   type: svec
> >   17 columns of global length 2333606220
> >   vector orthogonalization method: modified Gram-Schmidt
> >   orthogonalization refinement: if needed (eta: 0.7071)
> >   block orthogonalization method: GS
> >   doing matmult as a single matrix-matrix product
> >   generating random vectors independent of the number of processes
> > DS Object: 960 MPI processes
> >   type: hep
> >   parallel operation mode: REDUNDANT
> >   solving the problem with: Implicit QR method (_steqr)
> > ST Object: 960 MPI processes
> >   type: shift
> >   shift: 0.
> >   number of matrices: 1
> > =
> > EPS Object: 1024 MPI processes
> >   type: lanczos
> > LOCAL reorthogonalization
> >   problem type: symmetric eigenvalue problem
> >   selected portion of the spectrum: smallest real parts
> >   number of eigenvalues (nev): 1
> >   number of column vectors (ncv): 16
> >   maximum dimension of projected problem (mpd): 16
> >   maximum number of iterations: 291700777
> >   tolerance: 1e-08
> >   convergence test: relative to the eigenvalue
> > BV Object: 1024 MPI processes
> >   type: svec
> >   17 columns of global length 2333606220
> >   vector orthogonalization method: modified Gram-Schmidt
> >   orthogonalization refinement: if needed (eta: 0.7071)
> >   block orthogonalization method: GS
> >   doing matmult as a single matrix-matrix product
> >   generating random vectors independent of the number of processes
> > DS Object: 1024 MPI processes
> >   type: hep
> >   parallel operation mode: REDUNDANT
> >   solving the problem with: Implicit QR method (_steqr)
> > ST Object: 1024 MPI processes
> >   type: shift
> >   shift: 0.
> >   number of matrices: 1
> > =
> >
> > I run again the same configurations and I got the same result in term of
> the number of iterations.
> >
> > I also tried the full reorthogonalization (always with the
> -bv_reproducible_random option) but I still get different number of
> iterations: for 960 procs I get 172 iters, and for 1024 I get 362 iters.
> The -esp_view output for this case (only for 960 procs, the other one has
> the same information -except the number of processes-) is:
> > =
> > EPS Object: 960 MPI processes
> >   type: lanczos
> > FULL reorthogonalization
> >   problem type: symmetric eigenvalue problem
> >   selected portion of the spectrum: smallest real parts
> >   number of eigenvalues (nev): 1
> >   number of column vectors (ncv): 16
> >   maximum dimension of projected problem (mpd): 16
> >   maximum number of iterations: 291700777
> >   tolerance: 1e-08
> >   convergence test: relative to the eigenvalue
> > BV Object: 960 MPI processes
> >   type: svec
> >   17 columns of global length 2333606220
> >   vector orthogonalization method: classical Gram-Schmidt
> >   orthogonalization refinement: if needed (eta: 0.7071)
> >   block orthogonalization method: GS
> >   doing 

Re: [petsc-users] Communication during MatAssemblyEnd

2019-06-28 Thread Ale Foggia via petsc-users
Junchao,
I'm sorry for the late response.

El mié., 26 jun. 2019 a las 16:39, Zhang, Junchao ()
escribió:

> Ale,
> The job got a chance to run but failed with out-of-memory, "Some of your
> processes may have been killed by the cgroup out-of-memory handler."
>

I mentioned that I used 1024 nodes and 32 processes on each node because
the application needs a lot of memory. I think that for a system of size
38, one needs above 256 nodes for sure (assuming only 32 procs per node). I
would try with 512 if it's possible.

I also tried with 128 core with ./main.x 2 ... and got a weird error
> message  "The size of the basis has to be at least equal to the number
>  of MPI processes used."
>

The error comes from the fact that you put a system size of only 2 which is
too small.
I can also see the problem in the assembly with system sizes smaller than
38, so you can try with like 30 (for which I also have a log). In that case
I run with 64 nodes and 32 processes per node. I think the problem may also
fit in 32 nodes.

--Junchao Zhang
>
>
> On Tue, Jun 25, 2019 at 11:24 PM Junchao Zhang 
> wrote:
>
>> Ale,
>>   I successfully built your code and submitted a job to the NERSC Cori
>> machine requiring 32768 KNL cores and one and a half hours. It is estimated
>> to run in 3 days. If you also observed the same problem with less cores,
>> what is your input arguments?  Currently, I use what in your log file,
>> ./main.x 38 -nn -j1 1.0 -d1 1.0 -eps_type krylovschur -eps_tol 1e-9
>> -log_view
>>   The smaller the better. Thanks.
>> --Junchao Zhang
>>
>>
>> On Mon, Jun 24, 2019 at 6:20 AM Ale Foggia  wrote:
>>
>>> Yes, I used KNL nodes. I you can perform the test would be great. Could
>>> it be that I'm not using the correct configuration of the KNL nodes? These
>>> are the environment variables I set:
>>> MKL_NUM_THREADS=1
>>> OMP_NUM_THREADS=1
>>> KMP_HW_SUBSET=1t
>>> KMP_AFFINITY=compact
>>> I_MPI_PIN_DOMAIN=socket
>>> I_MPI_PIN_PROCESSOR_LIST=0-63
>>> MKL_DYNAMIC=0
>>>
>>> The code is in https://github.com/amfoggia/LSQuantumED and it has a
>>> readme to compile it and run it. When I run the test I used only 32
>>> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
>>> Thank you
>>>
>>> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao ()
>>> escribió:
>>>
>>>> Ale,
>>>>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL
>>>> nodes  one year ago. He used 32768 processors and called MatAssemblyEnd 118
>>>> times and it used only 1.5 seconds in total.  So I guess something was
>>>> wrong with your test. If you can share your code, I can have a test on our
>>>> machine to see how it goes.
>>>>  Thanks.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang 
>>>> wrote:
>>>>
>>>>> MatAssembly was called once (in stage 5) and cost 2.5% of the total
>>>>> time.  Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF,
>>>>> which does global synchronization. The high max/min ratio means load
>>>>> imbalance. What I do not understand is MatAssemblyEnd. The ratio is 1.0. 
>>>>> It
>>>>> means processors are already synchronized. With 32768 processors, there 
>>>>> are
>>>>> 1.2e+06 messages with average length 1.9e+06 bytes. So each processor 
>>>>> sends
>>>>> 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another chance 
>>>>> is
>>>>> the reduction at  MatAssemblyEnd. I don't know why it needs 8 reductions.
>>>>> In my mind, one is enough. I need to look at the code.
>>>>>
>>>>> Summary of Stages:   - Time --  - Flop --  ---
>>>>> Messages ---  -- Message Lengths --  -- Reductions --
>>>>> Avg %Total Avg %TotalCount
>>>>> %Total Avg %TotalCount   %Total
>>>>>  0:  Main Stage: 8.5045e+02  13.0%  3.0633e+15  14.0%  8.196e+07
>>>>>  13.1%  7.768e+06   13.1%  2.530e+02  13.0%
>>>>>  1:Create Basis: 7.9234e-02   0.0%  0.e+00   0.0%  0.000e+00
>>>>> 0.0%  0.000e+000.0%  0.000e+00   0.0%
>>>>>  2:  Create Lattice: 8.3944e-05   0.0%  0.e+00   0.0%  0.000e+00
>>>>> 0.0%  0.000e+00

Re: [petsc-users] Communication during MatAssemblyEnd

2019-07-03 Thread Ale Foggia via petsc-users
Thank you Richard for your explanation.

I first changed the way I was running the code. In the machine there's
SLURM and I was using "srun -n  ./my_program.x". I've
seen more than 60% improvement in execution time by just running with "srun
-n  --ntasks-per-core=1 --cpu-bind=cores>
./my_program.x". All the results I'm sharing now were obtained by running
with the second command.

1) I've tested the configuration you suggested me and I've found mixed
results. For one system (matrix) size and distribution of memory per node,
there's a small difference in execution time (say, it goes from 208 seconds
to 200 seconds) between the two cases: I_MPI_PIN_DOMAIN=socket and
I_MPI_PIN_DOMAIN=auto:compact. In another case, I obtained that using
I_MPI_PIN_DOMAIN=auto:socket gives actually worse results: I go from an
execution time of 5 second to 11 seconds, when using "socket" and
"auto:compact" configurations respectively. In this case, I see that the
log events that differ in execution time are MatMult, VecNorm and
VecScatterEnd. I also see that the imbalance for those log events is
smaller when using "socket" (consequently the time difference).

2) I also tested using I_MPI_PIN_PROCESSOR_LIST=0-63 and I get the same
numbers that I get when using I_MPI_PIN_DOMAIN=socket (I've seen the BIOS
numbering and those correspond to the first of the CPUs in each core).

El mar., 25 jun. 2019 a las 2:32, Mills, Richard Tran ()
escribió:

> Hi Ale,
>
> I don't know if this has anything to do with the strange performance you
> are seeing, but I notice that some of your Intel MPI settings are
> inconsistent and I'm not sure what you are intending. You have specified a
> value for I_MPI_PIN_DOMAIN and also a value for I_MPI_PIN_PROCESSOR_LIST.
> You can specify the domains that each MPI rank is pinned to by specifying
> one of these, but not both. According to what I found by a Google search
> for the Intel MPI documentation, if you specify both, it is
> I_MPI_PIN_DOMAIN that gets used.
>
> A KNL node has only one socket per node (though it can be made to appear
> as if it has four virtual sockets if booted into SNC-4 mode, which doesn't
> seem to be a popular usage model). If you run with
> "I_MPI_PIN_DOMAIN=socket", then each of your MPI ranks on a node will be
> "pinned" to a domain that consists of all of the CPU cores available in a
> socket, so the MPI processes aren't really pinned at all, and can be
> migrated around by the system. You may see better performance by having
> your MPI implementation pin processes to more restrictive domains. I
> suggest first trying
>
>   I_MPI_PIN_DOMAIN=auto:compact
>
> which I believe is the default Intel MPI behavior. This will create
> domains by dividing the number of logical CPUs (# of hardware threads) by
> the number of MPI processes being started on the node, and they will be
> "compact" in the sense that domain members will be as "close" to each other
> as possible in terms of sharing cores, caches, etc.
>
> I think that Intel MPI is ignoring your I_MPI_PIN_PROCESSOR_LIST setting,
> but I'll go ahead and point out that, if you are running with 64-core KNL
> nodes, there are 4*64 = 256 logical processors available, because each core
> supports four hardware threads. Asking for logical processors 0-63 might
> not actually be using all of the cores, as, depending on the BIOS numbering
> (which can be arbitrary), some of these logical processors may actually be
> hardware threads that share the same core.
>
> Best regards,
> Richard
>
> On 6/24/19 4:19 AM, Ale Foggia via petsc-users wrote:
>
> Yes, I used KNL nodes. I you can perform the test would be great. Could it
> be that I'm not using the correct configuration of the KNL nodes? These are
> the environment variables I set:
> MKL_NUM_THREADS=1
> OMP_NUM_THREADS=1
> KMP_HW_SUBSET=1t
> KMP_AFFINITY=compact
> I_MPI_PIN_DOMAIN=socket
> I_MPI_PIN_PROCESSOR_LIST=0-63
> MKL_DYNAMIC=0
>
> The code is in https://github.com/amfoggia/LSQuantumED and it has a
> readme to compile it and run it. When I run the test I used only 32
> processors per node, and I used 1024 nodes in total, and it's for nspins=38.
> Thank you
>
> El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao ()
> escribió:
>
>> Ale,
>>   Did you use Intel KNL nodes?  Mr. Hong (cc'ed) did experiments on KNL
>> nodes  one year ago. He used 32768 processors and called MatAssemblyEnd 118
>> times and it used only 1.5 seconds in total.  So I guess something was
>> wrong with your test. If you can share your code, I can have a test on our
>> machine to see how it goes.
>>  Thanks.
>> --Junchao Z