[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-03-06 Thread Barry Smith

   I don't see any options for turning on the threads here?

  #PETSc Option Table entries:
-ksp_type bcgs
-log_summary
-pc_type lu
#End of PETSc Option Table entries

  From http://www.mcs.anl.gov/petsc/features/threads.html

? The three important run-time options for using threads are:
? -threadcomm_nthreads : Sets the number of threads
? -threadcomm_affinities : Sets the core 
affinities of threads
? -threadcomm_type : Threading model 
(OpenMP, pthread, nothread)
? Run with -help to see the avialable options with threads.
? A few tutorial examples are located at 
$PETSC_DIR/src/sys/threadcomm/examples/tutorials

  Also LU is a direct solver that is not threaded so using threads for this 
exact run will not help (much) at all. The threads will only show useful speed 
up for iterative methods.

   Barry

  As time goes by we hope to have more extensive support in more routines for 
threads but things like factorization and solve are difficult so out side help 
would be very useful.

On Mar 6, 2013, at 3:39 AM, ?smund Ervik  wrote:

> Hi again,
> 
> On 01. mars 2013 20:06, Jed Brown wrote:
>> 
>> Matrix and vector operations are probably running in parallel, but probably
>> not the operations that are taking time. Always send -log_summary if you
>> have a performance question.
>> 
> 
> I don't think they are running in parallel. When I analyze my code in
> Intel Vtune Amplifier, the only routines running in parallel are my own
> OpenMP ones. Indeed, if I comment out my OpenMP pragmas and recompile my
> code, it never uses more than one thread.
> 
> -log_summary is shown below; this is using -pc_type lu -ksp_type bcgs.
> The fastest PC for my cases is usually BoomerAMG from HYPRE, so i used
> LU instead here in order to limit the test to PETSc only. The summary
> agrees with Vtune that MatLUFactorNumeric is the most time-consuming
> routine; in general it seems that the PC is always the most time-consuming.
> 
> Any advice on how to get OpenMP working?
> 
> Regards,
> ?smund
> 
> 
> 
> -- PETSc Performance
> Summary: --
> 
> ./run on a arch-linux2-c-opt named vsl161 with 1 processor, by asmunder
> Wed Mar  6 10:14:55 2013
> Using Petsc Development HG revision:
> 58cc6199509f1642f637843f1ca468283bf5ced9  HG Date: Wed Jan 30 00:39:35
> 2013 -0600
> 
> Max   Max/MinAvg  Total
> Time (sec):   4.446e+02  1.0   4.446e+02
> Objects:  2.017e+03  1.0   2.017e+03
> Flops:3.919e+11  1.0   3.919e+11  3.919e+11
> Flops/sec:8.815e+08  1.0   8.815e+08  8.815e+08
> MPI Messages: 0.000e+00  0.0   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00  0.0   0.000e+00  0.000e+00
> MPI Reductions:   2.818e+03  1.0
> 
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>and VecAXPY() for complex vectors of length
> N --> 8N flops
> 
> Summary of Stages:   - Time --  - Flops -  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>Avg %Total Avg %Total   counts
> %Total Avg %Total   counts   %Total
> 0:  Main Stage: 4.4460e+02 100.0%  3.9191e+11 100.0%  0.000e+00
> 0.0%  0.000e+000.0%  2.817e+03 100.0%
> 
> 
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops: Max - maximum over all processors
>   Ratio - ratio of maximum to minimum over all processors
>   Mess: number of messages sent
>   Avg. len: average message length (bytes)
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>  %T - percent time in this phase %f - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message
> lengths in this phase
>  %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> 
> EventCount  Time (sec) Flops
> --- Global ---  --- Stage ---   Total
>   Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s
> --

[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-03-06 Thread Åsmund Ervik
Hi again,

On 01. mars 2013 20:06, Jed Brown wrote:
> 
> Matrix and vector operations are probably running in parallel, but probably
> not the operations that are taking time. Always send -log_summary if you
> have a performance question.
> 

I don't think they are running in parallel. When I analyze my code in
Intel Vtune Amplifier, the only routines running in parallel are my own
OpenMP ones. Indeed, if I comment out my OpenMP pragmas and recompile my
code, it never uses more than one thread.

-log_summary is shown below; this is using -pc_type lu -ksp_type bcgs.
The fastest PC for my cases is usually BoomerAMG from HYPRE, so i used
LU instead here in order to limit the test to PETSc only. The summary
agrees with Vtune that MatLUFactorNumeric is the most time-consuming
routine; in general it seems that the PC is always the most time-consuming.

Any advice on how to get OpenMP working?

Regards,
?smund



-- PETSc Performance
Summary: --

./run on a arch-linux2-c-opt named vsl161 with 1 processor, by asmunder
Wed Mar  6 10:14:55 2013
Using Petsc Development HG revision:
58cc6199509f1642f637843f1ca468283bf5ced9  HG Date: Wed Jan 30 00:39:35
2013 -0600

 Max   Max/MinAvg  Total
Time (sec):   4.446e+02  1.0   4.446e+02
Objects:  2.017e+03  1.0   2.017e+03
Flops:3.919e+11  1.0   3.919e+11  3.919e+11
Flops/sec:8.815e+08  1.0   8.815e+08  8.815e+08
MPI Messages: 0.000e+00  0.0   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00  0.0   0.000e+00  0.000e+00
MPI Reductions:   2.818e+03  1.0

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length
N --> 8N flops

Summary of Stages:   - Time --  - Flops -  --- Messages
---  -- Message Lengths --  -- Reductions --
Avg %Total Avg %Total   counts
%Total Avg %Total   counts   %Total
 0:  Main Stage: 4.4460e+02 100.0%  3.9191e+11 100.0%  0.000e+00
0.0%  0.000e+000.0%  2.817e+03 100.0%


See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
  %T - percent time in this phase %f - percent flops in this
phase
  %M - percent messages in this phase %L - percent message
lengths in this phase
  %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)

EventCount  Time (sec) Flops
 --- Global ---  --- Stage ---   Total
   Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %f %M %L %R  %T %f %M %L %R Mflop/s


--- Event Stage 0: Main Stage

VecDot   802 1.0 9.2811e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2117
VecDotNorm2  401 1.0 7.1333e-02 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
4.0e+02  0  0  0  0 14   0  0  0  0 14  2755
VecNorm 1203 1.0 7.8265e-02 1.0 2.95e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  3766
VecCopy  802 1.0 1.1754e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecSet  1211 1.0 9.9961e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY  401 1.0 4.5847e-02 1.0 9.82e+07 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2143
VecAXPBYCZ   802 1.0 1.3489e-01 1.0 3.93e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2913
VecWAXPY 802 1.0 1.2292e-01 1.0 1.96e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  1599
VecAssemblyBegin 802 1.0 2.4509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAssemblyEnd   802 1.0 6.7234e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMu

[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-03-01 Thread Jed Brown
On Fri, Mar 1, 2013 at 3:26 AM, ?smund Ervik  wrote:

> Thanks for clarifying this. I am already using OpenMP pragmas in non-PETSc
> routines in my code, and using petsc-dev. Are you saying that I should also
> somehow use OpenMP pragmas around the calls to KSPSolve etc.?
>
> Suppose that my program is usually run like this:
> "./run -pc_type gamg -ksp_type bcgs"
> with other values left to their defaults, and I want to make it run in
> parallel:
> "./run -pc type gamg -ksp_type bcgs -threadcomm_type openmp
> -threadcomm_nthreads 8"
>
> When I do this, the PC and KSP still run in serial as far as I can tell,
> and the program does not execute faster. What am I missing here?
>

Matrix and vector operations are probably running in parallel, but probably
not the operations that are taking time. Always send -log_summary if you
have a performance question.


> In case it is of interest, the matrix from my Poisson equation has in the
> range of 0.4 - 1 million nonzero elements, on average 5 per row.
>
-- next part --
An HTML attachment was scrubbed...
URL: 



[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-03-01 Thread Åsmund Ervik
Hi Barry,

On 28. feb. 2013 17:38, Barry Smith wrote:
>
>  2) You should not need petscthreadcomm.h in Fortran. Simply using OpenMP 
> progmas in your portion of the code.
 >

Thanks for clarifying this. I am already using OpenMP pragmas in 
non-PETSc routines in my code, and using petsc-dev. Are you saying that 
I should also somehow use OpenMP pragmas around the calls to KSPSolve etc.?

Suppose that my program is usually run like this:
"./run -pc_type gamg -ksp_type bcgs"
with other values left to their defaults, and I want to make it run in 
parallel:
"./run -pc type gamg -ksp_type bcgs -threadcomm_type openmp 
-threadcomm_nthreads 8"

When I do this, the PC and KSP still run in serial as far as I can tell, 
and the program does not execute faster. What am I missing here?

In case it is of interest, the matrix from my Poisson equation has in 
the range of 0.4 - 1 million nonzero elements, on average 5 per row.

Regards,
?smund


[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-02-28 Thread Åsmund Ervik
Hi,

Having read that petsc-dev has the possibility of using OpenMP to run PC 
and KSP routines in parallel, and that people are getting nice speedup 
with this, I wanted to make use of this feature in our code. It is an 
incompressible two-phase flow solver programmed in Fortran which used 
PETSc for solving the pressure Poisson equation.

I am, however, unable to use PETSC with OpenMP from Fortran. In 
particular, I can't seem to find a Fortran header file corresponding to 
petscthreadcomm.h (which is for C/C++). Is this not implemented yet? If 
so, is only the header file missing, or is more work required?

Regards,
?smund Ervik


[petsc-dev] OpenMP in PETSc when calling from Fortran?

2013-02-28 Thread Barry Smith

   1) Work with petsc-dev http://www.mcs.anl.gov/petsc/developers/index.html  
the threaded stuff is more mature.

2) You should not need petscthreadcomm.h in Fortran. Simply using OpenMP 
progmas in your portion of the code.

3) See http://www.mcs.anl.gov/petsc/features/threads.html for a little bit 
of detail and feel free to post additional questions on this mailing list.

   Barry


On Feb 28, 2013, at 4:35 AM, ?smund Ervik  wrote:

> Hi,
> 
> Having read that petsc-dev has the possibility of using OpenMP to run PC and 
> KSP routines in parallel, and that people are getting nice speedup with this, 
> I wanted to make use of this feature in our code. It is an incompressible 
> two-phase flow solver programmed in Fortran which used PETSc for solving the 
> pressure Poisson equation.
> 
> I am, however, unable to use PETSC with OpenMP from Fortran. In particular, I 
> can't seem to find a Fortran header file corresponding to petscthreadcomm.h 
> (which is for C/C++). Is this not implemented yet? If so, is only the header 
> file missing, or is more work required?
> 
> Regards,
> ?smund Ervik