Begin forwarded message:

> From: Nathan Wichmann <wichmann at cray.com>
> Date: June 18, 2011 4:40:07 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at 
> mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl 
> at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier 
> <ethier at pppl.gov>, John Shalf <jshalf at lbl.gov>
> Subject: RE: Poisson step in GTS
> 
> Barry,
> 
> No, Cray does not provide any threaded BLAS 1.  Generally speaking it is not 
> worth threading a single nested loop unless the trip count is very high and 
> generally that does not happen often enough to warrant the special BLAS.  In 
> fact, I am not even sure we omp BLAS 2, I don't think so.
> 
> Nathan
> 
> Nathan Wichmann                Cray Inc.
> wichmann at cray.com              380 Jackson St
> Applications Engineer          St Paul, MN 55101
> office:  1-800-284-2729  x605-9079          
>  cell:  651-428-1131
> 
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
> Sent: Saturday, June 18, 2011 12:13 PM
> To: Nathan Wichmann
> Cc: Lois Curfman McInnes; Satish Balay; Alice Koniges; Robert Preissl; Erich 
> Strohmaier; Stephane Ethier; John Shalf
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
> 
>> Hi Robert, Barry and all,
>> 
>> Is it our assumption that the Poisson version of GTS will normally be run 
>> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
> 
>   Our new vector and matrix classes will allow the flexibility of any number 
> of MPI processes and any number of threads under that. So 1 MPI rank and 6 
> threads is supportable.
> 
>> In that case there should be sufficient bandwidth for decent scaling; I 
>> would say something Barry's Intel experience.  Barry is certainly correct 
>> that as one uses more cores one will be more bandwidth limited.
> 
>   I would be interested in seeing the OpenMP streams for this system.
>> 
>> I also like John's comment: "we have little faith that the compiler will do 
>> anything intelligent."  Which compiler are you using?  If you are using CCE 
>> then you should get a lst file to see what it is doing.  Probably the only 
>> thing that can and should be done is unroll the inner loop.
> 
>  Do you folks a provide a thread based BLAS 1 operations? For example ddot, 
> dscale, daxpy? If so, we can piggy-back on those to get the best possible 
> performance on the vector operations.,
>> 
>> Another consideration is the typical size of "n".  Normally the dense the 
>> matrix the large n is, no?  But still, it would be interesting to know.
> 
>  In this application the matrix is extremely sparse, likely between 7 and 27 
> nonzeros per row. Matrices, of course, can get as big as you like.
> 
>   Barry
> 
>> 
>> Ciao!
>> Nathan
>> 
>> 
>> Nathan Wichmann                Cray Inc.
>> wichmann at cray.com              380 Jackson St
>> Applications Engineer          St Paul, MN 55101
>> office:  1-800-284-2729  x605-9079          
>> cell:  651-428-1131
>> 
>> -----Original Message-----
>> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
>> Sent: Friday, June 17, 2011 1:42 PM
>> To: Lois Curfman McInnes; Satish Balay
>> Cc: Alice Koniges; Robert Preissl; Erich Strohmaier; Stephane Ethier; John 
>> Shalf; Nathan Wichmann
>> Subject: Re: Poisson step in GTS
>> 
>> 
>> With the point Jacobi solver yes MatMult_SeqAIJ() will be the most time 
>> consuming code in the simulation.  With hypre's BoomerAMG it will a far more 
>> complicated set of routines because that is a much more sophisticated solver 
>> (for hypre you will need to work directly with Rob Falgout on threaded 
>> versions of his software).
>> 
>> Based on the request of the GTS folks (and others) we have begun the 
>> implementation of a thread based vector and matrix class in PETSc that will 
>> allow all three of the routines
>> 
>>>>> 2399.257095               2399.257095 38.384429  MatMult_SeqAIJ 
>>>>> (GTS-usertime)
>>>>>             565.571417                565.571417 9.048274  VecMAXPY_Seq 
>>>>> (GTS-usertime)
>>>>>             481.171419                481.171419 7.698004  VecMDot_Seq 
>>>>> (GTS-usertime)
>> 
>> to run using multiple threads and improve the performance a bit. Give us a 
>> few more weeks and this will be ready for basic use in petsc-dev.
>> 
>> ___BUT___ it is important to understand that MatMult and the Vec operations 
>> are all MEMORY BANDWIDTH LIMITED operations and running on even a single 
>> core uses much of the physical memory bandwidth (depending on the system and 
>> how its memory is organized).  For example on a traditional desktop you will 
>> see almost no improvement in going from one thread to two threads in these 
>> operations. On a very good Intel system that has memory bandwidth that 
>> (somewhat) scales with the number of cores you will see some speedup. Based 
>> on (just a little) of our experience you might see a speed up of 1.7 on 2 
>> cores and 2.5 on 4 cores (this is on decent server class system, not a usual 
>> desktop). I have no experience on the AMD side so cannot say what will 
>> happen there. You can get a good handle on how they will do by running the 
>> OpenMP version of the streams benchmark on the system, if that scales well 
>> then the PETSc threaded routines will scale well, if that scales poorly then 
>> sparse linear algebra will scale poorly.
>> 
>>  Feel free to continue this conversation, it is important,
>> 
>>  Barry
>> 
>> 
>> 
>> 
>> 
>> On Jun 17, 2011, at 6:07 AM, Lois Curfman McInnes wrote:
>> 
>>> 
>>> Thanks for the update.  I'm cc-ing this response to Barry, as he has been 
>>> thinking about issues with MPI/OpenMP and may have suggestions.
>>> 
>>> My understanding from prior conversations with Stephane et al. is that GTS 
>>> typically solves this equation using Conjugate Gradient with either a 
>>> simple Jacobi preconditioner or otherwise hypre.  The matrix-vector product 
>>> is a key kernel in CG (or any Krylov method), so improvements in this are 
>>> indeed important for scalable linear solvers in GTS and in general.   I 
>>> believe a matrix-vector product is also used for a poloidal smoother in 
>>> GTS.  MatMult_SeqAIJ() is the local part of a parallel matrix-vector 
>>> product for the default storage sparse matrix storage format (problems with 
>>> multiple unknowns per mesh point typically use a block variant).  You can 
>>> see the parallel layout of a sparse matrix on slide 51 of this tutorial:
>>>         
>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/TACCTutorial2009.pdf
>>> 
>>> Barry:  Any comments?
>>> 
>>> Lois
>>> 
>>> On Jun 16, 2011, at 5:27 PM, Alice Koniges wrote:
>>> 
>>>> Erich Strohmaier has (or had?) an LDRD project to look at the seven dwarfs 
>>>> (modified to "parallel motifs") so maybe he has more information on other 
>>>> implementations?
>>>> 
>>>> I also cc Lois because she is a good PETSc contact.
>>>> 
>>>> On Jun 16, 2011, at 2:21 PM, Robert Preissl wrote:
>>>> 
>>>>> Hello John & Stephane,
>>>>> 
>>>>> I did a bit more debugging / profiling on the GTS poisson step (Kamesh 
>>>>> still needs time this week on the charger code) to find out:
>>>>> 1. what is the most time consuming kernel in PETSc? (is there even such a 
>>>>> kernel??)
>>>>> 2. can we add OpenMP to it?
>>>>> 
>>>>> The most time consuming part of the poisson solver is a sparse matrix 
>>>>> vector multiplication (sMxV), which accesses the main memory using index 
>>>>> lists.
>>>>> (as verified at: 
>>>>> http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/rechnersysteme/beschreibung_der_hpc_systeme/ultrasparc_t2/~rdt/ultrasparc_t2_smxv/?lang=de)
>>>>> 
>>>>> It seems that SMXV is one of the so called "seven dwarfs"; so, there is 
>>>>> lots of literature about this. and I am starting to read and try to find 
>>>>> out if there are elegant OpenMP versions available.. maybe John knows 
>>>>> about this???
>>>>> 
>>>>> a bit more details:
>>>>> 
>>>>> The kernel of interest is called "MatMult_SeqAIJ", where most of the 
>>>>> solve time is spent; this is also confirmed by performance analysis using 
>>>>> the OpenSpeedShop toolkit as verified under:
>>>>> http://www.openspeedshop.org/wp/wp-content/uploads/2011/01/OSS_TARGET_USAGE2.html
>>>>> 
>>>>> Exclusive CPU time in     Inclusive CPU time in  % of Total Exclusive CPU 
>>>>>  Function (defining location)
>>>>>               seconds.                  seconds.  Time
>>>>>            2399.257095               2399.257095 38.384429  
>>>>> MatMult_SeqAIJ (GTS-usertime)
>>>>>             565.571417                565.571417 9.048274  VecMAXPY_Seq 
>>>>> (GTS-usertime)
>>>>>             481.171419                481.171419 7.698004  VecMDot_Seq 
>>>>> (GTS-usertime)
>>>>>             362.599993                362.599993 5.801043  sched_yield 
>>>>> (GTS-usertime)
>>>>>             191.371425                191.371425 3.061649  
>>>>> MatSetValues_MPIAIJ (GTS-usertime: mpiaij.c,305)
>>>>>             145.371426                145.371426 2.325720  pow 
>>>>> (GTS-usertime)
>>>>> 
>>>>> 
>>>>> Anyway, the kernel looks like this: (copied from 
>>>>> ./petsc300_CRAY-XE6_real64/src/mat/impls/aij/seq/aij.c  --  Line 963)
>>>>> 
>>>>>   for (i=0; i<m; i++) {
>>>>>     jrow = ii[i];
>>>>>     n    = ii[i+1] - jrow;
>>>>>     sum  = 0.0;
>>>>>     nonzerorow += (n>0);
>>>>>     for (j=0; j<n; j++) {
>>>>>       sum += aa[jrow]*x[aj[jrow]]; jrow++;
>>>>>     }
>>>>>     y[i] = sum;
>>>>>   }
>>>>> 
>>>>> 
>>>>> I am continuing my literature study, but just wanted to give you this 
>>>>> update. Maybe we can talk a bit about this tomorrow or next week?
>>>>> 
>>>>> Robert
>>>>> 
>>>> 
>>> 
>> 
> 


Reply via email to