Begin forwarded message:
> From: Nathan Wichmann <wichmann at cray.com> > Date: June 18, 2011 4:40:07 PM CDT > To: Barry Smith <bsmith at mcs.anl.gov> > Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at > mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl > at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier > <ethier at pppl.gov>, John Shalf <jshalf at lbl.gov> > Subject: RE: Poisson step in GTS > > Barry, > > No, Cray does not provide any threaded BLAS 1. Generally speaking it is not > worth threading a single nested loop unless the trip count is very high and > generally that does not happen often enough to warrant the special BLAS. In > fact, I am not even sure we omp BLAS 2, I don't think so. > > Nathan > > Nathan Wichmann Cray Inc. > wichmann at cray.com 380 Jackson St > Applications Engineer St Paul, MN 55101 > office: 1-800-284-2729 x605-9079 > cell: 651-428-1131 > > > -----Original Message----- > From: Barry Smith [mailto:bsmith at mcs.anl.gov] > Sent: Saturday, June 18, 2011 12:13 PM > To: Nathan Wichmann > Cc: Lois Curfman McInnes; Satish Balay; Alice Koniges; Robert Preissl; Erich > Strohmaier; Stephane Ethier; John Shalf > Subject: Re: Poisson step in GTS > > > On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote: > >> Hi Robert, Barry and all, >> >> Is it our assumption that the Poisson version of GTS will normally be run >> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads? > > Our new vector and matrix classes will allow the flexibility of any number > of MPI processes and any number of threads under that. So 1 MPI rank and 6 > threads is supportable. > >> In that case there should be sufficient bandwidth for decent scaling; I >> would say something Barry's Intel experience. Barry is certainly correct >> that as one uses more cores one will be more bandwidth limited. > > I would be interested in seeing the OpenMP streams for this system. >> >> I also like John's comment: "we have little faith that the compiler will do >> anything intelligent." Which compiler are you using? If you are using CCE >> then you should get a lst file to see what it is doing. Probably the only >> thing that can and should be done is unroll the inner loop. > > Do you folks a provide a thread based BLAS 1 operations? For example ddot, > dscale, daxpy? If so, we can piggy-back on those to get the best possible > performance on the vector operations., >> >> Another consideration is the typical size of "n". Normally the dense the >> matrix the large n is, no? But still, it would be interesting to know. > > In this application the matrix is extremely sparse, likely between 7 and 27 > nonzeros per row. Matrices, of course, can get as big as you like. > > Barry > >> >> Ciao! >> Nathan >> >> >> Nathan Wichmann Cray Inc. >> wichmann at cray.com 380 Jackson St >> Applications Engineer St Paul, MN 55101 >> office: 1-800-284-2729 x605-9079 >> cell: 651-428-1131 >> >> -----Original Message----- >> From: Barry Smith [mailto:bsmith at mcs.anl.gov] >> Sent: Friday, June 17, 2011 1:42 PM >> To: Lois Curfman McInnes; Satish Balay >> Cc: Alice Koniges; Robert Preissl; Erich Strohmaier; Stephane Ethier; John >> Shalf; Nathan Wichmann >> Subject: Re: Poisson step in GTS >> >> >> With the point Jacobi solver yes MatMult_SeqAIJ() will be the most time >> consuming code in the simulation. With hypre's BoomerAMG it will a far more >> complicated set of routines because that is a much more sophisticated solver >> (for hypre you will need to work directly with Rob Falgout on threaded >> versions of his software). >> >> Based on the request of the GTS folks (and others) we have begun the >> implementation of a thread based vector and matrix class in PETSc that will >> allow all three of the routines >> >>>>> 2399.257095 2399.257095 38.384429 MatMult_SeqAIJ >>>>> (GTS-usertime) >>>>> 565.571417 565.571417 9.048274 VecMAXPY_Seq >>>>> (GTS-usertime) >>>>> 481.171419 481.171419 7.698004 VecMDot_Seq >>>>> (GTS-usertime) >> >> to run using multiple threads and improve the performance a bit. Give us a >> few more weeks and this will be ready for basic use in petsc-dev. >> >> ___BUT___ it is important to understand that MatMult and the Vec operations >> are all MEMORY BANDWIDTH LIMITED operations and running on even a single >> core uses much of the physical memory bandwidth (depending on the system and >> how its memory is organized). For example on a traditional desktop you will >> see almost no improvement in going from one thread to two threads in these >> operations. On a very good Intel system that has memory bandwidth that >> (somewhat) scales with the number of cores you will see some speedup. Based >> on (just a little) of our experience you might see a speed up of 1.7 on 2 >> cores and 2.5 on 4 cores (this is on decent server class system, not a usual >> desktop). I have no experience on the AMD side so cannot say what will >> happen there. You can get a good handle on how they will do by running the >> OpenMP version of the streams benchmark on the system, if that scales well >> then the PETSc threaded routines will scale well, if that scales poorly then >> sparse linear algebra will scale poorly. >> >> Feel free to continue this conversation, it is important, >> >> Barry >> >> >> >> >> >> On Jun 17, 2011, at 6:07 AM, Lois Curfman McInnes wrote: >> >>> >>> Thanks for the update. I'm cc-ing this response to Barry, as he has been >>> thinking about issues with MPI/OpenMP and may have suggestions. >>> >>> My understanding from prior conversations with Stephane et al. is that GTS >>> typically solves this equation using Conjugate Gradient with either a >>> simple Jacobi preconditioner or otherwise hypre. The matrix-vector product >>> is a key kernel in CG (or any Krylov method), so improvements in this are >>> indeed important for scalable linear solvers in GTS and in general. I >>> believe a matrix-vector product is also used for a poloidal smoother in >>> GTS. MatMult_SeqAIJ() is the local part of a parallel matrix-vector >>> product for the default storage sparse matrix storage format (problems with >>> multiple unknowns per mesh point typically use a block variant). You can >>> see the parallel layout of a sparse matrix on slide 51 of this tutorial: >>> >>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/TACCTutorial2009.pdf >>> >>> Barry: Any comments? >>> >>> Lois >>> >>> On Jun 16, 2011, at 5:27 PM, Alice Koniges wrote: >>> >>>> Erich Strohmaier has (or had?) an LDRD project to look at the seven dwarfs >>>> (modified to "parallel motifs") so maybe he has more information on other >>>> implementations? >>>> >>>> I also cc Lois because she is a good PETSc contact. >>>> >>>> On Jun 16, 2011, at 2:21 PM, Robert Preissl wrote: >>>> >>>>> Hello John & Stephane, >>>>> >>>>> I did a bit more debugging / profiling on the GTS poisson step (Kamesh >>>>> still needs time this week on the charger code) to find out: >>>>> 1. what is the most time consuming kernel in PETSc? (is there even such a >>>>> kernel??) >>>>> 2. can we add OpenMP to it? >>>>> >>>>> The most time consuming part of the poisson solver is a sparse matrix >>>>> vector multiplication (sMxV), which accesses the main memory using index >>>>> lists. >>>>> (as verified at: >>>>> http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/rechnersysteme/beschreibung_der_hpc_systeme/ultrasparc_t2/~rdt/ultrasparc_t2_smxv/?lang=de) >>>>> >>>>> It seems that SMXV is one of the so called "seven dwarfs"; so, there is >>>>> lots of literature about this. and I am starting to read and try to find >>>>> out if there are elegant OpenMP versions available.. maybe John knows >>>>> about this??? >>>>> >>>>> a bit more details: >>>>> >>>>> The kernel of interest is called "MatMult_SeqAIJ", where most of the >>>>> solve time is spent; this is also confirmed by performance analysis using >>>>> the OpenSpeedShop toolkit as verified under: >>>>> http://www.openspeedshop.org/wp/wp-content/uploads/2011/01/OSS_TARGET_USAGE2.html >>>>> >>>>> Exclusive CPU time in Inclusive CPU time in % of Total Exclusive CPU >>>>> Function (defining location) >>>>> seconds. seconds. Time >>>>> 2399.257095 2399.257095 38.384429 >>>>> MatMult_SeqAIJ (GTS-usertime) >>>>> 565.571417 565.571417 9.048274 VecMAXPY_Seq >>>>> (GTS-usertime) >>>>> 481.171419 481.171419 7.698004 VecMDot_Seq >>>>> (GTS-usertime) >>>>> 362.599993 362.599993 5.801043 sched_yield >>>>> (GTS-usertime) >>>>> 191.371425 191.371425 3.061649 >>>>> MatSetValues_MPIAIJ (GTS-usertime: mpiaij.c,305) >>>>> 145.371426 145.371426 2.325720 pow >>>>> (GTS-usertime) >>>>> >>>>> >>>>> Anyway, the kernel looks like this: (copied from >>>>> ./petsc300_CRAY-XE6_real64/src/mat/impls/aij/seq/aij.c -- Line 963) >>>>> >>>>> for (i=0; i<m; i++) { >>>>> jrow = ii[i]; >>>>> n = ii[i+1] - jrow; >>>>> sum = 0.0; >>>>> nonzerorow += (n>0); >>>>> for (j=0; j<n; j++) { >>>>> sum += aa[jrow]*x[aj[jrow]]; jrow++; >>>>> } >>>>> y[i] = sum; >>>>> } >>>>> >>>>> >>>>> I am continuing my literature study, but just wanted to give you this >>>>> update. Maybe we can talk a bit about this tomorrow or next week? >>>>> >>>>> Robert >>>>> >>>> >>> >> >