[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
Aron, 1. It's all NUMA 2. You don't get to repartition the matrix because that is unnatural and not a local optimization. 3. Because of 2, the algorithms are different, so direct comparison is not meaningful, but I do not buy that you can get the same throughput on the kernel that is natural and

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
On Fri, Nov 12, 2010 at 15:31, Rodrigo R. Paz wrote: > Of course, we have linear scaling in computing RHS and Matrix contrib. > Also, the problem size when ranging nodes and cores was fixed (2.5M). Thank you. In a chat with Lisandro, I hear that this is a 2D problem. So the interesting issue i

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
On Fri, Nov 12, 2010 at 15:04, Jed Brown wrote: > The comparison we are more interested is with an equivalent number of MPI > processes per node. Whoops, missed your second attachment. Indeed, that is what I expect, especially at relatively low node counts. I think it has been fairly well dem

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
On Fri, Nov 12, 2010 at 14:51, Rodrigo R. Paz wrote: > IMHO, the main problem here is the low memory bandwidth (FSB) in Xeon E5420 > nodes. This is well-known and fundamental (not dependent on the programming model). It appears that you are comparing OpenMP threading within a node to a single t

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Aron Ahmadia
> A partial counter-point is that MatSolve with OpenMP is unlikely to be near > the throughput of MPI-based MatSolve because the high-concurrency paths are > not going to provide very good memory locality, and may cause horrible > degradation due to cache-coherency issues. I am just going to th

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Matthew Knepley
On Fri, Nov 12, 2010 at 11:54 AM, Barry Smith wrote: > > On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote: > > > On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote: > > > > What should we use a for programming model for PETSc on multi-core > systems? Currently for conventional multicore we h

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Rodrigo R. Paz
I resend the first figure I sent because the y-label on the right plot was wrong (the right plot is the "transpose" of the left one). Of course, we have linear scaling in computing RHS and Matrix contrib. Also, the problem size when ranging nodes and cores was fixed (2.5M). Rodrigo On Fri, Nov 12

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Lisandro Dalcin
On 12 November 2010 11:09, Jed Brown wrote: > On Fri, Nov 12, 2010 at 15:04, Jed Brown wrote: >> >> The comparison we are more interested is with an equivalent number of MPI >> processes per node. > > Whoops, missed your second attachment. ?Indeed, that is what I expect, > especially at relativel

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Matthew Knepley
On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote: > > What should we use a for programming model for PETSc on multi-core > systems? Currently for conventional multicore we have only have one MPI > process per core and for GPU we have subclasses of Vec and Mat with custom > CUDA code. > > Sh

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Rodrigo R. Paz
Hi, Aron, IMHO, the main problem here is the low memory bandwidth (FSB) in Xeon E5420 nodes. In sparse matrix-vector product, the ratio between floating point operations and memory accesses is low. Thus, the overall performance in this stage is mainly controlled by memory bandwidth. We have also te

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Rodrigo R. Paz
Hi all, find attached a plot with some results (speedup) that we have obtained some time ago with some hacks we introduced to petsc in order to be used on hybrid archs using openmp. The tests were done in a set of 6 Xeon nodes with 8 cores each. Results are for the MatMult op in KSP in the context

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Aron Ahmadia
Rodrigo, Can you also clarify what your base case is (i.e. 1 process mapped to a single SMP node, or the same number of processes mapped virtually to each core?). In other words, did you try running MPI jobs with the same number of processes as cores, and how did this compare to your MPI+OpenMP r

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Aron Ahmadia
Hi Rodrigo, These are interesting results. It looks like you were bound by a speedup of about 2, which suggests you might have been seeing cache capacity/conflict problems. Did you do any further analysis on why you weren't able to get better performance? A On Fri, Nov 12, 2010 at 8:26 AM, Rod

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
On Fri, Nov 12, 2010 at 02:18, Barry Smith wrote: > How do you get adaptive load balancing (across the cores inside a process) > if you have OpenMP compiler decide the partitioning/parallelism? This was > Bill's point in why not to use OpenMP. For example if you give each core the > same amount o

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Jed Brown
On Fri, Nov 12, 2010 at 02:03, Barry Smith wrote: > > I mean its easy to tell a thread to do something, but I was not aware > that pthreads had nice support > > for telling all threads to do something at the same time. On a multicore, > you want vector instructions, > >Why do you want vector

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Aron Ahmadia
Jed, I think you're right. There are several approaches within OpenMP for doing what Barry is asking. Fundamentally, when distributing a for loop, you can use a dynamic scheduler: See the section on scheduling clauses in the Wikipedia article for a short overview: http://en.wikipedia.org/wiki/Op

[petsc-dev] PETSc programming model for multi-core systems

2010-11-12 Thread Aron Ahmadia
I would support either MPI+OpenMP or MPI+MPI. I've seen reasonable performance achieved for things like SpMV on both, but OpenMP gives you a lot of flexibility for reduction operations. Matt, in pthreads, all you have to do is fork, synchronize your threads, then run whatever piece of code you wa

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Richard Tran Mills
I've got to agree with Mark's last statement on OpenMP: it's not particularly good, but it appears to be used a lot. It seems like a lot of the major codes running on Jaguar at NCCS are moving towards OpenMP. I've always disliked OpenMP because of the various issues already mentioned in this t

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Keita Teranishi
[mailto:petsc-dev-boun...@mcs.anl.gov] On Behalf Of Barry Smith Sent: Thursday, November 11, 2010 4:53 PM To: For users of the development version of PETSc Subject: [petsc-dev] PETSc programming model for multi-core systems What should we use a for programming model for PETSc on multi-core systems

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Mark F. Adams
This is a great technical discussion the very vexing question of future programming models. In addition to these issues there are facts-on-the-ground. My limited view of this elephant, if you will, is that OpenMP seems to be getting a certain critical mass, for better or worse. We may not

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 8:24 PM, Mark F. Adams wrote: > This is a great technical discussion the very vexing question of future > programming models. > > In addition to these issues there are facts-on-the-ground. My limited view > of this elephant, if you will, is that OpenMP seems to be getting

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 7:22 PM, Jed Brown wrote: > On Fri, Nov 12, 2010 at 02:18, Barry Smith wrote: > How do you get adaptive load balancing (across the cores inside a process) if > you have OpenMP compiler decide the partitioning/parallelism? This was Bill's > point in why not to use OpenMP. Fo

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 7:15 PM, Jed Brown wrote: > On Fri, Nov 12, 2010 at 02:03, Barry Smith wrote: > > I mean its easy to tell a thread to do something, but I was not aware that > > pthreads had nice support > > for telling all threads to do something at the same time. On a multicore, > > you w

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 7:09 PM, Aron Ahmadia wrote: > I would support either MPI+OpenMP or MPI+MPI. I've seen reasonable > performance achieved for things like SpMV on both, but OpenMP gives > you a lot of flexibility for reduction operations. How do you get adaptive load balancing (across the

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 6:58 PM, Matthew Knepley wrote: > On Fri, Nov 12, 2010 at 11:54 AM, Barry Smith wrote: > > On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote: > > > On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote: > > > > What should we use a for programming model for PETSc on multi-

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote: > On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote: > > What should we use a for programming model for PETSc on multi-core systems? > Currently for conventional multicore we have only have one MPI process per > core and for GPU we have s

[petsc-dev] PETSc programming model for multi-core systems

2010-11-11 Thread Barry Smith
What should we use a for programming model for PETSc on multi-core systems? Currently for conventional multicore we have only have one MPI process per core and for GPU we have subclasses of Vec and Mat with custom CUDA code. Should we introduce subclasses of Vec and Mat built on pthreads