Aron,
1. It's all NUMA
2. You don't get to repartition the matrix because that is unnatural and not
a local optimization.
3. Because of 2, the algorithms are different, so direct comparison is not
meaningful, but I do not buy that you can get the same throughput on the
kernel that is natural and
On Fri, Nov 12, 2010 at 15:31, Rodrigo R. Paz wrote:
> Of course, we have linear scaling in computing RHS and Matrix contrib.
> Also, the problem size when ranging nodes and cores was fixed (2.5M).
Thank you. In a chat with Lisandro, I hear that this is a 2D problem. So
the interesting issue i
On Fri, Nov 12, 2010 at 15:04, Jed Brown wrote:
> The comparison we are more interested is with an equivalent number of MPI
> processes per node.
Whoops, missed your second attachment. Indeed, that is what I expect,
especially at relatively low node counts. I think it has been fairly well
dem
On Fri, Nov 12, 2010 at 14:51, Rodrigo R. Paz wrote:
> IMHO, the main problem here is the low memory bandwidth (FSB) in Xeon E5420
> nodes.
This is well-known and fundamental (not dependent on the programming model).
It appears that you are comparing OpenMP threading within a node to a
single t
> A partial counter-point is that MatSolve with OpenMP is unlikely to be near
> the throughput of MPI-based MatSolve because the high-concurrency paths are
> not going to provide very good memory locality, and may cause horrible
> degradation due to cache-coherency issues.
I am just going to th
On Fri, Nov 12, 2010 at 11:54 AM, Barry Smith wrote:
>
> On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote:
>
> > On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote:
> >
> > What should we use a for programming model for PETSc on multi-core
> systems? Currently for conventional multicore we h
I resend the first figure I sent because the y-label on the right plot was
wrong (the right plot is the "transpose" of the left one). Of course, we
have linear scaling in computing RHS and Matrix contrib. Also, the problem
size when ranging nodes and cores was fixed (2.5M).
Rodrigo
On Fri, Nov 12
On 12 November 2010 11:09, Jed Brown wrote:
> On Fri, Nov 12, 2010 at 15:04, Jed Brown wrote:
>>
>> The comparison we are more interested is with an equivalent number of MPI
>> processes per node.
>
> Whoops, missed your second attachment. ?Indeed, that is what I expect,
> especially at relativel
On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote:
>
> What should we use a for programming model for PETSc on multi-core
> systems? Currently for conventional multicore we have only have one MPI
> process per core and for GPU we have subclasses of Vec and Mat with custom
> CUDA code.
>
> Sh
Hi, Aron,
IMHO, the main problem here is the low memory bandwidth (FSB) in Xeon E5420
nodes.
In sparse matrix-vector product, the ratio between floating point operations
and memory accesses is low. Thus, the overall performance in this stage is
mainly controlled by memory bandwidth.
We have also te
Hi all,
find attached a plot with some results (speedup) that we have obtained some
time ago with some hacks we introduced to petsc in order to be used on
hybrid archs using openmp.
The tests were done in a set of 6 Xeon nodes with 8 cores each. Results are
for the MatMult op in KSP in the context
Rodrigo,
Can you also clarify what your base case is (i.e. 1 process mapped to
a single SMP node, or the same number of processes mapped virtually to
each core?). In other words, did you try running MPI jobs with the
same number of processes as cores, and how did this compare to your
MPI+OpenMP r
Hi Rodrigo,
These are interesting results. It looks like you were bound by a
speedup of about 2, which suggests you might have been seeing cache
capacity/conflict problems. Did you do any further analysis on why
you weren't able to get better performance?
A
On Fri, Nov 12, 2010 at 8:26 AM, Rod
On Fri, Nov 12, 2010 at 02:18, Barry Smith wrote:
> How do you get adaptive load balancing (across the cores inside a process)
> if you have OpenMP compiler decide the partitioning/parallelism? This was
> Bill's point in why not to use OpenMP. For example if you give each core the
> same amount o
On Fri, Nov 12, 2010 at 02:03, Barry Smith wrote:
> > I mean its easy to tell a thread to do something, but I was not aware
> that pthreads had nice support
> > for telling all threads to do something at the same time. On a multicore,
> you want vector instructions,
>
>Why do you want vector
Jed, I think you're right. There are several approaches within OpenMP
for doing what Barry is asking. Fundamentally, when distributing a
for loop, you can use a dynamic scheduler:
See the section on scheduling clauses in the Wikipedia article for a
short overview: http://en.wikipedia.org/wiki/Op
I would support either MPI+OpenMP or MPI+MPI. I've seen reasonable
performance achieved for things like SpMV on both, but OpenMP gives
you a lot of flexibility for reduction operations.
Matt, in pthreads, all you have to do is fork, synchronize your
threads, then run whatever piece of code you wa
I've got to agree with Mark's last statement on OpenMP: it's not particularly
good, but it appears to be used a lot. It seems like a lot of the major codes
running on Jaguar at NCCS are moving towards OpenMP.
I've always disliked OpenMP because of the various issues already mentioned in
this t
[mailto:petsc-dev-boun...@mcs.anl.gov]
On Behalf Of Barry Smith
Sent: Thursday, November 11, 2010 4:53 PM
To: For users of the development version of PETSc
Subject: [petsc-dev] PETSc programming model for multi-core systems
What should we use a for programming model for PETSc on multi-core systems
This is a great technical discussion the very vexing question of
future programming models.
In addition to these issues there are facts-on-the-ground. My limited
view of this elephant, if you will, is that OpenMP seems to be getting
a certain critical mass, for better or worse. We may not
On Nov 11, 2010, at 8:24 PM, Mark F. Adams wrote:
> This is a great technical discussion the very vexing question of future
> programming models.
>
> In addition to these issues there are facts-on-the-ground. My limited view
> of this elephant, if you will, is that OpenMP seems to be getting
On Nov 11, 2010, at 7:22 PM, Jed Brown wrote:
> On Fri, Nov 12, 2010 at 02:18, Barry Smith wrote:
> How do you get adaptive load balancing (across the cores inside a process) if
> you have OpenMP compiler decide the partitioning/parallelism? This was Bill's
> point in why not to use OpenMP. Fo
On Nov 11, 2010, at 7:15 PM, Jed Brown wrote:
> On Fri, Nov 12, 2010 at 02:03, Barry Smith wrote:
> > I mean its easy to tell a thread to do something, but I was not aware that
> > pthreads had nice support
> > for telling all threads to do something at the same time. On a multicore,
> > you w
On Nov 11, 2010, at 7:09 PM, Aron Ahmadia wrote:
> I would support either MPI+OpenMP or MPI+MPI. I've seen reasonable
> performance achieved for things like SpMV on both, but OpenMP gives
> you a lot of flexibility for reduction operations.
How do you get adaptive load balancing (across the
On Nov 11, 2010, at 6:58 PM, Matthew Knepley wrote:
> On Fri, Nov 12, 2010 at 11:54 AM, Barry Smith wrote:
>
> On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote:
>
> > On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote:
> >
> > What should we use a for programming model for PETSc on multi-
On Nov 11, 2010, at 6:15 PM, Matthew Knepley wrote:
> On Fri, Nov 12, 2010 at 9:52 AM, Barry Smith wrote:
>
> What should we use a for programming model for PETSc on multi-core systems?
> Currently for conventional multicore we have only have one MPI process per
> core and for GPU we have s
What should we use a for programming model for PETSc on multi-core systems?
Currently for conventional multicore we have only have one MPI process per core
and for GPU we have subclasses of Vec and Mat with custom CUDA code.
Should we introduce subclasses of Vec and Mat built on pthreads
27 matches
Mail list logo