Re: [petsc-users] PETSc with modern C++

Karl Rupp Wed, 05 Apr 2017 02:20:07 -0700

Hi Filippo,

did you compile PETSc with the same level of optimization than yourtemplate code? In particular, did you turn debugging off for the timings?

Either way, let me share some of the mental stages I've gone throughwith ViennaCL. It started out with the same premises as you provide:There is nice C++, compilers are 'now' mature enough, and everything canbe made a lot 'easier' and 'simpler' for the user. This was in 2010,back when all this GPU computing took off.

Now, fast-forward to this day, I consider the extensive use of C++ withexpression templates and other fanciness to be the biggest mistake ofViennaCL. Here is an incomplete list of why:

a) The 'generic' interface mimics that of Boost.uBLAS, so you can justmix&match the two libraries. Problems: First, one replicates the reallypoor design choices in uBLAS. Second, most users didn't even *think*about using these generic routines for anything other than ViennaCLtypes. There are just many more computational scientists familiar withFortran or C than with "modern C++".

b) Keeping the type system consistent. So you would probably introducea matrix, a vector, and operator overloads. For example, take y = A * xfor a matrix A and vectors y, x. Sooner or later, users will write y = R* A * P * x and all of a sudden your left-to-right associativity resultsin an evaluation as

  y = ((R * A) * P) * x.

(How to 'solve' this particular case: Introduce even more expressiontemplates! See Blaze for an example. So you soon have hundreds of linesof code just to fix problems you never had with just calling functions.)

c) Deal with errors: Continuing with the above example, what if R * Aruns out of memory? Sure, you can throw exceptions, but the resultingstack trace will contain all the expression template frenzy, plus youwill have a hard time to find out whether the operation R * A or thesubsequent multiplication with P causes the error.

d) Expression templates are limited to a single line of code. Everynow and then, algorithms require repeated operations on different data.For example, you may have vector operations of the form

  x_i <- x_{i-1} + alpha * p_{i-1}
  r_i <- r_{i-1} - alpha * y_i
  p_i <- r_i + beta * p_{i-1}
(with alpha and beta being scalars, everything else being vectors)

in a pipelined conjugate gradient formulation. The "C"-way of doing thisis to grab the pointers to the data and compute all three vectoroperations in a single for-loop, thus saving on repeated data accessfrom main memory. With expression templates, there is no way to makethis as efficient, because the lifetime of expression templates isexactly one line of code.

e) Lack of a stable ABI. There is a good reason why many vendorlibraries come with a C interface, not with a C++ interface. If you tryto link C++ object files generated by different C++ compilers (it isenough to have different versions of the same compiler, see Microsoft),it is undefined behavior. In best case, it fails with a linker error. Inworst case, it will produce random crashes right before the end of aone-month simulation that you needed for your paper to be submitted in afew days. If I remember correctly, one group got ambitious with C++ forwriting an MPI library, running into these kinds of problems with theC++ standard library.

f) The entry bar gets way too high. PETSc's current code base allowsseveral dozens of users to contribute to each release. If some problemshows up in a particular function, you just go there, hook up thedebugger, and figure out what is going on. Now assume there are C++expression templates involved. Your stack trace may now spread overseveral screen pages. Once you navigate to the offending line, you findthat a certain overload of a traits class coupled with two functorsprovides wrong results. The reverse lookup of how exactly the traitsclass was instantiated, which working set was passed in, and how thosefunctors interact may easily take you many minutes to digest - if youare the guy who has written that piece of code. People new to PETSc arelikely to just give up. Note that this is a problem for new libraryusers across the spectrum of C++ libraries; you can find evidence e.g.on the Eigen mailinglist. In the end, C++ templates are leakyabstractions and their use will sooner or later hit the user.

Long story short: Keep in mind that there is a cost to "modern C++" inexchange for eventual performance benefits. For PETSc applications, thebottleneck is all too often not the performance of an expression thatcould be 'expression templated', but in areas where the use of C++wouldn't make a difference: algorithm selection, network performance, orpoor process placement, just to name a few.

Nonetheless, don't feel set back; you may have better ideas and knowbetter ways of dealing with C++ than we do. A lightweight wrapper on topof PETSc, similar to what petsc4py does for Python, is something that ashare of users may find handy. Whether it's worth the extra coding andmaintenance effort? I don't know.


Best regards,
Karli


On 04/05/2017 07:42 AM, Filippo Leonardi wrote:

@jed: You assembly is what I would've expected. Let me simplify my code
and see if I can provide a useful test example. (also: I assume your
assembly is for xeon, so I should definitely use avx512).

Let me get back at you in a few days (work permitting) with something
you can use.

From your example I wouldn't expect any benefit with my code compared to
just calling petsc (for those simple kernels).

A big plus I hadn't thought of, would be that the compiler is really
forced to vectorise (like in my case, where I might'have messed up some
config parameter).

@barry: I'm definitely too young to comment here (i.e. it's me that
changed, not the world). Definitely this is not new stuff, and, for
instance, Armadillo/boost/Eigen have been successfully production ready
for many years now. I have somehow the impression that now that c++11 is
more mainstream, it is much easier to write easily readable/maintainable
code (still ugly as hell tough). I think we can now give for granted a
c++11 compiler on any "supercomputer", and even c++14 and soon c++17...
and this makes development and interfaces much nicer.

What I would like to see is something like PETSc (where I have nice,
hidden MPI calls for instance), combined with the niceness of those
libraries (where many operations can be written in a, if I might say so,
more natural way). (My plan is: you did all the hard work, C++ can put a
ribbon on it and see what comes out.)

On 5 Apr 2017 5:39 am, "Jed Brown" <[email protected]
<mailto:[email protected]>> wrote:

    Matthew Knepley <[email protected] <mailto:[email protected]>> writes:

    > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <[email protected]
    <mailto:[email protected]>> wrote:
    >
    >> Matthew Knepley <[email protected] <mailto:[email protected]>>
    writes:
    >>
    >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi
    <[email protected] <mailto:[email protected]>
    >> >
    >> > wrote:
    >> >
    >> >> I had weird issues where gcc (that I am using for my tests
    right now)
    >> >> wasn't vectorising properly (even enabling all flags, from
    >> tree-vectorize,
    >> >> to mavx). According to my tests, I know the Intel compiler was
    a bit
    >> better
    >> >> at that.
    >> >>
    >> >
    >> > We are definitely at the mercy of the compiler for this. Maybe
    Jed has an
    >> > idea why its not vectorizing.
    >>
    >> Is this so bad?
    >>
    >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR [rbp-0xb0]
    >> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1
    >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR
    >> [rax+r9*1]
    >> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR [rbp-0xa8]
    >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd
    ymm0,ymm8,YMMWORD PTR
    >> [rax+r9*1]
    >> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR [rbp-0xb8]
    >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd
    ymm0,ymm6,YMMWORD PTR
    >> [rax+r9*1]
    >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd
    ymm0,ymm5,YMMWORD PTR
    >> [r10+r9*1]
    >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR
    >> [r11+r9*1]
    >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR
    [r11+r9*1],ymm0
    >> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20
    >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR [rbp-0xa0],ebx
    >> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e
    >> <VecMAXPY_Seq+0x2fe>
    >>
    >
    > I agree that is what we should see. It cannot be what Fillippo has
    if he is
    > getting ~4x with the template stuff.

    I'm using gcc.  Fillippo, can you make an easy to run test that we can
    evaluate on Xeon and KNL?

Re: [petsc-users] PETSc with modern C++

Reply via email to