Hi Filippo,

did you compile PETSc with the same level of optimization than your template code? In particular, did you turn debugging off for the timings?

Either way, let me share some of the mental stages I've gone through with ViennaCL. It started out with the same premises as you provide: There is nice C++, compilers are 'now' mature enough, and everything can be made a lot 'easier' and 'simpler' for the user. This was in 2010, back when all this GPU computing took off.

Now, fast-forward to this day, I consider the extensive use of C++ with expression templates and other fanciness to be the biggest mistake of ViennaCL. Here is an incomplete list of why:

a) The 'generic' interface mimics that of Boost.uBLAS, so you can just mix&match the two libraries. Problems: First, one replicates the really poor design choices in uBLAS. Second, most users didn't even *think* about using these generic routines for anything other than ViennaCL types. There are just many more computational scientists familiar with Fortran or C than with "modern C++".

b) Keeping the type system consistent. So you would probably introduce a matrix, a vector, and operator overloads. For example, take y = A * x for a matrix A and vectors y, x. Sooner or later, users will write y = R * A * P * x and all of a sudden your left-to-right associativity results in an evaluation as
  y = ((R * A) * P) * x.
(How to 'solve' this particular case: Introduce even more expression templates! See Blaze for an example. So you soon have hundreds of lines of code just to fix problems you never had with just calling functions.)

c) Deal with errors: Continuing with the above example, what if R * A runs out of memory? Sure, you can throw exceptions, but the resulting stack trace will contain all the expression template frenzy, plus you will have a hard time to find out whether the operation R * A or the subsequent multiplication with P causes the error.

d) Expression templates are limited to a single line of code. Every now and then, algorithms require repeated operations on different data. For example, you may have vector operations of the form
  x_i <- x_{i-1} + alpha * p_{i-1}
  r_i <- r_{i-1} - alpha * y_i
  p_i <- r_i + beta * p_{i-1}
(with alpha and beta being scalars, everything else being vectors)
in a pipelined conjugate gradient formulation. The "C"-way of doing this is to grab the pointers to the data and compute all three vector operations in a single for-loop, thus saving on repeated data access from main memory. With expression templates, there is no way to make this as efficient, because the lifetime of expression templates is exactly one line of code.

e) Lack of a stable ABI. There is a good reason why many vendor libraries come with a C interface, not with a C++ interface. If you try to link C++ object files generated by different C++ compilers (it is enough to have different versions of the same compiler, see Microsoft), it is undefined behavior. In best case, it fails with a linker error. In worst case, it will produce random crashes right before the end of a one-month simulation that you needed for your paper to be submitted in a few days. If I remember correctly, one group got ambitious with C++ for writing an MPI library, running into these kinds of problems with the C++ standard library.

f) The entry bar gets way too high. PETSc's current code base allows several dozens of users to contribute to each release. If some problem shows up in a particular function, you just go there, hook up the debugger, and figure out what is going on. Now assume there are C++ expression templates involved. Your stack trace may now spread over several screen pages. Once you navigate to the offending line, you find that a certain overload of a traits class coupled with two functors provides wrong results. The reverse lookup of how exactly the traits class was instantiated, which working set was passed in, and how those functors interact may easily take you many minutes to digest - if you are the guy who has written that piece of code. People new to PETSc are likely to just give up. Note that this is a problem for new library users across the spectrum of C++ libraries; you can find evidence e.g. on the Eigen mailinglist. In the end, C++ templates are leaky abstractions and their use will sooner or later hit the user.

Long story short: Keep in mind that there is a cost to "modern C++" in exchange for eventual performance benefits. For PETSc applications, the bottleneck is all too often not the performance of an expression that could be 'expression templated', but in areas where the use of C++ wouldn't make a difference: algorithm selection, network performance, or poor process placement, just to name a few.

Nonetheless, don't feel set back; you may have better ideas and know better ways of dealing with C++ than we do. A lightweight wrapper on top of PETSc, similar to what petsc4py does for Python, is something that a share of users may find handy. Whether it's worth the extra coding and maintenance effort? I don't know.

Best regards,
Karli


On 04/05/2017 07:42 AM, Filippo Leonardi wrote:
@jed: You assembly is what I would've expected. Let me simplify my code
and see if I can provide a useful test example. (also: I assume your
assembly is for xeon, so I should definitely use avx512).

Let me get back at you in a few days (work permitting) with something
you can use.

From your example I wouldn't expect any benefit with my code compared to
just calling petsc (for those simple kernels).

A big plus I hadn't thought of, would be that the compiler is really
forced to vectorise (like in my case, where I might'have messed up some
config parameter).

@barry: I'm definitely too young to comment here (i.e. it's me that
changed, not the world). Definitely this is not new stuff, and, for
instance, Armadillo/boost/Eigen have been successfully production ready
for many years now. I have somehow the impression that now that c++11 is
more mainstream, it is much easier to write easily readable/maintainable
code (still ugly as hell tough). I think we can now give for granted a
c++11 compiler on any "supercomputer", and even c++14 and soon c++17...
and this makes development and interfaces much nicer.

What I would like to see is something like PETSc (where I have nice,
hidden MPI calls for instance), combined with the niceness of those
libraries (where many operations can be written in a, if I might say so,
more natural way). (My plan is: you did all the hard work, C++ can put a
ribbon on it and see what comes out.)

On 5 Apr 2017 5:39 am, "Jed Brown" <[email protected]
<mailto:[email protected]>> wrote:

    Matthew Knepley <[email protected] <mailto:[email protected]>> writes:

    > On Tue, Apr 4, 2017 at 10:02 PM, Jed Brown <[email protected]
    <mailto:[email protected]>> wrote:
    >
    >> Matthew Knepley <[email protected] <mailto:[email protected]>>
    writes:
    >>
    >> > On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi
    <[email protected] <mailto:[email protected]>
    >> >
    >> > wrote:
    >> >
    >> >> I had weird issues where gcc (that I am using for my tests
    right now)
    >> >> wasn't vectorising properly (even enabling all flags, from
    >> tree-vectorize,
    >> >> to mavx). According to my tests, I know the Intel compiler was
    a bit
    >> better
    >> >> at that.
    >> >>
    >> >
    >> > We are definitely at the mercy of the compiler for this. Maybe
    Jed has an
    >> > idea why its not vectorizing.
    >>
    >> Is this so bad?
    >>
    >> 000000000024080e <VecMAXPY_Seq+0x2fe> mov    rax,QWORD PTR [rbp-0xb0]
    >> 0000000000240815 <VecMAXPY_Seq+0x305> add    ebx,0x1
    >> 0000000000240818 <VecMAXPY_Seq+0x308> vmulpd ymm0,ymm7,YMMWORD PTR
    >> [rax+r9*1]
    >> 000000000024081e <VecMAXPY_Seq+0x30e> mov    rax,QWORD PTR [rbp-0xa8]
    >> 0000000000240825 <VecMAXPY_Seq+0x315> vfmadd231pd
    ymm0,ymm8,YMMWORD PTR
    >> [rax+r9*1]
    >> 000000000024082b <VecMAXPY_Seq+0x31b> mov    rax,QWORD PTR [rbp-0xb8]
    >> 0000000000240832 <VecMAXPY_Seq+0x322> vfmadd231pd
    ymm0,ymm6,YMMWORD PTR
    >> [rax+r9*1]
    >> 0000000000240838 <VecMAXPY_Seq+0x328> vfmadd231pd
    ymm0,ymm5,YMMWORD PTR
    >> [r10+r9*1]
    >> 000000000024083e <VecMAXPY_Seq+0x32e> vaddpd ymm0,ymm0,YMMWORD PTR
    >> [r11+r9*1]
    >> 0000000000240844 <VecMAXPY_Seq+0x334> vmovapd YMMWORD PTR
    [r11+r9*1],ymm0
    >> 000000000024084a <VecMAXPY_Seq+0x33a> add    r9,0x20
    >> 000000000024084e <VecMAXPY_Seq+0x33e> cmp    DWORD PTR [rbp-0xa0],ebx
    >> 0000000000240854 <VecMAXPY_Seq+0x344> ja     000000000024080e
    >> <VecMAXPY_Seq+0x2fe>
    >>
    >
    > I agree that is what we should see. It cannot be what Fillippo has
    if he is
    > getting ~4x with the template stuff.

    I'm using gcc.  Fillippo, can you make an easy to run test that we can
    evaluate on Xeon and KNL?

Reply via email to