Re: std.math API rework

Ilya Yaroshenko via Digitalmars-d Sun, 09 Oct 2016 03:56:58 -0700

On Friday, 7 October 2016 at 17:02:02 UTC, Andrei Alexandrescuwrote:

On 10/07/2016 03:42 AM, Ilya Yaroshenko wrote:
For example, SUM_i of sqrt(fabs(a[i])) can be vectorised using
mir.ndslice.algorithm.
vxorps instruction can be used for fabs.
vsqrtps instruction can be used for sqrt.
LDC's @fastmath allows to re-associate summation elements.
Depend on data cache level this allows to speed up iteration 8times forsingle precision floating point number for AVX (16 times forAVX512?).
Yah, 8 times is large enough to justify an important change.
Current std.math has following problems:

1. Math funcitons are not templates -> Phobos should be linked.
This is also the case for C++ - most math functions are linkedfrom the C standard library. How do typical linear algebralibraries similar in functionality with Mir (such as Eigen)deal with this situation?

1) BLAS-like API requires only sqrt and fabs. The solutions usedin Eigen depend on compiler. For example, the following code canbe found:


```c++

template<> EIGEN_DEVICE_FUNC inline float4 pabs<float4>(constfloat4& a) {return make_float4(fabsf(a.x), fabsf(a.y), fabsf(a.z),fabsf(a.w));

template<> EIGEN_DEVICE_FUNC inline double2 pabs<double2>(constdouble2& a) {

  return make_double2(fabs(a.x), fabs(a.y));
}
```

2) Eigen, uBLAS and other use Expression Templates [1], which areused to compose few multiplications, additions/subtractions andmaybe some per element operations on matrices and vectors. In thesame time I have never seen that a lambda can be passed. C/C++high performance libraries uses macroses/templates for typespecification, but lambdas are not used.

This makes upcoming ndslice.algorithm a unique solution, which ismore flexible, fast, and universal comparing with C++ ExpressionTemplates. It still requires some rework, and LDC based DMD 2.072for further optimization.

Also, one question is how does the existence of unusedfunctions impede the working of faster functions providedseparately? Is it a sticky point that std.math is he exactmodule used?

Of course a separate module or dub can be provided instead. Inaddition, std.math should be splitted into package and reworked.So, instead of modifying std.math we can start a new math package.

Trying to get a good grip on the matter. Generally you'd have avery easy time convincing me that templates are a better way togo :o). But we need to have a good motivation. Do you have abrief example illustrating one proposed template and how it isbetter than the old ways?


Yes, the example can be found at [2].

First template is better for BetterC mode. The example contains

a C program. The last paragraph in this post contains second partabout this example.

The first part:

```c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

float mir_alg_bar(float, float, float);

int main(int argc, char const *argv[])
{
        if(argc < 4)
        {
                puts("Usage: app number_a number_b number_с");
                return 1;
        }

        float a = atof(argv[1]);
        float b = atof(argv[2]);
        float c = atof(argv[3]);

        float d = mir_alg_bar(a, b, c);
        printf("%f\n", d);
        return 0;
}
```

This program should be linked with BetterC libray:

```sh
clang app.c alg/libmir-alg.a
```

`mir-alg` is a small betterC library, which uses a generic `mir`dummy (not a normal Mir for example simplicity). It can be linkedas common C library and has extern(C) nothrow @nogc interface.


```d
module alg_bar;

pragma(LDC_no_moduleinfo);

import ldc.attributes : fastmath;
import mir.alg;

extern(C) nothrow @nogc @fastmath:

float mir_alg_bar(float a, float b, float c) { return alg1!bar(a,b, c); };

```

Mir dummy contains 3 implementations `alg1`, `alg2`, `alg3`.

```d
module mir.alg;

import ldc.intrinsics : llvm_fabs;
import ldc.attributes : fastmath;

pragma(LDC_no_moduleinfo);

@fastmath
{
        auto alg1(alias f)(float a, float b, float c)
        {
                return f(a, llvm_fabs(b), c);
        }

        auto alg2(alias f)(float a, float b, float c)
        {
                return f(a, fabs(b), c);
        }

        auto alg3(alias f)(float a, float b, float c)
        {
                import std.math;
                return f(a, std.math.fabs(b), c);
        }
}

@fastmath
auto bar()(float a, float b, float c)
{
        return a * b + c;
}

float fabs(float x) @safe pure nothrow @nogc { returnllvm_fabs(x); }

```

`fabs` function declaration is the same as in LDC's Phobos fork.

`alg1` can be linked with C library in any optimization modes.

`alg2` and `alg3` uses function declarations and requir to link`libmir` dummy or `libphobos2` respectively. Making `fabs`template solves this problem.LDC can inline `fabs` for `alg2` and `alg3`, but `O2` flag isrequired.

1.a I strongly decided to move forward without DRuntime. Aphobos assource library is partially OK, but no linking dependenciesshould be.BetterC mode is what required for Mir to replace OpenBLAS andEigen.
New
cpuid, threads and mutexes should be provided too. New cpuid[1] isalready implemented (I just need to replace module constructorwith
explicit initialization function).
Do you think you can integrate the new cpuid implementationwith the existing interface (most likely greatly enhancing it)without breaking the existing clients?

New cpuid has low level and hight level API. The hight level APIwill be reworked to intermediate level API without the moduleconstructor. This is required for BetterC mode. Current DRuntimecpuid API can be implemented on top of new cpuid low levelinterface. However current DRuntime API can not be used for Mir.The reasons are:

  1. It is not compatible with betterC mode.

2. It performs additional weird computations for cache levelsizes. This makes me crazy to predict what returned value means.If an engineer asks about Level3 cache size, Level3 cache sizeshould be returned instead current hell. See also Issue 16028 [3].3. It can not represent complex CPU topology, which is requiredby ARM (especially by server ARM CPUs). CPU information isprotected on ARM CPUs, but it can be predefined by a user of befetched from an OS.

Same question for threads.
Same question for mutexes.

Current DRuntime mutexes and threads can be implemented on top ofnothrow @nogc successors.

My strong opinion is that a D library
for D is a wrong direction. A numeric D library should be aproduct forother languages too, like many C libraries does. One my clientisthinking to invest to nothrow @nogc async I/O for production,so it may
help to move to betterC direction too.
Sure. A different way to frame this is to make D friendliertoward linking with other languages. The way I see it, if weget alternatives for cpuid, threads, and mutexes in Mir, thatwould benefit clients interested in linear algebra. If we getthem in DRuntime, that would benefit clients interested inlinear algebra and everything else. Clearly the impact would bemuch larger.

We do not need to have DRuntime for the future, but existingusers. Fat runtime (except generic algorithms) is red flag forsoftware developers if they need to creat something like Eigen orhight perfromance web server. The number of such libraries isalways small. In the same time these libraries make weather and alot of packages will be build on top of them after a while. Duballows to overload dependencies versions in dub.selections.json.This is what is required for continuous development.

Assume you manage a set of integrated infrastructure projects,which use a set of third sides DUB projects, which depends onDRuntime. Part of this infrastructure is open source. Consultancyfor clients is main income. And you want to add support formodern CPU or add new system API for another one appleOS. Arelease has time constraints, so you can not wait for newcompiler release. Also clients wants new features and backwardcompatability for older compiler in the same time. Plus testingcomplex infrastructure with a compiler fork requires additionalefforts and time and clients would not be happy to deploy yourcompiler fork into their infrastructure and for their clients. Inaddition, you need to update forked DRuntime API usage for thethird side projects. This is a stalemate situation and red flagfor business.

Cpuid, threads, mutexes, event loop, async I/O, and numericsoftware as low level DUB packages with good community supportand small release cycle are what we really need. I am not againsthight level API. Furthermore, bindings to other languages is anoption to provide simple and familiar API for users. But lowlevel API is required.

Users do not care about `std`/`core` or other prefix. They wantgood support. Business requires reliability and flexibility, bugsare not a huge problem if an architecture allows to find and fixthem. Really huge problem is a high level object-orientedGC-oriented X86-oriented DRuntime, which is dependency almosteverywhere. I would like to see `std.glas` instead of `mir.glas`,but it should be provided as common dub project.

2.b In context of 1.a, linking multiple binaries compiledwithdifferent DRuntime/Phobos versions may cause significantproblems.DRuntime is not so stable like std C lib. One may say that Iam doingsomething wrong if I need to link libraries compiled withdifferentDRuntimes. But this is what will happen often with D in realworld if Dstart to replace C libraries (1.a). So, betterC withoutDRuntime /Phobos linking dependencies is a direction to move forward.nothrow
@nogc generic Phobos code seems to be OK.
Hmmm... well I seem to recall the C std lib in gcc has largeinteroperability issues with its own previous versions, evenacross minor releases. This has caused numerous headaches atFacebook because the breakages always come without warning andmanifest themselves in obscure ways. On the Microsoft sidethings are even worse, because they virtually guarantee that aversion of VS is not binary compatible with the previous ones(I'm not kidding; it's deliberate).
That sets a rather low baseline for us :o). Clearly we'd wantto do better, and we probably can. But I think it would be anexaggeration to worry too much about such scenarios.
2. Math funcitons are not templates -> They are not inlined ->Novectorization + function calls in a loop body. One day thismay be
fixed, but (1.a, 1.b).
How to the likes of Eigen do it? Do they provide their owntemplated implementation of <math.h>?

Seems like the recent LDC fixes this problem. Many thanks to ourLDC team!Eigen code is very weird, it uses templates and macroses in thesame time with specialization for different compilers and clibsincluding Intel MKL.

Have you investigated the much hailed link-time inlining?


This probably would not work for loop vectorization.

3. Math funcitons are not aliases for LDC -> LDC's @fastmathwould notwork for them. To enable @fastmath for this functions theyshould beannotated with @fastmath, which is not acceptable. If afunction is analias for llvm intrinsics, than @fastmath flag can be appliedto a
function, which calls it.
Not sure I udnerstand this, but it seems to me making the mathfunctions templates would solve it?

Yes. Templates can be replaced with aliases to the intrinsics for`version(LDC)`.For the example above and for all optimization turning on onlythe `alg1`, which calls `llvm_fabs` directly, has fusedoperations. The reason is that fma composition is in the end ofLLVM optimization pipeline. If one inlined function (bar) androot function (alg2) have `@fastmath` and another inlinedfunction (fabs) has not `@fastmath`, the code for root will havefma, but inlined code for _both_ functions will not have fma. Toperform other optimizations like vectorization LLVM needs todecompose fma and recompose it later.


Best regards,
Ilya

[1] https://en.wikipedia.org/wiki/Expression_templates

[2]https://github.com/libmir/temporary_experiments/tree/master/alias_vs_fun

[3] https://issues.dlang.org/show_bug.cgi?id=16028

Re: std.math API rework

Reply via email to