Re: Volk sqrt ARM performance

Marcus Müller Sun, 08 Oct 2023 10:31:53 -0700

Hi Jeff,

you'll want to compile with optimization, otherwise you'd be intentionally making thenative `sqrt` slower than it would be in a real application; you need to add `-O2` or`-O3` to your compilation. Also, you're using floats, not doubles, so use `sqrtf` in yourC code, not `sqrt`! (your code is C, not necessarily how you'd write the same program in C++).

Also, compared to the time for the math you're doing, both in the volk and in the libmsqrt case, your time measurement's uncertainty is large. (taking the square root of only16k values – that's nearly nothing.) You need to run that in a loop of many iterations,preferably with some warm-up to get the branch predictors trained. (assuming the CPU *has*branch prediction – the ARM1176JZ-S doesn't, as far as I know).

Hey, luckily your VOLK already ships with such a loop-running benchmark mockup:`volk_profile -R sqrt` will do exactly that. The `generic` implementation literally justcalls `sqrtf`. Could you share the output of `volk_profile -R sqrt` with us?

Furthermore, I'm **highly** confused by your results: ARM1176JZ-S is a 32 bit processor,developed somewhere in the early 2000s; so, it's –by modern standards– a painfully slow 32bit armv6 CPU. It predates both aarch64 and NEON! So, I'm pretty sure cpu_features must bewrong, or this is not the CPU you're using. In this rare case, I think you must be wrongand not the software, because you're also using a /usr/local/lib64 library path, whichwould quite unambigously point to a 64 bit OS, which couldn't run on an ARM11.

Could you double-check and *confirm* you're using an ARM1176JZ-S processor? If you are,are you perhaps running this with qemu-aarch64 on your armv6 (32 bit!) machine? Can yousend us the `volk_sqrt` you're getting, or at least share what `file volk_sqrt` says aboutthat binary?We then would need to help you file a bug upstream against cpu_features, because it'd beimpossible for us to build a working VOLK if cpu_features goes and miscategorizes anancient 32 bit machine as aarch64.


Best regards,
Marcus

On 08.10.23 00:22, Jeff R wrote:

I modified a simple Volk sqrt program for an ARM1176JZ-S processor to test performance,and the results are puzzling. The following program prints:


dur_VolkSqrt=(0.000000)0.001721 dur_CRTLSqrt=(0.000000)0.000318

The following processor information is displayed. It appears as though NEON is 
supported.


~/volk-3.0.0/build# cpu_features/list_cpu_features

arch : aarch64

implementer :  65 (0x41)

variant :   0 (0x00)

part : 3336 (0xD08)

revision :   3 (0x03)

flags : asimd,cpuid,crc32,fp

Why are the numbers so slow for Volk versus the CRTL? I may be missing somethingobvious. Thank you in advance.


Here’s the test program:

// g++ -I /usr/local/include/volk volk_sqrt.cpp -o volk_sqrt -L 
/usr/local/lib64/ -lvolk

// export LD_LIBRARY_PATH=/usr/local/lib64; ./volk_sqrt


#include <stdio.h>

#include <math.h>

#include <volk.h>

#include <limits.h>

#include <time.h>

#include <sys/time.h>


double get_wall_time()

{

    struct timeval time;


    if (gettimeofday(&time,NULL))

    {

        //  Handle error

        return 0;

    }

    return (double)time.tv_sec + (double)time.tv_usec * .000001;

}


int main(int argc, char* args[])

{

    double walStop;

    double walStart;

    double dur_VolkSqrt;

    double dur_CRTLSqrt;

    int N = 1024*16;


    unsigned int alignment = volk_get_alignment();

    float* in = (float*)volk_malloc(sizeof(float)*N, alignment);

    float* out = (float*)volk_malloc(sizeof(float)*N, alignment);


    for(unsigned int ii = 0; ii < N; ++ii)

    {

        in[ii] = (float)(ii*ii);

    }


    walStart = get_wall_time();

    volk_32f_sqrt_32f_a(out, in, N);

    //volk_32f_sqrt_32f(out, in, N);

    walStop = get_wall_time();

    dur_VolkSqrt = walStop - walStart;


    walStart = get_wall_time();

    for(unsigned int ii = 0; ii < N; ++ii)

    {

        out[ii] = sqrt(in[ii]);

    }

    walStop = get_wall_time();

    dur_CRTLSqrt = walStop - walStart;

printf("dur_VolkSqrt=(%f)%f dur_CRTLSqrt=(%f)%f\n", dur_VolkSqrt/N, dur_VolkSqrt,dur_CRTLSqrt/N, dur_CRTLSqrt);


    volk_free(in);

    volk_free(out);

    return 0;

}

Re: Volk sqrt ARM performance

Reply via email to