Hi Jeff,
you'll want to compile with optimization, otherwise you'd be intentionally making the
native `sqrt` slower than it would be in a real application; you need to add `-O2` or
`-O3` to your compilation. Also, you're using floats, not doubles, so use `sqrtf` in your
C code, not `sqrt`! (your code is C, not necessarily how you'd write the same program in C++).
Also, compared to the time for the math you're doing, both in the volk and in the libm
sqrt case, your time measurement's uncertainty is large. (taking the square root of only
16k values – that's nearly nothing.) You need to run that in a loop of many iterations,
preferably with some warm-up to get the branch predictors trained. (assuming the CPU *has*
branch prediction – the ARM1176JZ-S doesn't, as far as I know).
Hey, luckily your VOLK already ships with such a loop-running benchmark mockup:
`volk_profile -R sqrt` will do exactly that. The `generic` implementation literally just
calls `sqrtf`. Could you share the output of `volk_profile -R sqrt` with us?
Furthermore, I'm **highly** confused by your results: ARM1176JZ-S is a 32 bit processor,
developed somewhere in the early 2000s; so, it's –by modern standards– a painfully slow 32
bit armv6 CPU. It predates both aarch64 and NEON! So, I'm pretty sure cpu_features must be
wrong, or this is not the CPU you're using. In this rare case, I think you must be wrong
and not the software, because you're also using a /usr/local/lib64 library path, which
would quite unambigously point to a 64 bit OS, which couldn't run on an ARM11.
Could you double-check and *confirm* you're using an ARM1176JZ-S processor? If you are,
are you perhaps running this with qemu-aarch64 on your armv6 (32 bit!) machine? Can you
send us the `volk_sqrt` you're getting, or at least share what `file volk_sqrt` says about
that binary?
We then would need to help you file a bug upstream against cpu_features, because it'd be
impossible for us to build a working VOLK if cpu_features goes and miscategorizes an
ancient 32 bit machine as aarch64.
Best regards,
Marcus
On 08.10.23 00:22, Jeff R wrote:
I modified a simple Volk sqrt program for an ARM1176JZ-S processor to test performance,
and the results are puzzling. The following program prints:
dur_VolkSqrt=(0.000000)0.001721 dur_CRTLSqrt=(0.000000)0.000318
The following processor information is displayed. It appears as though NEON is
supported.
~/volk-3.0.0/build# cpu_features/list_cpu_features
arch : aarch64
implementer : 65 (0x41)
variant : 0 (0x00)
part : 3336 (0xD08)
revision : 3 (0x03)
flags : asimd,cpuid,crc32,fp
Why are the numbers so slow for Volk versus the CRTL? I may be missing something
obvious. Thank you in advance.
Here’s the test program:
// g++ -I /usr/local/include/volk volk_sqrt.cpp -o volk_sqrt -L
/usr/local/lib64/ -lvolk
// export LD_LIBRARY_PATH=/usr/local/lib64; ./volk_sqrt
#include <stdio.h>
#include <math.h>
#include <volk.h>
#include <limits.h>
#include <time.h>
#include <sys/time.h>
double get_wall_time()
{
struct timeval time;
if (gettimeofday(&time,NULL))
{
// Handle error
return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}
int main(int argc, char* args[])
{
double walStop;
double walStart;
double dur_VolkSqrt;
double dur_CRTLSqrt;
int N = 1024*16;
unsigned int alignment = volk_get_alignment();
float* in = (float*)volk_malloc(sizeof(float)*N, alignment);
float* out = (float*)volk_malloc(sizeof(float)*N, alignment);
for(unsigned int ii = 0; ii < N; ++ii)
{
in[ii] = (float)(ii*ii);
}
walStart = get_wall_time();
volk_32f_sqrt_32f_a(out, in, N);
//volk_32f_sqrt_32f(out, in, N);
walStop = get_wall_time();
dur_VolkSqrt = walStop - walStart;
walStart = get_wall_time();
for(unsigned int ii = 0; ii < N; ++ii)
{
out[ii] = sqrt(in[ii]);
}
walStop = get_wall_time();
dur_CRTLSqrt = walStop - walStart;
printf("dur_VolkSqrt=(%f)%f dur_CRTLSqrt=(%f)%f\n", dur_VolkSqrt/N, dur_VolkSqrt,
dur_CRTLSqrt/N, dur_CRTLSqrt);
volk_free(in);
volk_free(out);
return 0;
}