Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-04 Thread Pekka Paalanen
On Wed, 03 Jun 2015 17:47:47 +0100
Ben Avison bavi...@riscosopen.org wrote:

 On Wed, 03 Jun 2015 08:51:25 +0100, Pekka Paalanen ppaala...@gmail.com
 wrote:
  If we fixed gettime() for clock() wraparounds rather than ignoring them,
  I suppose there would be no reason to have gettimei().
 
  Ben?
 
 Well, that would help, but I still don't like the idea of using floating
 point in the middle of benchmarking loops when an integer version works
 just as well and floating point doesn't really gain you anything. Even the
 division by 100 is nearly always undone by the time we reach the
 printf because megapixels per second or megabytes per second are practical
 units - and those are the same things as pixels or bytes per microsecond.
 Nobody is currently doing more to the times than adding or subtracting
 them.
 
 I know they're increasingly rare these days, but a machine with no
 hardware FPU might take an appreciable time to do the integer-floating
 point conversion and floating point maths. Even if you have an FPU, it
 might be powered down on each context switch and only have its state
 restored lazily on the first floating point instruction encountered,
 resulting in a random timing element. In both cases this can be avoided
 by sticking to integers.

That is all the reason to rewrite gettime() in terms of integers then.
If gettime() internally stored its own epoch and returned times
starting from the first call, it'd be less likely to cause integer
overflows in callers. With integers though, unless using 64-bit, we'd
have to pick the resolution and wraparound time when designing the API.
Printing 64-bit values is a bit of a hassle with PRId64 etc.

I think having different timing functions for different tests is
unexpected in any case, and IMHO slightly worse than using 'double'.

I agree with Siarhei in that 'double' is very convenient for
calculations and printing. And you both have proved, that for our uses,
there are no precision issues.

The same arguments that invalidate my proposal to use more accurate
timing functions can be used to make the floating point usage not
matter.

At some point I might propose a patch to fix gettime() internally by
returning times starting from zero, and remove gettimei().


Thanks,
pq
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Pekka Paalanen
On Tue, 2 Jun 2015 15:03:01 -0700
Bill Spitzak spit...@gmail.com wrote:

 I would have the first call return 0.0 and all the others return the
 difference between current time and when that first call was done. Then
 there is no worry about accuracy of floating point. I do not think any
 callers are interested in the absolute time, only in subtracting two
 results to get an elapsed time.

OpenMP threading must be taken into account, but that'd be doable.

It would indeed allow removing the elapsed() function from my proposal.

There would be a problem with clock() wrapping around too often, but
that's an orthogonal issue.

 Not sure if cpu time is what the benchmarks want. This does not include
 blocking waiting for the X server or for the GPU or for reading files.
 Elapsed real time is probably more useful.

This is Pixman, not Cairo. There are no connections to X, no GPUs, and
the benchmarks are not reading any files during the benchmark either.


Thanks,
pq
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Pekka Paalanen
On Wed, 3 Jun 2015 08:38:25 +0300
Siarhei Siamashka siarhei.siamas...@gmail.com wrote:

 On Tue, 2 Jun 2015 10:32:35 +0300
 Pekka Paalanen ppaala...@gmail.com wrote:
 
  Hi,
  
  most pixman performance benchmarks currently rely on gettime() from
  test/util.[ch]:
  - lowlevel-blt-bench
  - prng-test
  - radial-perf-test
  - scaling-bench
  
  Furthermore, affine-bench has its own gettimei() which is essentially
  gettime() but with uin32_t instead of double.
  
  double
  gettime (void)
  {
  #ifdef HAVE_GETTIMEOFDAY
  struct timeval tv;
  
  gettimeofday (tv, NULL);
  return (double)((int64_t)tv.tv_sec * 100 + tv.tv_usec) / 100.;
  #else
  return (double)clock() / (double)CLOCKS_PER_SEC;
  #endif
  }
  
  This definition of gettime() has several potential drawbacks:
  
  1. clock() will wrap around often, the manual page warns that in some
 cases it wraps around every 72 minutes. As the code in Pixman never
 expects a wraparound, this is subtly broken. This is a fallback path
 for systems that do not provide gettimeofday(), so it is rarely used
 if at all.
 
 Yes, the clock() fallback is there just to fill the void and make this
 code compile on the systems, which don't have gettimeofday().
 
 It also measures the CPU time instead of wall-clock time, but this
 can't be helped.
 
  2. gettimeofday() measures wall-clock time, which might not be the best
 to measure code performance on a CPU, because all other load in the
 system will affect the result. It's probably not a significant
 problem on fast systems where you know to run your benchmarks
 uncontended.
 
 In my opinion, wall-clock time is exactly what we need to measure.

Hi Siarhei,

any other reason than the below?

Btw. you don't need to spend time answering this, I already got the
high level feeling I was looking for: nothing to fix here. The rest is
just details for the enthusiastic. :-)


 If you have something else running concurrently in the system, the
 benchmark results are already screwed. Relying on the OS scheduler
 to correctly account the CPU time still does not compensate, for
 example, the cache thrashing effects.

Further down below, you give the same arguments why measuring on rpi is
inaccurate as I would use here to justify using process time (the
kernel drivers and other processes, even if the system is seemingly
idle). :-)

 Basically, with the wall-clock time measurements it is easier to
 notice that something is wrong if the benchmark is not run uncontended.
 It happened to me a few times to accidentally run benchmarks with the
 compilation running in the background. Fortunately, the results were
 very much off and this was sufficient to raise a suspicion :-)
 
 Maybe trying to measure both wall-clock and CPU time at the same time
 and comparing the results in the end would be an even better safety
 guard?

Would even that detect a contended system if there were multiple CPU
cores?

  3. gettimeofday() is not only subject to NTP adjustments but is also
 affected by setting the system clock. IOW, this is not a guaranteed
 monotonic clock. Again, unlikely to be a problem in most cases, as
 benchmarks run long enough to even out NTP skews, but short enough to
 avoid accidentally hitting clock changes. (Or so I would assume.)
 
 Yes, this is unfortunate. But this is hardly the worst problem.

Yeah, none of these are really bad problems.

We could fix just this one by using CLOCK_MONOTONIC when it's
available. I'm not completely sure if that requires a runtime check for
the existense of CLOCK_MONOTONIC particularly. I think
CLOCK_MONOTONIC_RAW would need it.

  4. Using double to store an absolute timestamp is suspicious to me. In
 the end, you always compute the difference between two timestamps,
 and using a floating point type may cause catastrophic cancellation
 [1] depending on absolute values. However, [1] also explains that a
 double is enough. But, given that we read an arbitrary system clock
 whose starting point is unknown (ok, Epoch for the moment), we might
 already be getting values too large to maintain the expected
 accuracy (for floats, sure; for doubles, who knows?)
 
 The double type for storing the time in seconds (since Epoch) was
 selected on purpose.
 
 First, there is no accuracy problem. The time in seconds since
 Epoch is going to fit in a 32-bit value at least until 2038. The
 mantissa in double has 53 bits. Which leaves us with 53 - 32 = 21 bits
 for the fractional part. As such, the granularity of the time
 representation can be estimated as 1 / 2^21 = ~0.12 nanoseconds.
 Which is roughly equivalent to one clock cycle of a hypothetical
 8 GHz processor. The calculations may be off by one here or there,
 but we are interested in the order of magnitude. This is more than
 enough.

Yes, I totally expected you to have that justification.

However, if we change to CLOCK_MONOTONIC or something, the Epoch is no

Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Pekka Paalanen
On Wed, 3 Jun 2015 10:51:25 +0300
Pekka Paalanen ppaala...@gmail.com wrote:

 On Wed, 3 Jun 2015 08:38:25 +0300
 Siarhei Siamashka siarhei.siamas...@gmail.com wrote:
 
  On Tue, 2 Jun 2015 10:32:35 +0300
  Pekka Paalanen ppaala...@gmail.com wrote:

   I would propose the following:
   
   - runtime clock selection with this priority order:
 1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
 2. getrusage(RUSAGE_SELF) - rusage.ru_utime (user time)
 3. gettimeofday()
 4. clock()
 Naturally with build time checks, too. For 3 and 4 would print a
 warning about inaccurate measurements. clock_gettime(CLOCK_MONOTONIC)
 is not in the list because I would assume getrusage() is more widely
 available and I'd like to use process time before wall-clock delta.
   
   - A separate void init_gettime(void) for choosing the clock.
   
   - void gettime(struct timespec *ts) for reading the clock.
   
   - double elapsed(const struct timespec *begin, const struct timespec
 *end) for getting the elapsed time in seconds.
  
  This API looks excessively complicated to me.
 
 Only if you keep the assumption that using a double for absolute
 timestamps is always fine. You always compute time intervals anyway.
 
 This would even give a nice common place to detect clock wraparounds or
 going backwards so you'd be guaranteed to never get a negative interval
 length.
 
 If we need runtime detection of available clocks, then having
 init_gettime() would be useful to detect once instead of per OpenMP
 thread.

Sorry, nevermind this, there is no threading in benchmarks. So, the
explicit init call is not necessary, and like Bill pointed out, one
could during the first call not just init but also take the base
timestamp and always return elapsed time from that.

I could keep the API exactly like it is right now.


Thanks,
pq
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Bill Spitzak

On 06/03/2015 12:36 AM, Pekka Paalanen wrote:

On Tue, 2 Jun 2015 15:03:01 -0700
Bill Spitzak spit...@gmail.com wrote:


I would have the first call return 0.0 and all the others return the
difference between current time and when that first call was done. Then
there is no worry about accuracy of floating point. I do not think any
callers are interested in the absolute time, only in subtracting two
results to get an elapsed time.


OpenMP threading must be taken into account, but that'd be doable.

It would indeed allow removing the elapsed() function from my proposal.

There would be a problem with clock() wrapping around too often, but
that's an orthogonal issue.


But it would only wrap 72 days after the test started, rather than at 
some random point during the test.


___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Ben Avison

On Wed, 03 Jun 2015 08:51:25 +0100, Pekka Paalanen ppaala...@gmail.com
wrote:

If we fixed gettime() for clock() wraparounds rather than ignoring them,
I suppose there would be no reason to have gettimei().

Ben?


Well, that would help, but I still don't like the idea of using floating
point in the middle of benchmarking loops when an integer version works
just as well and floating point doesn't really gain you anything. Even the
division by 100 is nearly always undone by the time we reach the
printf because megapixels per second or megabytes per second are practical
units - and those are the same things as pixels or bytes per microsecond.
Nobody is currently doing more to the times than adding or subtracting
them.

I know they're increasingly rare these days, but a machine with no
hardware FPU might take an appreciable time to do the integer-floating
point conversion and floating point maths. Even if you have an FPU, it
might be powered down on each context switch and only have its state
restored lazily on the first floating point instruction encountered,
resulting in a random timing element. In both cases this can be avoided
by sticking to integers.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-03 Thread Ben Avison

On Wed, 03 Jun 2015 08:40:37 +0100, Pekka Paalanen ppaala...@gmail.com wrote:


On Tue, 02 Jun 2015 17:03:30 +0100
Ben Avison bavi...@riscosopen.org wrote:


If I were to make one change to gettimei() now, looking back, it would be
to make it return int32_t instead. This is because most often you'd be
subtracting two sample outputs of the function, and it's more often
useful to consider time intervals as signed (say if you're comparing the
current time against a timeout time which may be in the past or the
future). If gettimei() returns a signed integer, then C's type promotion
rules make the result of the subtraction signed automatically.


Wasn't overflow well-defined only for unsigned integers?


You're right, I'd forgotten that aspect of C. To be fully portable, a
modulo-2^32-arithmetic comparison needs to be written

uint32_t time1, time2;
if (((time1 - time2)  (1u31)) != 0)

rather than

int32_t time1, time2;
if ((time1 - time2)  0)

or

uint32_t time1, time2;
if ((int32_t) (time1 - time2)  0)

even though the latter two are more readable and all three are actually
equivalent if you're using two's complement integers, which (nearly?)
everybody does nowadays.

Sometimes I wish C had an inbuilt modulo integer type. At the assembly
level for a lot of CPUs it's a third type of comparison with similar ease
of use as signed and unsigned comparisons, but it's a pain to express in
C.

Anyway, this doesn't change the fact that struct timeval currently uses
signed long ints, and will enter undefined behaviour territory in 2038. I
think it's reasonable to assume that in practice, tv_sec will actually
contain an unsigned long int value (albeit one that has been cast to
signed) after that, as that would break the least software.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-02 Thread Pekka Paalanen
Hi,

most pixman performance benchmarks currently rely on gettime() from
test/util.[ch]:
- lowlevel-blt-bench
- prng-test
- radial-perf-test
- scaling-bench

Furthermore, affine-bench has its own gettimei() which is essentially
gettime() but with uin32_t instead of double.

double
gettime (void)
{
#ifdef HAVE_GETTIMEOFDAY
struct timeval tv;

gettimeofday (tv, NULL);
return (double)((int64_t)tv.tv_sec * 100 + tv.tv_usec) / 100.;
#else
return (double)clock() / (double)CLOCKS_PER_SEC;
#endif
}

This definition of gettime() has several potential drawbacks:

1. clock() will wrap around often, the manual page warns that in some
   cases it wraps around every 72 minutes. As the code in Pixman never
   expects a wraparound, this is subtly broken. This is a fallback path
   for systems that do not provide gettimeofday(), so it is rarely used
   if at all.

2. gettimeofday() measures wall-clock time, which might not be the best
   to measure code performance on a CPU, because all other load in the
   system will affect the result. It's probably not a significant
   problem on fast systems where you know to run your benchmarks
   uncontended.

3. gettimeofday() is not only subject to NTP adjustments but is also
   affected by setting the system clock. IOW, this is not a guaranteed
   monotonic clock. Again, unlikely to be a problem in most cases, as
   benchmarks run long enough to even out NTP skews, but short enough to
   avoid accidentally hitting clock changes. (Or so I would assume.)

4. Using double to store an absolute timestamp is suspicious to me. In
   the end, you always compute the difference between two timestamps,
   and using a floating point type may cause catastrophic cancellation
   [1] depending on absolute values. However, [1] also explains that a
   double is enough. But, given that we read an arbitrary system clock
   whose starting point is unknown (ok, Epoch for the moment), we might
   already be getting values too large to maintain the expected
   accuracy (for floats, sure; for doubles, who knows?)


I would propose the following:

- runtime clock selection with this priority order:
1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
2. getrusage(RUSAGE_SELF) - rusage.ru_utime (user time)
3. gettimeofday()
4. clock()
  Naturally with build time checks, too. For 3 and 4 would print a
  warning about inaccurate measurements. clock_gettime(CLOCK_MONOTONIC)
  is not in the list because I would assume getrusage() is more widely
  available and I'd like to use process time before wall-clock delta.

- A separate void init_gettime(void) for choosing the clock.

- void gettime(struct timespec *ts) for reading the clock.

- double elapsed(const struct timespec *begin, const struct timespec
  *end) for getting the elapsed time in seconds.

In my experiments on the Raspberry Pi 1 [2], it seems to me that
clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is the most accurate CPU time
measurement, getrusage() being possibly rounded to HZ ticks. If neither
is available, I'll just forget about accuracy and get whatever we can
get, essentially what the current gettime() does but without the
floating point difference and the wraparound accounted for (at least
avoid negative elapsed time, which might break algorithms).

What would you think about this scheme?

I am not making a promise to implement this any time soon, I would just
like to hear a few opinions whether it is worth even considering.

This is also like a public note for myself, in case I need to think
about these again. My difficulties with benchmarking Pixman on the Rpi1
is what prompted this, but this wouldn't solve most of the problems I
have there.


Thanks,
pq

[1] https://randomascii.wordpress.com/2012/02/13/dont-store-that-in-a-float/

[2] the test program:
https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/tree/src/timings.c?id=795db042f44a12147de7449f47da901670733f71
was running over a weekend, generating 78k samples (a big web page!):
https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/commit/?id=795db042f44a12147de7449f47da901670733f71
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-02 Thread Bill Spitzak
I would have the first call return 0.0 and all the others return the
difference between current time and when that first call was done. Then
there is no worry about accuracy of floating point. I do not think any
callers are interested in the absolute time, only in subtracting two
results to get an elapsed time.

Not sure if cpu time is what the benchmarks want. This does not include
blocking waiting for the X server or for the GPU or for reading files.
Elapsed real time is probably more useful.


On Tue, Jun 2, 2015 at 9:03 AM, Ben Avison bavi...@riscosopen.org wrote:

 On Tue, 02 Jun 2015 08:32:35 +0100, Pekka Paalanen ppaala...@gmail.com
 wrote:

  most pixman performance benchmarks currently rely on gettime() from
 test/util.[ch]:
 - lowlevel-blt-bench
 - prng-test
 - radial-perf-test
 - scaling-bench

 Furthermore, affine-bench has its own gettimei() which is essentially
 gettime() but with uin32_t instead of double.


 For what it's worth, here's my opinion. I'll sidestep the issue of
 *which* underlying system clock is read for now, and look at data types.

 It certainly makes more sense to use doubles than floats for holding
 absolute times. As of 2005-09-05 05:58:26 UTC, the number of microseconds
 elapsed since 1970-01-01 00:00:00 UTC has been expressable as a 51-bit
 integer. The next time that changes will be 2041-05-10 11:56:53 UTC, when
 that goes up to a 52-bit integer.

 IEEE double-precision floating point numbers use a 52-bit mantissa, so
 they are capable of expressing all 51- and 52-bit integers without any
 loss of precision. In fact, we don't lose precision until we reach 54-bit
 integers (because the mantissa excludes the implicit leading '1' bit):
 after 2255-06-05 23:47:34 UTC the times would start being rounded to an
 even number of microseconds.

 With only 23 mantissa bits in single-precision, times would currently
 be rounded with a granularity of over 2 minutes - unworkable for most
 purposes.

 Even dividing by 1000, as gettime() does, is fairly harmless with
 double-precision floating point - all you're really doing is subtracting
 20 from the exponent and adding a few multiples of the upper bits of the
 mantissa into the lower bits.

 But this is ignoring the fact that underneath we're calling
 gettimeofday(), which suffers from a perennial problem with clock APIs,
 the use of an absolute time expressed as an integer which is liable to
 overflow. There are a limited number of transformations you can safely
 perform on these - subtracting one from another is notable as a useful
 and safe operation (assuming the time interval is less than the maximum
 integer expressable, which will normally be the case).

 Assigning the time to a variable of wider type (such as assigning the
 long int tv_sec to a uint64_t) is *not* safe, unless you have a reference
 example of a nearby time that's already in the wider type, from which you
 can infer the most significant bits. There is no provision in the API as
 defined to pass in any such reference value, and when gettime() assigns
 the time to a double, that's effectively a very wide type indeed because
 it can hold the equivalent of an integer over 1000 bits long.

 Assuming 'long int' continues to be considered to be a signed 32-bit
 number, as it usually is for today's compilers, tv_sec will suffer signed
 overflow on 2038-01-19 03:14:08 UTC, which will hit long before we start
 losing precision for doubles. That's only 23 years away now, still within
 the careers of many of today's engineers.

 Dividing an integer absolute time is also no good, because differing
 values of the overflowed upper bits would completely scramble all the
 lower bits. gettimei() gets away with it in the #ifndef HAVE_GETTIMEOFDAY
 clause because CLOCKS_PER_SEC is normally 100 so the multiplication
 and division cancel each other out. Multiplication and addition, on the
 other hand, are OK so long as you don't widen the type because the
 missing upper bits only affect other missing upper bits in the result -
 hence why gettimei() multiplies tv_sec by 100 and adds tv_usec. The
 output of the function is safe to use to calculate time intervals so long
 as the interval doesn't exceed +/- 2^31 microseconds (about 35 minutes).

 If I were to make one change to gettimei() now, looking back, it would be
 to make it return int32_t instead. This is because most often you'd be
 subtracting two sample outputs of the function, and it's more often
 useful to consider time intervals as signed (say if you're comparing the
 current time against a timeout time which may be in the past or the
 future). If gettimei() returns a signed integer, then C's type promotion
 rules make the result of the subtraction signed automatically.

 Ben

 ___
 Pixman mailing list
 Pixman@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/pixman

___
Pixman mailing list

Re: [Pixman] RFC: Pixman benchmark CPU time measurement

2015-06-02 Thread Siarhei Siamashka
On Tue, 2 Jun 2015 10:32:35 +0300
Pekka Paalanen ppaala...@gmail.com wrote:

 Hi,
 
 most pixman performance benchmarks currently rely on gettime() from
 test/util.[ch]:
 - lowlevel-blt-bench
 - prng-test
 - radial-perf-test
 - scaling-bench
 
 Furthermore, affine-bench has its own gettimei() which is essentially
 gettime() but with uin32_t instead of double.
 
 double
 gettime (void)
 {
 #ifdef HAVE_GETTIMEOFDAY
 struct timeval tv;
 
 gettimeofday (tv, NULL);
 return (double)((int64_t)tv.tv_sec * 100 + tv.tv_usec) / 100.;
 #else
 return (double)clock() / (double)CLOCKS_PER_SEC;
 #endif
 }
 
 This definition of gettime() has several potential drawbacks:
 
 1. clock() will wrap around often, the manual page warns that in some
cases it wraps around every 72 minutes. As the code in Pixman never
expects a wraparound, this is subtly broken. This is a fallback path
for systems that do not provide gettimeofday(), so it is rarely used
if at all.

Yes, the clock() fallback is there just to fill the void and make this
code compile on the systems, which don't have gettimeofday().

It also measures the CPU time instead of wall-clock time, but this
can't be helped.

 2. gettimeofday() measures wall-clock time, which might not be the best
to measure code performance on a CPU, because all other load in the
system will affect the result. It's probably not a significant
problem on fast systems where you know to run your benchmarks
uncontended.

In my opinion, wall-clock time is exactly what we need to measure.

If you have something else running concurrently in the system, the
benchmark results are already screwed. Relying on the OS scheduler
to correctly account the CPU time still does not compensate, for
example, the cache thrashing effects.

Basically, with the wall-clock time measurements it is easier to
notice that something is wrong if the benchmark is not run uncontended.
It happened to me a few times to accidentally run benchmarks with the
compilation running in the background. Fortunately, the results were
very much off and this was sufficient to raise a suspicion :-)

Maybe trying to measure both wall-clock and CPU time at the same time
and comparing the results in the end would be an even better safety
guard?

 3. gettimeofday() is not only subject to NTP adjustments but is also
affected by setting the system clock. IOW, this is not a guaranteed
monotonic clock. Again, unlikely to be a problem in most cases, as
benchmarks run long enough to even out NTP skews, but short enough to
avoid accidentally hitting clock changes. (Or so I would assume.)

Yes, this is unfortunate. But this is hardly the worst problem.

 4. Using double to store an absolute timestamp is suspicious to me. In
the end, you always compute the difference between two timestamps,
and using a floating point type may cause catastrophic cancellation
[1] depending on absolute values. However, [1] also explains that a
double is enough. But, given that we read an arbitrary system clock
whose starting point is unknown (ok, Epoch for the moment), we might
already be getting values too large to maintain the expected
accuracy (for floats, sure; for doubles, who knows?)

The double type for storing the time in seconds (since Epoch) was
selected on purpose.

First, there is no accuracy problem. The time in seconds since
Epoch is going to fit in a 32-bit value at least until 2038. The
mantissa in double has 53 bits. Which leaves us with 53 - 32 = 21 bits
for the fractional part. As such, the granularity of the time
representation can be estimated as 1 / 2^21 = ~0.12 nanoseconds.
Which is roughly equivalent to one clock cycle of a hypothetical
8 GHz processor. The calculations may be off by one here or there,
but we are interested in the order of magnitude. This is more than
enough.

Second, the floating point representation is much better for handling
timestamps. One reason is that we avoid overflow problems (the
intermediate calculations may potentially have unsafe multiplications
by the CPU clock frequency, etc.). And if we want to calculate things
like the standard deviation, then the floating point representation is
also much more convenient. Another reason is that the printf style
functions work nicely with doubles, but have some portability
inconveniences with 64-bit integers.

This was the reason why obtaining the time (in a potentially system
dependent way) was abstracted in the gettime() function and the rest
of the code only works with convenient doubles.

Thanks for mentioning the new gettimei() function. Was there a
justified reason to introduce it instead of using gettime()? 

 I would propose the following:
 
 - runtime clock selection with this priority order:
   1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
   2. getrusage(RUSAGE_SELF) - rusage.ru_utime (user time)
   3. gettimeofday()
   4. clock()
   Naturally with build time