[Pixman] RFC: Pixman benchmark CPU time measurement
Hi, most pixman performance benchmarks currently rely on gettime() from test/util.[ch]: - lowlevel-blt-bench - prng-test - radial-perf-test - scaling-bench Furthermore, affine-bench has its own gettimei() which is essentially gettime() but with uin32_t instead of double. double gettime (void) { #ifdef HAVE_GETTIMEOFDAY struct timeval tv; gettimeofday (tv, NULL); return (double)((int64_t)tv.tv_sec * 100 + tv.tv_usec) / 100.; #else return (double)clock() / (double)CLOCKS_PER_SEC; #endif } This definition of gettime() has several potential drawbacks: 1. clock() will wrap around often, the manual page warns that in some cases it wraps around every 72 minutes. As the code in Pixman never expects a wraparound, this is subtly broken. This is a fallback path for systems that do not provide gettimeofday(), so it is rarely used if at all. 2. gettimeofday() measures wall-clock time, which might not be the best to measure code performance on a CPU, because all other load in the system will affect the result. It's probably not a significant problem on fast systems where you know to run your benchmarks uncontended. 3. gettimeofday() is not only subject to NTP adjustments but is also affected by setting the system clock. IOW, this is not a guaranteed monotonic clock. Again, unlikely to be a problem in most cases, as benchmarks run long enough to even out NTP skews, but short enough to avoid accidentally hitting clock changes. (Or so I would assume.) 4. Using double to store an absolute timestamp is suspicious to me. In the end, you always compute the difference between two timestamps, and using a floating point type may cause catastrophic cancellation [1] depending on absolute values. However, [1] also explains that a double is enough. But, given that we read an arbitrary system clock whose starting point is unknown (ok, Epoch for the moment), we might already be getting values too large to maintain the expected accuracy (for floats, sure; for doubles, who knows?) I would propose the following: - runtime clock selection with this priority order: 1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID) 2. getrusage(RUSAGE_SELF) - rusage.ru_utime (user time) 3. gettimeofday() 4. clock() Naturally with build time checks, too. For 3 and 4 would print a warning about inaccurate measurements. clock_gettime(CLOCK_MONOTONIC) is not in the list because I would assume getrusage() is more widely available and I'd like to use process time before wall-clock delta. - A separate void init_gettime(void) for choosing the clock. - void gettime(struct timespec *ts) for reading the clock. - double elapsed(const struct timespec *begin, const struct timespec *end) for getting the elapsed time in seconds. In my experiments on the Raspberry Pi 1 [2], it seems to me that clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is the most accurate CPU time measurement, getrusage() being possibly rounded to HZ ticks. If neither is available, I'll just forget about accuracy and get whatever we can get, essentially what the current gettime() does but without the floating point difference and the wraparound accounted for (at least avoid negative elapsed time, which might break algorithms). What would you think about this scheme? I am not making a promise to implement this any time soon, I would just like to hear a few opinions whether it is worth even considering. This is also like a public note for myself, in case I need to think about these again. My difficulties with benchmarking Pixman on the Rpi1 is what prompted this, but this wouldn't solve most of the problems I have there. Thanks, pq [1] https://randomascii.wordpress.com/2012/02/13/dont-store-that-in-a-float/ [2] the test program: https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/tree/src/timings.c?id=795db042f44a12147de7449f47da901670733f71 was running over a weekend, generating 78k samples (a big web page!): https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/commit/?id=795db042f44a12147de7449f47da901670733f71 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] RFC: Pixman benchmark CPU time measurement
I would have the first call return 0.0 and all the others return the difference between current time and when that first call was done. Then there is no worry about accuracy of floating point. I do not think any callers are interested in the absolute time, only in subtracting two results to get an elapsed time. Not sure if cpu time is what the benchmarks want. This does not include blocking waiting for the X server or for the GPU or for reading files. Elapsed real time is probably more useful. On Tue, Jun 2, 2015 at 9:03 AM, Ben Avison bavi...@riscosopen.org wrote: On Tue, 02 Jun 2015 08:32:35 +0100, Pekka Paalanen ppaala...@gmail.com wrote: most pixman performance benchmarks currently rely on gettime() from test/util.[ch]: - lowlevel-blt-bench - prng-test - radial-perf-test - scaling-bench Furthermore, affine-bench has its own gettimei() which is essentially gettime() but with uin32_t instead of double. For what it's worth, here's my opinion. I'll sidestep the issue of *which* underlying system clock is read for now, and look at data types. It certainly makes more sense to use doubles than floats for holding absolute times. As of 2005-09-05 05:58:26 UTC, the number of microseconds elapsed since 1970-01-01 00:00:00 UTC has been expressable as a 51-bit integer. The next time that changes will be 2041-05-10 11:56:53 UTC, when that goes up to a 52-bit integer. IEEE double-precision floating point numbers use a 52-bit mantissa, so they are capable of expressing all 51- and 52-bit integers without any loss of precision. In fact, we don't lose precision until we reach 54-bit integers (because the mantissa excludes the implicit leading '1' bit): after 2255-06-05 23:47:34 UTC the times would start being rounded to an even number of microseconds. With only 23 mantissa bits in single-precision, times would currently be rounded with a granularity of over 2 minutes - unworkable for most purposes. Even dividing by 1000, as gettime() does, is fairly harmless with double-precision floating point - all you're really doing is subtracting 20 from the exponent and adding a few multiples of the upper bits of the mantissa into the lower bits. But this is ignoring the fact that underneath we're calling gettimeofday(), which suffers from a perennial problem with clock APIs, the use of an absolute time expressed as an integer which is liable to overflow. There are a limited number of transformations you can safely perform on these - subtracting one from another is notable as a useful and safe operation (assuming the time interval is less than the maximum integer expressable, which will normally be the case). Assigning the time to a variable of wider type (such as assigning the long int tv_sec to a uint64_t) is *not* safe, unless you have a reference example of a nearby time that's already in the wider type, from which you can infer the most significant bits. There is no provision in the API as defined to pass in any such reference value, and when gettime() assigns the time to a double, that's effectively a very wide type indeed because it can hold the equivalent of an integer over 1000 bits long. Assuming 'long int' continues to be considered to be a signed 32-bit number, as it usually is for today's compilers, tv_sec will suffer signed overflow on 2038-01-19 03:14:08 UTC, which will hit long before we start losing precision for doubles. That's only 23 years away now, still within the careers of many of today's engineers. Dividing an integer absolute time is also no good, because differing values of the overflowed upper bits would completely scramble all the lower bits. gettimei() gets away with it in the #ifndef HAVE_GETTIMEOFDAY clause because CLOCKS_PER_SEC is normally 100 so the multiplication and division cancel each other out. Multiplication and addition, on the other hand, are OK so long as you don't widen the type because the missing upper bits only affect other missing upper bits in the result - hence why gettimei() multiplies tv_sec by 100 and adds tv_usec. The output of the function is safe to use to calculate time intervals so long as the interval doesn't exceed +/- 2^31 microseconds (about 35 minutes). If I were to make one change to gettimei() now, looking back, it would be to make it return int32_t instead. This is because most often you'd be subtracting two sample outputs of the function, and it's more often useful to consider time intervals as signed (say if you're comparing the current time against a timeout time which may be in the past or the future). If gettimei() returns a signed integer, then C's type promotion rules make the result of the subtraction signed automatically. Ben ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman ___ Pixman mailing list
Re: [Pixman] [PATCH 1/1 v2] vmx: workarounds to fix powerpc little endian particularities
On Mon, 1 Jun 2015 16:46:00 -0400 Fernando Seiti Furusato ferse...@linux.vnet.ibm.com wrote: I have made some changes to the file pixman-vmx.c, which uses vmx (aka altivec) to optimize pixman. Basically, what I did: Changed vec_perm to now perform xor operation over the positions so it will work regardless of endianness. Replaced usage of vec_mergeh, vec_mergel and vec_mladd by vec_mulo and vec_mule plus vec_add and vec_perm. The result is the same and not affected by endianness differences. Replaced usage of vec_lvsl to direct unaligned assignment operation (=). That is because, according to Power ABI Specification, the usage of lvsl is deprecated on ppc64le. Changed COMPUTE_SHIFT_{MASK,MASKS,MASKC} macro usage to no-op for powerpc little endian since unaligned access is supported on ppc64le. After those changes, all tests passed on ppc64, ppc64le and powerpc vms. Signed-off-by: Fernando Seiti Furusato ferse...@linux.vnet.ibm.com --- Hello, First here is one trick related to patches handling with git. If you are sending just a single patch, it is not necessary to accompany it with an extra cover letter e-mail. The point of the cover letter is to provide a general overview of the whole patch set and how these multiple patches fit together. If you want to provide an extra comment not intended to be a part of the commit message, it can be added somewhere between the '---' separator and the diff --git ... line in the patch. This part of the patch is ignored when applying it. Hello. In the v2 of the patch, I tried to address some concerns, raised by Siarhei Siamashka. One that I was not sure of how to treat was the vec_perm within splat_alpha function. It does not look all pretty, but it will work regardless of the endianness. Yes, indeed. Although it was suggested that I created a function that might have performed the unpacking by itself, I thought that maybe performing the operation while unpacking could save some lines of code and function calls. Thus I just replaced that part as you will see in the patch. Thanks. Adjusting the algorithm is also a good solution if it works at least as fast as the old code. I also ran lowlevel-blt-bench with over__ and the results (comparing unpatched and patched) were quite similar. How similar are they? Is the new code faster than the old one? Is it 1% slower? Is it 5% slower? Also the 'over__8_' operation can be tried because it uses the 'pix_multiply' code more. In general, it would be best to see the benchmark results included as part of the commit message. Please let me know if you have any comments. As there are more significant changes than just endian dependent ifdefs, I tried to have a look at the disassembly dumps of the generated code. Basically, I have added the following code to pixman-vmx.c vector unsigned int vmx_splat_alpha (vector unsigned int pix) { return splat_alpha(pix); } vector unsigned int vmx_pix_multiply (vector unsigned int p, vector unsigned int a) { return pix_multiply(p, a); } Then configured and built pixman for powerpc64 (with GCC 4.8.4): /configure --host=powerpc64-unknown-linux-gnu make And finally got the disassembly dump: powerpc64-unknown-linux-gnu-objdump -d pixman/.libs/libpixman-vmx.a pixman/pixman-vmx.c | 106 +--- 1 file changed, 67 insertions(+), 39 deletions(-) diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c index c33631c..57918c1 100644 --- a/pixman/pixman-vmx.c +++ b/pixman/pixman-vmx.c @@ -37,46 +37,49 @@ static force_inline vector unsigned int splat_alpha (vector unsigned int pix) { -return vec_perm (pix, pix, - (vector unsigned char)AVV ( - 0x00, 0x00, 0x00, 0x00, 0x04, 0x04, 0x04, 0x04, - 0x08, 0x08, 0x08, 0x08, 0x0C, 0x0C, 0x0C, 0x0C)); +union { + unsigned short s; + unsigned char c[2]; +} endian_xor = {0x0300}; + +/* endian_xor.c[1] will be 3 if little endian and 0 if big endian */ +vector unsigned char perm = vec_splat((vector unsigned char) A coding style issue here - a space is needed between 'vec_splat' and '('. There are also a few other occurrences of this in your patch. + AVV (endian_xor.c[1]),0); +perm = vec_xor (perm,(vector unsigned char) AVV ( + 0x00, 0x00, 0x00, 0x00, 0x04, 0x04, 0x04, 0x04, + 0x08, 0x08, 0x08, 0x08, 0x0C, 0x0C, 0x0C, 0x0C)); +return vec_perm (pix, pix, perm); } For this part, both the original and the patched code resulted in identical instruction sequences: .vmx_splat_alpha: 0: 3d 22 00 00 addis r9,r2,0 4: 39 29 00 00 addir9,r9,0 8: 7c 00 48 ce lvx v0,0,r9 c: 10 42 10 2b vperm v2,v2,v2,v0 10: 4e 80 00 20 blr This is actually good. I
Re: [Pixman] RFC: Pixman benchmark CPU time measurement
On Tue, 2 Jun 2015 10:32:35 +0300 Pekka Paalanen ppaala...@gmail.com wrote: Hi, most pixman performance benchmarks currently rely on gettime() from test/util.[ch]: - lowlevel-blt-bench - prng-test - radial-perf-test - scaling-bench Furthermore, affine-bench has its own gettimei() which is essentially gettime() but with uin32_t instead of double. double gettime (void) { #ifdef HAVE_GETTIMEOFDAY struct timeval tv; gettimeofday (tv, NULL); return (double)((int64_t)tv.tv_sec * 100 + tv.tv_usec) / 100.; #else return (double)clock() / (double)CLOCKS_PER_SEC; #endif } This definition of gettime() has several potential drawbacks: 1. clock() will wrap around often, the manual page warns that in some cases it wraps around every 72 minutes. As the code in Pixman never expects a wraparound, this is subtly broken. This is a fallback path for systems that do not provide gettimeofday(), so it is rarely used if at all. Yes, the clock() fallback is there just to fill the void and make this code compile on the systems, which don't have gettimeofday(). It also measures the CPU time instead of wall-clock time, but this can't be helped. 2. gettimeofday() measures wall-clock time, which might not be the best to measure code performance on a CPU, because all other load in the system will affect the result. It's probably not a significant problem on fast systems where you know to run your benchmarks uncontended. In my opinion, wall-clock time is exactly what we need to measure. If you have something else running concurrently in the system, the benchmark results are already screwed. Relying on the OS scheduler to correctly account the CPU time still does not compensate, for example, the cache thrashing effects. Basically, with the wall-clock time measurements it is easier to notice that something is wrong if the benchmark is not run uncontended. It happened to me a few times to accidentally run benchmarks with the compilation running in the background. Fortunately, the results were very much off and this was sufficient to raise a suspicion :-) Maybe trying to measure both wall-clock and CPU time at the same time and comparing the results in the end would be an even better safety guard? 3. gettimeofday() is not only subject to NTP adjustments but is also affected by setting the system clock. IOW, this is not a guaranteed monotonic clock. Again, unlikely to be a problem in most cases, as benchmarks run long enough to even out NTP skews, but short enough to avoid accidentally hitting clock changes. (Or so I would assume.) Yes, this is unfortunate. But this is hardly the worst problem. 4. Using double to store an absolute timestamp is suspicious to me. In the end, you always compute the difference between two timestamps, and using a floating point type may cause catastrophic cancellation [1] depending on absolute values. However, [1] also explains that a double is enough. But, given that we read an arbitrary system clock whose starting point is unknown (ok, Epoch for the moment), we might already be getting values too large to maintain the expected accuracy (for floats, sure; for doubles, who knows?) The double type for storing the time in seconds (since Epoch) was selected on purpose. First, there is no accuracy problem. The time in seconds since Epoch is going to fit in a 32-bit value at least until 2038. The mantissa in double has 53 bits. Which leaves us with 53 - 32 = 21 bits for the fractional part. As such, the granularity of the time representation can be estimated as 1 / 2^21 = ~0.12 nanoseconds. Which is roughly equivalent to one clock cycle of a hypothetical 8 GHz processor. The calculations may be off by one here or there, but we are interested in the order of magnitude. This is more than enough. Second, the floating point representation is much better for handling timestamps. One reason is that we avoid overflow problems (the intermediate calculations may potentially have unsafe multiplications by the CPU clock frequency, etc.). And if we want to calculate things like the standard deviation, then the floating point representation is also much more convenient. Another reason is that the printf style functions work nicely with doubles, but have some portability inconveniences with 64-bit integers. This was the reason why obtaining the time (in a potentially system dependent way) was abstracted in the gettime() function and the rest of the code only works with convenient doubles. Thanks for mentioning the new gettimei() function. Was there a justified reason to introduce it instead of using gettime()? I would propose the following: - runtime clock selection with this priority order: 1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID) 2. getrusage(RUSAGE_SELF) - rusage.ru_utime (user time) 3. gettimeofday() 4. clock() Naturally with build time