On Wednesday, 17 December 2014 at 09:11:22 UTC, Don wrote:
So am I, the halffloat is much faster than any other implementation I've seen. The fast path for the conversion functions involves only a few machine instructions.

I had an extra speedup for it that made it optimal, but it requires a language primitive to dump excess hidden precision. We still need this, it is a fundamental operation (C tries to do it implicitly using "sequence points", but they don't actually work properly).

The intrinsics _mm_cvtph_ps and _mm_cvtps_ph converts 4 floats/halffloats with a latency of 4 clock cycles and a throughput of 1 per cycle on Haswell.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Reply via email to