On Wednesday, 17 December 2014 at 09:11:22 UTC, Don wrote:
So am I, the halffloat is much faster than any other
implementation I've seen. The fast path for the conversion
functions involves only a few machine instructions.
I had an extra speedup for it that made it optimal, but it
requires a language primitive to dump excess hidden precision.
We still need this, it is a fundamental operation (C tries to
do it implicitly using "sequence points", but they don't
actually work properly).
The intrinsics _mm_cvtph_ps and _mm_cvtps_ph converts 4
floats/halffloats with a latency of 4 clock cycles and a
throughput of 1 per cycle on Haswell.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/