On Wednesday, 16 September 2015 at 08:17:59 UTC, Don wrote:
On Tuesday, 15 September 2015 at 11:13:59 UTC, Ola Fosheim Grøstad wrote:
On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim Grøstad wrote:
http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf

That's a pretty convincing case. Who does it :)?

I'm not convinced. I think they are downplaying the hardware difficulties. Slide 34:

Disadvantages of the Unum Format
* Non-power-of-two alignment. Needs packing and unpacking, garbage collection.

I think that disadvantage is so enormous that it negates most of the advantages. Note that in the x86 world, unaligned memory loads of SSE values still take longer than aligned loads. And that's a trivial case!

The energy savings are achieved by using a primitive form of compression. Sure, you can reduce the memory bandwidth required by compressing the data. You could do that for *any* form of data, not just floating point. But I don't think anyone thinks that's worthwhile.


GPU do it a lot. Especially, but not exclusively on mobile. Not to reduce the misses (a miss is pretty much guaranteed, you load 32 thread at once in a shader core, each of them will require at least 8 pixel for a bilinear texture with mipmap, that's the bare minimum. That means 256 memory access at once. One of these pixel WILL miss, and it is going to stall the 32 threads). It is not a latency issue, but a bandwidth and energy one.

But yeah, in the general case, random access is preferable, memory alignment, and the fact you don't need to do as much bookeeping are very significants.

Also, predictable size mean you can split your dataset and process it in parallel, which is impossible if sizes are random.

The energy comparisons are plain dishonest. The power required for accessing from DRAM is the energy consumption of a *cache miss* !! What's the energy consumption of a load from cache? That would show you what the real gains are, and my guess is they are tiny.


The energy comparison is bullshit. As long as you haven't loaded the data, you don't know how wide they are. Meaning you need either to go pessimistic and load for the worst case scenario or do 2 round trip to memory.

The author also use a lot the wire vs transistor cost, and how it evolved? He is right. Except that you won't cram more wire at runtime into the CPU. The CPU need the wiring for the worst case scenario, always.

The hardware is likely to be slower as you'll need way more wiring than for regular floats, and wire is not only cost, but also time.

That being said, even a hit in L1 is very energy hungry. Think about it, you need to go a 8 - way fetch (so you'll end up loading 4k of data from the cache) in parallel with address translation (usually 16 ways) in parallel with snooping into the load and the store buffer.

If the load is not aligned, you pretty much have to multiply this by 2 if it cross a cache line boundary.

I'm not sure what his number represent, but hitting L1 is quite power hungry. He is right on that one.

So:
* I don't believe the energy savings are real.
* There is no guarantee that it would be possible to implement it in hardware without a speed penalty, regardless of how many transistors you throw at it (hardware analogue of Amdahl's Law)
* but the error bound stuff is cool.

Yup, that's pretty much what I get out of it as well.

Reply via email to