Re: Implement the "unum" representation in D ?

deadalnix via Digitalmars-d Wed, 16 Sep 2015 01:41:13 -0700

On Wednesday, 16 September 2015 at 08:17:59 UTC, Don wrote:

On Tuesday, 15 September 2015 at 11:13:59 UTC, Ola FosheimGrøstad wrote:
On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola FosheimGrøstad wrote:
http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf
That's a pretty convincing case. Who does it :)?
I'm not convinced. I think they are downplaying the hardwaredifficulties. Slide 34:
Disadvantages of the Unum Format
* Non-power-of-two alignment. Needs packing and unpacking,garbage collection.
I think that disadvantage is so enormous that it negates mostof the advantages. Note that in the x86 world, unaligned memoryloads of SSE values still take longer than aligned loads. Andthat's a trivial case!
The energy savings are achieved by using a primitive form ofcompression. Sure, you can reduce the memory bandwidth requiredby compressing the data. You could do that for *any* form ofdata, not just floating point. But I don't think anyone thinksthat's worthwhile.

GPU do it a lot. Especially, but not exclusively on mobile. Notto reduce the misses (a miss is pretty much guaranteed, you load32 thread at once in a shader core, each of them will require atleast 8 pixel for a bilinear texture with mipmap, that's the bareminimum. That means 256 memory access at once. One of these pixelWILL miss, and it is going to stall the 32 threads). It is not alatency issue, but a bandwidth and energy one.

But yeah, in the general case, random access is preferable,memory alignment, and the fact you don't need to do as muchbookeeping are very significants.

Also, predictable size mean you can split your dataset andprocess it in parallel, which is impossible if sizes are random.

The energy comparisons are plain dishonest. The power requiredfor accessing from DRAM is the energy consumption of a *cachemiss* !! What's the energy consumption of a load from cache?That would show you what the real gains are, and my guess isthey are tiny.

The energy comparison is bullshit. As long as you haven't loadedthe data, you don't know how wide they are. Meaning you needeither to go pessimistic and load for the worst case scenario ordo 2 round trip to memory.

The author also use a lot the wire vs transistor cost, and how itevolved? He is right. Except that you won't cram more wire atruntime into the CPU. The CPU need the wiring for the worst casescenario, always.

The hardware is likely to be slower as you'll need way morewiring than for regular floats, and wire is not only cost, butalso time.

That being said, even a hit in L1 is very energy hungry. Thinkabout it, you need to go a 8 - way fetch (so you'll end uploading 4k of data from the cache) in parallel with addresstranslation (usually 16 ways) in parallel with snooping into theload and the store buffer.

If the load is not aligned, you pretty much have to multiply thisby 2 if it cross a cache line boundary.

I'm not sure what his number represent, but hitting L1 is quitepower hungry. He is right on that one.

So:
* I don't believe the energy savings are real.
* There is no guarantee that it would be possible to implementit in hardware without a speed penalty, regardless of how manytransistors you throw at it (hardware analogue of Amdahl's Law)
* but the error bound stuff is cool.


Yup, that's pretty much what I get out of it as well.

Re: Implement the "unum" representation in D ?

Reply via email to