Re: Implement the "unum" representation in D ?

deadalnix via Digitalmars-d Wed, 16 Sep 2015 13:11:06 -0700

On Wednesday, 16 September 2015 at 19:40:49 UTC, Ola FosheimGrøstad wrote:

You can load continuously 64 bytes in a stream, decode to yourinternal format and push them into the scratchpad of othercores. You could even do this in hardware.

1/ If you load the worst case scenario, then your power advantageis gone.2/ If you load these one by one, how do you expect to feed 256+cores ?

Obviously you can make this in hardware. And obviously this isnot going to be able to feed 256+ cores. Even with a chip at lowfrequency, let's say 800MHz or so, you have about 80 cycles toaccess memory. That mean you need to have 20 000+ cycles of workto do per core per unum.

That simple back of the envelope calculation. Your proposal issimply ludicrous. It's a complete non starter.

You can make this in hardware. Sure you can, no problem. But youwon't because it is a stupid idea.

To gives you a similar example, x86 decoding is often thebottleneck on an x86 CPU. The number of ALUs in x86 over thepast decade decreased rather than increased, because yousimply can't decode fast enough to feed them. Yet, x86 CPUshave a 64 ways speculative decoding as a first stage.
That's because we use a dumb compiler that does not prefetchintelligently.

You know, when you have no idea what you are talking about, youcan just move on to something you understand.

Prefetching would not change anything here. The problem come fromvariable size encoding, and the challenge it causes for hardware.You can have 100% L1 hit and still have the same problem.


No sufficiently smart compiler can fix that.

If you are writing for a tile based VLIW CPU you preload. Thesecalculations are highly iterative so I'd rather think of it asa co-processor solving a single equation repeatedly thanrunning the whole program. You can run the larger program on aregular CPU or a few cores.

That's irrelevant. The problem is not the kind of CPU, it is howdo you feed it at a fast enough rate.

The problem is not transistor it is wire. Because the damnthing is variadic in every ways, pretty much every bit asinput can end up anywhere in the functional unit. That is aLOT of wire.
I haven't seen a design, so I cannot comment. But keep in mindthat the CPU does not have to work with the format, it can usea different format internally.
We'll probably see FPGA implementations that can be run on FPGUcards for PCs within a few years. I read somewhere that a groupin Singapore was working on it.


That's hardware 101.

When you have a floating point unit, you get your 32 bits you get23 bits that go into the mantissa FU and 8 in the exponent FU.For instance, if you multiply floats, you send the 2 exponentinto a adder, you send the 2 mantissa into a 24bits multiplier(you add a leading 1), you xor the bit signs.

You get the carry from the adder, and emit a multiply, or youcount the leading 0 of the 48bit multiply result, shift by thatamount and add the shit to the exponent.

If you get a carry in the exponent adder, you saturate and emitan inifinity.

Each bit goes into a given functional unit. That mean you need onwire from the input to the functional unit is goes to. Sale forthese result.

Now, if the format is variadic, you need to wire all bits to allfunctional units, because they can potentially end up there.That's a lot of wire, in fact the number of wire is growingquadratically with that joke.

The author keep repeating that wire became the expensive thingand he is right. Meaning a solution with quadratic wiring is notgoing to cut it.

Re: Implement the "unum" representation in D ?

Reply via email to