On Wednesday, 16 September 2015 at 19:40:49 UTC, Ola Fosheim Grøstad wrote:
You can load continuously 64 bytes in a stream, decode to your internal format and push them into the scratchpad of other cores. You could even do this in hardware.


1/ If you load the worst case scenario, then your power advantage is gone. 2/ If you load these one by one, how do you expect to feed 256+ cores ?

Obviously you can make this in hardware. And obviously this is not going to be able to feed 256+ cores. Even with a chip at low frequency, let's say 800MHz or so, you have about 80 cycles to access memory. That mean you need to have 20 000+ cycles of work to do per core per unum.

That simple back of the envelope calculation. Your proposal is simply ludicrous. It's a complete non starter.

You can make this in hardware. Sure you can, no problem. But you won't because it is a stupid idea.

To gives you a similar example, x86 decoding is often the bottleneck on an x86 CPU. The number of ALUs in x86 over the past decade decreased rather than increased, because you simply can't decode fast enough to feed them. Yet, x86 CPUs have a 64 ways speculative decoding as a first stage.

That's because we use a dumb compiler that does not prefetch intelligently.

You know, when you have no idea what you are talking about, you can just move on to something you understand.

Prefetching would not change anything here. The problem come from variable size encoding, and the challenge it causes for hardware. You can have 100% L1 hit and still have the same problem.

No sufficiently smart compiler can fix that.

If you are writing for a tile based VLIW CPU you preload. These calculations are highly iterative so I'd rather think of it as a co-processor solving a single equation repeatedly than running the whole program. You can run the larger program on a regular CPU or a few cores.


That's irrelevant. The problem is not the kind of CPU, it is how do you feed it at a fast enough rate.

The problem is not transistor it is wire. Because the damn thing is variadic in every ways, pretty much every bit as input can end up anywhere in the functional unit. That is a LOT of wire.

I haven't seen a design, so I cannot comment. But keep in mind that the CPU does not have to work with the format, it can use a different format internally.

We'll probably see FPGA implementations that can be run on FPGU cards for PCs within a few years. I read somewhere that a group in Singapore was working on it.

That's hardware 101.

When you have a floating point unit, you get your 32 bits you get 23 bits that go into the mantissa FU and 8 in the exponent FU. For instance, if you multiply floats, you send the 2 exponent into a adder, you send the 2 mantissa into a 24bits multiplier (you add a leading 1), you xor the bit signs.

You get the carry from the adder, and emit a multiply, or you count the leading 0 of the 48bit multiply result, shift by that amount and add the shit to the exponent.

If you get a carry in the exponent adder, you saturate and emit an inifinity.

Each bit goes into a given functional unit. That mean you need on wire from the input to the functional unit is goes to. Sale for these result.

Now, if the format is variadic, you need to wire all bits to all functional units, because they can potentially end up there. That's a lot of wire, in fact the number of wire is growing quadratically with that joke.

The author keep repeating that wire became the expensive thing and he is right. Meaning a solution with quadratic wiring is not going to cut it.

Reply via email to