On Wednesday, 16 September 2015 at 19:40:49 UTC, Ola Fosheim
Grøstad wrote:
You can load continuously 64 bytes in a stream, decode to your
internal format and push them into the scratchpad of other
cores. You could even do this in hardware.
1/ If you load the worst case scenario, then your power advantage
is gone.
2/ If you load these one by one, how do you expect to feed 256+
cores ?
Obviously you can make this in hardware. And obviously this is
not going to be able to feed 256+ cores. Even with a chip at low
frequency, let's say 800MHz or so, you have about 80 cycles to
access memory. That mean you need to have 20 000+ cycles of work
to do per core per unum.
That simple back of the envelope calculation. Your proposal is
simply ludicrous. It's a complete non starter.
You can make this in hardware. Sure you can, no problem. But you
won't because it is a stupid idea.
To gives you a similar example, x86 decoding is often the
bottleneck on an x86 CPU. The number of ALUs in x86 over the
past decade decreased rather than increased, because you
simply can't decode fast enough to feed them. Yet, x86 CPUs
have a 64 ways speculative decoding as a first stage.
That's because we use a dumb compiler that does not prefetch
intelligently.
You know, when you have no idea what you are talking about, you
can just move on to something you understand.
Prefetching would not change anything here. The problem come from
variable size encoding, and the challenge it causes for hardware.
You can have 100% L1 hit and still have the same problem.
No sufficiently smart compiler can fix that.
If you are writing for a tile based VLIW CPU you preload. These
calculations are highly iterative so I'd rather think of it as
a co-processor solving a single equation repeatedly than
running the whole program. You can run the larger program on a
regular CPU or a few cores.
That's irrelevant. The problem is not the kind of CPU, it is how
do you feed it at a fast enough rate.
The problem is not transistor it is wire. Because the damn
thing is variadic in every ways, pretty much every bit as
input can end up anywhere in the functional unit. That is a
LOT of wire.
I haven't seen a design, so I cannot comment. But keep in mind
that the CPU does not have to work with the format, it can use
a different format internally.
We'll probably see FPGA implementations that can be run on FPGU
cards for PCs within a few years. I read somewhere that a group
in Singapore was working on it.
That's hardware 101.
When you have a floating point unit, you get your 32 bits you get
23 bits that go into the mantissa FU and 8 in the exponent FU.
For instance, if you multiply floats, you send the 2 exponent
into a adder, you send the 2 mantissa into a 24bits multiplier
(you add a leading 1), you xor the bit signs.
You get the carry from the adder, and emit a multiply, or you
count the leading 0 of the 48bit multiply result, shift by that
amount and add the shit to the exponent.
If you get a carry in the exponent adder, you saturate and emit
an inifinity.
Each bit goes into a given functional unit. That mean you need on
wire from the input to the functional unit is goes to. Sale for
these result.
Now, if the format is variadic, you need to wire all bits to all
functional units, because they can potentially end up there.
That's a lot of wire, in fact the number of wire is growing
quadratically with that joke.
The author keep repeating that wire became the expensive thing
and he is right. Meaning a solution with quadratic wiring is not
going to cut it.