On Wednesday, 16 September 2015 at 14:11:04 UTC, Ola Fosheim
Grøstad wrote:
On Wednesday, 16 September 2015 at 08:38:25 UTC, deadalnix
wrote:
The energy comparison is bullshit. As long as you haven't
loaded the data, you don't know how wide they are. Meaning you
need either to go pessimistic and load for the worst case
scenario or do 2 round trip to memory.
That really depends on memory layout and algorithm. A likely
implementation would be a co-processor that would take a unum
stream and then pipe it through a network of cores (tile based
co-processor). The internal busses between cores are very very
fast and with 256+ cores you get tremendous throughput. But you
need a good compiler/libraries and software support.
No you don't. Because the streamer still need to load the unum
one by one. Maybe 2 by 2 with a fair amount of hardware
speculation (which means you are already trading energy for
performances, so the energy argument is weak). There is no way
you can feed 256+ cores that way.
To gives you a similar example, x86 decoding is often the
bottleneck on an x86 CPU. The number of ALUs in x86 over the past
decade decreased rather than increased, because you simply can't
decode fast enough to feed them. Yet, x86 CPUs have a 64 ways
speculative decoding as a first stage.
The hardware is likely to be slower as you'll need way more
wiring than for regular floats, and wire is not only cost, but
also time.
You need more transistors per ALU, but slower does not matter
if the algorithm needs bounded accuracy or if it converge more
quickly with unums. The key challenge for him is to create a
market, meaning getting the semantics into scientific software
and getting initial workable implementations out to scientists.
If there is a market demand, then there will be products. But
you need to create the market first. Hence he wrote an easy to
read book on the topic and support people who want to implement
it.
The problem is not transistor it is wire. Because the damn thing
is variadic in every ways, pretty much every bit as input can end
up anywhere in the functional unit. That is a LOT of wire.