Jim Lux wrote:
At 07:03 AM 2/13/2007, Richard Walsh wrote:
Yes, but how much does it really abandon von Neumann.  It is just a lot
of little von Neumann machines unless the mesh is fully programmable
and the DRAM stacks can source data for any operation on any cpu as
the application's data flows through the application kernel(s) however it is laid out across the chip. And in that case it is a multi-core ASIC emulating an FPGA ... why not just use an FPGA ... ;-) ... and avoid wasting all those hard-wired functional units that won't be needed for this or that particular
kernel.
In fact, modern high density FPGAs (viz Xilinx Virtex II 6000 series) have partitioned their innards into little cells, some with ALU and combinatorial logic and a little memory, some with lots of memory and not so much logic.
   Hey Jim,

Yes, I do understand this although attention for double precision ops on FPGAs is focused on the Xilinx Virtex-5 at 65 nm. You can already get a PCIe card version I think. My comments about new 80-core/ASIC Intel chip were to suggest two things ... first was that having the ability to program your own (ala VHDL, Verilog, Mitrion-C, Handel-C, etc. ) core that is specific to your kernel is more circuit-efficient in theory, so if you are going to have multiple cores consider having them be programmable. Its like the plumber that brings only and all the tools he needs into to house to
   do the job at hand.

The second point I was trying to make was that all cyclic re-referencing of the same store (local or remote) is a reflection of the von Neuman model (even to the stacked DRAM in the new Intel chip). When the processor cannot "swallow the kernel whole" it has to consume it in von Neuman-like bites which imply register, cache, and memory writes. Part of the programmable core process is in making the connections between upstream and downstream hardware in a data-flow fashion that replace some number of cyclic stores with in-line passes to the next collection of functional units
   required by the applications specific kernel.

In this way, the "diameter" of the re-reference cycle is enlarged and the latency penalty is therefore reduced. So while the ASIC-cores in the new Intel chip are not programmable in the FPGA sense there is the hope/expectation that the interconnect on the chip will give the data flow benefits described. These are the features of the multi-core TRIPS and Raw processors that allow them to emulate ILP, TLP, and DLP oriented architectures and applications. The extent to which FPGAs are more flexible in this regard give them an advantage over less "wire-exposed" multi-core ASIC architectures.

There are obvious draw backs to FPGAs ... they are not commodity enough, programmability is poor, foriegn, and the improvements (Mitrion-C) generally consume 2x the circuits and run at 1/2 the clock that the FPGA in use is capable of. Joe Landman pointed out the large chunk of the device that the interface architecture can consume, and for HPC size data sets you still need to stream data in and out to external memory (algorithms must be pipelined). Still it seems like over the long haul some of the FPGA advantages mentioned will creep into the HPC space -- either on the chip or via accelerators. Underwood at Sandia has nice a paper showing that peak flop performance on FPGAs exceed commodity CPUs in summer of 2004 (same time Intel dropped the race to the 4.0 GHz clock) ... although the data needs to be updated with the Virtex-5 and the new multi-core processors. Here are some papers that I think you can Google that I have found useful/interesting.

1. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and
          Streams.   Taylor, et al.

2. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture.

3. FPGAs vs CPUs: Trends in Peak Floating-Point Performance. Keith Underwood.

4. Architectures and APIs" Assessing Requirements for Delivering FPGA Performance to
           Applications.   Underwood and Hemmert

5. A 64-bit Floating-point FPGA Matrix Multiplications. Yong Dou et al.

6. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs
           Ling Zhuo and Viktor Prasanna

7. Computing Lennard-Jones Potentials and Forces wth Reconfigurable Hardware
I think that as a general rule, the special purpose cores (ASICs) are going to be smaller, lower power, and faster (for a given technology) than the programmable cores (FPGAs). Back in the late 90s, I was doing tradeoffs between general
Here you are arguing for an ASIC for each typical HPC kernel ... ala the GRAPE processor. I will buy that ... but a commodity multi-core, CPU is not HPC-special-purpose or low power compared to an FPGA.
purpose CPUs (PowerPCs), DSPs (ADSP21020), and FPGAs for some signal processing applications. At that time, the DSP could do the FFTs, etc, for the least joules and least time. Since then, however, the FPGAs have pulled ahead, at least for spaceflight applications. But that's not because of architectural superiority in a given process.. it's that the FPGAs are benefiting from improvements in process (higher density) and nobody is designing space qualified DSPs using those processes (so they are stuck with the old processes).
Better process is good, but I think I hear you arguing for HPC-specific ASICs again like the GRAPE ... if they can be made cheaply, then you are right ... take the bit stream from the FPGA CFD code I have written and tuned, and produce 1000 ASICs for my special purpose CFD-only cluster. I can run it at higher clock rates, but I may need a
   new chip every time I change my code.
Heck, the latest SPARC V8 core from ESA (LEON 3) is often implemented in an FPGA, although there are a couple of space qualified ASIC implementations (from Atmel and Aeroflex).

In a high volume consumer application, where cost is everything, the ASIC is always going to win over the FPGA. For more specialized scientific computing, the trade is a bit more even ... But even so, the beowulf concept of combining large numbers of commodity computers leverages the consumer volume for the specialized application, giving up some theoretical performance in exchange for dollars.
Right, otherwise we would all be using our own version of GRAPE, but we are all looking for "New, New Thing" ... a new price-performance regime to take us up to the next level. Is it going to be FPGAs, GPGPUs, commodity multi-core, PIM, or novel 80-processor Intel chips. I think we are in for a period of extend HPC market fragmentation, but in any case I think two features of FPGA processing, the programmable core and data flow programming model have intrinsic/theoretical appeal. These forces may be completely overwhelmed by other
    forces in the market place of course ...

    Regards,

    rbw


--

Richard B. Walsh

"The world is given to me only once, not one existing and one
perceived. The subject and object are but one."

Erwin Schroedinger

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
[EMAIL PROTECTED]  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to