Re: [Beowulf] Teraflop chip hints at the future

Richard Walsh Fri, 16 Feb 2007 14:25:37 -0800

Jim Lux wrote:

At 07:03 AM 2/13/2007, Richard Walsh wrote:
Yes, but how much does it really abandon von Neumann.  It is just a lot
of little von Neumann machines unless the mesh is fully programmable
and the DRAM stacks can source data for any operation on any cpu as
the application's data flows through the application kernel(s)however itis laid out across the chip. And in that case it is a multi-coreASIC emulatingan FPGA ... why not just use an FPGA ... ;-) ... and avoid wastingall thosehard-wired functional units that won't be needed for this or thatparticular
kernel.
In fact, modern high density FPGAs (viz Xilinx Virtex II 6000 series)have partitioned their innards into little cells, some with ALU andcombinatorial logic and a little memory, some with lots of memory andnot so much logic.

   Hey Jim,

Yes, I do understand this although attention for double precisionops on FPGAs is focused on theXilinx Virtex-5 at 65 nm. You can already get a PCIe card version Ithink. My comments aboutnew 80-core/ASIC Intel chip were to suggest two things ... first wasthat having the ability toprogram your own (ala VHDL, Verilog, Mitrion-C, Handel-C, etc. )core that is specific to yourkernel is more circuit-efficient in theory, so if you are going tohave multiple cores consider having thembe programmable. Its like the plumber that brings only and all thetools he needs into to house to

   do the job at hand.

The second point I was trying to make was that all cyclicre-referencing of the same store (local orremote) is a reflection of the von Neuman model (even to the stackedDRAM in the new Intel chip).When the processor cannot "swallow the kernel whole" it has toconsume it in von Neuman-likebites which imply register, cache, and memory writes. Part of theprogrammable core process is inmaking the connections between upstream and downstream hardware in adata-flow fashion thatreplace some number of cyclic stores with in-line passes to the nextcollection of functional units

   required by the applications specific kernel.

In this way, the "diameter" of the re-reference cycle is enlargedand the latency penalty is therefore reduced.So while the ASIC-cores in the new Intel chip are not programmablein the FPGA sense there is thehope/expectation that the interconnect on the chip will give thedata flow benefits described. These arethe features of the multi-core TRIPS and Raw processors that allowthem to emulate ILP, TLP, and DLP orientedarchitectures and applications. The extent to which FPGAs are moreflexible in this regard givethem an advantage over less "wire-exposed" multi-core ASICarchitectures.

There are obvious draw backs to FPGAs ... they are not commodityenough, programmability ispoor, foriegn, and the improvements (Mitrion-C) generally consume 2xthe circuits and run at 1/2the clock that the FPGA in use is capable of. Joe Landman pointedout the large chunk of the devicethat the interface architecture can consume, and for HPC size datasets you still need to stream datain and out to external memory (algorithms must be pipelined). Stillit seems like over the longhaul some of the FPGA advantages mentioned will creep into the HPCspace -- either on the chipor via accelerators. Underwood at Sandia has nice a paper showingthat peak flop performanceon FPGAs exceed commodity CPUs in summer of 2004 (same time Inteldropped the raceto the 4.0 GHz clock) ... although the data needs to be updated withthe Virtex-5 and the newmulti-core processors.Here are some papers that I think you can Google that I have founduseful/interesting.

1. Evaluation of the Raw Microprocessor: An Exposed-Wire-DelayArchitecture for ILP and

          Streams.   Taylor, et al.

2. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPSArchitecture.

3. FPGAs vs CPUs: Trends in Peak Floating-Point Performance.Keith Underwood.

4. Architectures and APIs" Assessing Requirements forDelivering FPGA Performance to

           Applications.   Underwood and Hemmert

5. A 64-bit Floating-point FPGA Matrix Multiplications. YongDou et al.

6. Scalable and Modular Algorithms for Floating-Point MatrixMultiplication on FPGAs

           Ling Zhuo and Viktor Prasanna

7. Computing Lennard-Jones Potentials and Forces wthReconfigurable Hardware

I think that as a general rule, the special purpose cores (ASICs) aregoing to be smaller, lower power, and faster (for a given technology)than the programmable cores (FPGAs). Back in the late 90s, I wasdoing tradeoffs between general

Here you are arguing for an ASIC for each typical HPC kernel ... alathe GRAPE processor. I will buy that ... buta commodity multi-core, CPU is not HPC-special-purpose or low powercompared to an FPGA.

purpose CPUs (PowerPCs), DSPs (ADSP21020), and FPGAs for some signalprocessing applications. At that time, the DSP could do the FFTs,etc, for the least joules and least time. Since then, however, theFPGAs have pulled ahead, at least for spaceflight applications. Butthat's not because of architectural superiority in a given process..it's that the FPGAs are benefiting from improvements in process(higher density) and nobody is designing space qualified DSPs usingthose processes (so they are stuck with the old processes).

Better process is good, but I think I hear you arguing forHPC-specific ASICs again like the GRAPE ... if theycan be made cheaply, then you are right ... take the bit stream fromthe FPGA CFD code I have written and tuned, andproduce 1000 ASICs for my special purpose CFD-only cluster. I canrun it at higher clock rates, but I may need a

   new chip every time I change my code.

Heck, the latest SPARC V8 core from ESA (LEON 3) is often implementedin an FPGA, although there are a couple of space qualified ASICimplementations (from Atmel and Aeroflex).
In a high volume consumer application, where cost is everything, theASIC is always going to win over the FPGA. For more specializedscientific computing, the trade is a bit more even ... But even so,the beowulf concept of combining large numbers of commodity computersleverages the consumer volume for the specialized application, givingup some theoretical performance in exchange for dollars.

Right, otherwise we would all be using our own version of GRAPE,but we are all looking for "New, New Thing"... a new price-performance regime to take us up to the nextlevel. Is it going to be FPGAs, GPGPUs, commoditymulti-core, PIM, or novel 80-processor Intel chips. I think we arein for a period of extend HPC marketfragmentation, but in any case I think two features of FPGAprocessing, the programmable core and data flowprogramming model have intrinsic/theoretical appeal. These forcesmay be completely overwhelmed by other

    forces in the market place of course ...

    Regards,

    rbw


--

Richard B. Walsh

"The world is given to me only once, not one existing and one
perceived. The subject and object are but one."

Erwin Schroedinger

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
[EMAIL PROTECTED]  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.

-----------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Teraflop chip hints at the future

Reply via email to