I have recently written some "numerical code", some floating point scientific 
code I don't usually need. I have seen that this kind of code often contains 
kernels, that is small chunks of code inside the inner loops, that good 
compilers translate in even just 8-15 asm instructions.

There are situations where even just two or three asm instructions added or 
removed from such kernels can determine the difference from a slow code and an 
acceptably fast program. In other programs the kernel is a little longer but 
good register spills become essential (FFT). So the programmer, the language, 
and the optimization stages of the compiler must do all they can to pull away 
as much as possible from such inner loops. Even a moderate form of 
supercompilation can be appealing. Then, what's inside the other parts of the 
(smallish) program is often not important. Some times it can even be written in 
Python with only a moderate difference in the total running time. (This is why 
they have created a Python package named CorePy, that is a kind of dynamic 
assembler that gives a good help in the creation of computational kernels 
written in good modern asm).

[Such kernels are often inside other outer loops. Usually such code is at least 
O(n ln n) and often O(n^2). Today I see more and more algorithms almost O(n), 
used online and in real time on huge stream of real time data, used for example 
often by Google. Such algorithms can have a kernel, but on them plain I/O 
effects play an important role.]

Beside the contents of such inner loops, there are two more things the 
programmer & compiler have to take into account (and they will need to take 
them even more in account in the close future), that is the effects of the 
secondary CPU caches and the localization of data in the memory of the single 
CPUs (Both such problems are taken in account quite well by the Chapel 
language. Their designers have predicted well the future hardware).

1) The cache problems are not hard to explain: the kernels of the inner loops 
must use CPU registers very well (seeing them like yet another tiny level of 
cache) and the L1 CPU cache. (One way to do this is to use some form of memory 
tiling, that is the algorithm designer has to find a way to process a limited 
chunk of the input data, and then step to the next chunk. Often this designing 
needs to be done even when there are cache-oblivious algorithms). But around 
them there are one or two or more outer loops, so if the programmer wants an 
efficient program then such outer loops must take into account the L2 and L3 
caches too. So the outer loops too need some form of "higher level" tiling. In 
practice often numerical programs need to be organized in a pyramidal way 
(recently invented image processing algorithms are often designed this way). 
Creating such code that has two or three levels of tiling is possible, but it 
requires time and it's error-prone (this is why cache-obliv!
 ious algorithms are more appreciated, but they often aren't enough or are not 
available). So here you want some help from the programming language, to help 
in the partially automatic definition of such hierarchically tiled code. Chapel 
language helps in this.

2) I have read that when CPUs have more than about 16 cores (and AMD has 
announced a 12 core CPU), the current structure of a single shared RAM doesn't 
seem to work well any more. What's needed seems to be Core-local memory, plus 
communications across cores (and no shared RAM). In this situation the code 
must be quite local. The thread-local default of D is a step in the right 
direction, but more abstractions can become necessary. Chapel already gives 
means to write this kind of code with its Distributions:
http://chapel.cray.com/tutorials/DC2010/DC08-DISTRIBUTIONS.pdf


I have found an update of the Chapel tutorial:
http://chapel.cray.com/tutorials/DC2010/DC-tutorial.tar.gz

(In a future post I'd like to describe better the Chapel Domains, because they 
seem an abstraction quite easy to understand and useful for more than just 
numerical code.)

Bye,
bearophile

Reply via email to