On Sep 29, 2008, at 4:10 PM, Prentice Bisbal wrote:

In the previous thread I instigated about running services in cluster
nodes, there was some mentioning of precisely synchronizing the system
clocks and this issue is also mentioned in this paper:

"The Case of Missing Supercomputer Performance: Achieving Optimal
Performance on the 8,192 processor ASCI Q" (Petrini, Kerbisin and Pakin)
http://hpc.pnl.gov/people/fabrizio/papers/sc03_noise.pdf

I've also read a few other papers on the topic, and it seems you need to sync the system clocks to ~1 uS. On top of that, I imagine you also need
to synch the activities of each system so they all stop to do the same
system-level tasks at the same time.

The papers I read all mentioned different OSes, or at least specialized
hardware. Can this level of synchronization be achieved in Linux on
commodity hardware?  I imagine NTP doesn't have the resolution needed
for this, and Don Becker has some strong feelings against NTP.

The SiCortex systems I work on are not commodity, but they do run Linux. All the node chips in the machine are frequency locked to the same oscillator, so the core cycle counters (MIPS standard) advance at the same rate, but because the cores are released from reset at different times, they are not initially synchronized. We recently added a global clock synchronization step to booting the system by timestamping messages sent over an out-of-band channel of the interconnect. After some futzing around, we're able to synchronize all the cycle counters to within about 50 nanoseconds. The timer interrupts then happen at the same counter values system wide, which naturally synchronizes most of the daemons that wake up. I don't think we've gone to the trouble of gang scheduling them as well, which would also be a good idea.

We tried reducing the standard 1000Hz timer interrupts to 100 Hz, but a bunch of stuff in the IP network stack reacted badly, slowing down IP communications. We haven't tracked it all down yet.

As one would expect from the papers you cite, the clock synchronization has had a very dramatic effect on large scale collectives - a 5800 rank 8-byte allreduce is now down to 36 microseconds, where it was something like 170 microseconds before the clock project.

Since clusters built from commodity servers run on independent oscillators, it it much harder to synchronize them - NTP will do a very good job estimating the relative frequencies, but all those oscillators will drift independently with temperature and aging, so you have to run NTP continually.

However, the problem to solve - synchronizing local clocks with each other, is different from the one NTP is intended to solve. You don't really care what the wall clock time is, you only care that all the systems have the same time.

I've seen some other papers on the subject of using LAN timestamps to provide much more accurate local synchronization. Here's one that cites 10 microsecond results:

High-Precision Relative Clock Synchronization Using Time Stamp Counters
Guo-Song Tian; Yu-Chu Tian; Fidge, C.
Engineering of Complex Computer Systems, 2008. ICECCS 2008. 13th IEEE International Conference on
Volume , Issue , March 31 2008-April 3 2008 Page(s):69 - 78


Incidently, a good way to measure the effects of OS noise locally is to write a program that reads the core cycle counter in a tight loop, and keeps statistics on the intervals between successive samples. You can find out how often and for how long your OS is going out to lunch.

_larry

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to