Hi,

In light of some of the questions about a suitable computer to run
siesta on, I have put together some bench-marks.

Serial siesta:

Different computers will handle each routine within siesta with a
different efficiency. It is therefore very difficult to say how fast a
computer will be without first deciding upon the type of atomic system
one intends to run.  To illustrate this, I have run a series of
different sized GaAs cells on Itanium2 (1.4GHz 3MB L3 Altix 350) and
xeon (2.8GHz 512KB) processors:

n atoms t(Altix) t(xeon)    ratio  %diagon
       seconds  seconds
4        45.52    45.28     0.99    2.06
8        52.24    58.68     1.12    6.11
16       79.88    139.4     1.75   18.58
24      138.02    314.85    2.28   31.48
36      342.99    911.43    2.66   58.88
54     1112.14   3378.53    3.04   79.66
72     2613.03   7943.96    3.04   88.32
96     4188.65  13227.00    3.16   89.79

For further serial benchmarks, I have chosen a GaMnAs cell with 32 atoms
8*8*8 K-grid, 350 Ry cutoff, DZ basis for S and P orbitals for each atom
and a TZ basis for the d orbital of Mn. This cell was run for 40 SCF
iterations, on different platforms.

Computer                                       job time(hours)
1) Pentium III, 1.0 GHz, Cache L2 256KB        94.5
2) Xeon 3.2GHz 512KB Cache                     18.0
3) SGI Alitx 350 1.4GHz Itanium2 1.5MB L3       7.12
4) SGI Alitx 350 1.4GHz Itanium2 3.0MB L3       4.12
5) IBM power4                                  16.8
6) sunfire AMD optron 2.2 GHz 1MB L2           21.13

I can't guarantee that I have fully optimised the executables
particularly for the AMD optron for which I was only able to make a
32bit binary. This bench mark flatters the Itanium2 computer due to the
large cache, but for smaller atomic systems there is less of a
difference between the Itanium and xeon processors.

Parallel siesta:

The speed of inter node communications needed to run in parallel again
depends largely on the atomic system. For the GaMnAs system to run
efficiently using the "parallel over orbitals" method requires very fast
communications i.e. myrinet and faster.

Generally the CPU time decays exponentially against the number of
processors, I therefore give the gradient of the log log plot as a
measure of communication performance:
[d log(time)] / [d log(number of processor)].

computer                                    gradient of logs
Gigabit & Xeon 3.2GHz                       +0.58 (note +ve)
Myrinet & Pentium III                       -0.44
IBM power 3 & Shared memory                 -0.66
NUMA & Itanium2                             -0.43

The GaMnAs system does not perform well when parallelised over orbitals,
and for a majority of other atomic systems better scaling is observed
with the number of processors. The GaMnAs cell is (I think) a worst case
scenario for parallelisation, and would do much better parallelised over
K.

I hope this helps.

Tom

==============================================================
*                                                            *
*  Thomas Archer                                             *
*                                                            *
*  Computational Spintronics Group                           *
*  Physics Department                                        *
*  Trinity College                                           *
*  Dublin 2                    Phone:  +353 1  6083262       *
*  Ireland                     Mobile: +353 85 7157772       *
*                              Fax:    +353 1  6711759       *
*                              Email:  [EMAIL PROTECTED]        *
*                                                            *
==============================================================

Reply via email to