[Beowulf] hang-up of HPC Challenge

Mikhail Kuzminsky Mon, 18 Aug 2008 11:28:12 -0700

I ran a set of HPC Challenge benchmarks on ONE dual socket quad-coreOpteron2350 (Rev. B3) based server (8 logical CPUs).RAM size is 16 Gbytes. The tests performed were under SuSE10.3/x86-64, for LAM MPI 7.1.4 and MPICH 1.2.7 from SuSE distribution,using Atlas 3.9. Unfortunately there is only one such cluster node,and I can't reproduce the run on another node :-(

For N (matrix size) up to 10000 all looks OK. But for more large N(15000/20000/...) hpcc execution(mpirun -np 8 hpcc) leads to Linux hang-up.

In the "top" output I see 8 hpcc examplars each eating about 100% ofCPU, and reasonable amounts of virtual and RSS memory per hpccprocess, and the absense of swap using. Usually there is no PTRANSresults in hpccoutf.txt results file, but in a few cases (when I"activelly looked" to hpcc execution by means of ps/top issuing) I seereasonable PTRANS results but absense of HPLinpack results. One time Iobtained PTRANS, HPL and DGEMM results for N=20000, but hangup later -on STREAM tests. May be it's simple because of absense (at hangup) offinal writing of output buffer to output file on HDD.

One of possible reasons of hang-ups is memory hardware problem, butwhat is about possible software reasons of hangups ?The hpcc executable is 64-bit dynamically linked./etc/security/limits.conf is empty. stacksize limit (for user issuingmpirun) is "unlimited", main memory limit - about 14 GB, virtualmemory limit - about 30 GB. Atlas was compiled for 32-bit integers,but it's enough for such N values. Even /proc/sys/kernel/shmmax is2^63-1.


What else may be the reason of hangup ?

Mikhail Kuzminskiy
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow

_______________________________________________

Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] hang-up of HPC Challenge

Reply via email to