I have a very dim recollection of some kernel TCP issues back in some older 
kernel versions -- such issues affected all TCP communications, not just MPI.  
Can you try a newer kernel, perchance?


On Mar 30, 2010, at 1:26 PM, <open...@docawk.org> <open...@docawk.org> wrote:

> Hello List,
> 
> I hope you can help us out on that one, as we are trying to figure out
> since weeks.
> 
> The situation: We have a program being capable of slitting to several
> processes to be shared on nodes within a cluster network using openmpi.
> We were running that system on "older" cluster hardware (Intel Core2 Duo
> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
> diskless network booting. Recently we upgraded the hardware (Intel i5,
> 8GB RAM) which also required an upgrade to a recent kernel version
> (2.6.26+).
> 
> Here is the problem: We experience overall performance loss on the new
> hardware and think, we can break it down to a communication issue
> inbetween the processes.
> 
> Also, we found out, the issue araises in the transition from kernel
> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
> 
> Here is an output from our programm:
> 
> 2.6.23.17 (64bit), MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
>     93.33 seconds per iteration.
>  Node   0 communication/computation time:      6.83 /    647.64 seconds.
>  Node   1 communication/computation time:     10.09 /    644.36 seconds.
>  Node   2 communication/computation time:      7.27 /    645.03 seconds.
>  Node   3 communication/computation time:    165.02 /    485.52 seconds.
>  Node   4 communication/computation time:      6.50 /    643.82 seconds.
>  Node   5 communication/computation time:      7.80 /    627.63 seconds.
>  Computation time:    897.00 seconds.
> 
> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
>    131.33 seconds per iteration.
>  Node   0 communication/computation time:    364.15 /    645.24 seconds.
>  Node   1 communication/computation time:    362.83 /    645.26 seconds.
>  Node   2 communication/computation time:    349.39 /    645.07 seconds.
>  Node   3 communication/computation time:    508.34 /    485.53 seconds.
>  Node   4 communication/computation time:    349.94 /    643.81 seconds.
>  Node   5 communication/computation time:    349.07 /    627.47 seconds.
>  Computation time:   1251.00 seconds.
> 
> The program is 32 bit software, but it doesn't make any difference
> whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
> tested, cut communication times by half (which still is too high), but
> improvement decreased with increasing kernel version number.
> 
> The communication time is meant to be the time the master process
> distributes the data portions for calculation and collecting the results
> from the slave processes. The value also contains times a slave has to
> wait to communicate with the master as he is occupied. This explains the
> extended communication time of node #3 as the calculation time is
> reduced (based on the nature of the data)
> 
> The command to start the calculation:
> mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
> 4 -host cluster-18,cluster-19
> 
> Using top (with 'f' and 'j' showing P row) we could track which process
> runs on which core. We found processes stayed on its initial core in
> kernel 2.6.23, but started to flip around with 2.6.24. Using the
> --bind-to-core option in openmpi 1.4.1 kept the processes on its cores
> again, but that didn't influence the overall outcome, didn't fix the issue.
> 
> We found top showing ~25% CPU wait time, and processes showing 'D' ,
> also on slave only nodes. According to our programmer communications are
> only between the master process and its slaves, but not among slaves. On
> kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
> percentage.
> 
> Example from top:
> 
> Cpu(s): 75.3%us,  0.6%sy,  0.0%ni,  0.0%id, 23.1%wa,  0.7%hi,  0.3%si,
> 0.0%st
> Mem:   8181236k total,   131224k used,  8050012k free,        0k buffers
> Swap:        0k total,        0k used,        0k free,    49868k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
>  3386 oli       20   0 90512  20m 3988 R   74  0.3  12:31.80 0 invert-
>  3387 oli       20   0 85072  15m 3780 D   67  0.2  11:59.30 1 invert-
>  3388 oli       20   0 85064  14m 3588 D   77  0.2  12:56.90 2 invert-
>  3389 oli       20   0 84936  14m 3436 R   85  0.2  13:28.30 3 invert-
> 
> 
> Some system information that might be helpful:
> 
> Nodes Hardware:
> 1. "older": Intel Core2 Duo, (2x1)GB RAM
> 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM
> 
> Debian stable (lenny) distribution with
> ii  libc6                             2.7-18lenny2
> ii  libopenmpi1                       1.2.7~rc2-2
> ii  openmpi-bin                       1.2.7~rc2-2
> ii  openmpi-common                    1.2.7~rc2-2
> 
> Nodes are booting diskless with nfs-root and a kernel with all drivers
> needed compiled in.
> 
> Information on the program using openmpi and tools used to compile it:
> 
> mpirun --version:
> mpirun (Open MPI) 1.2.7rc2
> 
> libopenmpi-dev 1.2.7~rc2-2
> depends on:
>  libc6 (2.7-18lenny2)
>  libopenmpi1 (1.2.7~rc2-2)
>  openmpi-common (1.2.7~rc2-2)
> 
> 
> Compilation command:
> mpif90
> 
> 
> FORTRAN compiler (FC):
> gfortran --version:
> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
> 
> 
> Called OpenMPI-functions (FORTRAN Bindings):
> mpi_comm-rank
> mpi_comm_size
> 
> mpi_bcast
> mpi_reduce
> 
> mpi_isend
> mpi_wait
> 
> mpi_send
> mpi_probe
> mpi_recv
> 
> MPI_Wtime
> 
> 
> Additionally linked libncurses library:
> libncurses5-dev (5.7+20081213-1)
> On remote nodes no calls are ever made to this library. On local nodes
> such calls  (coded in C) are only optionally, but usually they are
> skipped too (i.e. even no initscr() is called).
> 
> 
> A signal handler is integrated (coded in C) that reacts specifically on
> SIGTERM and SIGUSR1 signals.
> 
> 
> If you need more information (e.g. kernel config etc.) please ask.
> I hope you can provide some ideas to test and resolve the issue.
> Thanks anyways.
> 
> Oli
> 
> 
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to