Hi,

I was using linux 2.1.79 on my cluster. The
machines are Dell Poweredges 6100, Quad Ppro, Gigabyte memory,
Onboard Adaptec 2940, intel etherexpress pro, communicating over a Intel
510T switch. 

I am running Message Passing (MPI) programs on these machines. Mostly in
jobs with two processes on one node and one on the other, on terminating
the job prematurely by a kill or ctrl-C, sockets get left in the
FIN_WAIT1. Sometimes the corresponding socket is in LAST_ACK, indicating
that one end went into LAST_ACK without doing the needful to get FIN_WAIT1
to FIN_WAIT2 or something like that.

The FIN_WAIT1's never go away:(from netstat)

Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0  23025 eip11.cluster01.en:4242 eip11.cluster01.en:4244 FIN_WAIT1
tcp        0      1 eip11.cluster01.en:4234 eip11.cluster01.en:4232 FIN_WAIT1

This behaviour goes away when I use 2.1.79 uniprocessor. There, the two
processes on the same machine are on the same processor.

Upgrading to 2.1.124 took away the FIN_WAIT1 problem, but the performance is 
very much slower. For jobs with 40,000 broadcasts and reduces, which take 8
seconds on 2.1.79, take 80 seconds on 2.1.124. I benchmarked the network
performance for TCP and MPI, and the TCP performance of 2.1.124 is almost
105 lower. The results, using the Netpipe benchmark are at
http://reno.cis.upenn.edu/~rahul/perf/, with graphs 
of throughput against block transfer time, and blocksize in both postcript
and gif format.

Is there some way to fix the FIN_WAIT1 problem? Or, why is the throughput
less on 2.1.124? Is there some way to fix that instead? Does IO-APIC have
anything to do with it(2.1.79 dosent seem to have it)

Thanks a lot,
Rahul

Reply via email to