Re: [Beowulf] Problems with a JS21 - Ah, the networking...

Mark Hahn Sat, 29 Sep 2007 10:59:11 -0700

I sniffed the network in the store nodes interface, and i got lots of
TCP lost fragment, previos lost fragments, ack lost fragments and TCP
window size full. The GPFS is now heavily used.


so this indicates that you have a serious ethernet problem, no?

The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,
this happened with the simple mpich examples like cpi, cpilog, etc. We
are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
shows all nodes connected when this happens, and the switch did not
overheat. mpirun.ch_mx -v shows that all the processes are issued ok
to the nodes, but somehow one (or more) process go to sleep or never
starts, and all the other processes just hangs. The mx diagnose tools
did not show any problem so far, but we still did not have done a


but spawning myrinet jobs normally involves some use of ethernet,

which has known problems. as I recall, the protocol involves arendezvous ethernet socket managed by the rank0 node. couldn't the

myrinet-starting problem simply be due to the eth problem, rather than
anything specific to myrinet?

here's an idea: configure ip-over-myrinet, and use it exclusively

to start the jobs. if that works, then you know for sure that theproblem is solely on the eth side (switch, perhaps, or maybe a nic

that's jabbering or otherwise misbehaving?)
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

Reply via email to