I sniffed the network in the store nodes interface, and i got lots of
TCP lost fragment, previos lost fragments, ack lost fragments and TCP
window size full. The GPFS is now heavily used.

so this indicates that you have a serious ethernet problem, no?

The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,
this happened with the simple mpich examples like cpi, cpilog, etc. We
are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info
shows all nodes connected when this happens, and the switch did not
overheat. mpirun.ch_mx -v shows that all the processes are issued ok
to the nodes, but somehow one (or more) process go to sleep or never
starts, and all the other processes just hangs. The mx diagnose tools
did not show any problem so far, but we still did not have done a

but spawning myrinet jobs normally involves some use of ethernet,
which has known problems. as I recall, the protocol involves a rendezvous ethernet socket managed by the rank0 node. couldn't the
myrinet-starting problem simply be due to the eth problem, rather than
anything specific to myrinet?

here's an idea: configure ip-over-myrinet, and use it exclusively
to start the jobs. if that works, then you know for sure that the problem is solely on the eth side (switch, perhaps, or maybe a nic
that's jabbering or otherwise misbehaving?)
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to