Hello everybody, I am beginning to take care of an IBM's JS21. The cluster consists of 112 nodes (8 bladecenters), plus 3 Power5 management nodes (1 headnode and 2 storage) and a DS4200 storage array. We are using now GPFS as the file system in the cluster in a gigabit dedicated service network (using a Force10 S switch), and Myrinet 2000 for mpi. And now comes the story...
The system was originally configured as NFS exported to the nodes and GPFS between two power5 store nodes (under then there is a storage array DS4200 using raid5). NFS died badly, letting lots of badcalls and badclnt. The acess time and copy time was terrible, and sometimes the connection just died. The GPFS daemon, mmfsd, in the primary NSD was stuck at 100% CPU. ssh did not show any problems then, so it was some sort of problem with NFS or the network. Then, was decided to change the NFS to GPFS in the entire cluster, restarting also the mmfsd daemon, and that worked - all the nodes had their file systems accessible again. But sometimes, completely random, some node will be removed from the GPFS structure - the error message is about "expired lease". This is still happening. The failures occurs from 5 to 5 days, on average, with or without load, and randomly in the cluster. The node is recovered back to GPFS after a few seconds. I wrote a script that checks if a node is disconnected from the GPFS, and then just pings the disconnected node. The node had connectivity when the GPFS failed. I sniffed the network in the store nodes interface, and i got lots of TCP lost fragment, previos lost fragments, ack lost fragments and TCP window size full. The GPFS is now heavily used. In the meantime... The myrinet connection was working right, but sometimes a user program just got stuck - one of the processes was sleeping, and all others were running. Then, the program hangs. Investigating this further, this happened with the simple mpich examples like cpi, cpilog, etc. We are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info shows all nodes connected when this happens, and the switch did not overheat. mpirun.ch_mx -v shows that all the processes are issued ok to the nodes, but somehow one (or more) process go to sleep or never starts, and all the other processes just hangs. The mx diagnose tools did not show any problem so far, but we still did not have done a mx_pingpong, for example, because of we still have some users using the cluster. The is no error whatsoever in the myrinet logs or the system logs. The operational system is Suse Entreprise 9, the kernel version 2.6.5-7.244. We have another problems (like some BA060021 errors on the bladecenters logs, and a PIO drv_stats x51 filling dmesg in the headnode), but these connection things are the main problems now. Any suggestions? I can provide any log necessary. Thank you! -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
