On 07/05/2010 11:19 PM, Peter Kitchener wrote: > Hi all, > > I have been troubleshooting a strange problem that is occurring with our > Lustre setup. Under high loads our developers are complaining that various > processes they run will error out with I/O error. > > Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) > the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid > setup. All of the machines are connected via 10Gig Ethernet. The clients run > Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel > 2.6.18-164.11.1.el5_lustre.1.8.2. The Clients run an un-patched vanilla > kernel from CentOS and Lustre 1.8.3 > > So far I've not been able to pin point where i should begin to look. I have > been trawling through log files that quite frankly don't make much sense to > me. > > Here is the messages output from the OSS > > ############################## > > Jul 6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from client > ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16....@tcp) in 227 seconds. I > think it's dead, and I am evicting it.
There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get stuck for long periods of time (around 10 minutes was the longest that I saw) on lustre clients under certain work loads. If the ptlrpcd is dead, the client may stop sending all RPCs to the servers, and the servers evict the client because they haven't heard from it in a while. See bug 22897 for a description of the bug. But the fix is a simple one-liner in bug 22786, attachment 29866. The fix will first appear in lustre 1.8.4. I would highly recommend to anyone using 1.8.2 or 1.8.3 that they add that patch. I don't know if that is the cause of your particular evictions, because there can be many causes of evictions. But the "haven't hear from client ... in 227 seconds" was one of the symptoms, and the client failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at the same time was another symptom. Chris _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss