Re: [Lustre-discuss] I/O error on clients

Gabriele Paciucci Wed, 07 Jul 2010 01:04:51 -0700

Hi,
the ptlrcp bug is a problem, but i don't find in the Peter's logs any 
refer to an eviction caused by the ptlrpc but instead by a timeout 
during the comunication between a ost and the client. But Peter could 
make a downgrade to 1.8.1.1 that not suffer by the problem.



My action plan could be:

1. First of all Peter use the same lustre between client and server 1.8.3.

2. Second check the /proc/sys/lustre/ldlm_timeout : 6 sec for MDS, 20 
sec for OSS!!

3. Third: do you have enough memory on the servers for all the clients 
locks? Please refer to: 
http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50417791_pgfId-1290875
 

Normally the server could suffer for more than 500k locks.

bye

On 07/07/2010 02:12 AM, Christopher J. Morrone wrote:
> On 07/05/2010 11:19 PM, Peter Kitchener wrote:
>    
>> Hi all,
>>
>> I have been troubleshooting a strange problem that is occurring with our 
>> Lustre setup. Under high loads our developers are complaining that various 
>> processes they run will error out with I/O error.
>>
>> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) 
>> the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid 
>> setup.  All of the machines are connected via 10Gig Ethernet. The clients 
>> run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 
>> 2.6.18-164.11.1.el5_lustre.1.8.2.  The Clients run an un-patched vanilla 
>> kernel from CentOS and Lustre 1.8.3
>>
>> So far I've not been able to pin point where i should begin to look. I have 
>> been trawling through log files that quite frankly don't make much sense to 
>> me.
>>
>> Here is the messages output from the OSS
>>
>> ##############################
>>
>> Jul  6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from 
>> client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16....@tcp) in 227 
>> seconds. I think it's dead, and I am evicting it.
>>      
> There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get
> stuck for long periods of time (around 10 minutes was the longest that I
> saw) on lustre clients under certain work loads.  If the ptlrpcd is
> dead, the client may stop sending all RPCs to the servers, and the
> servers evict the client because they haven't heard from it in a while.
>
> See bug 22897 for a description of the bug.  But the fix is a simple
> one-liner in bug 22786, attachment 29866.  The fix will first appear in
> lustre 1.8.4.  I would highly recommend to anyone using 1.8.2 or 1.8.3
> that they add that patch.
>
> I don't know if that is the cause of your particular evictions, because
> there can be many causes of evictions.  But the "haven't hear from
> client ... in 227 seconds" was one of the symptoms, and the client
> failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at
> the same time was another symptom.
>
> Chris
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>    


-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this 
email contains confidential information intended only for use of addressee. If 
you are not the addressee and have received this email by mistake, please send 
this email to the sender. You may not copy or disseminate this message to 
anyone. Thank You.

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] I/O error on clients

Reply via email to