[Lustre-discuss] Too many client eviction

2011-05-03 Thread DEGREMONT Aurelien
Hello

We often see some of our Lustre clients being evicted abusively (clients 
seem healthy).
The pattern is always the same:

All of this on Lustre 2.0, with adaptative timeout enabled

1 - A server complains about a client :
### lock callback timer expired... after 25315s...
(nothing on client)

(few seconds later)

2 - The client receives -107 to a obd_ping for this target
(server says @@@processing error 107)

3 - Client realize its connection was lost.
Client notices it was evicted.
It reconnects.

(To be sure) When client is evicted, all undergoing I/O are lost, no 
recovery will be done for that?

We are thinking to increase timeout to give more time to clients to 
answer the ldlm revocation.
(maybe it is just too loaded)
- Is ldlm_timeout enough to do so?
- Do we need to also change obd_timeout in accordance? Is there a risk 
to trigger new timeouts if we just change ldlm_timeout (cascading timeout).

Any feedback in this area is welcomed.

Thank you

Aurélien Degrémont
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Andreas Dilger
I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. 
I believe that LLNL has some adjusted tunables for AT that might help for you 
(increased at_min, etc).

Hopefully Chris or someone at LLNL can comment. I think they were also 
documented in bugzilla, though I don't know the bug number. 

Cheers, Andreas

On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr wrote:

 Hello
 
 We often see some of our Lustre clients being evicted abusively (clients 
 seem healthy).
 The pattern is always the same:
 
 All of this on Lustre 2.0, with adaptative timeout enabled
 
 1 - A server complains about a client :
 ### lock callback timer expired... after 25315s...
 (nothing on client)
 
 (few seconds later)
 
 2 - The client receives -107 to a obd_ping for this target
 (server says @@@processing error 107)
 
 3 - Client realize its connection was lost.
 Client notices it was evicted.
 It reconnects.
 
 (To be sure) When client is evicted, all undergoing I/O are lost, no 
 recovery will be done for that?
 
 We are thinking to increase timeout to give more time to clients to 
 answer the ldlm revocation.
 (maybe it is just too loaded)
 - Is ldlm_timeout enough to do so?
 - Do we need to also change obd_timeout in accordance? Is there a risk 
 to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
 
 Any feedback in this area is welcomed.
 
 Thank you
 
 Aurélien Degrémont
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread DEGREMONT Aurelien
Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
that client is adapting its timeout, but not the server. I'm understood 
that server-client RPC still use the old mechanism, especially for our 
case where it seems server is revoking a client lock (ldlm_timeout is 
used for that?) and client did not respond.

I forgot to say that we have LNET routers also involved for some cases.

Thank you

Aurélien

Andreas Dilger a écrit :
 I don't think ldlm_timeout and obd_timeout have much effect when AT is 
 enabled. I believe that LLNL has some adjusted tunables for AT that might 
 help for you (increased at_min, etc).

 Hopefully Chris or someone at LLNL can comment. I think they were also 
 documented in bugzilla, though I don't know the bug number. 

 Cheers, Andreas

 On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr 
 wrote:

   
 Hello

 We often see some of our Lustre clients being evicted abusively (clients 
 seem healthy).
 The pattern is always the same:

 All of this on Lustre 2.0, with adaptative timeout enabled

 1 - A server complains about a client :
 ### lock callback timer expired... after 25315s...
 (nothing on client)

 (few seconds later)

 2 - The client receives -107 to a obd_ping for this target
 (server says @@@processing error 107)

 3 - Client realize its connection was lost.
 Client notices it was evicted.
 It reconnects.

 (To be sure) When client is evicted, all undergoing I/O are lost, no 
 recovery will be done for that?

 We are thinking to increase timeout to give more time to clients to 
 answer the ldlm revocation.
 (maybe it is just too loaded)
 - Is ldlm_timeout enough to do so?
 - Do we need to also change obd_timeout in accordance? Is there a risk 
 to trigger new timeouts if we just change ldlm_timeout (cascading timeout).

 Any feedback in this area is welcomed.

 Thank you

 Aurélien Degrémont
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Nathan Rutman

On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:

 Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
 that client is adapting its timeout, but not the server. I'm understood 
 that server-client RPC still use the old mechanism, especially for our 
 case where it seems server is revoking a client lock (ldlm_timeout is 
 used for that?) and client did not respond.

Server and client cooperate together for the adaptive timeouts.  I don't 
remember which bug the ORNL settings were in, maybe 14071, bugzilla's not 
responding at the moment.  But a big question here is why 25315 seconds for a 
callback - that's well beyond anything at_max should allow...
 

 
 I forgot to say that we have LNET routers also involved for some cases.
 
 Thank you
 
 Aurélien
 
 Andreas Dilger a écrit :
 I don't think ldlm_timeout and obd_timeout have much effect when AT is 
 enabled. I believe that LLNL has some adjusted tunables for AT that might 
 help for you (increased at_min, etc).
 
 Hopefully Chris or someone at LLNL can comment. I think they were also 
 documented in bugzilla, though I don't know the bug number. 
 
 Cheers, Andreas
 
 On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr 
 wrote:
 
 
 Hello
 
 We often see some of our Lustre clients being evicted abusively (clients 
 seem healthy).
 The pattern is always the same:
 
 All of this on Lustre 2.0, with adaptative timeout enabled
 
 1 - A server complains about a client :
 ### lock callback timer expired... after 25315s...
 (nothing on client)
 
 (few seconds later)
 
 2 - The client receives -107 to a obd_ping for this target
 (server says @@@processing error 107)
 
 3 - Client realize its connection was lost.
 Client notices it was evicted.
 It reconnects.
 
 (To be sure) When client is evicted, all undergoing I/O are lost, no 
 recovery will be done for that?
 
 We are thinking to increase timeout to give more time to clients to 
 answer the ldlm revocation.
 (maybe it is just too loaded)
 - Is ldlm_timeout enough to do so?
 - Do we need to also change obd_timeout in accordance? Is there a risk 
 to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
 
 Any feedback in this area is welcomed.
 
 Thank you
 
 Aurélien Degrémont
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]

2011-05-03 Thread Andreas Dilger
Just to follow up in this issue. We landed a patch for 2.1 that will reduce the 
default OST cache to objects 8MB or smaller.  This can still be tuned via 
/proc, but is likely to provide better all-around performance by avoiding cache 
flushes for streaming read and write operations.

Robin, it would be great to know if tuning this would also solve your cache 
pressure woes without having to resort to disabling the VM cache pressure 
(which isn't something we can do by default for all users).

Cheers, Andreas

On 2011-02-09, at 8:11 AM, Robin Humble robin.humble+lus...@anu.edu.au wrote:

 rejoining this topic after a couple of weeks of experimentation
 
 Re: trying to improve metadata performance -
 
 we've been running with vfs_cache_pressure=0 on OSS's in production for
 over a week now and it's improved our metadata performance by a large factor.
 
 - filesystem scans that didn't finish in ~30hrs now complete in a little
   over 3 hours. so ~10x speedup.
 
 - a recursive ls -altrR of my home dir (on a random uncached client) now
   runs at 2000 to 4000 files/s wheras before it could be 100 files/s.
   so 20 to 40x speedup.
 
 of course vfs_cache_pressure=0 can be a DANGEROUS setting because
 inodes/dentries will never be reclaimed, so OSS's could OOM.
 
 however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so
 I expect many sites can (like us) easily cache everything. for a given
 number of inodes per OST it's easily calculable whether there's enough
 OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab.
 
 continued monitoring of the fs inode growth (== OSS slab size) over
 time is very important as fs's will inevitably acrue more files...
 
 sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful
 at keeping stat rates high. sustained OSS cache memory pressure through
 the day dropped enough inodes that nightly scans weren't fast any more.
 
 our current residual issue with vfs_cache_pressure=0 is unexpected.
 the number of OSS dentries appears to slowly grow over time :-/
 it appears that some/many dentries for deleted files are not reclaimed
 without some memory pressure.
 any idea why that might be?
 
 anyway, I've now added a few lines of code to create a different
 (non-zero) vfs_cache_pressure knob for dentries. we'll see how that
 goes...
 an alternate (simpler) workaround would be to occasionally drop OSS
 inode/dentry caches, or to set vfs_cache_pressure=100 once in a while,
 and to just live with a day of slow stat's while the inode caches
 repopulate.
 
 hopefully vfs_cache_pressure=0 also has a net small positive impact on
 regular i/o due to reduced iops to OSTs, but I haven't trid to measure
 that.
 slab didn't steal much ram from our read and write_through caches (we
 have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the
 additional cached inodes/dentries) so OSS file caching should be
 almost unaffected.
 
 On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote:
 On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:
 limiting the total amount of OSS cache used in order to leave room for
 inodes/dentries might be more useful. the data cache will always fill
 up and push out inodes otherwise.
 
 I disagree with myself now. I think mm/vmscan.c would probably still
 call shrink_slab, so shrinkers would get called and some cached inodes
 would get dropped.
 
 The inode and dentry objects in the slab cache aren't so much of an issue as 
 having the disk blocks that each are generated from available in the buffer 
 cache. Constructing the in-memory inode and dentry objects is cheap as long 
 as the corresponding disk blocks are available. Doing the disk reads, 
 depending on your hardware and some other factors, is not.
 
 on a test cluster (with read and write_through caches still active and
 synthetic i/o load) I didn't see a big change in stat rate from
 dropping OSS page/buffer cache - at most a slowdown for a client
 'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is
 because there is almost zero persistent buffer cache due to the OSS
 buffer and page caches being punished by file i/o.
 in the same testing, dropping OSS inode/dentry caches was a much larger
 effect (up to 60x slowdown with synthetic i/o) - which is why the
 vfs_cache_pressure setting works.
 the synthetic i/o wasn't crazily intensive, but did have a working
 set OSS mem which is likely true of our production machine.
 
 however for your setup with OSS caches off, and from doing tests on our
 MDS, I agree that buffer caches can be a big effect.
 
 dropping our MDS buffer cache slows down a client 'lfs find' by ~4x,
 but dropping inode/dentry caches doesn't slow it down at all, so
 buffers are definitely important there.
 happily we're not under any memory pressure on our MDS's at the
 moment.
 
 We went the extreme and disabled the OSS read cache (+ writethrough cache). 
 In addition, on the OSSes we pre-read all of the inode blocks that 

Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Andreas Dilger
On May 3, 2011, at 13:41, Nathan Rutman wrote:
 On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
 Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
 that client is adapting its timeout, but not the server. I'm understood 
 that server-client RPC still use the old mechanism, especially for our 
 case where it seems server is revoking a client lock (ldlm_timeout is 
 used for that?) and client did not respond.
 
 Server and client cooperate together for the adaptive timeouts.  I don't 
 remember which bug the ORNL settings were in, maybe 14071, bugzilla's not 
 responding at the moment.  But a big question here is why 25315 seconds for a 
 callback - that's well beyond anything at_max should allow...

I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if it 
was ported to 2.x) that calculated the wrong time when printing this error 
message for LDLM lock timeouts.

 I forgot to say that we have LNET routers also involved for some cases.

If there are routers they can cause dropped RPCs from the server to the client, 
and the client will be evicted for unresponsiveness even though it is not at 
fault.  At one time Johann was working on a patch (or at least investigating) 
the ability to have servers resend RPCs before evicting clients.  The tricky 
part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, 
since that may reduce stability instead of increasing it.

I think the bugzilla bug was called limited server-side resend or similar, 
filed by me several years ago.

 Andreas Dilger a écrit :
 I don't think ldlm_timeout and obd_timeout have much effect when AT is 
 enabled. I believe that LLNL has some adjusted tunables for AT that might 
 help for you (increased at_min, etc).
 
 Hopefully Chris or someone at LLNL can comment. I think they were also 
 documented in bugzilla, though I don't know the bug number. 
 
 Cheers, Andreas
 
 On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr 
 wrote:
 
 
 Hello
 
 We often see some of our Lustre clients being evicted abusively (clients 
 seem healthy).
 The pattern is always the same:
 
 All of this on Lustre 2.0, with adaptative timeout enabled
 
 1 - A server complains about a client :
 ### lock callback timer expired... after 25315s...
 (nothing on client)
 
 (few seconds later)
 
 2 - The client receives -107 to a obd_ping for this target
 (server says @@@processing error 107)
 
 3 - Client realize its connection was lost.
 Client notices it was evicted.
 It reconnects.
 
 (To be sure) When client is evicted, all undergoing I/O are lost, no 
 recovery will be done for that?
 
 We are thinking to increase timeout to give more time to clients to 
 answer the ldlm revocation.
 (maybe it is just too loaded)
 - Is ldlm_timeout enough to do so?
 - Do we need to also change obd_timeout in accordance? Is there a risk 
 to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
 
 Any feedback in this area is welcomed.
 
 Thank you
 
 Aurélien Degrémont
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss