[Lustre-discuss] Too many client eviction
Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same: All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says @@@processing error 107) 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Too many client eviction
I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc). Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. Cheers, Andreas On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr wrote: Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same: All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says @@@processing error 107) 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Too many client eviction
Correct me if I'm wrong, but when I'm looking at Lustre manual, it said that client is adapting its timeout, but not the server. I'm understood that server-client RPC still use the old mechanism, especially for our case where it seems server is revoking a client lock (ldlm_timeout is used for that?) and client did not respond. I forgot to say that we have LNET routers also involved for some cases. Thank you Aurélien Andreas Dilger a écrit : I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc). Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. Cheers, Andreas On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr wrote: Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same: All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says @@@processing error 107) 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Too many client eviction
On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote: Correct me if I'm wrong, but when I'm looking at Lustre manual, it said that client is adapting its timeout, but not the server. I'm understood that server-client RPC still use the old mechanism, especially for our case where it seems server is revoking a client lock (ldlm_timeout is used for that?) and client did not respond. Server and client cooperate together for the adaptive timeouts. I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment. But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow... I forgot to say that we have LNET routers also involved for some cases. Thank you Aurélien Andreas Dilger a écrit : I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc). Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. Cheers, Andreas On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr wrote: Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same: All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says @@@processing error 107) 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] improving metadata performance [was Re: question about size on MDS (MDT) for lustre-1.8]
Just to follow up in this issue. We landed a patch for 2.1 that will reduce the default OST cache to objects 8MB or smaller. This can still be tuned via /proc, but is likely to provide better all-around performance by avoiding cache flushes for streaming read and write operations. Robin, it would be great to know if tuning this would also solve your cache pressure woes without having to resort to disabling the VM cache pressure (which isn't something we can do by default for all users). Cheers, Andreas On 2011-02-09, at 8:11 AM, Robin Humble robin.humble+lus...@anu.edu.au wrote: rejoining this topic after a couple of weeks of experimentation Re: trying to improve metadata performance - we've been running with vfs_cache_pressure=0 on OSS's in production for over a week now and it's improved our metadata performance by a large factor. - filesystem scans that didn't finish in ~30hrs now complete in a little over 3 hours. so ~10x speedup. - a recursive ls -altrR of my home dir (on a random uncached client) now runs at 2000 to 4000 files/s wheras before it could be 100 files/s. so 20 to 40x speedup. of course vfs_cache_pressure=0 can be a DANGEROUS setting because inodes/dentries will never be reclaimed, so OSS's could OOM. however slabtop shows inodes are 0.89K and dentries 0.21K ie. small, so I expect many sites can (like us) easily cache everything. for a given number of inodes per OST it's easily calculable whether there's enough OSS ram to safely set vfs_cache_pressure=0 and cache them all in slab. continued monitoring of the fs inode growth (== OSS slab size) over time is very important as fs's will inevitably acrue more files... sadly a slightly less extreme vfs_cache_pressure=1 wasn't as successful at keeping stat rates high. sustained OSS cache memory pressure through the day dropped enough inodes that nightly scans weren't fast any more. our current residual issue with vfs_cache_pressure=0 is unexpected. the number of OSS dentries appears to slowly grow over time :-/ it appears that some/many dentries for deleted files are not reclaimed without some memory pressure. any idea why that might be? anyway, I've now added a few lines of code to create a different (non-zero) vfs_cache_pressure knob for dentries. we'll see how that goes... an alternate (simpler) workaround would be to occasionally drop OSS inode/dentry caches, or to set vfs_cache_pressure=100 once in a while, and to just live with a day of slow stat's while the inode caches repopulate. hopefully vfs_cache_pressure=0 also has a net small positive impact on regular i/o due to reduced iops to OSTs, but I haven't trid to measure that. slab didn't steal much ram from our read and write_through caches (we have 48g ram on OSS's and slab went up about 1.6g to 3.3g with the additional cached inodes/dentries) so OSS file caching should be almost unaffected. On Fri, Jan 28, 2011 at 09:45:10AM -0800, Jason Rappleye wrote: On Jan 27, 2011, at 11:34 PM, Robin Humble wrote: limiting the total amount of OSS cache used in order to leave room for inodes/dentries might be more useful. the data cache will always fill up and push out inodes otherwise. I disagree with myself now. I think mm/vmscan.c would probably still call shrink_slab, so shrinkers would get called and some cached inodes would get dropped. The inode and dentry objects in the slab cache aren't so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not. on a test cluster (with read and write_through caches still active and synthetic i/o load) I didn't see a big change in stat rate from dropping OSS page/buffer cache - at most a slowdown for a client 'ls -lR' of ~2x, and usually no slowdown at all. I suspect this is because there is almost zero persistent buffer cache due to the OSS buffer and page caches being punished by file i/o. in the same testing, dropping OSS inode/dentry caches was a much larger effect (up to 60x slowdown with synthetic i/o) - which is why the vfs_cache_pressure setting works. the synthetic i/o wasn't crazily intensive, but did have a working set OSS mem which is likely true of our production machine. however for your setup with OSS caches off, and from doing tests on our MDS, I agree that buffer caches can be a big effect. dropping our MDS buffer cache slows down a client 'lfs find' by ~4x, but dropping inode/dentry caches doesn't slow it down at all, so buffers are definitely important there. happily we're not under any memory pressure on our MDS's at the moment. We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that
Re: [Lustre-discuss] Too many client eviction
On May 3, 2011, at 13:41, Nathan Rutman wrote: On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote: Correct me if I'm wrong, but when I'm looking at Lustre manual, it said that client is adapting its timeout, but not the server. I'm understood that server-client RPC still use the old mechanism, especially for our case where it seems server is revoking a client lock (ldlm_timeout is used for that?) and client did not respond. Server and client cooperate together for the adaptive timeouts. I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment. But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow... I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated the wrong time when printing this error message for LDLM lock timeouts. I forgot to say that we have LNET routers also involved for some cases. If there are routers they can cause dropped RPCs from the server to the client, and the client will be evicted for unresponsiveness even though it is not at fault. At one time Johann was working on a patch (or at least investigating) the ability to have servers resend RPCs before evicting clients. The tricky part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, since that may reduce stability instead of increasing it. I think the bugzilla bug was called limited server-side resend or similar, filed by me several years ago. Andreas Dilger a écrit : I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc). Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. Cheers, Andreas On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien aurelien.degrem...@cea.fr wrote: Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same: All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says @@@processing error 107) 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss