Re: [Gluster-devel] md-cache improvements
On Mon, Aug 15, 2016 at 10:39:40PM -0400, Vijay Bellur wrote: > Hi Poornima, Dan - > > Let us have a hangout/bluejeans session this week to discuss the planned > md-cache improvements, proposed timelines and sort out open questions if > any. > > Would 11:00 UTC on Wednesday work for everyone in the To: list? I'd appreciate it if someone could send the meeting minutes. It'll make it easier to follow up and we can provide better status details on the progress. In any case, one of the points that Poornima mentioned was that upcall events (when enabled) get cached in gfapi until the application handles them. NFS-Ganesha is the only application that (currently) is interested in these events. Other use-cases (like md-cache invalidation) would enable upcalls too, and then cause event caching even when not needed. This change should address that, and I'm waiting for feedback on it. There should be a bug report about these unneeded and uncleared caches, but I could not find one... gfapi: do not cache upcalls if the application is not interested http://review.gluster.org/15191 Thanks, Niels > > Thanks, > Vijay > > > > On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote: > > > > My comments inline. > > > > Regards, > > Poornima > > > > - Original Message - > > > From: "Dan Lambright" <dlamb...@redhat.com> > > > To: "Gluster Devel" <gluster-devel@gluster.org> > > > Sent: Wednesday, August 10, 2016 10:35:58 PM > > > Subject: [Gluster-devel] md-cache improvements > > > > > > > > > There have been recurring discussions within the gluster community to > > > build > > > on existing support for md-cache and upcalls to help performance for small > > > file workloads. In certain cases, "lookup amplification" dominates data > > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from > > > the > > > client mitigates benefits from faster backend storage. > > > > > > To tackle this problem, one suggestion is to more aggressively utilize > > > md-cache to cache inodes on the client than is currently done. The inodes > > > would be cached until they are invalidated by the server. > > > > > > Several gluster development engineers within the DHT, NFS, and Samba teams > > > have been involved with related efforts, which have been underway for some > > > time now. At this juncture, comments are requested from gluster > > > developers. > > > > > > (1) .. help call out where additional upcalls would be needed to > > > invalidate > > > stale client cache entries (in particular, need feedback from DHT/AFR > > > areas), > > > > > > (2) .. identify failure cases, when we cannot trust the contents of > > > md-cache, > > > e.g. when an upcall may have been dropped by the network > > > > Yes, this needs to be handled. > > It can happen only when there is a one way disconnect, where the server > > cannot > > reach client and notify fails. We can have a retry for the same until the > > cache > > expiry time. > > > > > > > > (3) .. point out additional improvements which md-cache needs. For > > > example, > > > it cannot be allowed to grow unbounded. > > > > This is being worked on, and will be targetted for 3.9 > > > > > > > > Dan > > > > > > - Original Message - > > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > > > > > > List of areas where we need invalidation notification: > > > > 1. Any changes to xattrs used by xlators to store metadata (like dht > > > > layout > > > > xattr, afr xattrs etc). > > > > Currently, md-cache will negotiate(using ipc) with the brick, a list of > > xattrs > > that it needs invalidation for. Other xlators can add the xattrs they are > > interested > > in to the ipc. But then these xlators need to manage their own caching and > > processing > > the invalidation request, as md-cache will be above all cluater xlators. > > reference: http://review.gluster.org/#/c/15002/ > > > > > > 2. Scenarios where individual xlator feels like it needs a lookup. For > > > > example failed directory creation on non-hashed subvol in dht during > > > > mkdir. > > > > Though dht succeeds mkdir, it would be better to not cache this inode > > > > as a > > > > subsequent lookup will he
Re: [Gluster-devel] md-cache improvements
- Original Message - > From: "Niels de Vos" <nde...@redhat.com> > To: "Raghavendra G" <raghaven...@gluster.com> > Cc: "Dan Lambright" <dlamb...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org>, "Csaba Henk" > <csaba.h...@gmail.com> > Sent: Wednesday, August 17, 2016 4:49:41 AM > Subject: Re: [Gluster-devel] md-cache improvements > > On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote: > > On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G <raghaven...@gluster.com> > > wrote: > > > > > > > > > > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G <raghaven...@gluster.com> > > > wrote: > > > > > >> Couple of more areas to explore: > > >> 1. purging kernel dentry and/or page-cache too. Because of patch [1], > > >> upcall notification can result in a call to inode_invalidate, which > > >> results > > >> in an "invalidate" notification to fuse kernel module. While I am sure > > >> that, this notification will purge page-cache from kernel, I am not sure > > >> about dentries. I assume if an inode is invalidated, it should result in > > >> a > > >> lookup (from kernel to glusterfs). But neverthless, we should look into > > >> differences between entry_invalidation and inode_invalidation and > > >> harness > > >> them appropriately. > > I do not think fuse handles upcall yet. I think there is a patch for > that somewhere. It's been a while since I looked into that, but I think > invalidating the affected dentries was straight forwards. Can the patch # be tracked down ? I'd like to run some experiments with it + tiering.. > > > >> 2. Granularity of invalidation. For eg., We shouldn't be purging > > >> page-cache in kernel, because of a change in xattr used by an xlator > > >> (eg., > > >> dht layout xattr). We have to make sure that [1] is handling this. We > > >> need > > >> to add more granularity into invaldation (like internal xattr > > >> invalidation, > > >> user xattr invalidation, entry invalidation in kernel, page-cache > > >> invalidation in kernel, attribute/stat invalidation in kernel etc) and > > >> use > > >> them judiciously, while making sure other cached data remains to be > > >> present. > > >> > > > > > > To stress the importance of this point, it should be noted that with tier > > > there can be constant migration of files, which can result in spurious > > > (from perspective of application) invalidations, even though application > > > is > > > not doing any writes on files [2][3][4]. Also, even if application is > > > writing to file, there is no point in invalidating dentry cache. We > > > should > > > explore more ways to solve [2][3][4]. > > Actually upcall tracks the client/inode combination, and only sends > upcall events to clients that (recently/timeout?) accessed the inode. > There should not be any upcalls for inodes that the client did not > access. So, when promotion/demotion happens, only the process doing this > should receive the event, not any of the other clients that did not > access the inode. > > > > 3. We've a long standing issue of spurious termination of fuse > > > invalidation thread. Since after termination, the thread is not > > > re-spawned, > > > we would not be able to purge kernel entry/attribute/page-cache. This > > > issue > > > was touched upon during a discussion [5], though we didn't solve the > > > problem then for lack of bandwidth. Csaba has agreed to work on this > > > issue. > > > > > > > 4. Flooding of network with upcall notifications. Is it a problem? If yes, > > does upcall infra already solves it? Would NFS/SMB leases help here? > > I guess some form of flooding is possible when two or more clients do > many directory operations in the same directory. Hmm, now I wonder if a > client gets an upcall event for something it did itself. I guess that > would (most often?) not be needed. > > Niels > > > > > > > > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7 > > > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8 > > > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9 > > > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/ > > > fuse/src/fuse-bridge.c > > > > > > >
Re: [Gluster-devel] md-cache improvements
On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote: > On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G> wrote: > > > > > > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G > > wrote: > > > >> Couple of more areas to explore: > >> 1. purging kernel dentry and/or page-cache too. Because of patch [1], > >> upcall notification can result in a call to inode_invalidate, which results > >> in an "invalidate" notification to fuse kernel module. While I am sure > >> that, this notification will purge page-cache from kernel, I am not sure > >> about dentries. I assume if an inode is invalidated, it should result in a > >> lookup (from kernel to glusterfs). But neverthless, we should look into > >> differences between entry_invalidation and inode_invalidation and harness > >> them appropriately. I do not think fuse handles upcall yet. I think there is a patch for that somewhere. It's been a while since I looked into that, but I think invalidating the affected dentries was straight forwards. > >> 2. Granularity of invalidation. For eg., We shouldn't be purging > >> page-cache in kernel, because of a change in xattr used by an xlator (eg., > >> dht layout xattr). We have to make sure that [1] is handling this. We need > >> to add more granularity into invaldation (like internal xattr invalidation, > >> user xattr invalidation, entry invalidation in kernel, page-cache > >> invalidation in kernel, attribute/stat invalidation in kernel etc) and use > >> them judiciously, while making sure other cached data remains to be > >> present. > >> > > > > To stress the importance of this point, it should be noted that with tier > > there can be constant migration of files, which can result in spurious > > (from perspective of application) invalidations, even though application is > > not doing any writes on files [2][3][4]. Also, even if application is > > writing to file, there is no point in invalidating dentry cache. We should > > explore more ways to solve [2][3][4]. Actually upcall tracks the client/inode combination, and only sends upcall events to clients that (recently/timeout?) accessed the inode. There should not be any upcalls for inodes that the client did not access. So, when promotion/demotion happens, only the process doing this should receive the event, not any of the other clients that did not access the inode. > > 3. We've a long standing issue of spurious termination of fuse > > invalidation thread. Since after termination, the thread is not re-spawned, > > we would not be able to purge kernel entry/attribute/page-cache. This issue > > was touched upon during a discussion [5], though we didn't solve the > > problem then for lack of bandwidth. Csaba has agreed to work on this issue. > > > > 4. Flooding of network with upcall notifications. Is it a problem? If yes, > does upcall infra already solves it? Would NFS/SMB leases help here? I guess some form of flooding is possible when two or more clients do many directory operations in the same directory. Hmm, now I wonder if a client gets an upcall event for something it did itself. I guess that would (most often?) not be needed. Niels > > > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7 > > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8 > > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9 > > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/ > > fuse/src/fuse-bridge.c > > > > > >> > >> [1] http://review.gluster.org/12951 > >> > >> > >> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright > >> wrote: > >> > >>> > >>> There have been recurring discussions within the gluster community to > >>> build on existing support for md-cache and upcalls to help performance for > >>> small file workloads. In certain cases, "lookup amplification" dominates > >>> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs > >>> from the client mitigates benefits from faster backend storage. > >>> > >>> To tackle this problem, one suggestion is to more aggressively utilize > >>> md-cache to cache inodes on the client than is currently done. The inodes > >>> would be cached until they are invalidated by the server. > >>> > >>> Several gluster development engineers within the DHT, NFS, and Samba > >>> teams have been involved with related efforts, which have been underway > >>> for > >>> some time now. At this juncture, comments are requested from gluster > >>> developers. > >>> > >>> (1) .. help call out where additional upcalls would be needed to > >>> invalidate stale client cache entries (in particular, need feedback from > >>> DHT/AFR areas), > >>> > >>> (2) .. identify failure cases, when we cannot trust the contents of > >>> md-cache, e.g. when an upcall may have been dropped by the network > >>> > >>> (3) .. point out additional improvements which md-cache needs. For > >>> example, it cannot be allowed to grow unbounded. > >>> >
Re: [Gluster-devel] md-cache improvements
On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra Gwrote: > > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G > wrote: > >> Couple of more areas to explore: >> 1. purging kernel dentry and/or page-cache too. Because of patch [1], >> upcall notification can result in a call to inode_invalidate, which results >> in an "invalidate" notification to fuse kernel module. While I am sure >> that, this notification will purge page-cache from kernel, I am not sure >> about dentries. I assume if an inode is invalidated, it should result in a >> lookup (from kernel to glusterfs). But neverthless, we should look into >> differences between entry_invalidation and inode_invalidation and harness >> them appropriately. >> >> 2. Granularity of invalidation. For eg., We shouldn't be purging >> page-cache in kernel, because of a change in xattr used by an xlator (eg., >> dht layout xattr). We have to make sure that [1] is handling this. We need >> to add more granularity into invaldation (like internal xattr invalidation, >> user xattr invalidation, entry invalidation in kernel, page-cache >> invalidation in kernel, attribute/stat invalidation in kernel etc) and use >> them judiciously, while making sure other cached data remains to be present. >> > > To stress the importance of this point, it should be noted that with tier > there can be constant migration of files, which can result in spurious > (from perspective of application) invalidations, even though application is > not doing any writes on files [2][3][4]. Also, even if application is > writing to file, there is no point in invalidating dentry cache. We should > explore more ways to solve [2][3][4]. > > 3. We've a long standing issue of spurious termination of fuse > invalidation thread. Since after termination, the thread is not re-spawned, > we would not be able to purge kernel entry/attribute/page-cache. This issue > was touched upon during a discussion [5], though we didn't solve the > problem then for lack of bandwidth. Csaba has agreed to work on this issue. > 4. Flooding of network with upcall notifications. Is it a problem? If yes, does upcall infra already solves it? Would NFS/SMB leases help here? > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7 > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8 > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9 > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/ > fuse/src/fuse-bridge.c > > >> >> [1] http://review.gluster.org/12951 >> >> >> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright >> wrote: >> >>> >>> There have been recurring discussions within the gluster community to >>> build on existing support for md-cache and upcalls to help performance for >>> small file workloads. In certain cases, "lookup amplification" dominates >>> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs >>> from the client mitigates benefits from faster backend storage. >>> >>> To tackle this problem, one suggestion is to more aggressively utilize >>> md-cache to cache inodes on the client than is currently done. The inodes >>> would be cached until they are invalidated by the server. >>> >>> Several gluster development engineers within the DHT, NFS, and Samba >>> teams have been involved with related efforts, which have been underway for >>> some time now. At this juncture, comments are requested from gluster >>> developers. >>> >>> (1) .. help call out where additional upcalls would be needed to >>> invalidate stale client cache entries (in particular, need feedback from >>> DHT/AFR areas), >>> >>> (2) .. identify failure cases, when we cannot trust the contents of >>> md-cache, e.g. when an upcall may have been dropped by the network >>> >>> (3) .. point out additional improvements which md-cache needs. For >>> example, it cannot be allowed to grow unbounded. >>> >>> Dan >>> >>> - Original Message - >>> > From: "Raghavendra Gowdappa" >>> > >>> > List of areas where we need invalidation notification: >>> > 1. Any changes to xattrs used by xlators to store metadata (like dht >>> layout >>> > xattr, afr xattrs etc). >>> > 2. Scenarios where individual xlator feels like it needs a lookup. For >>> > example failed directory creation on non-hashed subvol in dht during >>> mkdir. >>> > Though dht succeeds mkdir, it would be better to not cache this inode >>> as a >>> > subsequent lookup will heal the directory and make things better. >>> > 3. removing of files >>> > 4. writev on brick (to invalidate read cache on client) >>> > >>> > Other questions: >>> > 5. Does md-cache has cache management? like lru or an upper limit for >>> cache. >>> > 6. Network disconnects and invalidating cache. When a network >>> disconnect >>> > happens we need to invalidate cache for inodes present on that brick >>> as we >>> > might be missing some notifications. Current approach of purging cache >>> of
Re: [Gluster-devel] md-cache improvements
On 08/16/2016 07:22 PM, Michael Adam wrote: Hi all, On 2016-08-15 at 22:39 -0400, Vijay Bellur wrote: Hi Poornima, Dan - Let us have a hangout/bluejeans session this week to discuss the planned md-cache improvements, proposed timelines and sort out open questions if any. Because the initial mail creates the impression that this is a topic that people are merely discussing, let me point out that it has actually moved way beyond that stage already: Poornima has been working hard on these cache improvements since late 2015 at least. (And desperately looking for review and support since at least springtime..) See all her patches that have now finally already gone into master recently (e.g. http://review.gluster.org/#/c/12951/ for an old one that has just been merged) and all the patches that she has still up for review (e.g. http://review.gluster.org/#/c/15002/ for a big one). I perhaps could have provided more context in my email. I have followed some of this work closely and it is in line with how I would like to see md-cache evolve. My intention behind scheduling this meeting is to: a> Get a better understanding of the current state of affairs b> Determine what workload profiles can benefit from this improvement c> Facilitate reviews and address pending open issues, if any, for 3.9 and beyond. These changes were mainly motivated by samba-workloads, since the chatty, md-heavy smb protocol is suffering most notably from the lack of proper caching of this metadata. The good news is that it recently started getting more attention and we are seeing very, very promising performance test results! Full functional and regression testings are also underway. Good to know! Look forward to understand more about the nature of performance improvements in the call or over here :). I think metadata intensive workloads from fuse/gfapi can also benefit from this improvement. We can hopefully start doing more focussed tests to validate this hypothesis post the call. Would 11:00 UTC on Wednesday work for everyone in the To: list? Not on the To: list myself, but would work for me.. :-) Although I have to admit it may really be very short notice for some... I agree that this is a very short notice. 3.9 being 6 weeks away is driving the urgency largely. And since Poornima drove the project thus far, and was mainly supported by Rajesh J and R.Talur from the gluster side for long stretches of time, afaict, I think these three should be present bare minimum. Thank you for letting me know whom I missed. Rajesh, R. Talur - look forward to see you folks in the meeting! Regards, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] md-cache improvements
Hi all, On 2016-08-15 at 22:39 -0400, Vijay Bellur wrote: > Hi Poornima, Dan - > > Let us have a hangout/bluejeans session this week to discuss the planned > md-cache improvements, proposed timelines and sort out open questions if > any. Because the initial mail creates the impression that this is a topic that people are merely discussing, let me point out that it has actually moved way beyond that stage already: Poornima has been working hard on these cache improvements since late 2015 at least. (And desperately looking for review and support since at least springtime..) See all her patches that have now finally already gone into master recently (e.g. http://review.gluster.org/#/c/12951/ for an old one that has just been merged) and all the patches that she has still up for review (e.g. http://review.gluster.org/#/c/15002/ for a big one). These changes were mainly motivated by samba-workloads, since the chatty, md-heavy smb protocol is suffering most notably from the lack of proper caching of this metadata. The good news is that it recently started getting more attention and we are seeing very, very promising performance test results! Full functional and regression testings are also underway. Discussion the state of affairs in a real call could be very useful indeed. Sometimes this can be less awkward than using the list.. > Would 11:00 UTC on Wednesday work for everyone in the To: list? Not on the To: list myself, but would work for me.. :-) Although I have to admit it may really be very short notice for some... And since Poornima drove the project thus far, and was mainly supported by Rajesh J and R.Talur from the gluster side for long stretches of time, afaict, I think these three should be present bare minimum. Thanks - Michael > On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote: > > > > My comments inline. > > > > Regards, > > Poornima > > > > - Original Message - > > > From: "Dan Lambright" <dlamb...@redhat.com> > > > To: "Gluster Devel" <gluster-devel@gluster.org> > > > Sent: Wednesday, August 10, 2016 10:35:58 PM > > > Subject: [Gluster-devel] md-cache improvements > > > > > > > > > There have been recurring discussions within the gluster community to > > > build > > > on existing support for md-cache and upcalls to help performance for small > > > file workloads. In certain cases, "lookup amplification" dominates data > > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from > > > the > > > client mitigates benefits from faster backend storage. > > > > > > To tackle this problem, one suggestion is to more aggressively utilize > > > md-cache to cache inodes on the client than is currently done. The inodes > > > would be cached until they are invalidated by the server. > > > > > > Several gluster development engineers within the DHT, NFS, and Samba teams > > > have been involved with related efforts, which have been underway for some > > > time now. At this juncture, comments are requested from gluster > > > developers. > > > > > > (1) .. help call out where additional upcalls would be needed to > > > invalidate > > > stale client cache entries (in particular, need feedback from DHT/AFR > > > areas), > > > > > > (2) .. identify failure cases, when we cannot trust the contents of > > > md-cache, > > > e.g. when an upcall may have been dropped by the network > > > > Yes, this needs to be handled. > > It can happen only when there is a one way disconnect, where the server > > cannot > > reach client and notify fails. We can have a retry for the same until the > > cache > > expiry time. > > > > > > > > (3) .. point out additional improvements which md-cache needs. For > > > example, > > > it cannot be allowed to grow unbounded. > > > > This is being worked on, and will be targetted for 3.9 > > > > > > > > Dan > > > > > > - Original Message - > > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > > > > > > List of areas where we need invalidation notification: > > > > 1. Any changes to xattrs used by xlators to store metadata (like dht > > > > layout > > > > xattr, afr xattrs etc). > > > > Currently, md-cache will negotiate(using ipc) with the brick, a list of > > xattrs > > that it needs invalidation for. Other xlators can add the xattrs they are > > interested > > in to
Re: [Gluster-devel] md-cache improvements
Hi Poornima, Dan - Let us have a hangout/bluejeans session this week to discuss the planned md-cache improvements, proposed timelines and sort out open questions if any. Would 11:00 UTC on Wednesday work for everyone in the To: list? Thanks, Vijay On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote: My comments inline. Regards, Poornima - Original Message - From: "Dan Lambright" <dlamb...@redhat.com> To: "Gluster Devel" <gluster-devel@gluster.org> Sent: Wednesday, August 10, 2016 10:35:58 PM Subject: [Gluster-devel] md-cache improvements There have been recurring discussions within the gluster community to build on existing support for md-cache and upcalls to help performance for small file workloads. In certain cases, "lookup amplification" dominates data transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the client mitigates benefits from faster backend storage. To tackle this problem, one suggestion is to more aggressively utilize md-cache to cache inodes on the client than is currently done. The inodes would be cached until they are invalidated by the server. Several gluster development engineers within the DHT, NFS, and Samba teams have been involved with related efforts, which have been underway for some time now. At this juncture, comments are requested from gluster developers. (1) .. help call out where additional upcalls would be needed to invalidate stale client cache entries (in particular, need feedback from DHT/AFR areas), (2) .. identify failure cases, when we cannot trust the contents of md-cache, e.g. when an upcall may have been dropped by the network Yes, this needs to be handled. It can happen only when there is a one way disconnect, where the server cannot reach client and notify fails. We can have a retry for the same until the cache expiry time. (3) .. point out additional improvements which md-cache needs. For example, it cannot be allowed to grow unbounded. This is being worked on, and will be targetted for 3.9 Dan - Original Message - From: "Raghavendra Gowdappa" <rgowd...@redhat.com> List of areas where we need invalidation notification: 1. Any changes to xattrs used by xlators to store metadata (like dht layout xattr, afr xattrs etc). Currently, md-cache will negotiate(using ipc) with the brick, a list of xattrs that it needs invalidation for. Other xlators can add the xattrs they are interested in to the ipc. But then these xlators need to manage their own caching and processing the invalidation request, as md-cache will be above all cluater xlators. reference: http://review.gluster.org/#/c/15002/ 2. Scenarios where individual xlator feels like it needs a lookup. For example failed directory creation on non-hashed subvol in dht during mkdir. Though dht succeeds mkdir, it would be better to not cache this inode as a subsequent lookup will heal the directory and make things better. For this, these xlators can specify an indicator in the dict of the fop cbk, to not cache. This should be fairly simple to implement. 3. removing of files When an unlink is issued from the mount point, the cache is invalidated. 4. writev on brick (to invalidate read cache on client) writev on brick from any other client will invalidate the metadata cache on all the other clients. Other questions: 5. Does md-cache has cache management? like lru or an upper limit for cache. Currently md-cache doesn't have any cache-management, we will be targeting this for 3.9 6. Network disconnects and invalidating cache. When a network disconnect happens we need to invalidate cache for inodes present on that brick as we might be missing some notifications. Current approach of purging cache of all inodes might not be optimal as it might rollback benefits of caching. Also, please note that network disconnects are not rare events. Network disconnects are handled to a minimal extent, where any brick down will cause the whole of the cache to be invalidated. Invalidating only the list of inodes that belong to that perticular brick will need the support from the underlying cluster xlators. regards, Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] md-cache improvements
My comments inline. Regards, Poornima - Original Message - > From: "Dan Lambright" <dlamb...@redhat.com> > To: "Gluster Devel" <gluster-devel@gluster.org> > Sent: Wednesday, August 10, 2016 10:35:58 PM > Subject: [Gluster-devel] md-cache improvements > > > There have been recurring discussions within the gluster community to build > on existing support for md-cache and upcalls to help performance for small > file workloads. In certain cases, "lookup amplification" dominates data > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the > client mitigates benefits from faster backend storage. > > To tackle this problem, one suggestion is to more aggressively utilize > md-cache to cache inodes on the client than is currently done. The inodes > would be cached until they are invalidated by the server. > > Several gluster development engineers within the DHT, NFS, and Samba teams > have been involved with related efforts, which have been underway for some > time now. At this juncture, comments are requested from gluster developers. > > (1) .. help call out where additional upcalls would be needed to invalidate > stale client cache entries (in particular, need feedback from DHT/AFR > areas), > > (2) .. identify failure cases, when we cannot trust the contents of md-cache, > e.g. when an upcall may have been dropped by the network Yes, this needs to be handled. It can happen only when there is a one way disconnect, where the server cannot reach client and notify fails. We can have a retry for the same until the cache expiry time. > > (3) .. point out additional improvements which md-cache needs. For example, > it cannot be allowed to grow unbounded. This is being worked on, and will be targetted for 3.9 > > Dan > > - Original Message - > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com> > > > > List of areas where we need invalidation notification: > > 1. Any changes to xattrs used by xlators to store metadata (like dht layout > > xattr, afr xattrs etc). Currently, md-cache will negotiate(using ipc) with the brick, a list of xattrs that it needs invalidation for. Other xlators can add the xattrs they are interested in to the ipc. But then these xlators need to manage their own caching and processing the invalidation request, as md-cache will be above all cluater xlators. reference: http://review.gluster.org/#/c/15002/ > > 2. Scenarios where individual xlator feels like it needs a lookup. For > > example failed directory creation on non-hashed subvol in dht during mkdir. > > Though dht succeeds mkdir, it would be better to not cache this inode as a > > subsequent lookup will heal the directory and make things better. For this, these xlators can specify an indicator in the dict of the fop cbk, to not cache. This should be fairly simple to implement. > > 3. removing of files When an unlink is issued from the mount point, the cache is invalidated. > > 4. writev on brick (to invalidate read cache on client) writev on brick from any other client will invalidate the metadata cache on all the other clients. > > > > Other questions: > > 5. Does md-cache has cache management? like lru or an upper limit for > > cache. Currently md-cache doesn't have any cache-management, we will be targeting this for 3.9 > > 6. Network disconnects and invalidating cache. When a network disconnect > > happens we need to invalidate cache for inodes present on that brick as we > > might be missing some notifications. Current approach of purging cache of > > all inodes might not be optimal as it might rollback benefits of caching. > > Also, please note that network disconnects are not rare events. Network disconnects are handled to a minimal extent, where any brick down will cause the whole of the cache to be invalidated. Invalidating only the list of inodes that belong to that perticular brick will need the support from the underlying cluster xlators. > > > > regards, > > Raghavendra > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] md-cache improvements
Couple of more areas to explore: 1. purging kernel dentry and/or page-cache too. Because of patch [1], upcall notification can result in a call to inode_invalidate, which results in an "invalidate" notification to fuse kernel module. While I am sure that, this notification will purge page-cache from kernel, I am not sure about dentries. I assume if an inode is invalidated, it should result in a lookup (from kernel to glusterfs). But neverthless, we should look into differences between entry_invalidation and inode_invalidation and harness them appropriately. 2. Granularity of invalidation. For eg., We shouldn't be purging page-cache in kernel, because of a change in xattr used by an xlator (eg., dht layout xattr). We have to make sure that [1] is handling this. We need to add more granularity into invaldation (like internal xattr invalidation, user xattr invalidation, entry invalidation in kernel, page-cache invalidation in kernel, attribute/stat invalidation in kernel etc) and use them judiciously, while making sure other cached data remains to be present. [1] http://review.gluster.org/12951 On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambrightwrote: > > There have been recurring discussions within the gluster community to > build on existing support for md-cache and upcalls to help performance for > small file workloads. In certain cases, "lookup amplification" dominates > data transfers, i.e. the cumulative round trip times of multiple LOOKUPs > from the client mitigates benefits from faster backend storage. > > To tackle this problem, one suggestion is to more aggressively utilize > md-cache to cache inodes on the client than is currently done. The inodes > would be cached until they are invalidated by the server. > > Several gluster development engineers within the DHT, NFS, and Samba teams > have been involved with related efforts, which have been underway for some > time now. At this juncture, comments are requested from gluster developers. > > (1) .. help call out where additional upcalls would be needed to > invalidate stale client cache entries (in particular, need feedback from > DHT/AFR areas), > > (2) .. identify failure cases, when we cannot trust the contents of > md-cache, e.g. when an upcall may have been dropped by the network > > (3) .. point out additional improvements which md-cache needs. For > example, it cannot be allowed to grow unbounded. > > Dan > > - Original Message - > > From: "Raghavendra Gowdappa" > > > > List of areas where we need invalidation notification: > > 1. Any changes to xattrs used by xlators to store metadata (like dht > layout > > xattr, afr xattrs etc). > > 2. Scenarios where individual xlator feels like it needs a lookup. For > > example failed directory creation on non-hashed subvol in dht during > mkdir. > > Though dht succeeds mkdir, it would be better to not cache this inode as > a > > subsequent lookup will heal the directory and make things better. > > 3. removing of files > > 4. writev on brick (to invalidate read cache on client) > > > > Other questions: > > 5. Does md-cache has cache management? like lru or an upper limit for > cache. > > 6. Network disconnects and invalidating cache. When a network disconnect > > happens we need to invalidate cache for inodes present on that brick as > we > > might be missing some notifications. Current approach of purging cache of > > all inodes might not be optimal as it might rollback benefits of caching. > > Also, please note that network disconnects are not rare events. > > > > regards, > > Raghavendra > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] md-cache improvements
There have been recurring discussions within the gluster community to build on existing support for md-cache and upcalls to help performance for small file workloads. In certain cases, "lookup amplification" dominates data transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the client mitigates benefits from faster backend storage. To tackle this problem, one suggestion is to more aggressively utilize md-cache to cache inodes on the client than is currently done. The inodes would be cached until they are invalidated by the server. Several gluster development engineers within the DHT, NFS, and Samba teams have been involved with related efforts, which have been underway for some time now. At this juncture, comments are requested from gluster developers. (1) .. help call out where additional upcalls would be needed to invalidate stale client cache entries (in particular, need feedback from DHT/AFR areas), (2) .. identify failure cases, when we cannot trust the contents of md-cache, e.g. when an upcall may have been dropped by the network (3) .. point out additional improvements which md-cache needs. For example, it cannot be allowed to grow unbounded. Dan - Original Message - > From: "Raghavendra Gowdappa"> > List of areas where we need invalidation notification: > 1. Any changes to xattrs used by xlators to store metadata (like dht layout > xattr, afr xattrs etc). > 2. Scenarios where individual xlator feels like it needs a lookup. For > example failed directory creation on non-hashed subvol in dht during mkdir. > Though dht succeeds mkdir, it would be better to not cache this inode as a > subsequent lookup will heal the directory and make things better. > 3. removing of files > 4. writev on brick (to invalidate read cache on client) > > Other questions: > 5. Does md-cache has cache management? like lru or an upper limit for cache. > 6. Network disconnects and invalidating cache. When a network disconnect > happens we need to invalidate cache for inodes present on that brick as we > might be missing some notifications. Current approach of purging cache of > all inodes might not be optimal as it might rollback benefits of caching. > Also, please note that network disconnects are not rare events. > > regards, > Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel