Re: [Gluster-devel] md-cache improvements

2016-08-18 Thread Niels de Vos
On Mon, Aug 15, 2016 at 10:39:40PM -0400, Vijay Bellur wrote:
> Hi Poornima, Dan -
> 
> Let us have a hangout/bluejeans session this week to discuss the planned
> md-cache improvements, proposed timelines and sort out open questions if
> any.
> 
> Would 11:00 UTC on Wednesday work for everyone in the To: list?

I'd appreciate it if someone could send the meeting minutes. It'll make
it easier to follow up and we can provide better status details on the
progress.

In any case, one of the points that Poornima mentioned was that upcall
events (when enabled) get cached in gfapi until the application handles
them. NFS-Ganesha is the only application that (currently) is interested
in these events. Other use-cases (like md-cache invalidation) would
enable upcalls too, and then cause event caching even when not needed.

This change should address that, and I'm waiting for feedback on it.
There should be a bug report about these unneeded and uncleared caches,
but I could not find one...

  gfapi: do not cache upcalls if the application is not interested
  http://review.gluster.org/15191

Thanks,
Niels


> 
> Thanks,
> Vijay
> 
> 
> 
> On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote:
> > 
> > My comments inline.
> > 
> > Regards,
> > Poornima
> > 
> > - Original Message -
> > > From: "Dan Lambright" <dlamb...@redhat.com>
> > > To: "Gluster Devel" <gluster-devel@gluster.org>
> > > Sent: Wednesday, August 10, 2016 10:35:58 PM
> > > Subject: [Gluster-devel] md-cache improvements
> > > 
> > > 
> > > There have been recurring discussions within the gluster community to 
> > > build
> > > on existing support for md-cache and upcalls to help performance for small
> > > file workloads. In certain cases, "lookup amplification" dominates data
> > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from 
> > > the
> > > client mitigates benefits from faster backend storage.
> > > 
> > > To tackle this problem, one suggestion is to more aggressively utilize
> > > md-cache to cache inodes on the client than is currently done. The inodes
> > > would be cached until they are invalidated by the server.
> > > 
> > > Several gluster development engineers within the DHT, NFS, and Samba teams
> > > have been involved with related efforts, which have been underway for some
> > > time now. At this juncture, comments are requested from gluster 
> > > developers.
> > > 
> > > (1) .. help call out where additional upcalls would be needed to 
> > > invalidate
> > > stale client cache entries (in particular, need feedback from DHT/AFR
> > > areas),
> > > 
> > > (2) .. identify failure cases, when we cannot trust the contents of 
> > > md-cache,
> > > e.g. when an upcall may have been dropped by the network
> > 
> > Yes, this needs to be handled.
> > It can happen only when there is a one way disconnect, where the server 
> > cannot
> > reach client and notify fails. We can have a retry for the same until the 
> > cache
> > expiry time.
> > 
> > > 
> > > (3) .. point out additional improvements which md-cache needs. For 
> > > example,
> > > it cannot be allowed to grow unbounded.
> > 
> > This is being worked on, and will be targetted for 3.9
> > 
> > > 
> > > Dan
> > > 
> > > - Original Message -
> > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > 
> > > > List of areas where we need invalidation notification:
> > > > 1. Any changes to xattrs used by xlators to store metadata (like dht 
> > > > layout
> > > > xattr, afr xattrs etc).
> > 
> > Currently, md-cache will negotiate(using ipc) with the brick, a list of 
> > xattrs
> > that it needs invalidation for. Other xlators can add the xattrs they are 
> > interested
> > in to the ipc. But then these xlators need to manage their own caching and 
> > processing
> > the invalidation request, as md-cache will be above all cluater xlators.
> > reference: http://review.gluster.org/#/c/15002/
> > 
> > > > 2. Scenarios where individual xlator feels like it needs a lookup. For
> > > > example failed directory creation on non-hashed subvol in dht during 
> > > > mkdir.
> > > > Though dht succeeds mkdir, it would be better to not cache this inode 
> > > > as a
> > > > subsequent lookup will he

Re: [Gluster-devel] md-cache improvements

2016-08-17 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" <nde...@redhat.com>
> To: "Raghavendra G" <raghaven...@gluster.com>
> Cc: "Dan Lambright" <dlamb...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>, "Csaba Henk"
> <csaba.h...@gmail.com>
> Sent: Wednesday, August 17, 2016 4:49:41 AM
> Subject: Re: [Gluster-devel] md-cache improvements
> 
> On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote:
> > On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G <raghaven...@gluster.com>
> > wrote:
> > 
> > >
> > >
> > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G <raghaven...@gluster.com>
> > > wrote:
> > >
> > >> Couple of more areas to explore:
> > >> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
> > >> upcall notification can result in a call to inode_invalidate, which
> > >> results
> > >> in an "invalidate" notification to fuse kernel module. While I am sure
> > >> that, this notification will purge page-cache from kernel, I am not sure
> > >> about dentries. I assume if an inode is invalidated, it should result in
> > >> a
> > >> lookup (from kernel to glusterfs). But neverthless, we should look into
> > >> differences between entry_invalidation and inode_invalidation and
> > >> harness
> > >> them appropriately.
> 
> I do not think fuse handles upcall yet. I think there is a patch for
> that somewhere. It's been a while since I looked into that, but I think
> invalidating the affected dentries was straight forwards.

Can the patch # be tracked down ? I'd like to run some experiments with it + 
tiering..


> 
> > >> 2. Granularity of invalidation. For eg., We shouldn't be purging
> > >> page-cache in kernel, because of a change in xattr used by an xlator
> > >> (eg.,
> > >> dht layout xattr). We have to make sure that [1] is handling this. We
> > >> need
> > >> to add more granularity into invaldation (like internal xattr
> > >> invalidation,
> > >> user xattr invalidation, entry invalidation in kernel, page-cache
> > >> invalidation in kernel, attribute/stat invalidation in kernel etc) and
> > >> use
> > >> them judiciously, while making sure other cached data remains to be
> > >> present.
> > >>
> > >
> > > To stress the importance of this point, it should be noted that with tier
> > > there can be constant migration of files, which can result in spurious
> > > (from perspective of application) invalidations, even though application
> > > is
> > > not doing any writes on files [2][3][4]. Also, even if application is
> > > writing to file, there is no point in invalidating dentry cache. We
> > > should
> > > explore more ways to solve [2][3][4].
> 
> Actually upcall tracks the client/inode combination, and only sends
> upcall events to clients that (recently/timeout?) accessed the inode.
> There should not be any upcalls for inodes that the client did not
> access. So, when promotion/demotion happens, only the process doing this
> should receive the event, not any of the other clients that did not
> access the inode.
> 
> > > 3. We've a long standing issue of spurious termination of fuse
> > > invalidation thread. Since after termination, the thread is not
> > > re-spawned,
> > > we would not be able to purge kernel entry/attribute/page-cache. This
> > > issue
> > > was touched upon during a discussion [5], though we didn't solve the
> > > problem then for lack of bandwidth. Csaba has agreed to work on this
> > > issue.
> > >
> > 
> > 4. Flooding of network with upcall notifications. Is it a problem? If yes,
> > does upcall infra already solves it? Would NFS/SMB leases help here?
> 
> I guess some form of flooding is possible when two or more clients do
> many directory operations in the same directory. Hmm, now I wonder if a
> client gets an upcall event for something it did itself. I guess that
> would (most often?) not be needed.
> 
> Niels
> 
> 
> > 
> > 
> > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> > > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> > > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> > > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> > > fuse/src/fuse-bridge.c
> > >
> > >
>

Re: [Gluster-devel] md-cache improvements

2016-08-17 Thread Niels de Vos
On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote:
> On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G 
> wrote:
> 
> >
> >
> > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G 
> > wrote:
> >
> >> Couple of more areas to explore:
> >> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
> >> upcall notification can result in a call to inode_invalidate, which results
> >> in an "invalidate" notification to fuse kernel module. While I am sure
> >> that, this notification will purge page-cache from kernel, I am not sure
> >> about dentries. I assume if an inode is invalidated, it should result in a
> >> lookup (from kernel to glusterfs). But neverthless, we should look into
> >> differences between entry_invalidation and inode_invalidation and harness
> >> them appropriately.

I do not think fuse handles upcall yet. I think there is a patch for
that somewhere. It's been a while since I looked into that, but I think
invalidating the affected dentries was straight forwards.

> >> 2. Granularity of invalidation. For eg., We shouldn't be purging
> >> page-cache in kernel, because of a change in xattr used by an xlator (eg.,
> >> dht layout xattr). We have to make sure that [1] is handling this. We need
> >> to add more granularity into invaldation (like internal xattr invalidation,
> >> user xattr invalidation, entry invalidation in kernel, page-cache
> >> invalidation in kernel, attribute/stat invalidation in kernel etc) and use
> >> them judiciously, while making sure other cached data remains to be 
> >> present.
> >>
> >
> > To stress the importance of this point, it should be noted that with tier
> > there can be constant migration of files, which can result in spurious
> > (from perspective of application) invalidations, even though application is
> > not doing any writes on files [2][3][4]. Also, even if application is
> > writing to file, there is no point in invalidating dentry cache. We should
> > explore more ways to solve [2][3][4].

Actually upcall tracks the client/inode combination, and only sends
upcall events to clients that (recently/timeout?) accessed the inode.
There should not be any upcalls for inodes that the client did not
access. So, when promotion/demotion happens, only the process doing this
should receive the event, not any of the other clients that did not
access the inode.

> > 3. We've a long standing issue of spurious termination of fuse
> > invalidation thread. Since after termination, the thread is not re-spawned,
> > we would not be able to purge kernel entry/attribute/page-cache. This issue
> > was touched upon during a discussion [5], though we didn't solve the
> > problem then for lack of bandwidth. Csaba has agreed to work on this issue.
> >
> 
> 4. Flooding of network with upcall notifications. Is it a problem? If yes,
> does upcall infra already solves it? Would NFS/SMB leases help here?

I guess some form of flooding is possible when two or more clients do
many directory operations in the same directory. Hmm, now I wonder if a
client gets an upcall event for something it did itself. I guess that
would (most often?) not be needed.

Niels


> 
> 
> > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> > fuse/src/fuse-bridge.c
> >
> >
> >>
> >> [1] http://review.gluster.org/12951
> >>
> >>
> >> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright 
> >> wrote:
> >>
> >>>
> >>> There have been recurring discussions within the gluster community to
> >>> build on existing support for md-cache and upcalls to help performance for
> >>> small file workloads. In certain cases, "lookup amplification" dominates
> >>> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs
> >>> from the client mitigates benefits from faster backend storage.
> >>>
> >>> To tackle this problem, one suggestion is to more aggressively utilize
> >>> md-cache to cache inodes on the client than is currently done. The inodes
> >>> would be cached until they are invalidated by the server.
> >>>
> >>> Several gluster development engineers within the DHT, NFS, and Samba
> >>> teams have been involved with related efforts, which have been underway 
> >>> for
> >>> some time now. At this juncture, comments are requested from gluster
> >>> developers.
> >>>
> >>> (1) .. help call out where additional upcalls would be needed to
> >>> invalidate stale client cache entries (in particular, need feedback from
> >>> DHT/AFR areas),
> >>>
> >>> (2) .. identify failure cases, when we cannot trust the contents of
> >>> md-cache, e.g. when an upcall may have been dropped by the network
> >>>
> >>> (3) .. point out additional improvements which md-cache needs. For
> >>> example, it cannot be allowed to grow unbounded.
> >>>
> 

Re: [Gluster-devel] md-cache improvements

2016-08-17 Thread Raghavendra G
On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G 
wrote:

>
>
> On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G 
> wrote:
>
>> Couple of more areas to explore:
>> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
>> upcall notification can result in a call to inode_invalidate, which results
>> in an "invalidate" notification to fuse kernel module. While I am sure
>> that, this notification will purge page-cache from kernel, I am not sure
>> about dentries. I assume if an inode is invalidated, it should result in a
>> lookup (from kernel to glusterfs). But neverthless, we should look into
>> differences between entry_invalidation and inode_invalidation and harness
>> them appropriately.
>>
>> 2. Granularity of invalidation. For eg., We shouldn't be purging
>> page-cache in kernel, because of a change in xattr used by an xlator (eg.,
>> dht layout xattr). We have to make sure that [1] is handling this. We need
>> to add more granularity into invaldation (like internal xattr invalidation,
>> user xattr invalidation, entry invalidation in kernel, page-cache
>> invalidation in kernel, attribute/stat invalidation in kernel etc) and use
>> them judiciously, while making sure other cached data remains to be present.
>>
>
> To stress the importance of this point, it should be noted that with tier
> there can be constant migration of files, which can result in spurious
> (from perspective of application) invalidations, even though application is
> not doing any writes on files [2][3][4]. Also, even if application is
> writing to file, there is no point in invalidating dentry cache. We should
> explore more ways to solve [2][3][4].
>
> 3. We've a long standing issue of spurious termination of fuse
> invalidation thread. Since after termination, the thread is not re-spawned,
> we would not be able to purge kernel entry/attribute/page-cache. This issue
> was touched upon during a discussion [5], though we didn't solve the
> problem then for lack of bandwidth. Csaba has agreed to work on this issue.
>

4. Flooding of network with upcall notifications. Is it a problem? If yes,
does upcall infra already solves it? Would NFS/SMB leases help here?


> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> fuse/src/fuse-bridge.c
>
>
>>
>> [1] http://review.gluster.org/12951
>>
>>
>> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright 
>> wrote:
>>
>>>
>>> There have been recurring discussions within the gluster community to
>>> build on existing support for md-cache and upcalls to help performance for
>>> small file workloads. In certain cases, "lookup amplification" dominates
>>> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs
>>> from the client mitigates benefits from faster backend storage.
>>>
>>> To tackle this problem, one suggestion is to more aggressively utilize
>>> md-cache to cache inodes on the client than is currently done. The inodes
>>> would be cached until they are invalidated by the server.
>>>
>>> Several gluster development engineers within the DHT, NFS, and Samba
>>> teams have been involved with related efforts, which have been underway for
>>> some time now. At this juncture, comments are requested from gluster
>>> developers.
>>>
>>> (1) .. help call out where additional upcalls would be needed to
>>> invalidate stale client cache entries (in particular, need feedback from
>>> DHT/AFR areas),
>>>
>>> (2) .. identify failure cases, when we cannot trust the contents of
>>> md-cache, e.g. when an upcall may have been dropped by the network
>>>
>>> (3) .. point out additional improvements which md-cache needs. For
>>> example, it cannot be allowed to grow unbounded.
>>>
>>> Dan
>>>
>>> - Original Message -
>>> > From: "Raghavendra Gowdappa" 
>>> >
>>> > List of areas where we need invalidation notification:
>>> > 1. Any changes to xattrs used by xlators to store metadata (like dht
>>> layout
>>> > xattr, afr xattrs etc).
>>> > 2. Scenarios where individual xlator feels like it needs a lookup. For
>>> > example failed directory creation on non-hashed subvol in dht during
>>> mkdir.
>>> > Though dht succeeds mkdir, it would be better to not cache this inode
>>> as a
>>> > subsequent lookup will heal the directory and make things better.
>>> > 3. removing of files
>>> > 4. writev on brick (to invalidate read cache on client)
>>> >
>>> > Other questions:
>>> > 5. Does md-cache has cache management? like lru or an upper limit for
>>> cache.
>>> > 6. Network disconnects and invalidating cache. When a network
>>> disconnect
>>> > happens we need to invalidate cache for inodes present on that brick
>>> as we
>>> > might be missing some notifications. Current approach of purging cache
>>> of

Re: [Gluster-devel] md-cache improvements

2016-08-16 Thread Vijay Bellur

On 08/16/2016 07:22 PM, Michael Adam wrote:

Hi all,

On 2016-08-15 at 22:39 -0400, Vijay Bellur wrote:

Hi Poornima, Dan -

Let us have a hangout/bluejeans session this week to discuss the planned
md-cache improvements, proposed timelines and sort out open questions if
any.


Because the initial mail creates the impression that this is
a topic that people are merely discussing, let me point out
that it has actually moved way beyond that stage already:

Poornima has been working hard on these cache improvements
since late 2015 at least. (And desperately looking for review
and support since at least springtime..) See all her patches
that have now finally already gone into master recently
(e.g. http://review.gluster.org/#/c/12951/ for an old one
that has just been merged)
and all the patches that she has still up for review
(e.g. http://review.gluster.org/#/c/15002/ for a big one).


I perhaps could have provided more context in my email. I have followed 
some of this work closely and it is in line with how I would like to see 
md-cache evolve. My intention behind scheduling this meeting is to:


a> Get a better understanding of the current state of affairs

b> Determine what workload profiles can benefit from this improvement

c> Facilitate reviews and address pending open issues, if any, for 3.9 
and beyond.




These changes were mainly motivated by samba-workloads,
since the chatty, md-heavy smb protocol is suffering most
notably from the lack of proper caching of this metadata.
The good news is that it recently started getting more
attention and we are seeing very, very promising performance
test results!
Full functional and regression testings are also underway.


Good to know! Look forward to understand more about the nature of 
performance improvements in the call or over here :). I think metadata 
intensive workloads from fuse/gfapi can also benefit from this 
improvement. We can hopefully start doing more focussed tests to 
validate this hypothesis post the call.





Would 11:00 UTC on Wednesday work for everyone in the To: list?


Not on the To: list myself, but would work for me.. :-)
Although I have to admit it may really be very short notice for
some...



I agree that this is a very short notice. 3.9 being 6 weeks away is 
driving the urgency largely.



And since Poornima drove the project thus far, and was mainly
supported by Rajesh J and R.Talur from the gluster side for long
stretches of time, afaict, I think these three should be present
bare minimum.



Thank you for letting me know whom I missed. Rajesh, R. Talur - look 
forward to see you folks in the meeting!


Regards,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache improvements

2016-08-16 Thread Michael Adam
Hi all,

On 2016-08-15 at 22:39 -0400, Vijay Bellur wrote:
> Hi Poornima, Dan -
> 
> Let us have a hangout/bluejeans session this week to discuss the planned
> md-cache improvements, proposed timelines and sort out open questions if
> any.

Because the initial mail creates the impression that this is
a topic that people are merely discussing, let me point out
that it has actually moved way beyond that stage already:

Poornima has been working hard on these cache improvements
since late 2015 at least. (And desperately looking for review
and support since at least springtime..) See all her patches
that have now finally already gone into master recently
(e.g. http://review.gluster.org/#/c/12951/ for an old one
that has just been merged)
and all the patches that she has still up for review
(e.g. http://review.gluster.org/#/c/15002/ for a big one).

These changes were mainly motivated by samba-workloads,
since the chatty, md-heavy smb protocol is suffering most
notably from the lack of proper caching of this metadata.
The good news is that it recently started getting more
attention and we are seeing very, very promising performance
test results!
Full functional and regression testings are also underway.

Discussion the state of affairs in a real call
could be very useful indeed. Sometimes this can be
less awkward than using the list..

> Would 11:00 UTC on Wednesday work for everyone in the To: list?

Not on the To: list myself, but would work for me.. :-)
Although I have to admit it may really be very short notice for
some...

And since Poornima drove the project thus far, and was mainly
supported by Rajesh J and R.Talur from the gluster side for long
stretches of time, afaict, I think these three should be present
bare minimum.

Thanks - Michael


> On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote:
> > 
> > My comments inline.
> > 
> > Regards,
> > Poornima
> > 
> > - Original Message -
> > > From: "Dan Lambright" <dlamb...@redhat.com>
> > > To: "Gluster Devel" <gluster-devel@gluster.org>
> > > Sent: Wednesday, August 10, 2016 10:35:58 PM
> > > Subject: [Gluster-devel] md-cache improvements
> > > 
> > > 
> > > There have been recurring discussions within the gluster community to 
> > > build
> > > on existing support for md-cache and upcalls to help performance for small
> > > file workloads. In certain cases, "lookup amplification" dominates data
> > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from 
> > > the
> > > client mitigates benefits from faster backend storage.
> > > 
> > > To tackle this problem, one suggestion is to more aggressively utilize
> > > md-cache to cache inodes on the client than is currently done. The inodes
> > > would be cached until they are invalidated by the server.
> > > 
> > > Several gluster development engineers within the DHT, NFS, and Samba teams
> > > have been involved with related efforts, which have been underway for some
> > > time now. At this juncture, comments are requested from gluster 
> > > developers.
> > > 
> > > (1) .. help call out where additional upcalls would be needed to 
> > > invalidate
> > > stale client cache entries (in particular, need feedback from DHT/AFR
> > > areas),
> > > 
> > > (2) .. identify failure cases, when we cannot trust the contents of 
> > > md-cache,
> > > e.g. when an upcall may have been dropped by the network
> > 
> > Yes, this needs to be handled.
> > It can happen only when there is a one way disconnect, where the server 
> > cannot
> > reach client and notify fails. We can have a retry for the same until the 
> > cache
> > expiry time.
> > 
> > > 
> > > (3) .. point out additional improvements which md-cache needs. For 
> > > example,
> > > it cannot be allowed to grow unbounded.
> > 
> > This is being worked on, and will be targetted for 3.9
> > 
> > > 
> > > Dan
> > > 
> > > - Original Message -
> > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > 
> > > > List of areas where we need invalidation notification:
> > > > 1. Any changes to xattrs used by xlators to store metadata (like dht 
> > > > layout
> > > > xattr, afr xattrs etc).
> > 
> > Currently, md-cache will negotiate(using ipc) with the brick, a list of 
> > xattrs
> > that it needs invalidation for. Other xlators can add the xattrs they are 
> > interested
> > in to

Re: [Gluster-devel] md-cache improvements

2016-08-15 Thread Vijay Bellur

Hi Poornima, Dan -

Let us have a hangout/bluejeans session this week to discuss the planned 
md-cache improvements, proposed timelines and sort out open questions if 
any.


Would 11:00 UTC on Wednesday work for everyone in the To: list?

Thanks,
Vijay



On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote:


My comments inline.

Regards,
Poornima

- Original Message -

From: "Dan Lambright" <dlamb...@redhat.com>
To: "Gluster Devel" <gluster-devel@gluster.org>
Sent: Wednesday, August 10, 2016 10:35:58 PM
Subject: [Gluster-devel] md-cache improvements


There have been recurring discussions within the gluster community to build
on existing support for md-cache and upcalls to help performance for small
file workloads. In certain cases, "lookup amplification" dominates data
transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the
client mitigates benefits from faster backend storage.

To tackle this problem, one suggestion is to more aggressively utilize
md-cache to cache inodes on the client than is currently done. The inodes
would be cached until they are invalidated by the server.

Several gluster development engineers within the DHT, NFS, and Samba teams
have been involved with related efforts, which have been underway for some
time now. At this juncture, comments are requested from gluster developers.

(1) .. help call out where additional upcalls would be needed to invalidate
stale client cache entries (in particular, need feedback from DHT/AFR
areas),

(2) .. identify failure cases, when we cannot trust the contents of md-cache,
e.g. when an upcall may have been dropped by the network


Yes, this needs to be handled.
It can happen only when there is a one way disconnect, where the server cannot
reach client and notify fails. We can have a retry for the same until the cache
expiry time.



(3) .. point out additional improvements which md-cache needs. For example,
it cannot be allowed to grow unbounded.


This is being worked on, and will be targetted for 3.9



Dan

- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>

List of areas where we need invalidation notification:
1. Any changes to xattrs used by xlators to store metadata (like dht layout
xattr, afr xattrs etc).


Currently, md-cache will negotiate(using ipc) with the brick, a list of xattrs
that it needs invalidation for. Other xlators can add the xattrs they are 
interested
in to the ipc. But then these xlators need to manage their own caching and 
processing
the invalidation request, as md-cache will be above all cluater xlators.
reference: http://review.gluster.org/#/c/15002/


2. Scenarios where individual xlator feels like it needs a lookup. For
example failed directory creation on non-hashed subvol in dht during mkdir.
Though dht succeeds mkdir, it would be better to not cache this inode as a
subsequent lookup will heal the directory and make things better.


For this, these xlators can specify an indicator in the dict of
the fop cbk, to not cache. This should be fairly simple to implement.


3. removing of files


When an unlink is issued from the mount point, the cache is invalidated.


4. writev on brick (to invalidate read cache on client)


writev on brick from any other client will invalidate the metadata cache on all
the other clients.



Other questions:
5. Does md-cache has cache management? like lru or an upper limit for
cache.


Currently md-cache doesn't have any cache-management, we will be targeting this
for 3.9


6. Network disconnects and invalidating cache. When a network disconnect
happens we need to invalidate cache for inodes present on that brick as we
might be missing some notifications. Current approach of purging cache of
all inodes might not be optimal as it might rollback benefits of caching.
Also, please note that network disconnects are not rare events.


Network disconnects are handled to a minimal extent, where any brick down will
cause the whole of the cache to be invalidated. Invalidating only the list of
inodes that belong to that perticular brick will need the support from the
underlying cluster xlators.



regards,
Raghavendra

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache improvements

2016-08-10 Thread Poornima Gurusiddaiah

My comments inline.

Regards,
Poornima

- Original Message -
> From: "Dan Lambright" <dlamb...@redhat.com>
> To: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Wednesday, August 10, 2016 10:35:58 PM
> Subject: [Gluster-devel] md-cache improvements
> 
> 
> There have been recurring discussions within the gluster community to build
> on existing support for md-cache and upcalls to help performance for small
> file workloads. In certain cases, "lookup amplification" dominates data
> transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the
> client mitigates benefits from faster backend storage.
> 
> To tackle this problem, one suggestion is to more aggressively utilize
> md-cache to cache inodes on the client than is currently done. The inodes
> would be cached until they are invalidated by the server.
> 
> Several gluster development engineers within the DHT, NFS, and Samba teams
> have been involved with related efforts, which have been underway for some
> time now. At this juncture, comments are requested from gluster developers.
> 
> (1) .. help call out where additional upcalls would be needed to invalidate
> stale client cache entries (in particular, need feedback from DHT/AFR
> areas),
> 
> (2) .. identify failure cases, when we cannot trust the contents of md-cache,
> e.g. when an upcall may have been dropped by the network

Yes, this needs to be handled.
It can happen only when there is a one way disconnect, where the server cannot
reach client and notify fails. We can have a retry for the same until the cache
expiry time.

> 
> (3) .. point out additional improvements which md-cache needs. For example,
> it cannot be allowed to grow unbounded.

This is being worked on, and will be targetted for 3.9

> 
> Dan
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > 
> > List of areas where we need invalidation notification:
> > 1. Any changes to xattrs used by xlators to store metadata (like dht layout
> > xattr, afr xattrs etc).

Currently, md-cache will negotiate(using ipc) with the brick, a list of xattrs
that it needs invalidation for. Other xlators can add the xattrs they are 
interested
in to the ipc. But then these xlators need to manage their own caching and 
processing
the invalidation request, as md-cache will be above all cluater xlators.
reference: http://review.gluster.org/#/c/15002/

> > 2. Scenarios where individual xlator feels like it needs a lookup. For
> > example failed directory creation on non-hashed subvol in dht during mkdir.
> > Though dht succeeds mkdir, it would be better to not cache this inode as a
> > subsequent lookup will heal the directory and make things better.

For this, these xlators can specify an indicator in the dict of
the fop cbk, to not cache. This should be fairly simple to implement.

> > 3. removing of files

When an unlink is issued from the mount point, the cache is invalidated.

> > 4. writev on brick (to invalidate read cache on client)

writev on brick from any other client will invalidate the metadata cache on all
the other clients.

> > 
> > Other questions:
> > 5. Does md-cache has cache management? like lru or an upper limit for
> > cache.

Currently md-cache doesn't have any cache-management, we will be targeting this
for 3.9

> > 6. Network disconnects and invalidating cache. When a network disconnect
> > happens we need to invalidate cache for inodes present on that brick as we
> > might be missing some notifications. Current approach of purging cache of
> > all inodes might not be optimal as it might rollback benefits of caching.
> > Also, please note that network disconnects are not rare events.

Network disconnects are handled to a minimal extent, where any brick down will
cause the whole of the cache to be invalidated. Invalidating only the list of
inodes that belong to that perticular brick will need the support from the
underlying cluster xlators.

> > 
> > regards,
> > Raghavendra
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache improvements

2016-08-10 Thread Raghavendra G
Couple of more areas to explore:
1. purging kernel dentry and/or page-cache too. Because of patch [1],
upcall notification can result in a call to inode_invalidate, which results
in an "invalidate" notification to fuse kernel module. While I am sure
that, this notification will purge page-cache from kernel, I am not sure
about dentries. I assume if an inode is invalidated, it should result in a
lookup (from kernel to glusterfs). But neverthless, we should look into
differences between entry_invalidation and inode_invalidation and harness
them appropriately.

2. Granularity of invalidation. For eg., We shouldn't be purging page-cache
in kernel, because of a change in xattr used by an xlator (eg., dht layout
xattr). We have to make sure that [1] is handling this. We need to add more
granularity into invaldation (like internal xattr invalidation, user xattr
invalidation, entry invalidation in kernel, page-cache invalidation in
kernel, attribute/stat invalidation in kernel etc) and use them
judiciously, while making sure other cached data remains to be present.

[1] http://review.gluster.org/12951

On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright  wrote:

>
> There have been recurring discussions within the gluster community to
> build on existing support for md-cache and upcalls to help performance for
> small file workloads. In certain cases, "lookup amplification" dominates
> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs
> from the client mitigates benefits from faster backend storage.
>
> To tackle this problem, one suggestion is to more aggressively utilize
> md-cache to cache inodes on the client than is currently done. The inodes
> would be cached until they are invalidated by the server.
>
> Several gluster development engineers within the DHT, NFS, and Samba teams
> have been involved with related efforts, which have been underway for some
> time now. At this juncture, comments are requested from gluster developers.
>
> (1) .. help call out where additional upcalls would be needed to
> invalidate stale client cache entries (in particular, need feedback from
> DHT/AFR areas),
>
> (2) .. identify failure cases, when we cannot trust the contents of
> md-cache, e.g. when an upcall may have been dropped by the network
>
> (3) .. point out additional improvements which md-cache needs. For
> example, it cannot be allowed to grow unbounded.
>
> Dan
>
> - Original Message -
> > From: "Raghavendra Gowdappa" 
> >
> > List of areas where we need invalidation notification:
> > 1. Any changes to xattrs used by xlators to store metadata (like dht
> layout
> > xattr, afr xattrs etc).
> > 2. Scenarios where individual xlator feels like it needs a lookup. For
> > example failed directory creation on non-hashed subvol in dht during
> mkdir.
> > Though dht succeeds mkdir, it would be better to not cache this inode as
> a
> > subsequent lookup will heal the directory and make things better.
> > 3. removing of files
> > 4. writev on brick (to invalidate read cache on client)
> >
> > Other questions:
> > 5. Does md-cache has cache management? like lru or an upper limit for
> cache.
> > 6. Network disconnects and invalidating cache. When a network disconnect
> > happens we need to invalidate cache for inodes present on that brick as
> we
> > might be missing some notifications. Current approach of purging cache of
> > all inodes might not be optimal as it might rollback benefits of caching.
> > Also, please note that network disconnects are not rare events.
> >
> > regards,
> > Raghavendra
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] md-cache improvements

2016-08-10 Thread Dan Lambright

There have been recurring discussions within the gluster community to build on 
existing support for md-cache and upcalls to help performance for small file 
workloads. In certain cases, "lookup amplification" dominates data transfers, 
i.e. the cumulative round trip times of multiple LOOKUPs from the client 
mitigates benefits from faster backend storage. 

To tackle this problem, one suggestion is to more aggressively utilize md-cache 
to cache inodes on the client than is currently done. The inodes would be 
cached until they are invalidated by the server. 

Several gluster development engineers within the DHT, NFS, and Samba teams have 
been involved with related efforts, which have been underway for some time now. 
At this juncture, comments are requested from gluster developers. 

(1) .. help call out where additional upcalls would be needed to invalidate 
stale client cache entries (in particular, need feedback from DHT/AFR areas), 

(2) .. identify failure cases, when we cannot trust the contents of md-cache, 
e.g. when an upcall may have been dropped by the network

(3) .. point out additional improvements which md-cache needs. For example, it 
cannot be allowed to grow unbounded.

Dan

- Original Message -
> From: "Raghavendra Gowdappa" 
> 
> List of areas where we need invalidation notification:
> 1. Any changes to xattrs used by xlators to store metadata (like dht layout
> xattr, afr xattrs etc).
> 2. Scenarios where individual xlator feels like it needs a lookup. For
> example failed directory creation on non-hashed subvol in dht during mkdir.
> Though dht succeeds mkdir, it would be better to not cache this inode as a
> subsequent lookup will heal the directory and make things better.
> 3. removing of files
> 4. writev on brick (to invalidate read cache on client)
> 
> Other questions:
> 5. Does md-cache has cache management? like lru or an upper limit for cache.
> 6. Network disconnects and invalidating cache. When a network disconnect
> happens we need to invalidate cache for inodes present on that brick as we
> might be missing some notifications. Current approach of purging cache of
> all inodes might not be optimal as it might rollback benefits of caching.
> Also, please note that network disconnects are not rare events.
> 
> regards,
> Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel