Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Xavier Hernandez Thu, 06 Feb 2014 00:52:23 -0800

Hi Avati,

El 06/02/14 00:24, Anand Avati ha escrit:

Xavi,
Getting such a caching mechanism has several aspects. First of all weneed the framework pieces implemented (particularly server originatedmessages to the client for invalidation and revokes) in a welldesigned way. Particularly how we address a specific translator in amessage originating from the server. Some of the recent changes toclient_t allows for server-side translators to get a handle (theclient_t object) on which messages can be submitted back to the client.
Such a framework (of having server originated messages) is alsonecessary for implementing oplocks (and possibly leases) -particularly interesting for the Samba integration.

Yes, that is a basic requirement for many features. I saw the client_tchanges but haven't had time to see if they could be used to implementthe kind of mechanism I proposed. This will need a look.

When I started implementing the DFC translator(https://forge.gluster.org/disperse/dfc) I needed something very similarbut at that time there wasn't any suitable client_t implementation Icould use. I solved it by using a pool of special getxattr requests thatthe translator on the bricks stores until it needs to send some messageback to the client. It's not a great solution but it works with theavailable resources at the moment.

As Jeff already mentioned, this is an area where gluster has notfocussed on, given the targeted use case. However the benefits ofextending this to internal use cases (to avoid per-operation inodelkscan benefit many modules - encryption/crypt, afr, etc.) It seemspossible to have a common framework for delegating locks to clients,and build caching coherency protocols / oplocks / inodelk avoidence ontop of it.
Feel free to share a more detailed proposal if you have have/plan -I'm sure the Samba folks (Ira copied) would be interested too.

I have some ideas on how to implement it and some special cases, but Ineed to work more on it before it can be considered a valid model. Ijust wanted to propose the idea to see if it could be valid or notbefore spending too much of my scarce time working on it. I'll try toget a more detailed picture to discuss it.


Best regards,

Xavi


Thanks!
Avati

On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez<xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:


    On 04.02.2014 17:18, Jeff Darcy wrote:

            The only synchronization point needed is to make sure that
            all bricks
            agree on the inode state and which client owns it. This
            can be achieved
            without locking using a method similar to what I
            implemented in the DFC
            translator. Besides the lock-less architecture, the main
            advantage is
            that much more aggressive caching strategies can be
            implemented very
            near to the final user, increasing considerably the
            throughput of the
            file system. Special care has to be taken with things than
            can fail on
            background writes (basically brick space and user access
            rights). Those
            should be handled appropiately on the client side to
            guarantee future
            success of writes. Of course this is only a high level
            overview. A
            deeper analysis should be done to see what to do on each
            special case.
            What do you think ?


        I think this is a great idea for where we can go - and need to
        go - in the
        long term. However, it's important to recognize that it *is*
        the long
        term. We had to solve almost exactly the same problems in MPFS
        long ago.
        Whether the synchronization uses locks or not *locally* is
        meaningless,
        because all of the difficult problems have to do with
        recovering the
        *distributed* state. What happens when a brick fails while
        holding an
        inode in any state but I? How do we recognize it, what do we
        do about it,
        how do we handle the case where it comes back and needs to
        re-acquire its
        previous state? How do we make sure that a brick can
        successfully flush
        everything it needs to before it yields a lock/lease/whatever?
        That's
        going to require some kind of flow control, which is itself a
        pretty big
        project. It's not impossible, but it took multiple people some
        years for
        MPFS, and ditto for every other project (e.g. Ceph or
        XtreemFS) which
        adopted similar approaches. GlusterFS's historical avoidance
        of this
        complexity certainly has some drawbacks, but it has also been
        key to us
        making far more progress in other areas.

    Well, it's true that there will be a lot of tricky cases that will
    need
    to be handled to be sure that data integrity and system
    responsiveness is
    guaranteed, however I think that they are not more difficult than what
    can happen currently if a client dies or loses communication while it
    holds a lock on a file.

    Anyway I think there is a great potential with this mechanism
    because it
    can allow the implementation of powefull caches, even based on SSD
    that
    could improve the performance a lot.

    Of course there is a lot of work solving all potential failures and
    designing the right thing. An important consideration is that all
    these methods try to solve a problem that is seldom found (i.e. having
    more than one client modifying the same file at the same time). So a
    solution that has almost 0 overhead for the normal case and allows the
    implementation of aggressive caching mechanisms seems a big win.


        To move forward on this, I think we need a *much* more
        detailed idea of
        how we're going to handle the nasty cases. Would some sort of
        online
        collaboration - e.g. Hangouts - make more sense than
        continuing via
        email?

    Of course, we can talk on irc or another place if you prefer

    Xavi


    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@nongnu.org <mailto:Gluster-devel@nongnu.org>
    https://lists.nongnu.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Reply via email to