Hi all,

I put together the attached one-pager on the ZFS Automatic Snapshots
service which I've been maintaining on my blog to date.

I would like to see if this could be integrated into ON and believe that
a first step towards this is a project one-pager: so I've attached a
draft version.

I'm happy to defer judgement to the ZFS team as to whether this would be
a suitable addition to OpenSolaris - if the consensus is that it's
better for the service to remain in it's current un-integrated state and
be discovered through BigAdmin or web searches, that's okay by me.
[ just thought I'd ask ]

        cheers,
                        tim

-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf
Template Version: @(#)onepager.txt 1.31 07/08/08 SMI

[ timf note: this is still a Draft, last updated 02/04/2008
  using the template at
  http://www.opensolaris.org/os/community/arc/handbook/onepager/
  ]

This information is Copyright 2008 Sun Microsystems

1. Introduction
   1.1. Project/Component Working Name: 
        ZFS Automatic Snapshots

   1.2. Name of Document Author/Supplier:
        Tim Foster

   1.3. Date of This Document: 
        02/04/2008

   1.4. Name of Major Document Customer(s)/Consumer(s):
        1.4.1. The Community you expect to review your project: 
                ZFS OpenSolaris Community
                [editor's note - I'm not sure what was expected for 1.4.1 above]
        1.4.2. The ARC(s) you expect to review your project: PSARC

   1.5. Email Aliases:
        1.5.2. Responsible Engineer: [EMAIL PROTECTED]
        1.5.4. Interest List: zfs-discuss@opensolaris.org

2. Project Summary
   2.1. Project Description:

        This project delivers an SMF service which allows the admin to performs
        regular, periodic snapshots of user/administrator-specified ZFS
        filesystems. It is loosely coupled with the ZFS codebase, using only
        the ZFS CLI, cron and SMF to perform it's functionality.

   2.2. Risks and Assumptions:

        The current prototype has been implemented entirely in Korn shell -
        performance/scalability testing has not yet been carried out to
        determine whether this implementation is fast enough. If much tighter
        integration into the ZFS codebase is required, then this project will
        need additional resources.

        This project is not officially Sun funded - the engineer is doing this
        in his spare time. This could be mitigated by additional resources if
        a significant amount of additional engineering is recommended by the
        ARC and those resources become available.

3. Business Summary
   3.1. Problem Area:

        This adds one more feature to the capabilities ZFS brings to Solaris,
        integrating ZFS more tightly with the operating system and providing
        a feature that some expect ZFS to have already.

   3.2. Market/Requester:

        No specific person has asked for this feature, but it appears to be a
        general feature of many NAS boxes. The idea for such a system in ZFS
        came from a discussion on the zfs-discuss@opensolaris.org mailing list:

        http://www.opensolaris.org/jive/thread.jspa?messageID=37190

   3.3. Business Justification:

        Not providing scheduled periodic ZFS snapshots on Solaris out of the
        box, means that there's one more thing that a system administrator needs
        to write and debug scripts for, before putting a Solaris system into
        production to best exercise the features ZFS can provide.

        Having a common facility in Solaris that does this would prevent
        duplication of effort at user sites, increase the
        speed-to deploy a Solaris system, and make life easier for support staff
        when users either request this feature, or try to troubleshoot a
        user's homemade solution.

   3.4. Competitive Analysis:

        Many other NAS products and operating systems that support snapshots
        already do this. These include:

        http://www.emc.com/products/software/snapview2.jsp
        
http://www.microsoft.com/windows/products/windowsvista/features/details/shadowcopy.mspx
        
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=relnotes&fname=/usr/relnotes/nasmgr
        http://www.netapp.com/ftp/snapshot-brochure.pdf
        http://www.real-storage.com/nas-snapshots.html
        http://people.freebsd.org/~rse/snapshot/

   3.5. Opportunity Window/Exposure:

        We're playing catchup.

   3.6. How will you know when you are done?:

        The major features have already been implemented in Korn shell, but we
        need to perform more testing, and get additional code reviews. 

        Community feedback can be used to determine if we've implemented enough
        of the functionality for this to be useful.

        [ editor's note: yes, that's pretty vague. I don't know
          what specific metrics I could use here - any suggestions ? ]

4. Technical Description:
    4.1. Details:

        The service works by having a separate service instance, each denoting
        a separate schedule of periodic snapshots, per group of fileystems. 
        The SMF method script is responsible for adding/removing the snapshot
        cron job, which corresponds to enabling and disabling the service.
 
        The method script is also called directly from cron according to the
        crontab entries - in which case it is responsible for taking the
        snapshot.

        Filesystems are grouped together either by setting their names as a
        space separated list in an SMF instance property, or queried dynamically
        by the method script, by the service searching for an instance-specific
        ZFS user-property across all ZFS filesystems. With ZFS Delegated
        Administration (PSARC 2006/465), users can specify this property on
        their own filesystems, and need not reconfigure the SMF service.

        The service can also be responsible for destroying older snapshots taken
        by the service, allowing the administrator to keep a given number of
        snapshots into the past. The service can perform a backup command at
        each invocation of the cron job - the admin specifies what command to
        run at the end of a pipe that starts with
        "zfs send <filesystem>@<snapshot>", with the option of sending an
        incremental stream from the previous periodic snapshot.

        What does this offer that a simple "zfs snapshot <filesystem>@snap"
        entry in crontab doesn't?  Using SMF allows the adminstrator to
        easily see when snapshots fail for some reason, allows them to easily
        enable/disable snapshots for groupings of filesystems and adds
        additional features, like performing backups of their filesystems.
        In the default configuration, we have daily, weekly, hourly, monthly
        and yearly snapshots - each managed under a different SMF instance.

        The administrator could add instances to take more frequent snapshots
        for some filesystems, less frequent snapshots for other filesystems - 
        and have the service manage the complexity of dealing with cron for
        them.

        This has been a personal project up till now, with code (licensed under
        CDDL) and implementation posted on the engineer's blog. The README
        documentation for the project is at:

        http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt

        The "SEE ALSO" section of the README has a list of links showing
        the various stages of the project to date. To summarize, the project
        has evolved over 10 versions since May 2006 to the present date. Users
        have been running the code, and providing feedback, which has been
        integrated into each subsequent version.


        Two known bugs are worth calling out here:

        One is to do with our reliance on cron. That is, to properly allow the
        administrator have snapshots taken every 3 days, we'd need to re-write
        the crontab entry when the days in the month aren't evenly divisible by 
        3 - at the moment, the following crontab day field would look like:
        
        1,4,7,10,13,16,19,22,25,28,31

        after taking the snapshot on the 31st our next snapshot should be taken
        on the 3rd of the following month - but as implemented, it'll get taken
        on the 1st instead. Other time periods are similarly affected.

        The other bug is 

        6474294 Need to be able to better control who can read files in a
                snapshot.

        This service doesn't change the implications of that bug, but having
        automatic snapshots could result in more people running into the
        situation.
        

    4.2. Bug/RFE Number(s):

        TBD
        

    4.3. In Scope:
        Everything discussed in this one-pager is in scope.

    4.4. Out of Scope:

        While this service does provide a means for a snapshot stream to be
        stored remotely (the "backup" option allows for a ZFS send-stream
        to get piped to an administrator-specified command) it doesn't provide
        the eqivalent "restore" command. This is not a general purpose backup
        tool (ie. does not fix 5004379). This is also not a general purpose
        remote replication facility (5036182) although some users have already
        started using it as a "poor man's cluster".

        [ editor's note - with that in mind, could this service ultimately end
         up confusing people who are expecting the above? Should we postpone
         work on this part of the big picture till the above facilities 
         are available? ]

    4.5. Interfaces:
        
        The interface will be the SMF service, allowing users to create
        instances of the service to perform work.

        We suspect the stability level will be Evolving, but would like advice.
        Over the course of the prototype development, we've added, but never
        removed several service properties - a 0.1 manifest will work correctly
        with a 0.10 version of the service.

    4.6. Doc Impact:
        
        The ZFS Administration Guide could be modified to reference this
        service.

    4.7. Admin/Config Impact:

        Adding this SMF service will introduce no change to the way Solaris
        is currently installed or administered. Out of the box, the included
        service instances can be installed as "disabled". The administrator
        would need to enable each service they wanted to use, then mark
        filesystems for inclusion under each the snapshot schedule set by the
        now-enabled instances.
        

    4.8. HA Impact:

        // What new requirements does this proposal place on the High
        // Availability or Clustering aspects of the component?

        [ editor's note: I'm not sure of the answer here - I assume HA
         clusters already have some form of SMF synchronisation to
         ensure that failover-nodes have the same SMF configuration
         applied automatically, should the running node change it's
         SMF configuration ? ]

    4.9. I18N/L10N Impact:

        Additional translation of the ZFS Administration Guide could be
        required.

    4.10. Packaging & Delivery:

        One additional package, which delivers the default instance,
        the included instances and the method script. No impact during
        Install/Upgrade.

    4.11. Security Impact:

        If periodic snapshots are taken of sensitive data, then 6474294
        may be worth visiting prior to integration, however this service
        only highlights that problem - it exists without the service as
        well.

    4.12. Dependencies:

        Cron, SMF and ZFS. The service works with ZFS from s10u2 and later
         - later ZFS versions include faster recursive snapshots, which the
        method script detects and uses the feature if it's available.

5. Reference Documents:
        
        The following bugids have been mentioned in this one-pager, under
        sections 4.4 and 4.11.

        5004379 want comprehensive backup strategy
        5036182 want remote replication (intent-log based)
        6474294 Need to be able to better control who can read files in
                a snapshot.


6. Resources and Schedule:
   6.1. Projected Availability:
        
        TDB

   6.2. Cost of Effort:
        // Order of magnitude people and time for the *whole* project, not
        // just the development engineering part.
        // You may wish to split the estimate between feature
        // implementation, implementing adminsitrative Interfaces, unit
        // tests, documentation, support training material, i18n, etc.

        [editor's note - any ideas ? The prototype is done - there's additional
        work to integrate it in ON and Install, properly use RBAC, perhaps a
        few weeks work in my spare time ? ]

   6.4. Product Approval Committee requested information:
        6.4.1. Consolidation or Component Name: ON
        6.4.7. Target RTI Date/Release:
        
                TBD

                // List target release & build and/or date.
                // RTI = Request to Integrate - when does *this* project
                // expect to be ready to integrate its changes back into
                // the master source tree?  We are not asking when the
                // component wants to ship, but instead, when the
                // gatekeeper/PM needs to expect your changes to show up.
                // examples: S8u7_1, S9_45, Aug 2002...

        6.4.8. Target Code Design Review Date: TBD

   6.5. ARC review type: Standard
   6.6. ARC Exposure: open
       6.6.1. Rationale: Part of OpenSolaris

7. Prototype Availability:
   7.1. Prototype Availability:

        An evolving prototype has been available since May 2006. More work is
        needed to add RBAC SMF authorisations to manage the service instances.

   7.2. Prototype Cost:
        
        $0
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to