Matt Benjamin alluded to this in other email on the info list; given the state of our world it's a good idea to get the idea out to others. "The state of our world" doesn't mean it's coming apart, just means that we probably aren't going to be working on this for the forseeable future.
Dan Hyde and I were doing a system at Michigan that was intended to allow rapid disaster recovery for AFS using a system analogous to snapmirroring on NetApp filers and similar devices. This note is to provide a quick overview of where we were going, how far we got, and why. Problem: at Michigan, losing an AFS file server can make some or all of the cell unusable (handwave, handwave on why). As the number of servers increases, the likelihood of this becomes higher and higher. We were looking for a way to minimize those losses, and glommed onto an unfinished project called 'shadow volumes' to do it. Shadow volumes have lots of theoretical capabilities; we were pushing for one specific set of features in our implementation. Don't take our work as being representative of either what the initial developers intended or as the only possible use for it. In classic open source fashion, our development reflected scratching our own particular itches. Credit where credit is due: Dan did all the heavy lifting on the code and a lot of the test operational deployment. And the original work was done by someone whose name escapes me right at this second; if time and energy permit I'll look that up and give that credit. A shadow volume is a read-only remote clone of a primary volume. We had to create some terminology here, and 'primary' is what we called the real-time, in-use, r/w production volume. A remote clone closely resembles a read-only replica of a volume, but differs in several important respects. First and foremost, it does not appear in the vldb. Thus there is no possibility of the read-only copy coming into production. If it were public like a r/o replica, it would generate all kinds of problems for the day to day use of the volume. Our solution to this follows the original developer: the only way to prevent use of the r/o was to not have it appear in the vldb. Longer term there are better ways, but this did the least violence to existing cells. A shadow volume should retain a timestamp and name-or-id relationship with the primary. This should enable something much like a release of a replicated volume - incremental changes are quickly and easily propogated to the shadow. We call that refreshing the shadow. As the shadow was not in the vldb, this requires the refresh be initiated by something external to the vldb/primary. That code is complete and works. This was running on a nightly basis in our cell with an acceptably small amount of overtime - not much more than the nightly backup snapshots. Big kudos to Dan on this. Shadow volumes could be detected only on the server on which they reside. Modification were made to vos listvol for that purpose. A bit in the volume header was selected for distinguishing a shadow from a primary volume; I believe that was the only modification made to the volume header file. This work is also done. A mechanism needs to be established such that a shadow volume can be promoted (our term) to a primary. This mechanism would involve at least two steps: flipping the shadow bit in the header file to indicate the volume is a primary, and updating the vldb to indicate the new location of the primary. This work is incomplete; I don't have a feel for how much if any is done. With these features, we could meet the minimum bar for our usage. We could, in theory, disastrously lose an AFS server, promote the shadows, and be back online in minutes. There would be data lossage for any changes which occur between the last refresh and the promotion, but this was judged preferable to having the cell down or non-functional for hours or even days. In our initial implementation, we were building afs servers in pairs with shadow servers. Each server in a pair was intended for only one purpose - either all primary volumes, or all shadow volumes. This isn't the only way to do it, but we selected this method for a couple of reasons: * It eased the tracking of where shadow volumes were, and enabled us to easily find shadow volumes that might no longer be needed on a a given shadow server. * It very much reflects the problem we're trying to solve: disastrous loss of either (a) a file server or (b) an entire data center. A quick ability to tell a server 'promote everything' made for quick and accurate response in the face of not having the shadow data in the vldb. To support this process, every night (or at interval you choose) the shadow servers would examine the primary volumes on it's paired server, and would create or refresh the shadows as needed. We intended to update our provisioning process for volumes such that shadows would automatically be created when a primary was created or moved, but since shadow servers caught any missing volumes automatically, it was kind of low on the list. Other things one could do with shadows: I mention using shadows and their clones as part of a file restore system. That's nice, but rather a pain in many ways. It's also a desire to work around the limitations of only having 7 clone slots available. Having a significantly larger number of clones is a much better solution, but that's outside the scope of this project. Thing envisioned but not yet followed through to an actual design: * a vldb-like solution such that shadows(s) of a given primary could be identified easily and moved/updated appropriately. In the best of all worlds, this would be a part of the vldb, but that's a lot to wish for * volume-sensitive and shadow-sensitive decisions on freshen frequency. One might refresh critical data volumes quite often, less critical ones rarely or not at all. One might refresh on-site shadows frequently, off-site daily * remote shadows become your long-term backup system. This would require several features, most critical ** the ability to have clones of shadows, one clone per, say, each daily backup. Note this requires that refreshing a clone should also have to manage those clones in some flexible way ** the ability to promote a shadow to a different name. This enables the shadow and its clones to be made visible without taking the production off-line. * clones (in particular, .backup) of a primary should be refreshable to a shadow, ie, specified clones of the primary could be refreshed to the shadow * some way of mediating between incompatible operations, eg, have refresh operations either queue or abort cleanly if they would interfere with other activities like volume moves, backupsys, etc. Some open questions: It was clear we were talking about volume families - a primary, its clones, it's shadows, their clones, etc, etc. Should you be able to have shadows of shadows? We think so; refreshing multiple shadows of a given volume shouldn't require hitting the primary multiple times nor doing all those refreshes in lock steps. We need to establish a sort of taxonomy of volumes with well-defined relationships. Dan and I came up with a lot of ideas, but are very aware that we were reasoning in the dark. Other sites might well have other needs that would affect this. I think we were sliding towards a transparent upward-compatible replacement of the vldb as well. Based purely on how I imagine the vldb to work :-), it should be possible to add shadow data to it and define some additional rpcs. Users of the old rpcs would only get the data that was in the 'legacy' vldb, users of the new rpcs would get shadow data as well. That's a door folks may not want opened yet, but it seems a better choice than bolting a separate shadow-oriented vldb to the side. So that's where we are. I believe out latest shadow software is built against 1.4.11, but could be wrong. If folks are interested, I'd be happy to chat with Dan and we'll release the patches to interested parties. If folks think this is worth writing up in the afslore wiki as a partial project, I'd be glad to take this note and shovel it in with appropriate formatting. Steve _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info