Re: Versioning file system

Kyle Moffett Tue, 19 Jun 2007 19:44:51 -0700

On Jun 19, 2007, at 03:58:57, Bron Gondwana wrote:

On Mon, Jun 18, 2007 at 11:10:42PM -0400, Kyle Moffett wrote:
On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
The question remains is where to implement versioning: directlyin individual filesystems or in the vfs code so all filesystemscan use it?
Or not in the kernel at all. I've been doing versioning of thetypes I described for years with user space code and I don'tremember feeling that I compromised in order not to involve thekernel.
What I think would be particularly interesting in this domain issomething similar in concept to GIT, except in a file-system:
[...snip...]

It can work, but there's one big pain at the file level: no mmap.

IMHO it's actually not that bad. The "gitfs" would divide largerfiles up into manageable chunks (say 4MB) which could be quicklySHA-1ed. When a file is mmapped and partially modified, the SHA-1would be marked as locally invalid, but since mmap() loses mostconsistency guarantees that's OK. A time or writeout based "commit"scheme might still freeze, SHA-1, and write-out the page at regularintervals without the program's knowledge, but since you only have toSHA-1 the relatively-small 4MB chunk (which is about to hit diskanyways), it's not a significant time penalty. Even if under memorypressure and swapping data out to disk you don't have to update theSHA-1 and create a new commit as long as you keep a reference to theobject stored in the volume header somewhere and maintain the "SHA-1out-of-date" bit.

A program which carefully uses msync() would be fine, of course (withproper configuration) as that would create a new commit as appropriate.

Since mmap() is poorly defined on network filesystems in the absenceof msync(), I don't see that such behaviour would be a problem. Andit certainly would be fine on local filesystems as there you can juststuff the "SHA-1 out-of-date" bit and a reference to the parentcommit and path in the object itself. Then you just need to keep auseful reference to that object in a table somewhere in the volumeand you're set.

If you don't want to support mmap it can work reasonably happily,though you may want to keep your sha1 (or other digest) state aswell as the final digest so you can cheaply calculate the digestfor a small append without walking the entire file. You may alsowant to keep state checkpoints every so often along a big file sothat truncates don't cost too much to recalculate.

That may be worth it even if the file is divided into 4MB chunks (orother configurable value), but it would need benchmarking.

Luckily in a userspace VFS that's only accessed via FTP and DAV wecan support a limited set of operations (basically create, append,read, delete) You don't get that luxury for a general purposefilesystem, and that's the problem. There will always beparticular usage patterns (especially something that mmaps or seeksand touches all over the place like a loopback mounted filesystemor a database file) that just dodn't work for file-level sha1s.


I'd think that loopback-mounted filesystems wouldn't be that difficult

1) Set the SHA-1 block size appropriately to divide the big fileinto a bunch of little manageable files. Could conceivably be multi-layered like directories, depending on the size of the file.2) Mark the file as exempt from normal commits (IE: withoutspecial syscalls or fsync/msync() on the file itself, it is neverupdated in the tree objects.3) Set up the loopback device to call the gitfs commit code whenit receives barriers or flushes from the parent filesystem.

And database files aren't a big issue. I have yet to see a networkedfilesystem which you could stick a MySQL database on it from one nodeand expect to get useful/recent read results from other nodes. Ifyou really wanted something like that for such a "gitfs", you couldjust add code to MySQL to create a gitfs commit every N transactionsand not otherwise. The best part is: that would make online MySQLbackups from another node trivial! Just pick any arbitraryappropriate commit object and mount that object, then "cp -amysql_db_dir mysql_backup_dir". That's not to say it wouldn't have aperformance penalty, but for some people the performance penaltymight be worth it.

Oh, and for those programs which want multi-master replication, thismakes it ten times easier:

  1)  Put each master-server on a different gitfs branch

2) Write your program as gitfs aware. Make it create gitfscommits at appropriate times (so the data is accessible from othernodes).3) Come up with a useful non-interactive database-file mergealgorithm. Useful examples of different kinds of merge engines maybe found in the git project. This should take $BASE_VERSION,$NEWVERSION1, $NEWVERSION2, and produce a $MERGEDVERSION. A goodalgorithm should probably pick a safe default and save a "conflict"entry in the face of conflicting changes.4) Hook your merge algorithm into the gitfs mechanics using someto-be-defined API.5) Whenever your software does a database-file commit it sendsout a little notification to the other nodes (maybe using a gitfs API?)6) Run a periodic (as defined by the admin yet again) thread oneach node which does branch merging. When two or more branches havedifferent SHA-1 sums the servers will rotate the merging task betweenthem. The thus-selected server will merge changes from the otherserver(s) into its current working copy. With 2 servers this meansthat the maximum delay between one server making a change and theother server seeing it will be 2 times the merge interval.7) For small pools of servers a simple rotated-merge-masteralgorithm would work. For larger pools you would need to come upwith some logarithmic rotating-merge-node algorithm to evenly dividethe work of propagating changes across all nodes.

It does have some lovely properties though. I'd enjoy working inan envionment that didn't look much like POSIX but had the strongguarantees and auditability that addressing by sha1 buys you.

I'd like to think we can have our cake and eat it too :-D. POSIXrequirements should be doable on the local system and can be mimicedwell enough on networked filesystems (albeit with update latency)that most programs won't care. If you're the only person modifyingfiles on gitfs, regardless of what node they are stored on, it shouldhave the same behavior as local files (since with gitfs caching theywould *become* local files too :-D). The few programs that do careabout POSIX atomicity across networked filesystems (which is alreadymostly implementation defined) could probably be updated to map gitfscommits and merges into their own internal transactions and do justfine.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Versioning file system

Reply via email to