(resent after subscribing to the the [EMAIL PROTECTED]) On Friday 25 July 2008 11:53, Matthew Dillon wrote: > > :Hi; > : > :The announcement of yet another filesystem: > : > :http://lkml.org/lkml/2008/7/23/257 > : > :led to some comments about hammer fs: > : > :http://tux3.org/pipermail/tux3/2008-July/000006.html > : > :enjoy, > : > : Pedro. > > Those are interesting comments. I think I found Daniel's email address > so I am adding him to the To: Dan, feel free to post this on your Tux > groups if you want.
How about a cross post? > I did consider multiple-parentage... that is the ability to have a > writable snapshot that 'forks' the filesystem history. It would be > an ultra cool feature to have but I couldn't fit it into the B-Tree > model I was using. Explicit snapshotting would be needed to make it > work, and the snapshot id would have to be made part of the B-Tree key, > which is fine. HAMMER is based around implicit snapshots (being able > to do an as-of historical lookup without having explicitly snapshotted > bits of the history). Yes, that is the main difference indeed, essentially "log everything" vs "commit" style versioning. The main similarity is the lifespan oriented version control at the btree leaves. > I would caution against using B-Tree iterations > in historical lookups, B-Trees are only fast if you can pretty much > zero in on the exact element you want as part of the primary search > mechanic. Once you have to iterate or chain performance goes out the > window. I know this because there are still two places where HAMMER > sometimes has to iterate due to the inexact nature of an as-of lookup. > Multiple-parentage almost certainly require two inequality searches, > or one inequality search and an iteration. A single timeline only > requires one inequality search. Once I get down to the leaf level I binary search on logical address in the case of a file index btree or on version in the case of an inode table block, so this cost is still Log(N) with a small k. For a heavily versioned inode or file region this might sometimes result in an overflow block or two that has to be linearly searched which is not a big problem so long as it is a rare case, which it really ought to be for the kinds of filesystem loads I have seen. A common example of a worst case is /var/log/messages, where the mtime and size are going to change in pretty much every version, so if you have hourly snapshots and hold them for three months it adds up to about 2200 16 byte inode table attributes to record, about 8 inode table leaf blocks. I really do not see that as a problem. If it goes up to 100 leaf blocks to search then that could be a problem. Usually, I will only be interested in the first and last of those blocks, the first holding the rarely changing attributes including the root of the file index btree and the last holding the most recent incarnation of the mtime/size attribute. The penultimate inode table index block tells me how many blocks a given inode lives in because several blocks will have the same inum key. So the lookup algorithm for a massively versioned file becomes: 1) read the first inode table block holding that inum; 2) read the last block with the same inum. The latter operation only needs to consult the immediate parent index block, which is locked in the page cache at that point. It will take a little care and attention to the inode format, especially the inode header, to make sure that I can reliably do that first/last probe optimization for the "head" version, but it does seem worth the effort. For starters I will just do a mindless linear walk of an "overflow" inode and get fancy if that turns out to be a problem. > I couldn't get away from having a delete_tid (the 'death version > numbers'). I really tried :-) There are many cases where one is only > deleting, rather then overwriting. By far the most common case I would think. But check out the versioned pointer algorithms. Surprisingly that just works, which is not exactly obvious: Versioned pointers: a new method of representing snapshots http://lwn.net/Articles/288896/ I was originally planning to keep all versions of a truncate/rewrite file in the same file index, but recently I realized that that is dumb because there will never be any file data shared by a successor version in that case. So the thing to do is just create an entirely new versioned file data attribute for each rewrite, bulking up the inode table entry a little but greatly constraining the search for versions to delete and reducing cache pressure by not loading unrelated version data when traversing a file. > Both numbers are then needed to > be able to properly present a historical view of the filesystem. > For example, when one is deleting a directory entry a snapshot after > that deletion must show the directory entry gone. ...which happens in Tux3 courtesy of the fact that the entire block containing the dirent will have been versioned, with the new version showing the entry gone. Here is one of two places where I violate my vow to avoid copying an entire block when only one data item in it changes (the other being the atime table). I rely on two things to make this nice: 1) Most dirent changes will be logically logged and only rolled up into the versioned file blocks when there are enough to be reasonably sure that each changed directory block will be hit numerous times in each rollup episode. (Storing the directory blocks in dirent-create order as in PHTree makes this very likely for mass deletes.) 2) When we care about this is usually during a mass delete, where most or all dirents in each directory file block are removed before moving on to the next block. > Both numbers are > also needed to be able to properly prune out unwanted historical data > from the filesystem. HAMMER's pruning algorithm (cleaning out old > historical data which is no longer desired) creates holes in the > sequence so once you start pruning out unwanted historical data > the delete_tid of a prior record will not match the create_tid of the > following one (historically speaking). Again, check out the versioned pointer algorithms. You can tell what can be pruned just by consulting the version tree and the create_tids (birth versions) for a particular logical address. Maybe the hang is that you do not organize the btrees by logical address (or inum in the case of the inode table tree). I thought you did but have not read closely enough to be sure. > At one point I did collapse the holes, rematching the delete_tid with > the create_tid of the following historical record, but I had to remove > that code because it seriously complicated the mirroring implementation. > I wanted each mirror to be able to have its own historical retention > policy independant of the master. e.g. so you could mirror to a backup > system which would retain a longer and more coarse-grained history then > the production system. Fair enough. I have an entirely different approach to what you call mirroring and what I call delta replication. (I reserve the term mirroring to mean mindless duplication of physical media writes.) This method proved out well in ddsnap: http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257 (Sorry about the massive URL, you can blame Linus for that;-) What I do in ddsnap is compute all the blocks that differ between two versions and apply those to a remote volume already holding the first of the two versions, yielding a replica of the second version that is logically but not physically identical. The same idea works for a versioned filesystem: compute all the leaf data that differs between two versions, per inum, and apply the resulting delta to the corresponding inums in the remote replica. The main difference vs a ddsnap volume delta is that not all of the changes are physical blocks, there are also changed inode attributes, so the delta stream format has to be elaborated accordingly. > I also considered increasing the B-Tree fan-out to 256 but decided > against it because insertions and deletions really bloated up the > UNDO FIFO. Frankly I'm still undecided as to whether that was a good > idea, I would have prefered 256. I can actually compile in 256 by > changing a #define, but I released with 64 because I hit up against > a number of performance issues: bcopy() overheads for insertions > and deletions in certain tests became very apparent. Locking > conflicts became a much more serious issue because I am using whole-node > locks rather then element locks. And, finally, the UNDO records got > really bloated. I would need to adjust the node locking and UNDO > generation code significantly to remove the bottlenecks before I could > go to a 256-element B-Tree node. I intend to log insertions and deletions logically, which keeps each down to a few bytes until a btree rollup episode comes along to perform updating of the btree nodes in bulk. I am pretty sure this will work for you as well, and you might want to check out that forward logging trick. That reminds me, I was concerned about the idea of UNDO records vs REDO. I hope I have this right: you delay acknowledging any write transaction to the submitter until log commit has progressed beyond the associated UNDO records. Otherwise, if you acknowledge, crash, and prune all UNDO changes, then you would end up with applications believing that they had got things onto stable storage and be wrong about that. I have no doubt you did the right thing there, but it is not obvious from your design documentation. > HAMMER's B-Tree elements are probably huge compared to Tux3, and that's > another limiting factor for the fan-out I can use. My elements > are 64 bytes each. Yes, I mostly have 16 byte elements and am working on getting most of them down to 12 or 8. > 64x64 = 4K per B-Tree node. I decided to go > with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key, > 64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and > other auxillary info. That's instead of using a radix-compressed key. > Radix compressed keys would have doubled the complexity of the B-Tree > code, particularly with the middle-of-tree pointer caching that > HAMMER does. I use two-stage lookup, or four stage if you count searches within btree blocks. This makes the search depth smaller in the case of small files, and in the case of really huge files it adds depth exactly as appropriate. The index blocks end up cached in the page cache (the buffer cache is just a flavor of page cache in recent Linux) so there is little repeated descending through the btree indices. Instead, most of the probing is through the scarily fast radix tree code, which has been lovingly optimized over the years to avoid cache line misses and SMP lock contention. I also received a proposal for a "fat" btree index from a collaborator (Maciej) that included the file offset but I really did not like the... fattening. A two level btree yields less metadata overall which in my estimation is more important than saving some bree probes. The way things work these days, falling down from level 1 cache to level 2 or from level 2 to level 3 costs much more than executing some extra CPU instructions. So optimization strategy inexorably shifts away from minimizing instructions towards minimizing cache misses. > The historical access mechanic alone added major complexity to the > B-Tree algorithms. I will note here that B-Tree searches have a very > high cpu overhead no matter how you twist it, and being able to cache > cursors within the B-Tree is absolutely required if you want the > filesystem to perform well. If you always start searches at the root > your cpu overhead will be horrendous... so plan on caching cursors > from the get-go. Fortunately, we get that for free on Linux, courtesy of the page cache radix trees :-) I might eventually add some explicit cursor caching, but various artists over the years have noticed that it does not make as much difference as you might think. > If I were to do radix compression I would also want to go with a > fully dynamic element size and fully dynamic fan-out in order to > best-compress each B-Tree node. Definitely food for thought. Indeed. But Linux is braindamaged about large block size, so there is very strong motivation to stay within physical page size for the immediate future. Perhaps if I get around to a certain hack that has been perenially delayed, that situation will improve: "Variable sized page objects" http://lwn.net/Articles/37795/ > I'd love to do something like that. I think radix compression would > remove much of the topological bloat the B-Tree creates verses using > blockmaps, generally speaking. Topological bloat? > Space management is currently HAMMER's greatest weakness, but it only > applies to small storage systems. Several things in HAMMER's design > are simply not condusive to a small storage. The storage model is not > fine-grained and (unless you are deleting a lot of stuff) needs > reblocking to actually recover the freed space. The flushing algorithms > need around 100MB of UNDO FIFO space on-media to handle worst-case > dependancies (mainly directory/file visibility issues on crash recovery), > and the front-end caching of modifying operations, since they don't > know exactly how much actual storage will be needed, need ~16MB of > wiggle room to be able to estimate the on-media storage required to > back the operations cached by the front-end. Plus, on top of that, > any sort of reblocking also needs some wiggle room to be able to clean > out partially empty big-blocks and fill in new ones. Ah, that is a very nice benefit of Tux3-style forward logging I forgot to mention in the original post: transaction size is limited only by available free space on the volume, because log commit blocks can be anywhere. Last night I dreamed up a log commit block variant where the associated transaction data blocks can be physically discontiguous, to make this assertion come true, shortly after reading Stephen Tweedie's horror stories about what you have to do if you must work around not being able to put an entire, consistent transaction onto stable media in one go: http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html Most of the nasties he mentions just vanish if: a) File data is always committed atomically b) All commit transactions are full transactions He did remind me (via time travel from year 2000) of some details I should write into the design explicitly, for example, logging orphan inodes that are unlinked while open, so they can be deleted on replay after a crash. Another nice application of forward logging, which avoids the seek-happy linked list through the inode table that Ext3 does. > On the flip side, the reblocker doesn't just de-fragment the filesystem, > it is also intended to be used for expanding and contracting partitions, > and adding removing volumes in the next release. Definitely a > multi-purpose utility. Good point. I expect Tux3 will eventually have a reblocker (aka defragger). There are some really nice things you can do, like: 1) Set a new version so logical changes cease for the parent version. 2) We want to bud off a given directory tree into its own volume, so start by deleting the subtree in the current version. If any link counts in the directory tree remain nonzero in the current version then there are hard links into the subtree, so fail now and drop the new version. 3) Reblock a given region of the inode table tree and all the files in it into one physically contiguous region of blocks 4) Add a free tree and other once-per volume structures to the new home region. 5) The new region is now entirely self contained and even keeps its version history. At the volume manager level, map it to a new, sparse volume that just has a superblock in the zeroth extent and the new, mini filesystem at some higher logical address. Remap the region in the original volume to empty space and add the empty space to the free tree. 6) Optionally reblock the newly budded filesystem to the base of the new volume so utilties that do not not play well with sparse volumes do not do silly things. > So I'm not actually being cavalier about it, its just that I had to > make some major decisions on the algorithm design and I decided to > weight the design more towards performance and large-storage, and > small-storage suffered a bit. Cavalier was a poor choice of words, the post was full of typos as well so I am not proud of it. You are solving a somewhat different problem and you have code out now, which is a huge achievement. Still I think you can iteratively improve your design using some of the techniques I have stumbled upon. There are probably some nice tricks I can get from your code base too once I delve into it. > In anycase, it sounds like Tux3 is using many similar ideas. I think > you are on the right track. Thankyou very much. I think you are on the right track too, which you have a rather concrete way of proving. > I will add one big note of caution, drawing > from my experience implementing HAMMER, because I think you are going > to hit a lot of the same issues. > > I spent 9 months designing HAMMER and 9 months implementing it. During > the course of implementing it I wound up throwing away probably 80% of > the original design outright. Amoung the major components I had to > rewrite were the LOG mechanic (which ultimately became the meta-data > UNDO FIFO), and the fine-grained storage mechanic (which ultimately > became coarse-grained). The UNDO FIFO was actually the saving grace, > once that was implemented most of the writer-ordering dependancies went > away (devolved into just flushing meta-data buffers after syncing the > UNDO FIFO)... I suddenly had much, much more freedom in designing > the other on-disk structures and algorithms. > > What I found while implementing HAMMER was that the on-disk topological > design essentially dictated the extent of HAMMER's feature set AND > most of its deficiencies (such as having to reblock to recover space), > and the algorithms I chose often dictated other restrictions. But the > major restrictions came from the on-disk structures. > > Because of the necessarily tight integration between subsystems I found > myself having to do major redesigns during the implementation phase. > Fixing one subsystem created a cascade effect that required tweaking other > subsystems. Even adding new features, such as the mirroring, required > significant changes in the B-Tree deadlock recovery code. I couldn't get > away from it. Ultimately I chose to simplify some of the algorithms > rather then have to go through another 4 months of rewriting. All > major designs are an exercise in making trade-offs in topology, feature > set, algorithmic complexity, debuggability, robustness, etc. The list > goes on forever. > The big ahas! that eliminated much of the complexity in the Tux3 design were: * Forward logging - just say no to incomplete transactions * Version weaving - just say no to recursive copy on write Essentially I have been designing Tux3 for ten years now and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3. One of the nice things about moving on from design to implementation of Tux3 is that I can now background the LVM3 design process and see what Tux3 really wants from it. I am determined to match every checkbox volume management feature of ZFS as efficiently or more efficiently, without violating the traditional layering between filesystem and block device, and without making LVM3 a Tux3-private invention. > Laters! Hopefully not too much later. I find this dialog very fruitful. I just wish such dialog would occur more often at the design/development stage in Linux and other open source work instead of each group obsessively ignoring all "competing" designs and putting huge energy into chatting away about the numerous bugs that arise from rushing their design or ignoring the teachings of history. Regards, Daniel
