Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
those restrictions are not necessary On 20 May 2012 04:13, Bakul Shah ba...@bitblocks.com wrote: This last point is more or less independent of the FS (as long as an io buffer is page aligned and io count is a multiple of page size).
Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
we don't compute on file servers On 20 May 2012 04:13, Bakul Shah ba...@bitblocks.com wrote: When you suddenly need lots of memory for some memory intensive computation, it may be too late to evacuate the memory of your FS data
Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
On May 20, 2012, at 5:13 AM, Bakul Shah ba...@bitblocks.com wrote: How often would you flush to disk? You still need to worry about the order of writing metadata. that's the nice thing. it's so simple I don't have to worry about order. you write new blocks and, once all of them reached the disk without errors, you write a new super with the address of a new root. if you fail before the super is written, you have the old tree, otherwise, you get the new one. the super has two copies of its data, in case you fail in the middle of writing it, if they don't match, you use the oldest one. The tmr proc writes new versions of the tree on a periodic basis and also if the number of dirty (of memory addressed, actually) blocks exceeds some value. You do have to keep track of free disk blocks. On disk. So a linked list would require you visit every freed block. There's no linked list either :) There's a mark per block, an epoch number, so you have one block with marks for each block group. all blocks given addresses on disk use the current value for the mark or epoch. when you want to collect more free blocks, you traverse the tree and update the mark for blocks. After that, you increment the value for the mark that is used to recognize free blocks. Of course you could fail at any time, and, again, if you do that before writing the new super the only thing that happens is that you'll have to mark again the blocks (you prune the mark process when you find that the block visited already has the correct mark value). This was actually the code I had in place to double check that the previous implementation for free block management was correct, but I noticed that it was both doing the job, and not disturbing normal operation in the file system, so I removed the management code and declared this debug check the actual algorithm. I think an incore FS the easy case but you will have to face the issues of corner/boundary cases, various error conditions and efficiency when dealing with real disks. These things are what introduce complexity and bugs. Soft updates in FFS took quite a while shake out bugs. zfs took a long time. Hammer fs of DragonFly took a while. Pretty much every FS design has taken a while to be rock solid. Far longer than the original estimates of the designers I think. Yes, that's why I improved it mostly by removing features and simplifying algorithms instead of using clever ones. It was not easy to do that, it had like three or four rewrites. Funny it was a lot easier to write the first, much more complex, version of it. There are no soft updates now, because the log makes that unneeded. And the complex part of log management, which is reclaiming unused blocks, is also gone because of the naive, yet effective, algorithm used now. But you are right. I spent most of the time fixing problems with disks that were full or had faults injected at the worst times. The operation in non boundary cases was always fine, even with the fancier implementation I used before. Regarding memory and processors, the code is aggressive and tries to use everything because the machine will be devoted just to serve files.
Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
On Sat, 19 May 2012 00:45:58 +0200 Francisco J Ballesteros n...@lsub.org wrote: Just curious. If the tree doesn't fit in memory, how do you decide who to kick out? LRU? Sounds much like a cache fs. What does it buy you over existing cache filesystems? Speaking more generally, not just in the plan9 context. lru for clean blocks. but you really have the tree you use in memory, all if it fits. what it buys is simplicity, thus reliability, and speed. instead of a single program doing everything, you have several trying to use their memory and to avoid copying blocks in the main server. plus, it's going to be modified to exploit the upcoming nix zero copy framework. This last point is more or less independent of the FS (as long as an io buffer is page aligned and io count is a multiple of page size). it's not cow. you reuse the memory of a frozen block instead of copying. you just melt it and reuse it. all this is in memory. cow happens only on the disk, but you don't wait for that. that's the main difference wrt others. How often would you flush to disk? You still need to worry about the order of writing metadata. When the disk gets full, all reachable blocks are marked and all other blocks are considered available for growing the log (this is a description of semantics, not of the imple- mentation). Thus, the log is circular but jumps to the next available block each time it grows. If, after the mark pro- cess, the disk is still full, the file system becomes read only but for removing files. Why does circularity matter? It would make more sense to allocate new blocks for a given file near its existing blocks regardless of writing order. for simplicity, I removed most of the fanciest things I had before in place in previous versions that could be a source of bugs. there are no ref. counters, for example. it's designed to operate on main memory, and it seems it does well even though the disk algorithms are naive. You do have to keep track of free disk blocks. On disk. So a linked list would require you visit every freed block. Why not just use venti or some existing FS underneath than come up with a new disk format? to avoid complexity, latency, and bugs. I think an incore FS the easy case but you will have to face the issues of corner/boundary cases, various error conditions and efficiency when dealing with real disks. These things are what introduce complexity and bugs. Soft updates in FFS took quite a while shake out bugs. zfs took a long time. Hammer fs of DragonFly took a while. Pretty much every FS design has taken a while to be rock solid. Far longer than the original estimates of the designers I think. that was the motivation, exploiting large main memories and keeping things simple and reliable. Time will tell if we managed to achieve that or not :) Ah. I have been looking at SBCs with memories in the 128MB to 512MB range! Can't afford an incore FS! But even if there is gigabytes of memory why would I want to dedicate a lot of it to a filesystem? Most of the FS data is going to be cold most of the time. When you suddenly need lots of memory for some memory intensive computation, it may be too late to evacuate the memory of your FS data. But memory is just a file cache, this data can be thrown away any time if you need more space. And by making sure the cache holds a bounded amount of dirty data, you lose no more than that amount of data in case of a crash. sorry I wrote in Sioux this time. its been a long day here :) Thanks for taking the time. Always nice to see yet another attempt at getting this right :-)
[9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
Because I noticed Ken's worm fs was being discussed in this thread, I thought I might just drop here the man page for a new alternate file server that we wrote for nix. It's not yet ready for use (I'm using it, but it's still under testing, and the version in the main nix tree is now out of date, btw; will send the new one soon). But I thought it might be of interest to 9fans, if only to get comments. CREEPY(4) CREEPY(4) NAME 9pix, fmt, rip, arch - creepy file server and WORM archive SYNOPSIS creepy/9pix [ -DFLAGS ] [ -ra ] [ -A addr ] [ -S srv ] disk creepy/fmt [ -DFLAGS ] [ -wy ] disk creepy/rip [ -DFLAGS ] [ -a ] [ -c n ] [ -A addr ] [ -S srv ] disk creepy/arch [ -DFLAGS ] [ dir ] DESCRIPTION Creepy is a prototype file server for Nix. It maintains a mutable file tree with unix semantics, kept in main memory, served through 9P, see intro(5), and through IX. Creepy/9pix is the main file server program. It serves a file tree through 9P and IX. The tree kept in memory is mutable. But frozen versions of the tree are written to disk, both upon request and also on a periodic basis, to survive power outages and other problems, and to be able to operate on trees that do not fit on main memory. The tree(s) stored on disk are frozen and cannot be changed once written. By default the program listens for 9P in the standard TCP port and posts a connection that can be mounted at /srv/9pix. Flags -A and -S may be used to specify an alter- nate network address and/or srv(3) file to post. Using these flags makes the program not to listen and not to post to srv(3) unless a flag indicates so. Flag -r makes the pro- gram operate in read-only mode, and flag -a starts the pro- gram without requiring authentication to mount the file tree. The disk is organized as a log of blocks. When a new version of the tree must be written to disk, all blocks that changed are given disk addresses and are appended to the log. Once written, they are frozen. If new changes are made to the tree, blocks are melted and forget their previous addresses: each time they are written again, they are assigned new ones. When the disk gets full, all reachable blocks are marked and all other blocks are considered available for growing the log (this is a description of semantics, not of the imple- mentation). Thus, the log is circular but jumps to the next available block each time it grows. If, after the mark pro- cess, the disk is still full, the file system becomes read only but for removing files. Before using a disk it must be formatted using creepy/fmt. This initializes blocks used to store marks for the mark and sweep process and also initializes a super block and an empty root directory. Flag -y forces the format even if the disk seems to contain a fossil (4) or creepy (4) partition. Flag -w is used to format the partition for a WORM (described later) and not for a main file tree. Creepy/rip is the same file server program, but operates in WORM mode. In this case, no mark and sweep for free blocks will ever happen. Blocks are consumed from the disk until it becomes full. The file tree served is always read-only. Operating the WORM to in administrative mode to add more trees or new versions of the archived trees does not require any protocol: it can be done using the standard file inter- face used to operate on any other tree, by mounting and changing it. An alternate mount specifier, wormwr, can be used to mount the tree in read-write mode, to update the archive. Updat- ing the archive is performed by creating new trees with the conventional /treename/year/mmdd paths on the WORM. But note that such paths are not enforced by the program at all. Before updating a tree in the archive, for a new day, a con- trol request described later can be used to link the direc- tory for the previous version of the archive to the new one. After that, changes made to files would in effect copy on write all blocks affected, and refer to old ones when they did not change. Creepy/arch is a program started by creepy/9pix to archive snapshots of the tree into a directory provided by creepy/rip (or by any other file server). The program is not expected to be run by users, and
Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)
On Fri, 18 May 2012 23:13:54 +0200 Nemo n...@lsub.org wrote: Creepy is a prototype file server for Nix. It maintains a mutable file tree with unix semantics, kept in main memory, served through 9P, see intro(5), and through IX. Creepy? It has become a creepy word now! Creepy/9pix is the main file server program. It serves a file tree through 9P and IX. The tree kept in memory is mutable. But frozen versions of the tree are written to disk, both upon request and also on a periodic basis, to survive power outages and other problems, and to be able to operate on trees that do not fit on main memory. The tree(s) stored on disk are frozen and cannot be changed once written. Just curious. If the tree doesn't fit in memory, how do you decide who to kick out? LRU? Sounds much like a cache fs. What does it buy you over existing cache filesystems? Speaking more generally, not just in the plan9 context. The disk is organized as a log of blocks. When a new version of the tree must be written to disk, all blocks that changed are given disk addresses and are appended to the log. Once written, they are frozen. If new changes are made to the tree, blocks are melted and forget their previous addresses: each time they are written again, they are assigned new ones. I don't understand use of the words frozen melted here. How is this different from how things work now? Something worse than what venti or zfs do, which is to leave the old blocks alone and allocate new space for new blocks. When the disk gets full, all reachable blocks are marked and all other blocks are considered available for growing the log (this is a description of semantics, not of the imple- mentation). Thus, the log is circular but jumps to the next available block each time it grows. If, after the mark pro- cess, the disk is still full, the file system becomes read only but for removing files. Why does circularity matter? It would make more sense to allocate new blocks for a given file near its existing blocks regardless of writing order. Why not just use venti or some existing FS underneath than come up with a new disk format? Sounds like a fun project but it would be nice to see the rationale for it. Thanks!