Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-20 Thread Charles Forsyth
those restrictions are not necessary

On 20 May 2012 04:13, Bakul Shah ba...@bitblocks.com wrote:

 This last point is more or less independent of the FS (as long
 as an io buffer is page aligned and io count is a multiple of
 page size).



Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-20 Thread Charles Forsyth
we don't compute on file servers

On 20 May 2012 04:13, Bakul Shah ba...@bitblocks.com wrote:

 When you suddenly need
 lots of memory for some memory intensive computation, it may
 be too late to evacuate the memory of your FS data



Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-20 Thread Francisco J Ballesteros



On May 20, 2012, at 5:13 AM, Bakul Shah ba...@bitblocks.com wrote:

 
 How often would you flush to disk? You still need to worry about the order
 of writing metadata.
 

that's the nice thing. it's so simple I don't have to worry about order. you 
write new
blocks and, once all of them reached the disk without errors, you write a new 
super with
the address of a new root. if you fail before the super is written, you have 
the old tree,
otherwise, you get the new one. the super has two copies of its data, in case 
you fail
in the middle of writing it, if they don't match, you use the oldest one.

The tmr proc writes new versions of the tree on a periodic basis and also if the
number of dirty (of memory addressed, actually) blocks exceeds some value.


 You do have to keep track of free disk blocks. On disk. So a linked list
 would require you visit every freed block.
 


There's no linked list either :)
There's a mark per block, an epoch number, so you have one block with marks
for each block group. all blocks given addresses on disk use the current value 
for the
mark or epoch. when you want to collect more free blocks, you traverse the tree 
and update
the mark for blocks. After that, you increment the value for the mark that is 
used to recognize
free blocks.

Of course you could fail at any time, and, again, if you do that before writing 
the new
super the only thing that happens is that you'll have to mark again the blocks 
(you
prune the mark process when you find that the block visited already has the 
correct mark
value).

This was actually the code I had in place to double check that the previous 
implementation
for free block management was correct, but I noticed that it was both doing the 
job, and 
not disturbing normal operation in the file system, so I removed the management 
code and
declared this debug check the actual algorithm.


 I think an incore FS the easy case but you will have to face
 the issues of corner/boundary cases, various error conditions
 and efficiency when dealing with real disks. These things are
 what introduce complexity and bugs. Soft updates in FFS took
 quite a while shake out bugs.  zfs took a long time.  Hammer
 fs of DragonFly took a while. Pretty much every FS design has
 taken a while to be rock solid.  Far longer than the original
 estimates of the designers I think.
 


Yes, that's why I improved it mostly by removing features and simplifying 
algorithms
instead of using clever ones. It was not easy to do that, it had like three or 
four rewrites.
Funny it was a lot easier to write the first, much more complex, version of it.

There are no soft
updates now, because the log makes that unneeded. And the complex part of
log management, which is reclaiming unused blocks, is also gone because of the
naive, yet effective, algorithm used now.

But you are right. I spent most of the time fixing problems with disks that 
were full or
had faults injected at the worst times. The operation in non boundary cases was 
always
fine, even with the fancier implementation I used before.

Regarding memory and processors, the code is aggressive and tries to use 
everything
because the machine will be devoted just to serve files.


Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-19 Thread Bakul Shah
On Sat, 19 May 2012 00:45:58 +0200 Francisco J Ballesteros n...@lsub.org  
wrote:

  Just curious.
  If the tree doesn't fit in memory, how do you decide who to
  kick out? LRU? Sounds much like a cache fs. What does it buy
  you over existing cache filesystems? Speaking more generally,
  not just in the plan9 context.
 
 lru for clean blocks. but you really have the tree you use in memory, all if
 it fits.  what it buys is simplicity, thus reliability, and speed.
 instead of a single program doing everything, you have several trying to use
 their memory and to avoid copying blocks in the main server.
 plus, it's going to be modified to exploit the upcoming nix zero copy 
 framework.

This last point is more or less independent of the FS (as long
as an io buffer is page aligned and io count is a multiple of
page size).

 it's not cow. you reuse the memory of a frozen block instead of copying.
 you just melt it and reuse it.

 all this is in memory. cow happens only on the disk, but you don't wait for 
 that.
 that's the main difference wrt others.

How often would you flush to disk? You still need to worry about the order
of writing metadata.

   When the disk gets full, all reachable blocks are marked and
   all other blocks are considered available for growing the
   log (this is a description of semantics, not of the imple-
   mentation). Thus, the log is circular but jumps to the next
   available block each time it grows.  If, after the mark pro-
   cess, the disk is still full, the file system becomes read
   only but for removing files.
 
  Why does circularity matter? It would make more sense to allocate
  new blocks for a given file near its existing blocks regardless of
  writing order.
 
 for simplicity, I removed most of the fanciest things I had before in place in
 previous versions that could be a source of bugs. there are no ref. counters,
 for example. it's designed to operate on
 main memory, and it seems it does well even though the disk algorithms are
 naive.

You do have to keep track of free disk blocks. On disk. So a linked list
would require you visit every freed block.

  Why not just use venti or some existing FS underneath than
  come up with a new disk format?
 
 to avoid complexity, latency, and bugs.

I think an incore FS the easy case but you will have to face
the issues of corner/boundary cases, various error conditions
and efficiency when dealing with real disks. These things are
what introduce complexity and bugs. Soft updates in FFS took
quite a while shake out bugs.  zfs took a long time.  Hammer
fs of DragonFly took a while. Pretty much every FS design has
taken a while to be rock solid.  Far longer than the original
estimates of the designers I think.

 that was the motivation, exploiting large main memories and keeping things
 simple and reliable. Time will tell if we managed to achieve that or not :)

Ah. I have been looking at SBCs with memories in the 128MB to
512MB range! Can't afford an incore FS!

But even if there is gigabytes of memory why would I want to
dedicate a lot of it to a filesystem? Most of the FS data is
going to be cold most of the time. When you suddenly need
lots of memory for some memory intensive computation, it may
be too late to evacuate the memory of your FS data. But
memory is just a file cache, this data can be thrown away any
time if you need more space.  And by making sure the cache
holds a bounded amount of dirty data, you lose no more than
that amount of data in case of a crash.

 sorry I wrote in Sioux this time. its been a long day here :)

Thanks for taking the time. Always nice to see yet another
attempt at getting this right :-)



[9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-18 Thread Nemo
Because I noticed Ken's worm fs was being discussed in this thread, I thought
I might just drop here the man page for a new alternate file server that we 
wrote
for nix.

It's not yet ready for use (I'm using it, but it's still under testing, and the 
version
in the main nix tree is now out of date, btw; will send the new one soon).

But I thought it might be of interest to 9fans, if only to get comments.


 CREEPY(4)   CREEPY(4)

 NAME
  9pix, fmt, rip, arch - creepy file server and WORM archive

 SYNOPSIS
  creepy/9pix [ -DFLAGS ] [ -ra ] [ -A addr ] [ -S srv ] disk

  creepy/fmt [ -DFLAGS ] [ -wy ] disk

  creepy/rip [ -DFLAGS ] [ -a ] [ -c n ] [ -A addr ] [ -S srv
  ] disk

  creepy/arch [ -DFLAGS ] [ dir ]

 DESCRIPTION
  Creepy is a prototype file server for Nix. It maintains a
  mutable file tree with unix semantics, kept in main memory,
  served through 9P, see intro(5), and through IX.

  Creepy/9pix is the main file server program. It serves a
  file tree through 9P and IX.  The tree kept in memory is
  mutable. But frozen versions of the tree are written to
  disk, both upon request and also on a periodic basis, to
  survive power outages and other problems, and to be able to
  operate on trees that do not fit on main memory.  The
  tree(s) stored on disk are frozen and cannot be changed once
  written.

  By default the program listens for 9P in the standard TCP
  port and posts a connection that can be mounted at
  /srv/9pix.  Flags -A and -S may be used to specify an alter-
  nate network address and/or srv(3) file to post. Using these
  flags makes the program not to listen and not to post to
  srv(3) unless a flag indicates so.  Flag -r makes the pro-
  gram operate in read-only mode, and flag -a starts the pro-
  gram without requiring authentication to mount the file
  tree.

  The disk is organized as a log of blocks. When a new version
  of the tree must be written to disk, all blocks that changed
  are given disk addresses and are appended to the log. Once
  written, they are frozen.  If new changes are made to the
  tree, blocks are melted and forget their previous addresses:
  each time they are written again, they are assigned new
  ones.

  When the disk gets full, all reachable blocks are marked and
  all other blocks are considered available for growing the
  log (this is a description of semantics, not of the imple-
  mentation). Thus, the log is circular but jumps to the next
  available block each time it grows.  If, after the mark pro-
  cess, the disk is still full, the file system becomes read
  only but for removing files.

  Before using a disk it must be formatted using creepy/fmt.
  This initializes blocks used to store marks for the mark and
  sweep process and also initializes a super block and an
  empty root directory. Flag -y forces the format even if the
  disk seems to contain a fossil (4) or creepy (4) partition.
  Flag -w is used to format the partition for a WORM
  (described later) and not for a main file tree.

  Creepy/rip is the same file server program, but operates in
  WORM mode. In this case, no mark and sweep for free blocks
  will ever happen. Blocks are consumed from the disk until it
  becomes full.  The file tree served is always read-only.

  Operating the WORM to in administrative mode to add more
  trees or new versions of the archived trees does not require
  any protocol: it can be done using the standard file inter-
  face used to operate on any other tree, by mounting and
  changing it.

  An alternate mount specifier, wormwr, can be used to mount
  the tree in read-write mode, to update the archive.  Updat-
  ing the archive is performed by creating new trees with the
  conventional /treename/year/mmdd paths on the WORM. But note
  that such paths are not enforced by the program at all.
  Before updating a tree in the archive, for a new day, a con-
  trol request described later can be used to link the direc-
  tory for the previous version of the archive to the new one.
  After that, changes made to files would in effect copy on
  write all blocks affected, and refer to old ones when they
  did not change.

  Creepy/arch is a program started by creepy/9pix to archive
  snapshots of the tree into a directory provided by
  creepy/rip (or by any other file server).  The program is
  not expected to be run by users, and 

Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience)

2012-05-18 Thread Bakul Shah
On Fri, 18 May 2012 23:13:54 +0200 Nemo n...@lsub.org  wrote:
   Creepy is a prototype file server for Nix. It maintains a
   mutable file tree with unix semantics, kept in main memory,
   served through 9P, see intro(5), and through IX.

Creepy? It has become a creepy word now!

   Creepy/9pix is the main file server program. It serves a
   file tree through 9P and IX.  The tree kept in memory is
   mutable. But frozen versions of the tree are written to
   disk, both upon request and also on a periodic basis, to
   survive power outages and other problems, and to be able to
   operate on trees that do not fit on main memory.  The
   tree(s) stored on disk are frozen and cannot be changed once
   written.

Just curious.
If the tree doesn't fit in memory, how do you decide who to
kick out? LRU? Sounds much like a cache fs. What does it buy
you over existing cache filesystems? Speaking more generally,
not just in the plan9 context.
 

   The disk is organized as a log of blocks. When a new version
   of the tree must be written to disk, all blocks that changed
   are given disk addresses and are appended to the log. Once
   written, they are frozen.  If new changes are made to the
   tree, blocks are melted and forget their previous addresses:
   each time they are written again, they are assigned new
   ones.

I don't understand use of the words frozen  melted here.  How
is this different from how things work now? Something worse
than what venti or zfs do, which is to leave the old blocks
alone and allocate new space for new blocks.

   When the disk gets full, all reachable blocks are marked and
   all other blocks are considered available for growing the
   log (this is a description of semantics, not of the imple-
   mentation). Thus, the log is circular but jumps to the next
   available block each time it grows.  If, after the mark pro-
   cess, the disk is still full, the file system becomes read
   only but for removing files.

Why does circularity matter? It would make more sense to allocate
new blocks for a given file near its existing blocks regardless of
writing order.

Why not just use venti or some existing FS underneath than
come up with a new disk format?

Sounds like a fun project but it would be nice to see the
rationale for it.

Thanks!