Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-11 Thread Matt Mackall
On Fri, May 11, 2007 at 03:46:41AM -0600, Valerie Henson wrote:
> On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote:
> >
> > We will, unfortunately, need to be able to check an entire directory
> > at once. There's no other efficient way to assure that there are no
> > duplicate names in a directory, for instance.
> 
> I don't see that being a major problem for the vast majority of
> workloads.
> 
> > In summary, checking a tile requires trivial checks on all the inodes
> > and directories that point into a tile. Inodes, directories, and data
> > that are inside a tile get checked more thoroughly but still don't
> > need to do much pointer chasing.
> 
> Okay, I'm totally convinced - checking a tile-at-a-time works!  I'm
> going to steal as many of your ideas as possible and write ChileFS. :)
> Per-block inode rmap in particular has so many advantages that I'm
> ranking it up with checksums as a must-have feature.
> 
> Now for the hard part: repair.  If you find an indirect block or
> extent with a bad checksum, how much of the file system are you going
> to have to read to fix the dangling blocks?  I can see a speed-up by
> reading just the rmaps and looking for the associated inode number.
> What about an inode that has been corrupted?  You could at least get
> the inode number out of the rmap, but your pointers to your first
> level indirect blocks are gone.  A directory block?  No way to get
> useful information out of the rmap there.  Ad nauseum.  This is where
> I really like having the encapsulation of chunkfs, despite all the
> nasty continuation inode bits.

There are a few interesting possibilities here.

First we've got several new sources of integrity checking. If a block
points to an inode and the inode doesn't point back to it, we follow
the inode's corresponding pointer forward and back. If that turns out
to be consistent, we know that the original block is in the wrong. We
can also check the tile header and inode CRCs if they disagree and
neither pointer checks out.

Failing that, stray blocks can get attached to an inode in lost+found,
and we can reattach them later during a full filesystem sweep.

Another possibility is that we can compare forward and backward
pointers at runtime. This is probably a good idea anyway: we've got to
read tile headers for runtime CRC checks, so might as well check the
pointers too. If we discover a pointer mismatch, we can either orphan
data blocks -or recover orphans- on the fly. So doing a
filesystem-wide backup will also do most of an fsck.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-11 Thread Valerie Henson
On Wed, May 09, 2007 at 02:51:41PM -0500, Matt Mackall wrote:
>
> We will, unfortunately, need to be able to check an entire directory
> at once. There's no other efficient way to assure that there are no
> duplicate names in a directory, for instance.

I don't see that being a major problem for the vast majority of
workloads.

> In summary, checking a tile requires trivial checks on all the inodes
> and directories that point into a tile. Inodes, directories, and data
> that are inside a tile get checked more thoroughly but still don't
> need to do much pointer chasing.

Okay, I'm totally convinced - checking a tile-at-a-time works!  I'm
going to steal as many of your ideas as possible and write ChileFS. :)
Per-block inode rmap in particular has so many advantages that I'm
ranking it up with checksums as a must-have feature.

Now for the hard part: repair.  If you find an indirect block or
extent with a bad checksum, how much of the file system are you going
to have to read to fix the dangling blocks?  I can see a speed-up by
reading just the rmaps and looking for the associated inode number.
What about an inode that has been corrupted?  You could at least get
the inode number out of the rmap, but your pointers to your first
level indirect blocks are gone.  A directory block?  No way to get
useful information out of the rmap there.  Ad nauseum.  This is where
I really like having the encapsulation of chunkfs, despite all the
nasty continuation inode bits.

-VAL

P.S. Note the email address change - I'm leaving Intel as of Tuesday
and my [EMAIL PROTECTED] address will probably either bounce
or black hole - not sure which.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Jörn Engel
On Wed, 9 May 2007 14:51:41 -0500, Matt Mackall wrote:
> On Wed, May 09, 2007 at 11:59:23AM -0700, Valerie Henson wrote:
> > 
> > Hrm.  Can you help me understand how you would check i_size then?
> 
> That's pretty straightforward, I think. When we check an inode, we
> have to check whether it has a block that corresponds with i_size, and
> none beyond that.

i_size is indeed simple, but that got me thinking.  i_blocks is a much
harder problem.  Luckily it is almost the same problem as the free/used
block count for the filesystem.  And again the solution would be to have
a tree structure and have a sub-total for each node in the tree.

Now, inodes already have a tree structure, the indirect blocks.  So
indirect blocks would need to get an extra field somewhere to store how
many used blocks are below them somewhere.  Only problem is: indirect
blocks have a nice power-of-two size and no spare space around.

> That begs the question of when we check various pieces of data. It
> seems the best time to check the various elements of an inode is when
> we're checking the tile it lives on. This is when we'd check i_size,
> that link counts made sense and that the ring of hardlinks was
> correct, etc. 

Yup.  Checking i_size costs O(log(n)), i_count with above method is
O(log(n)) as well.  The hardlink ring is O(number of links).  For most
people that don't have a forest of hard-linked kernel trees around, that
should be fairly small as well.

I believe for large files it is important not to check the complete
file.  We can divide&conquer the physical device, so we can do the same
with files.  Although I wonder if that would require a dirty bit for
inodes as well.

> We will, unfortunately, need to be able to check an entire directory
> at once. There's no other efficient way to assure that there are no
> duplicate names in a directory, for instance.

There is.  As long as directories are in htree or similar format, that
is.  Problem is the same as fast lookup.

Jörn

-- 
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Matt Mackall
On Wed, May 09, 2007 at 12:01:13PM -0700, Valerie Henson wrote:
> On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> > On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote:
> > > There are a number of filesystem corruptions this algorithm won't
> > > catch.  The most obvious is one where the directory tree isn't really
> > > a tree, but an cyclic graph.   What if you have something like this:
> > > 
> > >   A <+
> > >  / \ |
> > > B   C^
> > >/ |
> > >   D->+
> > > 
> > > That is, what if D's parent is B, and B's parent is A, and A's parent
> > > is... D?  Assume for the sake of argument that each inode, A, B, C, D,
> > > are in separate tiles.
> > 
> > From the original message:
> > 
> >  Inodes have a backpointer to a directory that links them. Hardlinked
> >  files have two extra inode pointers in the directory structure, to
> >  the previous and next directories containing the link. Hardlinked
> >  inodes have a checksum of the members of that list.
> > 
> > When we check directory D, D's inode has a backpointer (which had
> > better match ".." in the directory itself if we keep that
> > redundancy!). If we can follow this back to root (using a standard
> > two-pointer cycle detection algorith), we have no cycle. As we also
> > check that every inode pointed to by directory D also points back to
> > D, any deviation from a valid tree can be detected.
> > 
> > And again, a small cache of inodes known to be properly rooted will
> > save a lot of checks.
> 
> I really, really like this idea.  I wonder how hard it would be to
> prototype on something like ext3.  Any bored grad students listening?

It should be pretty straightforward as mods to filesystems go. Most of
the work is in shimming the tile read/write layer under the rest of
the FS with helper functions and teaching the file I/O code to pass
down the reverse pointers to the tile layer.

Truncate/rm will also have to null out reverse references on tiles,
but that can be done at the same time bits are fixed up in the block
bitmaps.

Fixing up directories to record the hardlink ring should be easier.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Matt Mackall
On Wed, May 09, 2007 at 11:59:23AM -0700, Valerie Henson wrote:
> On Wed, May 09, 2007 at 12:06:52PM -0500, Matt Mackall wrote:
> > On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote:
> > > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> > > > 
> > > > This does mean that our time to make progress on a check is bounded at
> > > > the top by the size of our largest file. If we have a degenerate
> > > > filesystem filled with a single file, this will in fact take as long
> > > > as a conventional fsck. If your filesystem has, say, 100 roughly
> > > > equally-sized files, you're back in Chunkfs territory.
> > > 
> > > Hm, I'm not sure that everyone understands, a particular subtlety of
> > > how the fsck algorithm works in chunkfs.  A lot of people seem to
> > > think that you need to check *all* cross-chunk links, every time an
> > > individual chunk is checked.  That's not the case; you only need to
> > > check the links that go into and out of the dirty chunk.  You also
> > > don't need to check the other parts of the file outside the chunk,
> > > except for perhaps reading the byte range info for each continuation
> > > node and making sure no two continuation inodes think they both have
> > > the same range, but you don't check the indirect blocks, block
> > > bitmaps, etc.
> > 
> > My reference to chunkfs here is simply that the worst-case is checking ~1
> > chunk, which is about 1/100th of a volume.
> 
> I understand that being the case if each file is only in one tile.
> Does the fpos make this irrelevant as well?

Fpos does make it irrelevant.
 
> > > > So we should have no trouble checking an exabyte-sized filesystem on a
> > > > 4MB box. Even if it has one exabyte-sized file! We check the first
> > > > tile, see that it points to our file, then iterate through that file,
> > > > checking that the forward and reverse pointers for each block match
> > > > and all CRCs match, etc. We cache the file's inode as clean, finish
> > > > checking anything else in the first tile, then mark it clean. When we 
> > > > get
> > > > to the next tile (and the next billion after that!), we notice that
> > > > each block points back to our cached inode and skip rechecking it.
> > > 
> > > If I understand correctly then, if you do have a one exabyte sized
> > > file, and any part of it is in a dirty tile, you will need to check
> > > the whole file?  Or will Joern's fpos proposal fix this?
> > 
> > Yes, the original idea is you have to check every file that "covers" a
> > tile in its entirety. With Joern's fpos piece, I think we can restrict
> > our checks to just the section of the file that covers the tile.
> 
> Hrm.  Can you help me understand how you would check i_size then?

That's pretty straightforward, I think. When we check an inode, we
have to check whether it has a block that corresponds with i_size, and
none beyond that.

That begs the question of when we check various pieces of data. It
seems the best time to check the various elements of an inode is when
we're checking the tile it lives on. This is when we'd check i_size,
that link counts made sense and that the ring of hardlinks was
correct, etc. 

We would also check that direct and indirect pointers were sensible
(ie pointing to data blocks on the disk). If so, we know we'll
eventually verify those pointers when we check the corresponding back
pointers from those blocks.

Directory checks are a bit more problematic. But I think we can
trigger a directory check each time we hit a tile data block that's
part of a directory. Keeping a small cache of checked directories will
keep this from being expensive.

We will, unfortunately, need to be able to check an entire directory
at once. There's no other efficient way to assure that there are no
duplicate names in a directory, for instance.

In summary, checking a tile requires trivial checks on all the inodes
and directories that point into a tile. Inodes, directories, and data
that are inside a tile get checked more thoroughly but still don't
need to do much pointer chasing.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Nikita Danilov
Valerie Henson writes:

[...]

 > 
 > You're right about needing to read the equivalent data-structure - for
 > other reasons, each continuation inode will need an easily accessible
 > list of byte ranges covered by that inode. (Sounds like, hey,
 > extents!) The important part is that you don't have go walk all the

I see. I was under impression that idea was to use indirect blocks
themselves as that data-structure, e.g., block number 0 to mark holes,
block number 1 to mark "block not in this continuation", and all other
block numbers for real blocks.

 > indirect blocks or check your bitmap.
 > 
 > -VAL

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Valerie Henson
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote:
> > There are a number of filesystem corruptions this algorithm won't
> > catch.  The most obvious is one where the directory tree isn't really
> > a tree, but an cyclic graph.   What if you have something like this:
> > 
> > A <+
> >/ \ |
> >   B   C^
> >  / |
> > D->+
> > 
> > That is, what if D's parent is B, and B's parent is A, and A's parent
> > is... D?  Assume for the sake of argument that each inode, A, B, C, D,
> > are in separate tiles.
> 
> From the original message:
> 
>  Inodes have a backpointer to a directory that links them. Hardlinked
>  files have two extra inode pointers in the directory structure, to
>  the previous and next directories containing the link. Hardlinked
>  inodes have a checksum of the members of that list.
> 
> When we check directory D, D's inode has a backpointer (which had
> better match ".." in the directory itself if we keep that
> redundancy!). If we can follow this back to root (using a standard
> two-pointer cycle detection algorith), we have no cycle. As we also
> check that every inode pointed to by directory D also points back to
> D, any deviation from a valid tree can be detected.
> 
> And again, a small cache of inodes known to be properly rooted will
> save a lot of checks.

I really, really like this idea.  I wonder how hard it would be to
prototype on something like ext3.  Any bored grad students listening?

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Valerie Henson
On Wed, May 09, 2007 at 12:06:52PM -0500, Matt Mackall wrote:
> On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote:
> > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> > > 
> > > This does mean that our time to make progress on a check is bounded at
> > > the top by the size of our largest file. If we have a degenerate
> > > filesystem filled with a single file, this will in fact take as long
> > > as a conventional fsck. If your filesystem has, say, 100 roughly
> > > equally-sized files, you're back in Chunkfs territory.
> > 
> > Hm, I'm not sure that everyone understands, a particular subtlety of
> > how the fsck algorithm works in chunkfs.  A lot of people seem to
> > think that you need to check *all* cross-chunk links, every time an
> > individual chunk is checked.  That's not the case; you only need to
> > check the links that go into and out of the dirty chunk.  You also
> > don't need to check the other parts of the file outside the chunk,
> > except for perhaps reading the byte range info for each continuation
> > node and making sure no two continuation inodes think they both have
> > the same range, but you don't check the indirect blocks, block
> > bitmaps, etc.
> 
> My reference to chunkfs here is simply that the worst-case is checking ~1
> chunk, which is about 1/100th of a volume.

I understand that being the case if each file is only in one tile.
Does the fpos make this irrelevant as well?

> > > So we should have no trouble checking an exabyte-sized filesystem on a
> > > 4MB box. Even if it has one exabyte-sized file! We check the first
> > > tile, see that it points to our file, then iterate through that file,
> > > checking that the forward and reverse pointers for each block match
> > > and all CRCs match, etc. We cache the file's inode as clean, finish
> > > checking anything else in the first tile, then mark it clean. When we get
> > > to the next tile (and the next billion after that!), we notice that
> > > each block points back to our cached inode and skip rechecking it.
> > 
> > If I understand correctly then, if you do have a one exabyte sized
> > file, and any part of it is in a dirty tile, you will need to check
> > the whole file?  Or will Joern's fpos proposal fix this?
> 
> Yes, the original idea is you have to check every file that "covers" a
> tile in its entirety. With Joern's fpos piece, I think we can restrict
> our checks to just the section of the file that covers the tile.

Hrm.  Can you help me understand how you would check i_size then?

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Valerie Henson
On Wed, May 09, 2007 at 03:16:41PM +0400, Nikita Danilov wrote:
> 
> I guess I miss something. If chunkfs maintains "at most one continuation
> per chunk" invariant, then continuation inode might end up with multiple
> byte ranges, and to check that they do not overlap one has to read
> indirect blocks (or some equivalent data-structure).

You're right about needing to read the equivalent data-structure - for
other reasons, each continuation inode will need an easily accessible
list of byte ranges covered by that inode. (Sounds like, hey,
extents!) The important part is that you don't have go walk all the
indirect blocks or check your bitmap.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Matt Mackall
On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote:
> On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> > 
> > This does mean that our time to make progress on a check is bounded at
> > the top by the size of our largest file. If we have a degenerate
> > filesystem filled with a single file, this will in fact take as long
> > as a conventional fsck. If your filesystem has, say, 100 roughly
> > equally-sized files, you're back in Chunkfs territory.
> 
> Hm, I'm not sure that everyone understands, a particular subtlety of
> how the fsck algorithm works in chunkfs.  A lot of people seem to
> think that you need to check *all* cross-chunk links, every time an
> individual chunk is checked.  That's not the case; you only need to
> check the links that go into and out of the dirty chunk.  You also
> don't need to check the other parts of the file outside the chunk,
> except for perhaps reading the byte range info for each continuation
> node and making sure no two continuation inodes think they both have
> the same range, but you don't check the indirect blocks, block
> bitmaps, etc.

My reference to chunkfs here is simply that the worst-case is checking ~1
chunk, which is about 1/100th of a volume.

> > So we should have no trouble checking an exabyte-sized filesystem on a
> > 4MB box. Even if it has one exabyte-sized file! We check the first
> > tile, see that it points to our file, then iterate through that file,
> > checking that the forward and reverse pointers for each block match
> > and all CRCs match, etc. We cache the file's inode as clean, finish
> > checking anything else in the first tile, then mark it clean. When we get
> > to the next tile (and the next billion after that!), we notice that
> > each block points back to our cached inode and skip rechecking it.
> 
> If I understand correctly then, if you do have a one exabyte sized
> file, and any part of it is in a dirty tile, you will need to check
> the whole file?  Or will Joern's fpos proposal fix this?

Yes, the original idea is you have to check every file that "covers" a
tile in its entirety. With Joern's fpos piece, I think we can restrict
our checks to just the section of the file that covers the tile.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Nikita Danilov
Valerie Henson writes:

[...]

 > 
 > Hm, I'm not sure that everyone understands, a particular subtlety of
 > how the fsck algorithm works in chunkfs.  A lot of people seem to
 > think that you need to check *all* cross-chunk links, every time an
 > individual chunk is checked.  That's not the case; you only need to
 > check the links that go into and out of the dirty chunk.  You also
 > don't need to check the other parts of the file outside the chunk,
 > except for perhaps reading the byte range info for each continuation
 > node and making sure no two continuation inodes think they both have
 > the same range, but you don't check the indirect blocks, block
 > bitmaps, etc.

I guess I miss something. If chunkfs maintains "at most one continuation
per chunk" invariant, then continuation inode might end up with multiple
byte ranges, and to check that they do not overlap one has to read
indirect blocks (or some equivalent data-structure).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Jörn Engel
On Tue, 8 May 2007 22:56:09 -0700, Valerie Henson wrote:
> 
> I like it too, especially the rmap stuff, but I don't think it solves
> some of the problems chunkfs solves.  The really nice thing about
> chunkfs is that it tries hard to isolate each chunk from all the other
> chunks.  You can think of regular file systems as an OS one big shared
> address space - any process can potentially modify any other process's
> address space, including the kernel's - and chunkfs as the modern UNIX
> private address space model.  Except in rare worst case models (the
> equivalent of a kernel bug or writing /dev/mem), the only way one
> chunk can affect another chunk is through the narrow little interface
> of the continuation inode.  This severely limits the ability of one
> chunk to corrupt another - the worse you can do is end up with the
> wrong link count on an inode pointed to from another chunk.

This leaves me a bit confused.  Imo a filesystem equivalent of process's
address spaces would be permissions and quotas.  Indeed there is no
guarantee where any address spaces pages may physically reside.  They
can be in any zone, node or even swap or regular files.

Otoh, each physical page does have an rmap of some sorts - enough to
figure out why currently owns this page.  Does your own analogy work
against you?

Back to chunkfs, the really smart idea behind it imo is to take just a
small part of the filesystem, assume that everything else is flawless,
and check the small part under that assumption.  The assumption may be
wrong.  If that wrongness would effect the minimal fsck, it should get
detected as well.  Otherwise it doesn't matter right now.

What I never liked about chunkfs were two things.  First it splits the
filesystem into an array of chunks.  With sufficiently large devices,
either the number or the size of chunks will come close to problematic
again.  Some sort of tree arrangement intuitively makes more sense.

Secondly, the cnodes are... weird, complicated, not well understood, a
hack.  Pick a term.  Avoiding cnodes is harder than avoiding regular
fragmentation and the recent defragment patches seem to imply we're
doing a bad job at that already.  Linked lists of cnodes - yuck.

Not directly a chunkfs problem, but still unfortunate is that it still
cannot detect medium errors.  There are no checksums.  Checksums cost
performance, so they obviously have to be optional at user's choice.
But not even having the option is quite 80's.

Matt's proposal is an alternative solution that can address all of my
concerns.  Instead of cnodes it has the rmap.  That is a very simple
structure I can explain to my nephews.  It allows for checksums, which
is nice as well.  And it does allow for a tree structure of tiles.

Tree structure means that each tile can have free space counters.  A
supertile (or whatever one may call it) can have a free space counter
that is the sum of all member free space counters.  And so forth
upwards.  Same for dirty bits and anything else I've forgotten.

So individual tiles can be significantly smaller than chunks in chunkfs.
Checking them is significantly faster than checking a chunk.  There will
be more dirty tiles at any given time, but a better way to look at it is
to say that for any dirty chunk in chunkfs, tilefs has some dirty and
some clean tiles.  So the overall ratio of dirty space is never higher
and almost always lower.

Overall I almost envy Matt for having this idea.  In hindsight it should
have been obvious to me.  But then again, in hindsight the fsck problem
and using divide and conquer should have been obvious to everyone and
iirc you were the only one who seriously persued the idea and got all
this frenzy started. :)

Jörn

-- 
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Valerie Henson
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> 
> This does mean that our time to make progress on a check is bounded at
> the top by the size of our largest file. If we have a degenerate
> filesystem filled with a single file, this will in fact take as long
> as a conventional fsck. If your filesystem has, say, 100 roughly
> equally-sized files, you're back in Chunkfs territory.

Hm, I'm not sure that everyone understands, a particular subtlety of
how the fsck algorithm works in chunkfs.  A lot of people seem to
think that you need to check *all* cross-chunk links, every time an
individual chunk is checked.  That's not the case; you only need to
check the links that go into and out of the dirty chunk.  You also
don't need to check the other parts of the file outside the chunk,
except for perhaps reading the byte range info for each continuation
node and making sure no two continuation inodes think they both have
the same range, but you don't check the indirect blocks, block
bitmaps, etc.

There is one case where you do need to do a full check of all links
that cross chunks - if a continuation inode's pointers have been
corrupted, and you might end up with orphan continuation inodes or
dangling links in other chunks.  I expect that to be relatively rare.

> So we should have no trouble checking an exabyte-sized filesystem on a
> 4MB box. Even if it has one exabyte-sized file! We check the first
> tile, see that it points to our file, then iterate through that file,
> checking that the forward and reverse pointers for each block match
> and all CRCs match, etc. We cache the file's inode as clean, finish
> checking anything else in the first tile, then mark it clean. When we get
> to the next tile (and the next billion after that!), we notice that
> each block points back to our cached inode and skip rechecking it.

If I understand correctly then, if you do have a one exabyte sized
file, and any part of it is in a dirty tile, you will need to check
the whole file?  Or will Joern's fpos proposal fix this?

This is interesting stuff, thanks!

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-08 Thread Valerie Henson
On Sun, Apr 29, 2007 at 02:21:13PM +0200, J??rn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> I like it a lot.  Until now it appears to solve more problems and cause
> fewer new problems than ChunkFS.

I like it too, especially the rmap stuff, but I don't think it solves
some of the problems chunkfs solves.  The really nice thing about
chunkfs is that it tries hard to isolate each chunk from all the other
chunks.  You can think of regular file systems as an OS one big shared
address space - any process can potentially modify any other process's
address space, including the kernel's - and chunkfs as the modern UNIX
private address space model.  Except in rare worst case models (the
equivalent of a kernel bug or writing /dev/mem), the only way one
chunk can affect another chunk is through the narrow little interface
of the continuation inode.  This severely limits the ability of one
chunk to corrupt another - the worse you can do is end up with the
wrong link count on an inode pointed to from another chunk.

I'm a big fan of the inode backpointers with directory-entry linked
list idea - so much so that I came up with it too. :) With regard to
rmap for blocks, I've been assuming that a chunk-level rmap is good
enough for what I want to do ("Who owns this block? Anyone?" during
fsck).  If not, I'm all over the per-block rmap idea.

Overall, great ideas!

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-02 Thread Jörn Engel
On Wed, 2 May 2007 10:37:38 -0500, Matt Mackall wrote:
> 
> fpos does allow us to check just a subset of the file efficiently,
> yes. And that things are more strictly 1:1, because it unambiguously
> matches a single forward pointer in the file. Ok, I'm warming to the
> idea.
> 
> But indirect blocks don't have an fpos, per se. They'd need a special
> encoding. As the fpos entries will all be block aligned, we'll have 12
> extra bits to play with, so that may be easy enough.

No special encoding is needed.  LogFS uses the fpos of some block behind
that indirect block.  It doesn't even matter which one.  When checking
the block, the indirect block has to get touched while trying to get to
the alledged fpos.

As already explained, LogFS also encodes the level (data, indirect,
doubly indirect, etc.).  If you want to play games with extra bits, you
could do the same.  Two bits should be sufficient for ext2.

The only thing the level buys you for TileFS is to detect writes to the
wrong address, where wrong address and correct address match in (ino,
fpos), but not in level.

> It's a bit frustrating to have 96-bit (inode+fpos) pointers in one
> direction and 32-bit (blockno) pointers in the other though. This
> doubles the overhead to .4%. Still not fatal - regular ext2 overhead
> is somewhere between 1% and 3% depending on inode usage.

My ext3 root partition currently wastes 60 Bytes per block on unused
inodes.  If space was an issue, adding an inode file to to TileFS would
help more than trying to squeeze bits from the rmap.

Jörn

-- 
/* Keep these two variables together */
int bar;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-02 Thread Matt Mackall
On Wed, May 02, 2007 at 03:32:05PM +0200, Jörn Engel wrote:
> On Sun, 29 April 2007 20:40:42 -0500, Matt Mackall wrote:
> > 
> > So we should have no trouble checking an exabyte-sized filesystem on a
> > 4MB box. Even if it has one exabyte-sized file! We check the first
> > tile, see that it points to our file, then iterate through that file,
> > checking that the forward and reverse pointers for each block match
> > and all CRCs match, etc. We cache the file's inode as clean, finish
> > checking anything else in the first tile, then mark it clean. When we get
> > to the next tile (and the next billion after that!), we notice that
> > each block points back to our cached inode and skip rechecking it.
> 
> How would you catch the case where some block in tile 2 claims to belong
> to your just-checked inode but the inode has no reference to it?

You're right, that is a problem. Without the known-clean inode cache,
we would revisit the file in its entirety when checking tile 2, thus
ensuring that both forward and reverse pointers were intact..

> How would you catch the inode referencing the same block twice with just
> 4MB of memory?

..which would also let us catch instances of the above, but would be
very slow for files that span many tiles.

> I believe you need the fpos field in your rmap for both problems.

fpos does allow us to check just a subset of the file efficiently,
yes. And that things are more strictly 1:1, because it unambiguously
matches a single forward pointer in the file. Ok, I'm warming to the
idea.

But indirect blocks don't have an fpos, per se. They'd need a special
encoding. As the fpos entries will all be block aligned, we'll have 12
extra bits to play with, so that may be easy enough.

It's a bit frustrating to have 96-bit (inode+fpos) pointers in one
direction and 32-bit (blockno) pointers in the other though. This
doubles the overhead to .4%. Still not fatal - regular ext2 overhead
is somewhere between 1% and 3% depending on inode usage.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-02 Thread Jörn Engel
On Sun, 29 April 2007 20:40:42 -0500, Matt Mackall wrote:
> 
> So we should have no trouble checking an exabyte-sized filesystem on a
> 4MB box. Even if it has one exabyte-sized file! We check the first
> tile, see that it points to our file, then iterate through that file,
> checking that the forward and reverse pointers for each block match
> and all CRCs match, etc. We cache the file's inode as clean, finish
> checking anything else in the first tile, then mark it clean. When we get
> to the next tile (and the next billion after that!), we notice that
> each block points back to our cached inode and skip rechecking it.

How would you catch the case where some block in tile 2 claims to belong
to your just-checked inode but the inode has no reference to it?

How would you catch the inode referencing the same block twice with just
4MB of memory?

I believe you need the fpos field in your rmap for both problems.

Jörn

-- 
Anything that can go wrong, will.
-- Finagle's Law
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-02 Thread Jörn Engel
On Mon, 30 April 2007 12:59:26 -0500, Matt Mackall wrote:
> 
> We could eliminate the block bitmap, but I don't think there's much
> reason to. It improves allocator performance with negligible footprint
> and improves redundancy.

LogFS uses that scheme and it costs dearly.  Walking the rmap to figure
out which blocks are used and which aren't can approach fsck time when
the filesystem gets full.

Jörn

-- 
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Matt Mackall
On Mon, Apr 30, 2007 at 01:26:24PM -0400, Theodore Tso wrote:
> On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> > chunkfs. The other is reverse maps (aka back pointers) for blocks ->
> > inodes and inodes -> directories that obviate the need to have large
> > amounts of memory to check for collisions.
> 
> Yes, I missed the fact that you had back pointers for blocks as well
> as inodes.  So the block table in the tile header gets used for
> determing if a block is free, much like is done with FAT, right?  

We could eliminate the block bitmap, but I don't think there's much
reason to. It improves allocator performance with negligible footprint
and improves redundancy.
 
> That's a clever system; I like it.  It does mean that there is a lot
> more metadata updates, but since you're not journaling, that should
> counter that effect to some extent.

I had actually envisioned this as working with or without a journal.
I suspect there are ways to keep the performance downside here low.

> IMHO, it's definitely worth a try to see how well it works!

I'm not much of an FS hacker and I've got a lot of other projects in
the air, but I may give it a shot. Any help on this front would be
appreciated.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Theodore Tso
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
> chunkfs. The other is reverse maps (aka back pointers) for blocks ->
> inodes and inodes -> directories that obviate the need to have large
> amounts of memory to check for collisions.

Yes, I missed the fact that you had back pointers for blocks as well
as inodes.  So the block table in the tile header gets used for
determing if a block is free, much like is done with FAT, right?  

That's a clever system; I like it.  It does mean that there is a lot
more metadata updates, but since you're not journaling, that should
counter that effect to some extent.

IMHO, it's definitely worth a try to see how well it works!

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall
On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote:
> On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> What's your goal here?  Is it to it to speed up fsck's after an
> unclean shutdown to the point so that you don't need to use a journal
> or some kind of soft updates scheme?   
> 
> Is it to speed up fsck's after the kernel has detected some kind if
> internal consistency error?  
> 
> Is it to speed up fsck's after you no longer have confidence that
> random blocks in the filesystem may have gotten corrupted, due to a
> suspend/resume bug, hard drive failure, reported CRC/parity errors
> when writing to the device, reports of massive ECC mailures in your
> memory that could have caused random block to have gotten been written
> with multiple bit flips?

Mostly this last.
 
> The first is relatively easy, but as you move down the list, things
> get progressive harder, since it's no longer possible to use a per
> tile "clean bit" to assume that you get to skip checking that
> particular tile or chunk.

Tile headers contain a timestamp that lets you say "let's revisit
every clean block that hasn't been checked in the last 6 months". And
the blocks containing the clean bits are themselves covered by CRC!

> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> So what happens for files that are bigger than 2MB?  Presumably they
> consists of blocks that must come from more than one tile, right?  So
> is an inode allowed to reference blocks outside of its tile?  Or does
> an inode that needs to span multiple tiles have a local sub-inode in
> each tile, with back pointers to parent inode?   

We don't need any form of sub-inode or continuation inode, which is
why it's not mentioned. This is one of two main differences from
chunkfs. The other is reverse maps (aka back pointers) for blocks ->
inodes and inodes -> directories that obviate the need to have large
amounts of memory to check for collisions.

When you check a tile, you check all the inodes that "cover" that tile
(ie have overlapping data or metadata). So if a large file spans many
tiles, you end up reading many tiles to check one. So a completely
naive checker would check the same file many times in the course of
checking a file. Total work would then be multiplied by the average
tile-crossing rate for files.

However most of this work can be trivially avoided, of course. You can
keep a small in-memory LRU cache of recently-checked inodes to avoid
having to repeatedly check files in neighboring blocks as your scan
progresses. This LRU doesn't need to be very big at all to be quite
effective.

This does mean that our time to make progress on a check is bounded at
the top by the size of our largest file. If we have a degenerate
filesystem filled with a single file, this will in fact take as long
as a conventional fsck. If your filesystem has, say, 100 roughly
equally-sized files, you're back in Chunkfs territory.

Note that even though our time usage scales with max file size, our
memory usage is bound by a small constant. We only need one tile
header in memory plus what it takes to walk an inode's data. The
block->inode reverse map means that we don't have keep a filesystem-
or chunk-sized map to spot block collisions. Each block knows where it
belongs and spotting broken links is easy.

So we should have no trouble checking an exabyte-sized filesystem on a
4MB box. Even if it has one exabyte-sized file! We check the first
tile, see that it points to our file, then iterate through that file,
checking that the forward and reverse pointers for each block match
and all CRCs match, etc. We cache the file's inode as clean, finish
checking anything else in the first tile, then mark it clean. When we get
to the next tile (and the next billion after that!), we notice that
each block points back to our cached inode and skip rechecking it.

> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> Hopefully the clean bitmap block is protected by a checksum.  After
> all, the smaller set of clean bitmap block is going to be constantly
> updated as tiles get dirtied, and then cleaned.  What if they get
> corrupted?  How 

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Theodore Tso
On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

What's your goal here?  Is it to it to speed up fsck's after an
unclean shutdown to the point so that you don't need to use a journal
or some kind of soft updates scheme?   

Is it to speed up fsck's after the kernel has detected some kind if
internal consistency error?  

Is it to speed up fsck's after you no longer have confidence that
random blocks in the filesystem may have gotten corrupted, due to a
suspend/resume bug, hard drive failure, reported CRC/parity errors
when writing to the device, reports of massive ECC mailures in your
memory that could have caused random block to have gotten been written
with multiple bit flips?

The first is relatively easy, but as you move down the list, things
get progressive harder, since it's no longer possible to use a per
tile "clean bit" to assume that you get to skip checking that
particular tile or chunk.

>  Divide disk into a bunch of tiles. For each tile, allocate a one
>  block tile header that contains (inode, checksum) pairs for each
>  block in the tile. Unused blocks get marked inode -1, filesystem
>  metadata blocks -2. The first element contains a last-clean
>  timestamp, a clean flag and a checksum for the block itself. For 4K
>  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
>  with ~.2% overhead.

So what happens for files that are bigger than 2MB?  Presumably they
consists of blocks that must come from more than one tile, right?  So
is an inode allowed to reference blocks outside of its tile?  Or does
an inode that needs to span multiple tiles have a local sub-inode in
each tile, with back pointers to parent inode?   

Note that both design paths have some serious tradeoffs.  If you allow
an inode to span multiple tiles, now you can't check the block
allocation data structures without scanning all of the tiles involved.
If you have sub-inodes, now you have to have bidirectional pointers
and you have to validate those pointers after validating all of the
individual tiles.  

This is one of the toughest aspects of either the chunkfs or tilefs
design, and when we discussed chunkfs at past filesystem workshops, we
didn't come to any firm conclusions about the best way to solve it,
except to acknowledge that it is a hard problem.  My personal
inclination is to use substantially bigger chunks than the 2MB that
you've proposed, and make each of the chunks more like 2 or 4 GB each,
and to enforce a rule which says an inode in a chunk can only
reference blocks in that local chunk, and to try like mad to keep
directories reference inodes in a chunk, and to keep inodes reference
blocks within that chunk.  When a file is bigger than a chunk, then
you will be forced to use indirection pointers that basically say,
"for offsets 2GB-4GB, reference inode  in chunk , and for
4GB-8GB, checkout inode  in chunk , etc".   

I won't say that this is definitely the best way to do things, but I
note that you haven't really address this design point, and there are
no obvious best ways of handling this.

>  [Note that CRCs are optional so we can cut the overhead in half. I
>  choose CRCs here because they're capable of catching the vast
>  majority of accidental corruptions at a small cost and mostly serve
>  to protect against errors not caught by on-disk ECC (eg cable noise,
>  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
>  SHA-n is perfectly doable.]

If the goal is just accidental corruptions, CRC's are just fine.  If
you want better protection against accidental corruption, then the
answer is to use a bigger CRC.  Using a cryptographic hash like SHA-n
is pure overkill unless you're trying to design protection against a
malicious attacker, in which case you've got a much bigger set of
problems that you have to address first --- you don't get a
cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash
function

>  Every time we write to a tile, we must mark the tile dirty. To cut
>  down time to find dirty tiles, the clean bits can be collected into a
>  smaller set of blocks, one clean bitmap block per 64GB of data.

Hopefully the clean bitmap block is protected by a checksum.  After
all, the smaller set of clean bitmap block is going to be constantly
updated as tiles get dirtied, and then cleaned.  What if they get
corrupted?  How does the checker notice?  And presumably if there is a
CRC that doesn't verify, it would have to check all of the tiles,
right?

> Checking a tile:
> 
>  Read the tile
>  If clean and current, we're done.
>  Check the tile header checksum
>  Check the checksum on each block in the tile
>  Check that metadata blocks are metadata
>  Check that in

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall
On Sun, Apr 29, 2007 at 05:58:48PM +0200, Jörn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > Some things we need to check during fsck:
> > 
> >  all directories point to in-use inodes
> >  all in-use inodes are referred to by directories
> >  all inodes in use are marked in use
> >  all free inodes are marked free
> >  all inodes point to in-use blocks
> >  all inode refcounts are correct
> >  all inode block counts are correct
> >  free inode count is correct
> > 
> >  no blocks are used twice
> >  all used blocks are marked as used
> >  all free blocks are marked as free
> >  optional: all block contents are correct
> 
>statfs information matches filesystem content
> 
> This one may still require a full fsck in your current approach.  One if
> the good aspects of ChunkFS (assuming my understanding matches reality)
> is to have per-chunk counters for free blocks, free inodes, etc.  For a
> fast fsck you would need to have these counters per-unit as well.  It
> doesn't matter whether your unit is a tile, chunk, blockgroup or
> karboozel.

Yes. This is probably not a very big problem. Once we've checked all
the tiles in a block group and they're clean, we know that block and
inode bitmaps are correct, so we can doublecheck the per-blockgroup
counters.

One obvious approach is simply batching a blockgroup's worth of tiles
at a time (if tile header entries are 64 bits, then there are 64 tiles
in a block group). Then you keep running totals and check at the end.
Memory usage is still O(single tile).

Summing blockgroup totals across a million block groups.. we'll
probably want something like:

> Having some tree structure for these counters would also help.  Statfs
> requires to add all counters for all units.  Smaller units speed up fsck
> but slow down statfs.  With a tree statfs can be O(log(n)).

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel
On Sun, 29 April 2007 18:34:59 +0200, Andi Kleen wrote:
> Matt Mackall <[EMAIL PROTECTED]> writes:
> 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> Problem I see is that your scheme doesn't support metadata checksums only.

Why?  It is fairly simple to run many different schemes with this:
- Generate checksums for everything, compare for each access.
- Generate checksums for everything, only compare metadata checksums.
- Generate checksums for everything, only compare at fsck time.
- Generate metadata checksums, use 0x_ as data "checksums".
- Not generate any checksums.

Users without checksums would still pay the .1% space overhead, sure.
I'd bet they already pay more for unused inodes.

Jörn

-- 
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall
On Sun, Apr 29, 2007 at 06:34:59PM +0200, Andi Kleen wrote:
> Matt Mackall <[EMAIL PROTECTED]> writes:
> 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> Problem I see is that your scheme doesn't support metadata checksums only.
> IMHO those are the most interesting because they have the potential
> to be basically zero cost, unlike full data checksumming.
> And doing metadata checksums is enough to handle the fsck
> problem.

It does. Checksums are not in any way an essential part of this
approach. We can checksum nothing, just metadata, or everything.

Further, we might find ourselves wanting to change the setting, so
we'd probably just want to leave slots in the tile headers for CRCs
always. So for now, let's imagine three mount options: nocrc, crc, and
metacrc.

> I'm sure there are many cases where full checksumming makes sense too,
> but those shouldn't be forced on everybody because it will slow down
> some important workloads (like O_DIRECT)

Right.

> Metadata checksums would be best just put into the file systems
> data structures. Essentially every object (inode, extent, directory entry,
> super block) should have a checksum that can be incrementially updated.

Putting CRCs on extents is trouble. You can't validate them without
reading the whole extent and writing into the middle of an extent is
also hard.

But this is kind of a rat-hole. The more immediate question is do the
reverse mapping pointers make sense for fsck purposes?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel
On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> 
> Some things we need to check during fsck:
> 
>  all directories point to in-use inodes
>  all in-use inodes are referred to by directories
>  all inodes in use are marked in use
>  all free inodes are marked free
>  all inodes point to in-use blocks
>  all inode refcounts are correct
>  all inode block counts are correct
>  free inode count is correct
> 
>  no blocks are used twice
>  all used blocks are marked as used
>  all free blocks are marked as free
>  optional: all block contents are correct

   statfs information matches filesystem content

This one may still require a full fsck in your current approach.  One if
the good aspects of ChunkFS (assuming my understanding matches reality)
is to have per-chunk counters for free blocks, free inodes, etc.  For a
fast fsck you would need to have these counters per-unit as well.  It
doesn't matter whether your unit is a tile, chunk, blockgroup or
karboozel.

Having some tree structure for these counters would also help.  Statfs
requires to add all counters for all units.  Smaller units speed up fsck
but slow down statfs.  With a tree statfs can be O(log(n)).

Jörn

-- 
"Translations are and will always be problematic. They inflict violence
upon two languages." (translation from German)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel
On Sun, 29 April 2007 07:57:18 -0500, Matt Mackall wrote:
> On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote:
> 
> Thanks. I think this is a bit more direct solution than ChunkFS, but
> a) I haven't followed ChunkFS closely and b) I haven't been thinking
> about fsck very long, so this is still just a presented as fodder for
> discussion.

After thinking about it for a while, I believe you have a problem as
well.  Will cover that in a later mail.

> > You should add a 64bit fpos field.  That allows you to easily check for
> > addressing errors.
> 
> Describe the scenario where this manifests, please.

Block with checksum is written somewhere.  Block matches checksum, but a
bit flipped on the address bus.  To catch this you have to match 1)
block contents, 2) checksum and 3) inode tree.  ZFS does it by having
the checksum next to the block pointers in indirect blocks.  LogFS does
it by having a block header with checksum and (ino, pos, level) tupel.

Level is 0 for data blocks, 1 for indirect blocks, etc.  In the very
rare case that a data block gets written to the offset belonging to one
of its indirect blocks or vice versa this is necessary to catch the
error.  LogFS needs it for different reasons and I wouldn't mind if you
just ignore that detail.

> It just occurred to me that my approach is analogous to object-based
> rmap on the filesystem. The fpos proposal I think makes it more like
> the original per-pte rmap. This is not to say I think the same lessons
> apply, as I'm not clear what you're proposing yet.
> 
> Ooh.. I also just realized the tile approach allows much easier
> defragging/shrinking of large filesystems because finding the
> associated inode for blocks you want to move is fast.

It also allows for a background scan of the filesystem.  If data is
rotting on the medium, the only chance to detect this is by reading and
checking it.  Lots of data is never read from userspace, so it can
accumulate lots of errors.

Without rmap the filesystem cannot verify random blocks.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Andi Kleen
Matt Mackall <[EMAIL PROTECTED]> writes:

> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

Problem I see is that your scheme doesn't support metadata checksums only.
IMHO those are the most interesting because they have the potential
to be basically zero cost, unlike full data checksumming.
And doing metadata checksums is enough to handle the fsck
problem.

I'm sure there are many cases where full checksumming makes sense too,
but those shouldn't be forced on everybody because it will slow down
some important workloads (like O_DIRECT)

Metadata checksums would be best just put into the file systems
data structures. Essentially every object (inode, extent, directory entry,
super block) should have a checksum that can be incrementially updated.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Matt Mackall
On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> I like it a lot.  Until now it appears to solve more problems and cause
> fewer new problems than ChunkFS.

Thanks. I think this is a bit more direct solution than ChunkFS, but
a) I haven't followed ChunkFS closely and b) I haven't been thinking
about fsck very long, so this is still just a presented as fodder for
discussion.

> > [...]
> > 
> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> You should add a 64bit fpos field.  That allows you to easily check for
> addressing errors.

Describe the scenario where this manifests, please.

It just occurred to me that my approach is analogous to object-based
rmap on the filesystem. The fpos proposal I think makes it more like
the original per-pte rmap. This is not to say I think the same lessons
apply, as I'm not clear what you're proposing yet.

Ooh.. I also just realized the tile approach allows much easier
defragging/shrinking of large filesystems because finding the
associated inode for blocks you want to move is fast.

> >  [Note that CRCs are optional so we can cut the overhead in half. I
> >  choose CRCs here because they're capable of catching the vast
> >  majority of accidental corruptions at a small cost and mostly serve
> >  to protect against errors not caught by on-disk ECC (eg cable noise,
> >  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
> >  SHA-n is perfectly doable.]
> 
> The checksum cannot protect against a maliciously prepared medium
> anyway, so crypto makes little sense.

In a past life, I wrote a device mapper layer that kept a
cryptographic hash per cluster of the underlying device, with a
top-level digital signature of said hashes. That gets you pretty
close to tamper-proof, in theory. Practice of course is a different
matter, so don't try this at home.

As it happens, this earlier system was the inspiration for the tile
idea, the integrity parts of which have been kicking around in my head
since I heard ZFS was tracking checksums.

> Crc32 can provably (if you trust those who did the proof) detect all
> 1, 2 and 3-bit errors and has a 1:2^32 chance of detecting any
> remaining errors. That is fairly hard to improve on.

Indeed.
 
> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> You can and possibly should organize this as a tree, similar to a file.
> One bit at the lowest level marks a tile as dirty.  One bit for each
> indirect block pointer marks some tiles behind the pointer as dirty.
> That scales logarithmically to any filesystem size.

Right. 3 levels takes us to 512TB, etc..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Jörn Engel
On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> 
> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

I like it a lot.  Until now it appears to solve more problems and cause
fewer new problems than ChunkFS.

> [...]
> 
>  Divide disk into a bunch of tiles. For each tile, allocate a one
>  block tile header that contains (inode, checksum) pairs for each
>  block in the tile. Unused blocks get marked inode -1, filesystem
>  metadata blocks -2. The first element contains a last-clean
>  timestamp, a clean flag and a checksum for the block itself. For 4K
>  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
>  with ~.2% overhead.

You should add a 64bit fpos field.  That allows you to easily check for
addressing errors.

>  [Note that CRCs are optional so we can cut the overhead in half. I
>  choose CRCs here because they're capable of catching the vast
>  majority of accidental corruptions at a small cost and mostly serve
>  to protect against errors not caught by on-disk ECC (eg cable noise,
>  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
>  SHA-n is perfectly doable.]

The checksum cannot protect against a maliciously prepared medium
anyway, so crypto makes little sense.  Crc32 can provably (if you trust
those who did the proof) detect all 1, 2 and 3-bit errors and has a
1:2^32 chance of detecting any remaining errors.  That is fairly hard to
improve on.

>  Every time we write to a tile, we must mark the tile dirty. To cut
>  down time to find dirty tiles, the clean bits can be collected into a
>  smaller set of blocks, one clean bitmap block per 64GB of data.

You can and possibly should organize this as a tree, similar to a file.
One bit at the lowest level marks a tile as dirty.  One bit for each
indirect block pointer marks some tiles behind the pointer as dirty.
That scales logarithmically to any filesystem size.

Jörn

-- 
I don't understand it. Nobody does.
-- Richard P. Feynman
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] TileFS - a proposal for scalable integrity checking

2007-04-28 Thread Matt Mackall
This is a relatively simple scheme for making a filesystem with
incremental online consistency checks of both data and metadata.
Overhead can be well under 1% disk space and CPU overhead may also be
very small, while greatly improving filesystem integrity.

Some things we need to check during fsck:

 all directories point to in-use inodes
 all in-use inodes are referred to by directories
 all inodes in use are marked in use
 all free inodes are marked free
 all inodes point to in-use blocks
 all inode refcounts are correct
 all inode block counts are correct
 free inode count is correct

 no blocks are used twice
 all used blocks are marked as used
 all free blocks are marked as free
 optional: all block contents are correct

Layout:

 Start with a conventional ext2/3-like filesystem.

 Divide disk into a bunch of tiles. For each tile, allocate a one
 block tile header that contains (inode, checksum) pairs for each
 block in the tile. Unused blocks get marked inode -1, filesystem
 metadata blocks -2. The first element contains a last-clean
 timestamp, a clean flag and a checksum for the block itself. For 4K
 blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
 with ~.2% overhead.

 [Note that CRCs are optional so we can cut the overhead in half. I
 choose CRCs here because they're capable of catching the vast
 majority of accidental corruptions at a small cost and mostly serve
 to protect against errors not caught by on-disk ECC (eg cable noise,
 kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
 SHA-n is perfectly doable.]

 Every time we write to a tile, we must mark the tile dirty. To cut
 down time to find dirty tiles, the clean bits can be collected into a
 smaller set of blocks, one clean bitmap block per 64GB of data.

 Inodes have a backpointer to a directory that links them. Hardlinked
 files have two extra inode pointers in the directory structure, to
 the previous and next directories containing the link. Hardlinked
 inodes have a checksum of the members of that list.

Checking a tile:

 Read the tile
 If clean and current, we're done.
 Check the tile header checksum
 Check the checksum on each block in the tile
 Check that metadata blocks are metadata
 Check that inodes in tile agree with inode bitmap
 Check that data blocks in tile agree with block bitmap
 Check for each inode covering tile according to header:
  Marked in-use
  Blocks are owned by the inode
Blocks incorrectly owned put on list for tree sweep or orphaned
  Block count is right
  Traverse backpointer linked list, check link count and checksum
Inodes with incorrect refcount on tree sweep list or orphaned
  If directory:
Points to in-use inodes
 Mark tile clean

 If all tiles are simultaneously clean, fs is clean.

Tree sweeps and orphans:

 For damaged backpointers, we may want to walk the directory tree to
 repair the links. Memory usage here is bounded by the size of the
 orphan list and we may be able to trim the walk when examining files
 or directories inside of clean tiles.

 Alternately, we can move orphaned blocks and inodes to lost+found/
 and forgo the more expensive tree walk.

Performance implications:

 Amount of RAM needed to check an FS is tiny
 Tile headers are very near their blocks, so usually don't require a seek
 Caching overhead is relatively small
 Most actual tile header writes can be folded in the journal
 Hardlinks are a bit of a pain, but somewhat rare
 We can cache checks of inodes covering multiple tiles
 Huge file deletes have to hit even more blocks than ext2/3
 It's antithetical to using extents, as is any system with block CRCs
 Keeping the average read-mostly filesystem fully-checked should be easy

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html