Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-05-01 Thread Valerie Henson
On Fri, Apr 27, 2007 at 11:06:47AM -0400, Jeff Dike wrote:
> On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote:
> > Here's an example, spelled out:
> > 
> > Allocate file 1 in chunk A.
> > Grow file 1.
> > Chunk A fills up.
> > Allocate continuation inode for file 1 in chunk B.
> > Chunk A gets some free space.
> > Chunk B fills up.
> > Pick chunk A for allocating next block of file 1.
> > Try to look up a continuation inode for file 1 in chunk A.
> > Continuation inode for file 1 found in chunk A!
> > Attach newly allocated block to existing inode for file 1 in chunk A.
> 
> So far, so good (and the slides are helpful, tx!).  What happens when
> file 1 keeps growing and chunk A fills up (and chunk B is still full)?
> Can the same continuation inode also point at chunk C, where the file
> is going to grow to?

You allocate a new continuation inode in chunk C.  The rule is that
only inodes inside a chunk can point to blocks inside the chunk, so
you need an inode in C if you want to allocate blocks from C.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-28 Thread Jörn Engel
On Fri, 27 April 2007 23:50:06 -0700, Valerie Henson wrote:
> 
> Any mapping structure will have to be pre-allocated.

How much space would you allocate for it then?  The required size surely
depends on the file size and fragmentation.

> So in my secret heart of hearts, I do indeed hope that cnodes are rare
> enough that we don't actually have to do anything smart to make them
> go fast.  Either having no fast lookup structure or creating it in
> memory as needed would be the nicest solution.  However, since I can't
> guarantee this will be the case, it's nice to have some idea of what
> we'll do if this does become important.

You go slow and slower.  Or you can defragment/clean the filesystem.
Fragmented filesystems being slow is hardly news to anyone.  ChunkFS
will be slower than others, so the reasons to fight fragmentation weigh
heavier.

What you can do:
- If two files 1 and 2 contain blocks in chunks A and B, move file 1
  blocks to chunk A and file 2 blocks to chunk B until one of the files
  only spans a single chunk.
- Same as above with longer chains like file 1 in chunks (A,B), file 2
  in chunks (B,C), file 3 in chunks (A,C).
- When writing to file 1 in chunk A but running out of space in chunk A,
  move some cross-chunk file over instead of writing file 1 blocks to
  chunk B.

The first two can hope to use some idle times (if they exist), the third
will make writes go slower in order to speed up future reads.

There ain't no such thing as a free lunch. :)

Jörn

-- 
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Valerie Henson
On Fri, Apr 27, 2007 at 12:53:34PM +0200, J??rn Engel wrote:
> 
> All this would get easier if continuation inodes were known to be rare.
> You can ditch the doubly-linked list in favor of a pointer to the main
> inode then - traversing the list again is cheap, after all.  And you can
> just try to read the same block once for every continuation inode.
> 
> If those lists can get long and you need a mapping from offset to
> continuation inode on the medium, you are basically fscked.  Storing the
> mapping requires space.  You need the mapping only when space (in some
> chunk) gets tight and you allocate continuation inodes.  So either you
> don't need the mapping or you don't have a good place to put it.

Any mapping structure will have to be pre-allocated.

> Having a mapping in memory is also questionable.  Either you scan the
> whole file on first access and spend a long time for large files.  Or
> you create the mapping on the fly.  In that case the page cache will
> already give you a 90% solution for free.

So in my secret heart of hearts, I do indeed hope that cnodes are rare
enough that we don't actually have to do anything smart to make them
go fast.  Either having no fast lookup structure or creating it in
memory as needed would be the nicest solution.  However, since I can't
guarantee this will be the case, it's nice to have some idea of what
we'll do if this does become important.

> You should spend a lot of effort trying to minimize cnodes. ;)

Yep.  It's much better to optimize away most cnodes instead of trying
to make the go fast.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Jeff Dike
On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote:
> Here's an example, spelled out:
> 
> Allocate file 1 in chunk A.
> Grow file 1.
> Chunk A fills up.
> Allocate continuation inode for file 1 in chunk B.
> Chunk A gets some free space.
> Chunk B fills up.
> Pick chunk A for allocating next block of file 1.
> Try to look up a continuation inode for file 1 in chunk A.
> Continuation inode for file 1 found in chunk A!
> Attach newly allocated block to existing inode for file 1 in chunk A.

So far, so good (and the slides are helpful, tx!).  What happens when
file 1 keeps growing and chunk A fills up (and chunk B is still full)?
Can the same continuation inode also point at chunk C, where the file
is going to grow to?

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Jörn Engel
On Thu, 26 April 2007 22:07:43 -0700, Valerie Henson wrote:
>
> What's important is that each continuation inode have a
> back pointer to the parent and that there is some structure for
> quickly looking up the continuation inode for a given file offset.
> Suggestions for data structures that work well in this situation are
> welcome. :)

All this would get easier if continuation inodes were known to be rare.
You can ditch the doubly-linked list in favor of a pointer to the main
inode then - traversing the list again is cheap, after all.  And you can
just try to read the same block once for every continuation inode.

If those lists can get long and you need a mapping from offset to
continuation inode on the medium, you are basically fscked.  Storing the
mapping requires space.  You need the mapping only when space (in some
chunk) gets tight and you allocate continuation inodes.  So either you
don't need the mapping or you don't have a good place to put it.

Having a mapping in memory is also questionable.  Either you scan the
whole file on first access and spend a long time for large files.  Or
you create the mapping on the fly.  In that case the page cache will
already give you a 90% solution for free.

You should spend a lot of effort trying to minimize cnodes. ;)

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Valerie Henson
On Thu, Apr 26, 2007 at 10:47:38AM +0200, Jan Kara wrote:
>   Do I get it right that you just have in each cnode a pointer to the
> previous & next cnode? But then if two consecutive cnodes get corrupted,
> you have no way to connect the chain, do you? If each cnode contained
> some unique identifier of the file and a number identifying position of
> cnode,  then there would be at least some way (through expensive) to
> link them together correctly...

You're right, it's easy to add a little more redundancy that would
make it possible to recover from two consecutive nodes being
corrupted.  Keeping a parent inode id in each continuation inode is
definitely a smart thing to do.

Some minor side notes: Continuation inodes aren't really in any
defined order - if you look at Jeff's ping-pong chunk allocation
example, you'll see that the data in each continuation inode won't be
in linearly increasing order.  Also, while the current implementation
is a simple doubly-linked list, this may not be the best solution
long-term.  What's important is that each continuation inode have a
back pointer to the parent and that there is some structure for
quickly looking up the continuation inode for a given file offset.
Suggestions for data structures that work well in this situation are
welcome. :)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Valerie Henson
On Thu, Apr 26, 2007 at 12:05:04PM -0400, Jeff Dike wrote:
> 
> No, I'm referring to a different file.  The scenario is that you have
> a growing file in a nearly full disk with files being deleted (and
> thus space being freed) such that allocations for the growing file
> bounce back and forth between chunks.

This is an excellent question.  I call this the ping-pong problem.
The solution is as Amit describes: You have a maximum of one
continuation inode per file per chunk, and you require sparse files.
Here's an example, spelled out:

Allocate file 1 in chunk A.
Grow file 1.
Chunk A fills up.
Allocate continuation inode for file 1 in chunk B.
Chunk A gets some free space.
Chunk B fills up.
Pick chunk A for allocating next block of file 1.
Try to look up a continuation inode for file 1 in chunk A.
Continuation inode for file 1 found in chunk A!
Attach newly allocated block to existing inode for file 1 in chunk A.

This is why the file format inside each chunk needs to support sparse
files.

I have a presentation that has a series of slides on problems and
potential resolutions that might help:

http://infohost.nmt.edu/~val/review/chunkfs_presentation.pdf

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jörn Engel
On Thu, 26 April 2007 10:47:40 +1000, David Chinner wrote:
> 
> This assumes that you know a chunk has been corrupted, though.
> How do you find that out?

Option 1: you notice something odd while serving userspace.
Option 2: a checking/scrubbing daemon of some sorts.

The first will obviously miss any corruption in data that is not touched
for a long time (ever?).

> > What you need to make this go fast is (1) a pre-made list of which
> > chunks have links with which other chunks,
> 
> So you add a new on-disk structure that needs to be kept up to
> date? How do you trust that structure to be correct if you are
> not journalling it?

Only chance I see is to treat this list as hints.  It should contain all
chunks that possibly have links.  It may also contain chunks that don't
have links.  By keeping strict FFS-style ordering of all relevant
writes, any mismatch should only cost fsck time.

Managing this list appears to be less than trivial.  Might actually be
easier to have LogFS-style rmap for each object in the filesystem.

> What happens if fsfuzzer trashes part
> of this table as well and you can't trust it?

If you have 5000 redundant copies of data and all get corrupted, you are
doomed.  I don't expect my filesystem to recover after having written
0x00 over the whole device.

Being able to recover a single corruption happening anywhere on the
device is already a huge step forward.  Of course most current
filesystems wouldn't even be able to detect all possible corruptions.
That alone would be a step forward.

One of the smart things of ZFS is to checksum everything.  Among the
Linux filesystems only JFFS2 seems to do it, but it cannot distinguish
between corrupted data and incomplete writes before a crash.  It
definitely costs performance, but that is the price one has to pay if
errors are to be detected.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.

Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.


No, I'm referring to a different file.  The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.



In such scenario either lot of continuation inodes < number of chunks 
would be created or lot of sparse pieces would be created. But we can 
certainly enforce the constraint of one continuation inode per file per 
chunk, excluding the file's primary chunk in which it started.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Alan Cox wrote:
Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 


How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?

ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?


Hello Alan,

You re-use the same inode with multiple sparse pieces.

This way you avoid hopping around continuation inodes and coming back to 
same chunk with which you started but this time on a different 
continuation inode. This may not be I/O intensive for successive 
traversals if the continuation inodes are pinned in the memory, but it 
certainly is a waste of resource - inodes. Not allowing this would make 
worst case of every file having a continuation inode in every chunk, 
even worse; may be like only single file exist in the file system and 
rest all inodes in all chunks (including file's own chunk) are 
continuation inodes.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Alan Cox
> Preventive measures are taken to limit only one continuation inode per 
> file per chunk. This can be done easily in the chunk allocation 
> algorithm for disk space. Although I'm not quite sure what you mean by 

How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?

ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jeff Dike
On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:
> Jeff Dike wrote:
> >How about this case:
> >
> > Growing file starts in chunk A.
> > Overflows into chunk B.
> > Delete file in chunk A.
> > Growing file overflows chunk B and spots new free space in
> >chunk A (and nothing anywhere else)
> > Overflows into chunk A
> > Delete file in chunk B.
> > Overflow into chunk B again.
> >
> >Maybe this is not realistic, but in the absence of a mechanism to pull
> >data back from an overflow chunk, it seems at least a theoretical
> >possibility that there could be > 1 continuation inodes per file per
> >chunk.
> >
> 
> Preventive measures are taken to limit only one continuation inode per 
> file per chunk. This can be done easily in the chunk allocation 
> algorithm for disk space. Although I'm not quite sure what you mean by 
> "Delete file in chunk A". If you are referring to same file thats 
> growing, then deletion is not possible, because individual parts of any 
> file in any chunk cannot be deleted.

No, I'm referring to a different file.  The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.



Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jeff Dike
On Wed, Apr 25, 2007 at 03:47:10PM -0700, Valerie Henson wrote:
> Actually, there is an upper limit on the number of continuation
> inodes.  Each file can have a maximum of one continuation inode per
> chunk. (This is why we need to support sparse files.)

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jan Kara
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > >  --   --
> > > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> > >  --   --
> > > | cnode 1  |-  | cnode 1  |-
> > >  --   |   --  |
> > > | cnode 2  |-- |  | cnode 2  |--   |
> > >  --  | |  --  |   |
> > > | cnode 3  | | |  | cnode 3  | |   |
> > >  --  | |  --  |   |
> > > |  |  ||  |   |
> > > 
> > >  inodes   inodes or NULL
> > 
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at.  Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
> 
> Excellent question.  This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable).  The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk?  A back pointer does you no good
> if the back pointer is corrupted.
> 
> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.
> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks, (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.
  Do I get it right that you just have in each cnode a pointer to the
previous & next cnode? But then if two consecutive cnodes get corrupted,
you have no way to connect the chain, do you? If each cnode contained
some unique identifier of the file and a number identifying position of
cnode,  then there would be at least some way (through expensive) to
link them together correctly...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Chinner
On Wed, Apr 25, 2007 at 04:03:44PM -0700, Valerie Henson wrote:
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > > 
> > > The structure looks like this:
> > > 
> > >  --   --
> > > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> > >  --   --
> > > | cnode 1  |-  | cnode 1  |-
> > >  --   |   --  |
> > > | cnode 2  |-- |  | cnode 2  |--   |
> > >  --  | |  --  |   |
> > > | cnode 3  | | |  | cnode 3  | |   |
> > >  --  | |  --  |   |
> > > |  |  ||  |   |
> > > 
> > >  inodes   inodes or NULL
> > 
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at.  Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
> 
> Excellent question.  This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable).  The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk?  A back pointer does you no good
> if the back pointer is corrupted.

*nod*

> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.

This assumes that you know a chunk has been corrupted, though.
How do you find that out?

> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks,

So you add a new on-disk structure that needs to be kept up to
date? How do you trust that structure to be correct if you are
not journalling it? What happens if fsfuzzer trashes part
of this table as well and you can't trust it?

> (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.

Assuming you can trust this list. if not, finding cnodes is going
to be rather slow.

> > It seems that any sort of damage to the underlying storage (e.g.
> > media error, I/O error or user brain explosion) results in the need
> > to do a full fsck and hence chunkfs gives you no benefit in this
> > case.
> 
> I worry about this but so far haven't found something which couldn't
> be cut down significantly with just a little extra work.  It might be
> helpful to look at an extreme case.
> 
> Let's say we're incredibly paranoid.  We could be justified in running
> a full fsck on the entire file system in between every single I/O.
> After all, something *might* have been silently corrupted.  But this
> would be ridiculously slow.  We could instead never check the file
> system.  But then we would end up panicking and corrupting the file
> system a lot.  So what's a good compromise?
> 
> In the chunkfs case, here's my rules of thumb so far:
> 
> 1. Detection: All metadata has magic numbers and checksums.
> 2. Scrubbing: Random check of chunks when possible.
> 3. Repair: When we detect corruption, either by checksum error, file
>system code assertion failure, or hardware tells us we have a bug,
>check the chunk containing the error and any outside-chunk
>information that could be affected by it.

So if you end up with a corruption in a "clean" part of the
filesystem, you may not find out about the corruption on reboot and
fsck?  You need to trip over the corruption first before fsck can be
told it needs to check/repair a given chunk? Or do you need to force
a "check everything" fsck in this case?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson
On Wed, Apr 25, 2007 at 05:38:34AM -0600, Andreas Dilger wrote:
> 
> The case where only a fsck of the corrupt chunk is done would not find the
> cnode references.  Maybe there needs to be per-chunk info which contains
> a list/bitmap of other chunks that have cnodes shared with each chunk?

Yes, exactly.  One might almost think you had solved this problem
before. :):):)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson
On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > 
> > The structure looks like this:
> > 
> >  -- --
> > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> >  -- --
> > | cnode 1  |-  | cnode 1  |-
> >  -- |   --  |
> > | cnode 2  |-- |  | cnode 2  |--   |
> >  --  | |--  |   |
> > | cnode 3  | | |  | cnode 3  | |   |
> >  --  | |--  |   |
> >   |  |  ||  |   |
> > 
> >inodes   inodes or NULL
> 
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at.  Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?

Excellent question.  This is one of the trickier aspects of chunkfs -
the orphan inode problem (tricky, but solvable).  The problem is what
if you smash/lose/corrupt an inode in one chunk that has a
continuation inode in another chunk?  A back pointer does you no good
if the back pointer is corrupted.

What you do is keep tabs on whether you see damage that looks like
this has occurred - e.g., inode use/free counts wrong, you had to zero
a corrupted inode - and when this happens, you do a scan of all
continuation inodes in chunks that have links to the corrupted chunk.
What you need to make this go fast is (1) a pre-made list of which
chunks have links with which other chunks, (2) a fast way to read all
of the continuation inodes in a chunk (ignoring chunk-local inodes).
This stage is O(fs size) approximately, but it should be quite swift.

> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.

I worry about this but so far haven't found something which couldn't
be cut down significantly with just a little extra work.  It might be
helpful to look at an extreme case.

Let's say we're incredibly paranoid.  We could be justified in running
a full fsck on the entire file system in between every single I/O.
After all, something *might* have been silently corrupted.  But this
would be ridiculously slow.  We could instead never check the file
system.  But then we would end up panicking and corrupting the file
system a lot.  So what's a good compromise?

In the chunkfs case, here's my rules of thumb so far:

1. Detection: All metadata has magic numbers and checksums.
2. Scrubbing: Random check of chunks when possible.
3. Repair: When we detect corruption, either by checksum error, file
   system code assertion failure, or hardware tells us we have a bug,
   check the chunk containing the error and any outside-chunk
   information that could be affected by it.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson
On Wed, Apr 25, 2007 at 03:34:03PM +0400, Nikita Danilov wrote:
> 
> What is more important, design puts (as far as I can see) no upper limit
> on the number of continuation inodes, and hence, even if _average_ fsck
> time is greatly reduced, occasionally it can take more time than ext2 of
> the same size. This is clearly unacceptable in many situations (HA,
> etc.).

Actually, there is an upper limit on the number of continuation
inodes.  Each file can have a maximum of one continuation inode per
chunk. (This is why we need to support sparse files.)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Valerie Henson
On Tue, Apr 24, 2007 at 11:34:48PM +0400, Nikita Danilov wrote:
> 
> Maybe I failed to describe the problem presicely.
> 
> Suppose that all chunks have been checked. After that, for every inode
> I0 having continuations I1, I2, ... In, one has to check that every
> logical block is presented in at most one of these inodes. For this one
> has to read I0, with all its indirect (double-indirect, triple-indirect)
> blocks, then read I1 with all its indirect blocks, etc. And to repeat
> this for every inode with continuations.
> 
> In the worst case (every inode has a continuation in every chunk) this
> obviously is as bad as un-chunked fsck. But even in the average case,
> total amount of io necessary for this operation is proportional to the
> _total_ file system size, rather than to the chunk size.

Fsck in chunkfs is still going to have an element that is proportional
to the file system size for certain cases.  However, that element will
be a great deal smaller than in a regular file system, except in the
most pathological cases.  If those pathological cases happen often,
then it's back to the drawing board.  My hunch is that they won't be
common.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Amit Gud

Andreas Dilger wrote:

How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at.  Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?


Presumably, the cnodes in the other chunks contain forward and back
references.  Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.

If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.


It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.




Yes, what originated from discussions on #linuxfs is that redundancy is 
required for cnodes, in order to avoid checking the entire file system 
in search of a dangling cnode reference or for "parent" of a cnode.


If corruption, due to any reason, occurs in any other part of the file 
system, it would be localized for that chunk. Even if entire fsck is 
needed, chances of which are rare, full fsck of chunked file system is 
no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of 
fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6 
for chunkfs wouldn't add whole lot.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Lang

On Wed, 25 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > David Lang writes:
> > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> > >
> > > > Amit Gud writes:
> > > >
> > > > Hello,
> > > >
> > > > >
> > > > > This is an initial implementation of ChunkFS technique, briefly 
discussed
> > > > > at: http://lwn.net/Articles/190222 and
> > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > > >
> > > > I have a couple of questions about chunkfs repair process.
> > > >
> > > > First, as I understand it, each continuation inode is a sparse file,
> > > > mapping some subset of logical file blocks into block numbers. Then it
> > > > seems, that during "final phase" fsck has to check that these partial
> > > > mappings are consistent, for example, that no two different continuation
> > > > inodes for a given file contain a block number for the same offset. This
> > > > check requires scan of all chunks (rather than of only "active during
> > > > crash"), which seems to return us back to the scalability problem
> > > > chunkfs tries to address.
> > >
> > > not quite.
> > >
> > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
memory in
> > > the process. with chunkfs you divide the problem by a large constant (100 
or
> > > more) for the checks of individual chunks. after those are done then the 
final
> > > pass checking the cross-chunk links doesn't have to keep track of 
everything, it
> > > only needs to check those links and what they point to
> >
> > Maybe I failed to describe the problem presicely.
> >
> > Suppose that all chunks have been checked. After that, for every inode
> > I0 having continuations I1, I2, ... In, one has to check that every
> > logical block is presented in at most one of these inodes. For this one
> > has to read I0, with all its indirect (double-indirect, triple-indirect)
> > blocks, then read I1 with all its indirect blocks, etc. And to repeat
> > this for every inode with continuations.
> >
> > In the worst case (every inode has a continuation in every chunk) this
> > obviously is as bad as un-chunked fsck. But even in the average case,
> > total amount of io necessary for this operation is proportional to the
> > _total_ file system size, rather than to the chunk size.
>
> actually, it should be proportional to the number of continuation nodes. The
> expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).


correct, but remember that in the real world O(total_file_system_size) does not 
mean that it can't work well. it just means that larger filesystems will take 
longer to check.


they aren't out to eliminate the need for fsck, just to be able to divide the 
time it currently takes by a large value so that as the filesystems continue to 
get larger it is still reasonable to check them



What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).


in a pathalogical situation you are correct, it would take longer. however 
before declaring that this is completely unacceptable why don't you wait and see 
if the pathalogical situation is at all likely?


when you are doing ha with shared storage you tend to be doing things like 
databases, every database that I know about splits it's data files into many 
pieces of a fixed size. Postgres for example does 1M files. if you do a chunk 
size of 1G it's very unlikly that more then a couple files out of every thousand 
will end up with continuation nodes.


remember that the current thinking on chunk size is to make the chunks be ~1% of 
your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in 
the example above would mean just a couple files out of every ten thousand would 
have continuation nodes).


with the current filesystems it's _possible_ for a file to be spread out across 
the disk such that it's first block is at the beginning of the disk, the second 
at the end of the disk, the third back at the beginning, the fourth at the end, 
etc. but users don't worry about this when useing the filesystems becouse the 
odds of this happening under normal use are vanishingly small (and the 
filesystem designers work to make the odds this small). similarly the chunkfs 
designers are working to make the odds of every file having a continuation nodes 
vanishingly small as well.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Andreas Dilger
On Apr 25, 2007  20:54 +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > Right now, there is no distinction between an inode and continuation 
> > inode (also referred to as 'cnode' below), except for the 
> > EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
> > inodes, currently limited to 4.
> > 
> > The structure looks like this:
> > 
> >  -- --
> > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> >  -- --
> > | cnode 1  |-  | cnode 1  |-
> >  -- |   --  |
> > | cnode 2  |-- |  | cnode 2  |--   |
> >  --  | |--  |   |
> > | cnode 3  | | |  | cnode 3  | |   |
> >  --  | |--  |   |
> >   |  |  ||  |   |
> > 
> >inodes   inodes or NULL
> 
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at.  Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?

Presumably, the cnodes in the other chunks contain forward and back
references.  Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.

If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.

> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.

There are several cases where such corruption could be found:
- file access from the "parent" cnode will be missing corrupted cnode,
  probably causing a fsck of both the source and target chunks
- a fsck of the source chunk would find the dangling cnode reference
  and cause a fsck of the corrupt chunk
- a fsck of the later cnode chunks would find the dangling cnode reference
  and cause a fsck of the corrupt chunk
- a fsck of the corrupt chunk would find the original corruption

The case where only a fsck of the corrupt chunk is done would not find the
cnode references.  Maybe there needs to be per-chunk info which contains
a list/bitmap of other chunks that have cnodes shared with each chunk?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > David Lang writes:
 > > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > > >
 > > > > Amit Gud writes:
 > > > >
 > > > > Hello,
 > > > >
 > > > > >
 > > > > > This is an initial implementation of ChunkFS technique, briefly 
 > > > > > discussed
 > > > > > at: http://lwn.net/Articles/190222 and
 > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > > > >
 > > > > I have a couple of questions about chunkfs repair process.
 > > > >
 > > > > First, as I understand it, each continuation inode is a sparse file,
 > > > > mapping some subset of logical file blocks into block numbers. Then it
 > > > > seems, that during "final phase" fsck has to check that these partial
 > > > > mappings are consistent, for example, that no two different 
 > > > > continuation
 > > > > inodes for a given file contain a block number for the same offset. 
 > > > > This
 > > > > check requires scan of all chunks (rather than of only "active during
 > > > > crash"), which seems to return us back to the scalability problem
 > > > > chunkfs tries to address.
 > > >
 > > > not quite.
 > > >
 > > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
 > > > memory in
 > > > the process. with chunkfs you divide the problem by a large constant 
 > > > (100 or
 > > > more) for the checks of individual chunks. after those are done then the 
 > > > final
 > > > pass checking the cross-chunk links doesn't have to keep track of 
 > > > everything, it
 > > > only needs to check those links and what they point to
 > >
 > > Maybe I failed to describe the problem presicely.
 > >
 > > Suppose that all chunks have been checked. After that, for every inode
 > > I0 having continuations I1, I2, ... In, one has to check that every
 > > logical block is presented in at most one of these inodes. For this one
 > > has to read I0, with all its indirect (double-indirect, triple-indirect)
 > > blocks, then read I1 with all its indirect blocks, etc. And to repeat
 > > this for every inode with continuations.
 > >
 > > In the worst case (every inode has a continuation in every chunk) this
 > > obviously is as bad as un-chunked fsck. But even in the average case,
 > > total amount of io necessary for this operation is proportional to the
 > > _total_ file system size, rather than to the chunk size.
 > 
 > actually, it should be proportional to the number of continuation nodes. The 
 > expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).

What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread David Chinner
On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> Nikita Danilov wrote:
> >Maybe I failed to describe the problem presicely.
> >
> >Suppose that all chunks have been checked. After that, for every inode
> >I0 having continuations I1, I2, ... In, one has to check that every
> >logical block is presented in at most one of these inodes. For this one
> >has to read I0, with all its indirect (double-indirect, triple-indirect)
> >blocks, then read I1 with all its indirect blocks, etc. And to repeat
> >this for every inode with continuations.
> >
> >In the worst case (every inode has a continuation in every chunk) this
> >obviously is as bad as un-chunked fsck. But even in the average case,
> >total amount of io necessary for this operation is proportional to the
> >_total_ file system size, rather than to the chunk size.
> >
> 
> Perhaps, I should talk about how continuation inodes are managed / 
> located on disk. (This is how it is in my current implementation)
> 
> Right now, there is no distinction between an inode and continuation 
> inode (also referred to as 'cnode' below), except for the 
> EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
> inodes, currently limited to 4.
> 
> The structure looks like this:
> 
>  --   --
> | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
>  --   --
> | cnode 1  |-  | cnode 1  |-
>  --   |   --  |
> | cnode 2  |-- |  | cnode 2  |--   |
>  --  | |  --  |   |
> | cnode 3  | | |  | cnode 3  | |   |
>  --  | |  --  |   |
> |  |  ||  |   |
> 
>  inodes   inodes or NULL

How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at.  Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?

It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Amit Gud

Nikita Danilov wrote:

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.



Perhaps, I should talk about how continuation inodes are managed / 
located on disk. (This is how it is in my current implementation)


Right now, there is no distinction between an inode and continuation 
inode (also referred to as 'cnode' below), except for the 
EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
inodes, currently limited to 4.


The structure looks like this:

 -- --
| cnode 0  |-->| cnode 0  |--> to another cnode or NULL
 -- --
| cnode 1  |-  | cnode 1  |-
 -- |   --  |
| cnode 2  |--  |  | cnode 2  |--   |
 --  |  |   --  |   |
| cnode 3  | |  |  | cnode 3  | |   |
 --  |  |   --  |   |
  |  |  ||  |   |

   inodes   inodes or NULL

I.e. only first cnode in the list carries forward the chain if all the 
slots are occupied.


Every cnode# field contains
{
ino_t cnode;
__u32 start;/* starting logical block number */
__u32 end;  /* ending logical block number */
}

(current implementation has just one field: cnode)

I thought of this structure to avoid recursion and / or use of any data 
structure while traversing the continuation inodes.


Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode 
has any sparse portions. 'start' and 'end' fields are used to speed-up 
finding a cnode given a logical block number without the need of 
actually reading the inode - this can be done away with, perhaps more 
conveniently by, pinning the cnodes in memory as and when read.


Now going back to the Nikita's question, all the cnodes for an inode 
need to be scanned iff 'start' field or number of blocks or flag 
EXT2_SPARSE_CONT_FL in any of its cnodes is altered.


And yes, the whole attempt is to reduce the number of continuation inodes.

Comments, suggestions welcome.


AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > Amit Gud writes:
> >
> > Hello,
> >
> > >
> > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > at: http://lwn.net/Articles/190222 and
> > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > I have a couple of questions about chunkfs repair process.
> >
> > First, as I understand it, each continuation inode is a sparse file,
> > mapping some subset of logical file blocks into block numbers. Then it
> > seems, that during "final phase" fsck has to check that these partial
> > mappings are consistent, for example, that no two different continuation
> > inodes for a given file contain a block number for the same offset. This
> > check requires scan of all chunks (rather than of only "active during
> > crash"), which seems to return us back to the scalability problem
> > chunkfs tries to address.
>
> not quite.
>
> this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> the process. with chunkfs you divide the problem by a large constant (100 or
> more) for the checks of individual chunks. after those are done then the final
> pass checking the cross-chunk links doesn't have to keep track of everything, 
it
> only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.


actually, it should be proportional to the number of continuation nodes. The 
expectation (and design) is that they are rare.


If you get into the worst-case situation of all of them being continuation 
nodes, then you are actually worse off then you were to start with (as you are 
saying), but numbers from people's real filesystems (assuming a chunk size equal 
to a block cluster size) indicates that we are more on the order of a fraction 
of a percent of the nodes. and the expectation is that since the chunk sizes 
will be substantially larger then the block cluster sizes this should get 
reduced even more.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > Amit Gud writes:
 > >
 > > Hello,
 > >
 > > >
 > > > This is an initial implementation of ChunkFS technique, briefly discussed
 > > > at: http://lwn.net/Articles/190222 and
 > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > >
 > > I have a couple of questions about chunkfs repair process.
 > >
 > > First, as I understand it, each continuation inode is a sparse file,
 > > mapping some subset of logical file blocks into block numbers. Then it
 > > seems, that during "final phase" fsck has to check that these partial
 > > mappings are consistent, for example, that no two different continuation
 > > inodes for a given file contain a block number for the same offset. This
 > > check requires scan of all chunks (rather than of only "active during
 > > crash"), which seems to return us back to the scalability problem
 > > chunkfs tries to address.
 > 
 > not quite.
 > 
 > this checking is a O(n^2) or worse problem, and it can eat a lot of memory 
 > in 
 > the process. with chunkfs you divide the problem by a large constant (100 or 
 > more) for the checks of individual chunks. after those are done then the 
 > final 
 > pass checking the cross-chunk links doesn't have to keep track of 
 > everything, it 
 > only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.

 > 
 > any ability to mark a filesystem as 'clean' and then not have to check it on 
 > reboot is a bonus on top of this.
 > 
 > David Lang

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


Amit Gud writes:

Hello,

>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process.

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.


not quite.

this checking is a O(n^2) or worse problem, and it can eat a lot of memory in 
the process. with chunkfs you divide the problem by a large constant (100 or 
more) for the checks of individual chunks. after those are done then the final 
pass checking the cross-chunk links doesn't have to keep track of everything, it 
only needs to check those links and what they point to


any ability to mark a filesystem as 'clean' and then not have to check it on 
reboot is a bonus on top of this.


David Lang


Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

>
> Best,
> AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
Amit Gud writes:

Hello,

 > 
 > This is an initial implementation of ChunkFS technique, briefly discussed
 > at: http://lwn.net/Articles/190222 and 
 > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process. 

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.

Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

 > 
 > Best,
 > AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud

Suparna Bhattacharya wrote:

Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 


What kind of results are you planning to gather to evaluate/optimize this ?



Mainly I'm trying to gather following:

- Graph of continuation inodes vs. the file system fragmentation (or 
aging) factor with varying configurations of chunk sizes


- Graph of wall clock time vs. disk size + data on the disk with both 
chunkfs and native ext2, and/or other file systems



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
> On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> > 
> > This is an initial implementation of ChunkFS technique, briefly discussed
> > at: http://lwn.net/Articles/190222 and 
> > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > 
> > This implementation is done within ext2 driver. Every chunk is an 
> > independent ext2 file system. The knowledge about chunks is kept within 
> > ext2 and 'continuation inodes', which are used to allow files and 
> > directories span across multiple chunks, are managed within ext2.
> > 
> > At mount time, super blocks for all the chunks are created and linked with 
> > the global super_blocks list maintained by VFS. This allows independent 
> > behavior or individual chunks and also helps writebacks to happen 
> > seamlessly.
> > 
> > Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> > of:
> > 
> > - what inode's which block number to look for, for a given file's logical 
> > block number
> > - in which chunk to allocate next inode / block
> > - number of inodes to scan when a directory is being read
> > 
> > To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> > inode number are used to indicate the chunk number in which it resides.
> > 
> > As said, this is a preliminary implementation and lots of changes are 
> > expected before this code is even sanely usable. Some known issues and 
> > obvious optimizations are listed in the TODO file in the chunkfs patch.
> > 
> > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> > - one big patch
> > - applies to 2.6.18
> 
> 
> Could you send this out as a patch to ext2 codebase, so we can just look
> at the changes for chunkfs ? That might also make it small enough
> to inline your patch in email for review. 

Sorry, I missed the part about ext2-chunkfs-diff below.

Regards
suparna

> 
> What kind of results are you planning to gather to evaluate/optimize this ?
> 
> Regards
> Suparna
> 
> > 
> > Attached - ext2-chunkfs-diff.patch.gz
> > - since the code is a spin-off of ext2, this patch explains better what
> >   has changed from the ext2.
> > 
> > git://cislinux.cis.ksu.edu/chunkfs-tools
> > - mkfs, and fsck for chunkfs.
> > 
> > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> > - config file used; tested mostly on UML with loopback file systems.
> > 
> > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> > should be "no" for clean compile.
> > 
> > 
> > Please comment, suggest, criticize. Patches most welcome.
> > 
> > 
> > Best,
> > AG
> > --
> > May the source be with you.
> > http://www.cis.ksu.edu/~gud
> 
> 
> 
> -- 
> Suparna Bhattacharya ([EMAIL PROTECTED])
> Linux Technology Center
> IBM Software Lab, India
> 

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> 
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and 
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> 
> This implementation is done within ext2 driver. Every chunk is an 
> independent ext2 file system. The knowledge about chunks is kept within 
> ext2 and 'continuation inodes', which are used to allow files and 
> directories span across multiple chunks, are managed within ext2.
> 
> At mount time, super blocks for all the chunks are created and linked with 
> the global super_blocks list maintained by VFS. This allows independent 
> behavior or individual chunks and also helps writebacks to happen 
> seamlessly.
> 
> Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> of:
> 
> - what inode's which block number to look for, for a given file's logical 
> block number
> - in which chunk to allocate next inode / block
> - number of inodes to scan when a directory is being read
> 
> To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> inode number are used to indicate the chunk number in which it resides.
> 
> As said, this is a preliminary implementation and lots of changes are 
> expected before this code is even sanely usable. Some known issues and 
> obvious optimizations are listed in the TODO file in the chunkfs patch.
> 
> http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> - one big patch
> - applies to 2.6.18


Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 

What kind of results are you planning to gather to evaluate/optimize this ?

Regards
Suparna

> 
> Attached - ext2-chunkfs-diff.patch.gz
> - since the code is a spin-off of ext2, this patch explains better what
>   has changed from the ext2.
> 
> git://cislinux.cis.ksu.edu/chunkfs-tools
> - mkfs, and fsck for chunkfs.
> 
> http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> - config file used; tested mostly on UML with loopback file systems.
> 
> NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> should be "no" for clean compile.
> 
> 
> Please comment, suggest, criticize. Patches most welcome.
> 
> 
> Best,
> AG
> --
> May the source be with you.
> http://www.cis.ksu.edu/~gud



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud


This is an initial implementation of ChunkFS technique, briefly discussed
at: http://lwn.net/Articles/190222 and 
http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf


This implementation is done within ext2 driver. Every chunk is an 
independent ext2 file system. The knowledge about chunks is kept within 
ext2 and 'continuation inodes', which are used to allow files and 
directories span across multiple chunks, are managed within ext2.


At mount time, super blocks for all the chunks are created and linked with 
the global super_blocks list maintained by VFS. This allows independent 
behavior or individual chunks and also helps writebacks to happen 
seamlessly.


Apart from this, chunkfs code in ext2 effectively only provides knowledge of:

- what inode's which block number to look for, for a given file's logical 
block number

- in which chunk to allocate next inode / block
- number of inodes to scan when a directory is being read

To maintain the ext2's inode number uniqueness property, 8 msb bits of 
inode number are used to indicate the chunk number in which it resides.


As said, this is a preliminary implementation and lots of changes are 
expected before this code is even sanely usable. Some known issues and 
obvious optimizations are listed in the TODO file in the chunkfs patch.


http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
- one big patch
- applies to 2.6.18

Attached - ext2-chunkfs-diff.patch.gz
- since the code is a spin-off of ext2, this patch explains better what
  has changed from the ext2.

git://cislinux.cis.ksu.edu/chunkfs-tools
- mkfs, and fsck for chunkfs.

http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
- config file used; tested mostly on UML with loopback file systems.

NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
should be "no" for clean compile.



Please comment, suggest, criticize. Patches most welcome.


Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

ext2-chunkfs-diff.patch.gz
Description: Binary data