Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Fri, Apr 27, 2007 at 11:06:47AM -0400, Jeff Dike wrote: > On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote: > > Here's an example, spelled out: > > > > Allocate file 1 in chunk A. > > Grow file 1. > > Chunk A fills up. > > Allocate continuation inode for file 1 in chunk B. > > Chunk A gets some free space. > > Chunk B fills up. > > Pick chunk A for allocating next block of file 1. > > Try to look up a continuation inode for file 1 in chunk A. > > Continuation inode for file 1 found in chunk A! > > Attach newly allocated block to existing inode for file 1 in chunk A. > > So far, so good (and the slides are helpful, tx!). What happens when > file 1 keeps growing and chunk A fills up (and chunk B is still full)? > Can the same continuation inode also point at chunk C, where the file > is going to grow to? You allocate a new continuation inode in chunk C. The rule is that only inodes inside a chunk can point to blocks inside the chunk, so you need an inode in C if you want to allocate blocks from C. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Fri, 27 April 2007 23:50:06 -0700, Valerie Henson wrote: > > Any mapping structure will have to be pre-allocated. How much space would you allocate for it then? The required size surely depends on the file size and fragmentation. > So in my secret heart of hearts, I do indeed hope that cnodes are rare > enough that we don't actually have to do anything smart to make them > go fast. Either having no fast lookup structure or creating it in > memory as needed would be the nicest solution. However, since I can't > guarantee this will be the case, it's nice to have some idea of what > we'll do if this does become important. You go slow and slower. Or you can defragment/clean the filesystem. Fragmented filesystems being slow is hardly news to anyone. ChunkFS will be slower than others, so the reasons to fight fragmentation weigh heavier. What you can do: - If two files 1 and 2 contain blocks in chunks A and B, move file 1 blocks to chunk A and file 2 blocks to chunk B until one of the files only spans a single chunk. - Same as above with longer chains like file 1 in chunks (A,B), file 2 in chunks (B,C), file 3 in chunks (A,C). - When writing to file 1 in chunk A but running out of space in chunk A, move some cross-chunk file over instead of writing file 1 blocks to chunk B. The first two can hope to use some idle times (if they exist), the third will make writes go slower in order to speed up future reads. There ain't no such thing as a free lunch. :) Jörn -- I can say that I spend most of my time fixing bugs even if I have lots of new features to implement in mind, but I give bugs more priority. -- Andrea Arcangeli, 2000 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Fri, Apr 27, 2007 at 12:53:34PM +0200, J??rn Engel wrote: > > All this would get easier if continuation inodes were known to be rare. > You can ditch the doubly-linked list in favor of a pointer to the main > inode then - traversing the list again is cheap, after all. And you can > just try to read the same block once for every continuation inode. > > If those lists can get long and you need a mapping from offset to > continuation inode on the medium, you are basically fscked. Storing the > mapping requires space. You need the mapping only when space (in some > chunk) gets tight and you allocate continuation inodes. So either you > don't need the mapping or you don't have a good place to put it. Any mapping structure will have to be pre-allocated. > Having a mapping in memory is also questionable. Either you scan the > whole file on first access and spend a long time for large files. Or > you create the mapping on the fly. In that case the page cache will > already give you a 90% solution for free. So in my secret heart of hearts, I do indeed hope that cnodes are rare enough that we don't actually have to do anything smart to make them go fast. Either having no fast lookup structure or creating it in memory as needed would be the nicest solution. However, since I can't guarantee this will be the case, it's nice to have some idea of what we'll do if this does become important. > You should spend a lot of effort trying to minimize cnodes. ;) Yep. It's much better to optimize away most cnodes instead of trying to make the go fast. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote: > Here's an example, spelled out: > > Allocate file 1 in chunk A. > Grow file 1. > Chunk A fills up. > Allocate continuation inode for file 1 in chunk B. > Chunk A gets some free space. > Chunk B fills up. > Pick chunk A for allocating next block of file 1. > Try to look up a continuation inode for file 1 in chunk A. > Continuation inode for file 1 found in chunk A! > Attach newly allocated block to existing inode for file 1 in chunk A. So far, so good (and the slides are helpful, tx!). What happens when file 1 keeps growing and chunk A fills up (and chunk B is still full)? Can the same continuation inode also point at chunk C, where the file is going to grow to? Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, 26 April 2007 22:07:43 -0700, Valerie Henson wrote: > > What's important is that each continuation inode have a > back pointer to the parent and that there is some structure for > quickly looking up the continuation inode for a given file offset. > Suggestions for data structures that work well in this situation are > welcome. :) All this would get easier if continuation inodes were known to be rare. You can ditch the doubly-linked list in favor of a pointer to the main inode then - traversing the list again is cheap, after all. And you can just try to read the same block once for every continuation inode. If those lists can get long and you need a mapping from offset to continuation inode on the medium, you are basically fscked. Storing the mapping requires space. You need the mapping only when space (in some chunk) gets tight and you allocate continuation inodes. So either you don't need the mapping or you don't have a good place to put it. Having a mapping in memory is also questionable. Either you scan the whole file on first access and spend a long time for large files. Or you create the mapping on the fly. In that case the page cache will already give you a 90% solution for free. You should spend a lot of effort trying to minimize cnodes. ;) Jörn -- To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible. -- Paul Graham - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, Apr 26, 2007 at 10:47:38AM +0200, Jan Kara wrote: > Do I get it right that you just have in each cnode a pointer to the > previous & next cnode? But then if two consecutive cnodes get corrupted, > you have no way to connect the chain, do you? If each cnode contained > some unique identifier of the file and a number identifying position of > cnode, then there would be at least some way (through expensive) to > link them together correctly... You're right, it's easy to add a little more redundancy that would make it possible to recover from two consecutive nodes being corrupted. Keeping a parent inode id in each continuation inode is definitely a smart thing to do. Some minor side notes: Continuation inodes aren't really in any defined order - if you look at Jeff's ping-pong chunk allocation example, you'll see that the data in each continuation inode won't be in linearly increasing order. Also, while the current implementation is a simple doubly-linked list, this may not be the best solution long-term. What's important is that each continuation inode have a back pointer to the parent and that there is some structure for quickly looking up the continuation inode for a given file offset. Suggestions for data structures that work well in this situation are welcome. :) -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, Apr 26, 2007 at 12:05:04PM -0400, Jeff Dike wrote: > > No, I'm referring to a different file. The scenario is that you have > a growing file in a nearly full disk with files being deleted (and > thus space being freed) such that allocations for the growing file > bounce back and forth between chunks. This is an excellent question. I call this the ping-pong problem. The solution is as Amit describes: You have a maximum of one continuation inode per file per chunk, and you require sparse files. Here's an example, spelled out: Allocate file 1 in chunk A. Grow file 1. Chunk A fills up. Allocate continuation inode for file 1 in chunk B. Chunk A gets some free space. Chunk B fills up. Pick chunk A for allocating next block of file 1. Try to look up a continuation inode for file 1 in chunk A. Continuation inode for file 1 found in chunk A! Attach newly allocated block to existing inode for file 1 in chunk A. This is why the file format inside each chunk needs to support sparse files. I have a presentation that has a series of slides on problems and potential resolutions that might help: http://infohost.nmt.edu/~val/review/chunkfs_presentation.pdf -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, 26 April 2007 10:47:40 +1000, David Chinner wrote: > > This assumes that you know a chunk has been corrupted, though. > How do you find that out? Option 1: you notice something odd while serving userspace. Option 2: a checking/scrubbing daemon of some sorts. The first will obviously miss any corruption in data that is not touched for a long time (ever?). > > What you need to make this go fast is (1) a pre-made list of which > > chunks have links with which other chunks, > > So you add a new on-disk structure that needs to be kept up to > date? How do you trust that structure to be correct if you are > not journalling it? Only chance I see is to treat this list as hints. It should contain all chunks that possibly have links. It may also contain chunks that don't have links. By keeping strict FFS-style ordering of all relevant writes, any mismatch should only cost fsck time. Managing this list appears to be less than trivial. Might actually be easier to have LogFS-style rmap for each object in the filesystem. > What happens if fsfuzzer trashes part > of this table as well and you can't trust it? If you have 5000 redundant copies of data and all get corrupted, you are doomed. I don't expect my filesystem to recover after having written 0x00 over the whole device. Being able to recover a single corruption happening anywhere on the device is already a huge step forward. Of course most current filesystems wouldn't even be able to detect all possible corruptions. That alone would be a step forward. One of the smart things of ZFS is to checksum everything. Among the Linux filesystems only JFFS2 seems to do it, but it cannot distinguish between corrupted data and incomplete writes before a crash. It definitely costs performance, but that is the price one has to pay if errors are to be detected. Jörn -- Do not stop an army on its way home. -- Sun Tzu - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Jeff Dike wrote: On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote: Jeff Dike wrote: How about this case: Growing file starts in chunk A. Overflows into chunk B. Delete file in chunk A. Growing file overflows chunk B and spots new free space in chunk A (and nothing anywhere else) Overflows into chunk A Delete file in chunk B. Overflow into chunk B again. Maybe this is not realistic, but in the absence of a mechanism to pull data back from an overflow chunk, it seems at least a theoretical possibility that there could be > 1 continuation inodes per file per chunk. Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by "Delete file in chunk A". If you are referring to same file thats growing, then deletion is not possible, because individual parts of any file in any chunk cannot be deleted. No, I'm referring to a different file. The scenario is that you have a growing file in a nearly full disk with files being deleted (and thus space being freed) such that allocations for the growing file bounce back and forth between chunks. In such scenario either lot of continuation inodes < number of chunks would be created or lot of sparse pieces would be created. But we can certainly enforce the constraint of one continuation inode per file per chunk, excluding the file's primary chunk in which it started. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Alan Cox wrote: Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by How are you handling the allocation in this situation, are you assuming that a chunk is "out of bounds" because part of a file already lives on it or simply keeping a single inode per chunk which has multiple sparse pieces of the file on it ? ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write 16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have to find another chunk to use ? Hello Alan, You re-use the same inode with multiple sparse pieces. This way you avoid hopping around continuation inodes and coming back to same chunk with which you started but this time on a different continuation inode. This may not be I/O intensive for successive traversals if the continuation inodes are pinned in the memory, but it certainly is a waste of resource - inodes. Not allowing this would make worst case of every file having a continuation inode in every chunk, even worse; may be like only single file exist in the file system and rest all inodes in all chunks (including file's own chunk) are continuation inodes. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
> Preventive measures are taken to limit only one continuation inode per > file per chunk. This can be done easily in the chunk allocation > algorithm for disk space. Although I'm not quite sure what you mean by How are you handling the allocation in this situation, are you assuming that a chunk is "out of bounds" because part of a file already lives on it or simply keeping a single inode per chunk which has multiple sparse pieces of the file on it ? ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write 16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have to find another chunk to use ? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote: > Jeff Dike wrote: > >How about this case: > > > > Growing file starts in chunk A. > > Overflows into chunk B. > > Delete file in chunk A. > > Growing file overflows chunk B and spots new free space in > >chunk A (and nothing anywhere else) > > Overflows into chunk A > > Delete file in chunk B. > > Overflow into chunk B again. > > > >Maybe this is not realistic, but in the absence of a mechanism to pull > >data back from an overflow chunk, it seems at least a theoretical > >possibility that there could be > 1 continuation inodes per file per > >chunk. > > > > Preventive measures are taken to limit only one continuation inode per > file per chunk. This can be done easily in the chunk allocation > algorithm for disk space. Although I'm not quite sure what you mean by > "Delete file in chunk A". If you are referring to same file thats > growing, then deletion is not possible, because individual parts of any > file in any chunk cannot be deleted. No, I'm referring to a different file. The scenario is that you have a growing file in a nearly full disk with files being deleted (and thus space being freed) such that allocations for the growing file bounce back and forth between chunks. Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Jeff Dike wrote: How about this case: Growing file starts in chunk A. Overflows into chunk B. Delete file in chunk A. Growing file overflows chunk B and spots new free space in chunk A (and nothing anywhere else) Overflows into chunk A Delete file in chunk B. Overflow into chunk B again. Maybe this is not realistic, but in the absence of a mechanism to pull data back from an overflow chunk, it seems at least a theoretical possibility that there could be > 1 continuation inodes per file per chunk. Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by "Delete file in chunk A". If you are referring to same file thats growing, then deletion is not possible, because individual parts of any file in any chunk cannot be deleted. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, Apr 25, 2007 at 03:47:10PM -0700, Valerie Henson wrote: > Actually, there is an upper limit on the number of continuation > inodes. Each file can have a maximum of one continuation inode per > chunk. (This is why we need to support sparse files.) How about this case: Growing file starts in chunk A. Overflows into chunk B. Delete file in chunk A. Growing file overflows chunk B and spots new free space in chunk A (and nothing anywhere else) Overflows into chunk A Delete file in chunk B. Overflow into chunk B again. Maybe this is not realistic, but in the absence of a mechanism to pull data back from an overflow chunk, it seems at least a theoretical possibility that there could be > 1 continuation inodes per file per chunk. Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote: > > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote: > > > -- -- > > > | cnode 0 |-->| cnode 0 |--> to another cnode or NULL > > > -- -- > > > | cnode 1 |- | cnode 1 |- > > > -- | -- | > > > | cnode 2 |-- | | cnode 2 |-- | > > > -- | | -- | | > > > | cnode 3 | | | | cnode 3 | | | > > > -- | | -- | | > > > | | || | | > > > > > > inodes inodes or NULL > > > > How do you recover if fsfuzzer takes out a cnode in the chain? The > > chunk is marked clean, but clearly corrupted and needs fixing and > > you don't know what it was pointing at. Hence you have a pointer to > > a trashed cnode *somewhere* that you need to find and fix, and a > > bunch of orphaned cnodes that nobody points to *somewhere else* in > > the filesystem that you have to find. That's a full scan fsck case, > > isn't? > > Excellent question. This is one of the trickier aspects of chunkfs - > the orphan inode problem (tricky, but solvable). The problem is what > if you smash/lose/corrupt an inode in one chunk that has a > continuation inode in another chunk? A back pointer does you no good > if the back pointer is corrupted. > > What you do is keep tabs on whether you see damage that looks like > this has occurred - e.g., inode use/free counts wrong, you had to zero > a corrupted inode - and when this happens, you do a scan of all > continuation inodes in chunks that have links to the corrupted chunk. > What you need to make this go fast is (1) a pre-made list of which > chunks have links with which other chunks, (2) a fast way to read all > of the continuation inodes in a chunk (ignoring chunk-local inodes). > This stage is O(fs size) approximately, but it should be quite swift. Do I get it right that you just have in each cnode a pointer to the previous & next cnode? But then if two consecutive cnodes get corrupted, you have no way to connect the chain, do you? If each cnode contained some unique identifier of the file and a number identifying position of cnode, then there would be at least some way (through expensive) to link them together correctly... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, Apr 25, 2007 at 04:03:44PM -0700, Valerie Henson wrote: > On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote: > > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote: > > > > > > The structure looks like this: > > > > > > -- -- > > > | cnode 0 |-->| cnode 0 |--> to another cnode or NULL > > > -- -- > > > | cnode 1 |- | cnode 1 |- > > > -- | -- | > > > | cnode 2 |-- | | cnode 2 |-- | > > > -- | | -- | | > > > | cnode 3 | | | | cnode 3 | | | > > > -- | | -- | | > > > | | || | | > > > > > > inodes inodes or NULL > > > > How do you recover if fsfuzzer takes out a cnode in the chain? The > > chunk is marked clean, but clearly corrupted and needs fixing and > > you don't know what it was pointing at. Hence you have a pointer to > > a trashed cnode *somewhere* that you need to find and fix, and a > > bunch of orphaned cnodes that nobody points to *somewhere else* in > > the filesystem that you have to find. That's a full scan fsck case, > > isn't? > > Excellent question. This is one of the trickier aspects of chunkfs - > the orphan inode problem (tricky, but solvable). The problem is what > if you smash/lose/corrupt an inode in one chunk that has a > continuation inode in another chunk? A back pointer does you no good > if the back pointer is corrupted. *nod* > What you do is keep tabs on whether you see damage that looks like > this has occurred - e.g., inode use/free counts wrong, you had to zero > a corrupted inode - and when this happens, you do a scan of all > continuation inodes in chunks that have links to the corrupted chunk. This assumes that you know a chunk has been corrupted, though. How do you find that out? > What you need to make this go fast is (1) a pre-made list of which > chunks have links with which other chunks, So you add a new on-disk structure that needs to be kept up to date? How do you trust that structure to be correct if you are not journalling it? What happens if fsfuzzer trashes part of this table as well and you can't trust it? > (2) a fast way to read all > of the continuation inodes in a chunk (ignoring chunk-local inodes). > This stage is O(fs size) approximately, but it should be quite swift. Assuming you can trust this list. if not, finding cnodes is going to be rather slow. > > It seems that any sort of damage to the underlying storage (e.g. > > media error, I/O error or user brain explosion) results in the need > > to do a full fsck and hence chunkfs gives you no benefit in this > > case. > > I worry about this but so far haven't found something which couldn't > be cut down significantly with just a little extra work. It might be > helpful to look at an extreme case. > > Let's say we're incredibly paranoid. We could be justified in running > a full fsck on the entire file system in between every single I/O. > After all, something *might* have been silently corrupted. But this > would be ridiculously slow. We could instead never check the file > system. But then we would end up panicking and corrupting the file > system a lot. So what's a good compromise? > > In the chunkfs case, here's my rules of thumb so far: > > 1. Detection: All metadata has magic numbers and checksums. > 2. Scrubbing: Random check of chunks when possible. > 3. Repair: When we detect corruption, either by checksum error, file >system code assertion failure, or hardware tells us we have a bug, >check the chunk containing the error and any outside-chunk >information that could be affected by it. So if you end up with a corruption in a "clean" part of the filesystem, you may not find out about the corruption on reboot and fsck? You need to trip over the corruption first before fsck can be told it needs to check/repair a given chunk? Or do you need to force a "check everything" fsck in this case? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, Apr 25, 2007 at 05:38:34AM -0600, Andreas Dilger wrote: > > The case where only a fsck of the corrupt chunk is done would not find the > cnode references. Maybe there needs to be per-chunk info which contains > a list/bitmap of other chunks that have cnodes shared with each chunk? Yes, exactly. One might almost think you had solved this problem before. :):):) -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote: > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote: > > > > The structure looks like this: > > > > -- -- > > | cnode 0 |-->| cnode 0 |--> to another cnode or NULL > > -- -- > > | cnode 1 |- | cnode 1 |- > > -- | -- | > > | cnode 2 |-- | | cnode 2 |-- | > > -- | |-- | | > > | cnode 3 | | | | cnode 3 | | | > > -- | |-- | | > > | | || | | > > > >inodes inodes or NULL > > How do you recover if fsfuzzer takes out a cnode in the chain? The > chunk is marked clean, but clearly corrupted and needs fixing and > you don't know what it was pointing at. Hence you have a pointer to > a trashed cnode *somewhere* that you need to find and fix, and a > bunch of orphaned cnodes that nobody points to *somewhere else* in > the filesystem that you have to find. That's a full scan fsck case, > isn't? Excellent question. This is one of the trickier aspects of chunkfs - the orphan inode problem (tricky, but solvable). The problem is what if you smash/lose/corrupt an inode in one chunk that has a continuation inode in another chunk? A back pointer does you no good if the back pointer is corrupted. What you do is keep tabs on whether you see damage that looks like this has occurred - e.g., inode use/free counts wrong, you had to zero a corrupted inode - and when this happens, you do a scan of all continuation inodes in chunks that have links to the corrupted chunk. What you need to make this go fast is (1) a pre-made list of which chunks have links with which other chunks, (2) a fast way to read all of the continuation inodes in a chunk (ignoring chunk-local inodes). This stage is O(fs size) approximately, but it should be quite swift. > It seems that any sort of damage to the underlying storage (e.g. > media error, I/O error or user brain explosion) results in the need > to do a full fsck and hence chunkfs gives you no benefit in this > case. I worry about this but so far haven't found something which couldn't be cut down significantly with just a little extra work. It might be helpful to look at an extreme case. Let's say we're incredibly paranoid. We could be justified in running a full fsck on the entire file system in between every single I/O. After all, something *might* have been silently corrupted. But this would be ridiculously slow. We could instead never check the file system. But then we would end up panicking and corrupting the file system a lot. So what's a good compromise? In the chunkfs case, here's my rules of thumb so far: 1. Detection: All metadata has magic numbers and checksums. 2. Scrubbing: Random check of chunks when possible. 3. Repair: When we detect corruption, either by checksum error, file system code assertion failure, or hardware tells us we have a bug, check the chunk containing the error and any outside-chunk information that could be affected by it. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, Apr 25, 2007 at 03:34:03PM +0400, Nikita Danilov wrote: > > What is more important, design puts (as far as I can see) no upper limit > on the number of continuation inodes, and hence, even if _average_ fsck > time is greatly reduced, occasionally it can take more time than ext2 of > the same size. This is clearly unacceptable in many situations (HA, > etc.). Actually, there is an upper limit on the number of continuation inodes. Each file can have a maximum of one continuation inode per chunk. (This is why we need to support sparse files.) -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, Apr 24, 2007 at 11:34:48PM +0400, Nikita Danilov wrote: > > Maybe I failed to describe the problem presicely. > > Suppose that all chunks have been checked. After that, for every inode > I0 having continuations I1, I2, ... In, one has to check that every > logical block is presented in at most one of these inodes. For this one > has to read I0, with all its indirect (double-indirect, triple-indirect) > blocks, then read I1 with all its indirect blocks, etc. And to repeat > this for every inode with continuations. > > In the worst case (every inode has a continuation in every chunk) this > obviously is as bad as un-chunked fsck. But even in the average case, > total amount of io necessary for this operation is proportional to the > _total_ file system size, rather than to the chunk size. Fsck in chunkfs is still going to have an element that is proportional to the file system size for certain cases. However, that element will be a great deal smaller than in a regular file system, except in the most pathological cases. If those pathological cases happen often, then it's back to the drawing board. My hunch is that they won't be common. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Andreas Dilger wrote: How do you recover if fsfuzzer takes out a cnode in the chain? The chunk is marked clean, but clearly corrupted and needs fixing and you don't know what it was pointing at. Hence you have a pointer to a trashed cnode *somewhere* that you need to find and fix, and a bunch of orphaned cnodes that nobody points to *somewhere else* in the filesystem that you have to find. That's a full scan fsck case, isn't? Presumably, the cnodes in the other chunks contain forward and back references. Those need to contain at minimum inode + generation + chunk to avoid problem of pointing to a _different_ inode after such corruption caused the old inode to be deleted and a new one allocated in its place. If the cnode in each chunk is more than just a singly-linked list, the file as a whole could survive multiple chunk corruptions, though there would now be holes in the file. It seems that any sort of damage to the underlying storage (e.g. media error, I/O error or user brain explosion) results in the need to do a full fsck and hence chunkfs gives you no benefit in this case. Yes, what originated from discussions on #linuxfs is that redundancy is required for cnodes, in order to avoid checking the entire file system in search of a dangling cnode reference or for "parent" of a cnode. If corruption, due to any reason, occurs in any other part of the file system, it would be localized for that chunk. Even if entire fsck is needed, chances of which are rare, full fsck of chunked file system is no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6 for chunkfs wouldn't add whole lot. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Wed, 25 Apr 2007, Nikita Danilov wrote: David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > David Lang writes: > > > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > > > > > Amit Gud writes: > > > > > > > > Hello, > > > > > > > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > > > at: http://lwn.net/Articles/190222 and > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > > > > > I have a couple of questions about chunkfs repair process. > > > > > > > > First, as I understand it, each continuation inode is a sparse file, > > > > mapping some subset of logical file blocks into block numbers. Then it > > > > seems, that during "final phase" fsck has to check that these partial > > > > mappings are consistent, for example, that no two different continuation > > > > inodes for a given file contain a block number for the same offset. This > > > > check requires scan of all chunks (rather than of only "active during > > > > crash"), which seems to return us back to the scalability problem > > > > chunkfs tries to address. > > > > > > not quite. > > > > > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in > > > the process. with chunkfs you divide the problem by a large constant (100 or > > > more) for the checks of individual chunks. after those are done then the final > > > pass checking the cross-chunk links doesn't have to keep track of everything, it > > > only needs to check those links and what they point to > > > > Maybe I failed to describe the problem presicely. > > > > Suppose that all chunks have been checked. After that, for every inode > > I0 having continuations I1, I2, ... In, one has to check that every > > logical block is presented in at most one of these inodes. For this one > > has to read I0, with all its indirect (double-indirect, triple-indirect) > > blocks, then read I1 with all its indirect blocks, etc. And to repeat > > this for every inode with continuations. > > > > In the worst case (every inode has a continuation in every chunk) this > > obviously is as bad as un-chunked fsck. But even in the average case, > > total amount of io necessary for this operation is proportional to the > > _total_ file system size, rather than to the chunk size. > > actually, it should be proportional to the number of continuation nodes. The > expectation (and design) is that they are rare. Indeed, but total size of meta-data pertaining to all continuation inodes is still proportional to the total file system size, and so is fsck time: O(total_file_system_size). correct, but remember that in the real world O(total_file_system_size) does not mean that it can't work well. it just means that larger filesystems will take longer to check. they aren't out to eliminate the need for fsck, just to be able to divide the time it currently takes by a large value so that as the filesystems continue to get larger it is still reasonable to check them What is more important, design puts (as far as I can see) no upper limit on the number of continuation inodes, and hence, even if _average_ fsck time is greatly reduced, occasionally it can take more time than ext2 of the same size. This is clearly unacceptable in many situations (HA, etc.). in a pathalogical situation you are correct, it would take longer. however before declaring that this is completely unacceptable why don't you wait and see if the pathalogical situation is at all likely? when you are doing ha with shared storage you tend to be doing things like databases, every database that I know about splits it's data files into many pieces of a fixed size. Postgres for example does 1M files. if you do a chunk size of 1G it's very unlikly that more then a couple files out of every thousand will end up with continuation nodes. remember that the current thinking on chunk size is to make the chunks be ~1% of your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in the example above would mean just a couple files out of every ten thousand would have continuation nodes). with the current filesystems it's _possible_ for a file to be spread out across the disk such that it's first block is at the beginning of the disk, the second at the end of the disk, the third back at the beginning, the fourth at the end, etc. but users don't worry about this when useing the filesystems becouse the odds of this happening under normal use are vanishingly small (and the filesystem designers work to make the odds this small). similarly the chunkfs designers are working to make the odds of every file having a continuation nodes vanishingly small as well. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Apr 25, 2007 20:54 +1000, David Chinner wrote: > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote: > > Right now, there is no distinction between an inode and continuation > > inode (also referred to as 'cnode' below), except for the > > EXT2_IS_CONT_FL flag. Every inode holds a list of static number of > > inodes, currently limited to 4. > > > > The structure looks like this: > > > > -- -- > > | cnode 0 |-->| cnode 0 |--> to another cnode or NULL > > -- -- > > | cnode 1 |- | cnode 1 |- > > -- | -- | > > | cnode 2 |-- | | cnode 2 |-- | > > -- | |-- | | > > | cnode 3 | | | | cnode 3 | | | > > -- | |-- | | > > | | || | | > > > >inodes inodes or NULL > > How do you recover if fsfuzzer takes out a cnode in the chain? The > chunk is marked clean, but clearly corrupted and needs fixing and > you don't know what it was pointing at. Hence you have a pointer to > a trashed cnode *somewhere* that you need to find and fix, and a > bunch of orphaned cnodes that nobody points to *somewhere else* in > the filesystem that you have to find. That's a full scan fsck case, > isn't? Presumably, the cnodes in the other chunks contain forward and back references. Those need to contain at minimum inode + generation + chunk to avoid problem of pointing to a _different_ inode after such corruption caused the old inode to be deleted and a new one allocated in its place. If the cnode in each chunk is more than just a singly-linked list, the file as a whole could survive multiple chunk corruptions, though there would now be holes in the file. > It seems that any sort of damage to the underlying storage (e.g. > media error, I/O error or user brain explosion) results in the need > to do a full fsck and hence chunkfs gives you no benefit in this > case. There are several cases where such corruption could be found: - file access from the "parent" cnode will be missing corrupted cnode, probably causing a fsck of both the source and target chunks - a fsck of the source chunk would find the dangling cnode reference and cause a fsck of the corrupt chunk - a fsck of the later cnode chunks would find the dangling cnode reference and cause a fsck of the corrupt chunk - a fsck of the corrupt chunk would find the original corruption The case where only a fsck of the corrupt chunk is done would not find the cnode references. Maybe there needs to be per-chunk info which contains a list/bitmap of other chunks that have cnodes shared with each chunk? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > David Lang writes: > > > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > > > > > Amit Gud writes: > > > > > > > > Hello, > > > > > > > > > > > > > > This is an initial implementation of ChunkFS technique, briefly > > > > > discussed > > > > > at: http://lwn.net/Articles/190222 and > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > > > > > I have a couple of questions about chunkfs repair process. > > > > > > > > First, as I understand it, each continuation inode is a sparse file, > > > > mapping some subset of logical file blocks into block numbers. Then it > > > > seems, that during "final phase" fsck has to check that these partial > > > > mappings are consistent, for example, that no two different > > > > continuation > > > > inodes for a given file contain a block number for the same offset. > > > > This > > > > check requires scan of all chunks (rather than of only "active during > > > > crash"), which seems to return us back to the scalability problem > > > > chunkfs tries to address. > > > > > > not quite. > > > > > > this checking is a O(n^2) or worse problem, and it can eat a lot of > > > memory in > > > the process. with chunkfs you divide the problem by a large constant > > > (100 or > > > more) for the checks of individual chunks. after those are done then the > > > final > > > pass checking the cross-chunk links doesn't have to keep track of > > > everything, it > > > only needs to check those links and what they point to > > > > Maybe I failed to describe the problem presicely. > > > > Suppose that all chunks have been checked. After that, for every inode > > I0 having continuations I1, I2, ... In, one has to check that every > > logical block is presented in at most one of these inodes. For this one > > has to read I0, with all its indirect (double-indirect, triple-indirect) > > blocks, then read I1 with all its indirect blocks, etc. And to repeat > > this for every inode with continuations. > > > > In the worst case (every inode has a continuation in every chunk) this > > obviously is as bad as un-chunked fsck. But even in the average case, > > total amount of io necessary for this operation is proportional to the > > _total_ file system size, rather than to the chunk size. > > actually, it should be proportional to the number of continuation nodes. The > expectation (and design) is that they are rare. Indeed, but total size of meta-data pertaining to all continuation inodes is still proportional to the total file system size, and so is fsck time: O(total_file_system_size). What is more important, design puts (as far as I can see) no upper limit on the number of continuation inodes, and hence, even if _average_ fsck time is greatly reduced, occasionally it can take more time than ext2 of the same size. This is clearly unacceptable in many situations (HA, etc.). Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote: > Nikita Danilov wrote: > >Maybe I failed to describe the problem presicely. > > > >Suppose that all chunks have been checked. After that, for every inode > >I0 having continuations I1, I2, ... In, one has to check that every > >logical block is presented in at most one of these inodes. For this one > >has to read I0, with all its indirect (double-indirect, triple-indirect) > >blocks, then read I1 with all its indirect blocks, etc. And to repeat > >this for every inode with continuations. > > > >In the worst case (every inode has a continuation in every chunk) this > >obviously is as bad as un-chunked fsck. But even in the average case, > >total amount of io necessary for this operation is proportional to the > >_total_ file system size, rather than to the chunk size. > > > > Perhaps, I should talk about how continuation inodes are managed / > located on disk. (This is how it is in my current implementation) > > Right now, there is no distinction between an inode and continuation > inode (also referred to as 'cnode' below), except for the > EXT2_IS_CONT_FL flag. Every inode holds a list of static number of > inodes, currently limited to 4. > > The structure looks like this: > > -- -- > | cnode 0 |-->| cnode 0 |--> to another cnode or NULL > -- -- > | cnode 1 |- | cnode 1 |- > -- | -- | > | cnode 2 |-- | | cnode 2 |-- | > -- | | -- | | > | cnode 3 | | | | cnode 3 | | | > -- | | -- | | > | | || | | > > inodes inodes or NULL How do you recover if fsfuzzer takes out a cnode in the chain? The chunk is marked clean, but clearly corrupted and needs fixing and you don't know what it was pointing at. Hence you have a pointer to a trashed cnode *somewhere* that you need to find and fix, and a bunch of orphaned cnodes that nobody points to *somewhere else* in the filesystem that you have to find. That's a full scan fsck case, isn't? It seems that any sort of damage to the underlying storage (e.g. media error, I/O error or user brain explosion) results in the need to do a full fsck and hence chunkfs gives you no benefit in this case. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Nikita Danilov wrote: Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. Perhaps, I should talk about how continuation inodes are managed / located on disk. (This is how it is in my current implementation) Right now, there is no distinction between an inode and continuation inode (also referred to as 'cnode' below), except for the EXT2_IS_CONT_FL flag. Every inode holds a list of static number of inodes, currently limited to 4. The structure looks like this: -- -- | cnode 0 |-->| cnode 0 |--> to another cnode or NULL -- -- | cnode 1 |- | cnode 1 |- -- | -- | | cnode 2 |-- | | cnode 2 |-- | -- | | -- | | | cnode 3 | | | | cnode 3 | | | -- | | -- | | | | || | | inodes inodes or NULL I.e. only first cnode in the list carries forward the chain if all the slots are occupied. Every cnode# field contains { ino_t cnode; __u32 start;/* starting logical block number */ __u32 end; /* ending logical block number */ } (current implementation has just one field: cnode) I thought of this structure to avoid recursion and / or use of any data structure while traversing the continuation inodes. Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode has any sparse portions. 'start' and 'end' fields are used to speed-up finding a cnode given a logical block number without the need of actually reading the inode - this can be done away with, perhaps more conveniently by, pinning the cnodes in memory as and when read. Now going back to the Nikita's question, all the cnodes for an inode need to be scanned iff 'start' field or number of blocks or flag EXT2_SPARSE_CONT_FL in any of its cnodes is altered. And yes, the whole attempt is to reduce the number of continuation inodes. Comments, suggestions welcome. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the final > pass checking the cross-chunk links doesn't have to keep track of everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. actually, it should be proportional to the number of continuation nodes. The expectation (and design) is that they are rare. If you get into the worst-case situation of all of them being continuation nodes, then you are actually worse off then you were to start with (as you are saying), but numbers from people's real filesystems (assuming a chunk size equal to a block cluster size) indicates that we are more on the order of a fraction of a percent of the nodes. and the expectation is that since the chunk sizes will be substantially larger then the block cluster sizes this should get reduced even more. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory > in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the > final > pass checking the cross-chunk links doesn't have to keep track of > everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. > > any ability to mark a filesystem as 'clean' and then not have to check it on > reboot is a bonus on top of this. > > David Lang Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. not quite. this checking is a O(n^2) or worse problem, and it can eat a lot of memory in the process. with chunkfs you divide the problem by a large constant (100 or more) for the checks of individual chunks. after those are done then the final pass checking the cross-chunk links doesn't have to keep track of everything, it only needs to check those links and what they point to any ability to mark a filesystem as 'clean' and then not have to check it on reboot is a bonus on top of this. David Lang Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Suparna Bhattacharya wrote: Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Mainly I'm trying to gather following: - Graph of continuation inodes vs. the file system fragmentation (or aging) factor with varying configurations of chunk sizes - Graph of wall clock time vs. disk size + data on the disk with both chunkfs and native ext2, and/or other file systems AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote: > On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > at: http://lwn.net/Articles/190222 and > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > This implementation is done within ext2 driver. Every chunk is an > > independent ext2 file system. The knowledge about chunks is kept within > > ext2 and 'continuation inodes', which are used to allow files and > > directories span across multiple chunks, are managed within ext2. > > > > At mount time, super blocks for all the chunks are created and linked with > > the global super_blocks list maintained by VFS. This allows independent > > behavior or individual chunks and also helps writebacks to happen > > seamlessly. > > > > Apart from this, chunkfs code in ext2 effectively only provides knowledge > > of: > > > > - what inode's which block number to look for, for a given file's logical > > block number > > - in which chunk to allocate next inode / block > > - number of inodes to scan when a directory is being read > > > > To maintain the ext2's inode number uniqueness property, 8 msb bits of > > inode number are used to indicate the chunk number in which it resides. > > > > As said, this is a preliminary implementation and lots of changes are > > expected before this code is even sanely usable. Some known issues and > > obvious optimizations are listed in the TODO file in the chunkfs patch. > > > > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch > > - one big patch > > - applies to 2.6.18 > > > Could you send this out as a patch to ext2 codebase, so we can just look > at the changes for chunkfs ? That might also make it small enough > to inline your patch in email for review. Sorry, I missed the part about ext2-chunkfs-diff below. Regards suparna > > What kind of results are you planning to gather to evaluate/optimize this ? > > Regards > Suparna > > > > > Attached - ext2-chunkfs-diff.patch.gz > > - since the code is a spin-off of ext2, this patch explains better what > > has changed from the ext2. > > > > git://cislinux.cis.ksu.edu/chunkfs-tools > > - mkfs, and fsck for chunkfs. > > > > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml > > - config file used; tested mostly on UML with loopback file systems. > > > > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP > > should be "no" for clean compile. > > > > > > Please comment, suggest, criticize. Patches most welcome. > > > > > > Best, > > AG > > -- > > May the source be with you. > > http://www.cis.ksu.edu/~gud > > > > -- > Suparna Bhattacharya ([EMAIL PROTECTED]) > Linux Technology Center > IBM Software Lab, India > -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > This implementation is done within ext2 driver. Every chunk is an > independent ext2 file system. The knowledge about chunks is kept within > ext2 and 'continuation inodes', which are used to allow files and > directories span across multiple chunks, are managed within ext2. > > At mount time, super blocks for all the chunks are created and linked with > the global super_blocks list maintained by VFS. This allows independent > behavior or individual chunks and also helps writebacks to happen > seamlessly. > > Apart from this, chunkfs code in ext2 effectively only provides knowledge > of: > > - what inode's which block number to look for, for a given file's logical > block number > - in which chunk to allocate next inode / block > - number of inodes to scan when a directory is being read > > To maintain the ext2's inode number uniqueness property, 8 msb bits of > inode number are used to indicate the chunk number in which it resides. > > As said, this is a preliminary implementation and lots of changes are > expected before this code is even sanely usable. Some known issues and > obvious optimizations are listed in the TODO file in the chunkfs patch. > > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch > - one big patch > - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna > > Attached - ext2-chunkfs-diff.patch.gz > - since the code is a spin-off of ext2, this patch explains better what > has changed from the ext2. > > git://cislinux.cis.ksu.edu/chunkfs-tools > - mkfs, and fsck for chunkfs. > > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml > - config file used; tested mostly on UML with loopback file systems. > > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP > should be "no" for clean compile. > > > Please comment, suggest, criticize. Patches most welcome. > > > Best, > AG > -- > May the source be with you. > http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] ChunkFS: fs fission for faster fsck
This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be "no" for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud ext2-chunkfs-diff.patch.gz Description: Binary data