Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Jeff Dike wrote: On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote: Jeff Dike wrote: How about this case: Growing file starts in chunk A. Overflows into chunk B. Delete file in chunk A. Growing file overflows chunk B and spots new free space in chunk A (and nothing anywhere else) Overflows into chunk A Delete file in chunk B. Overflow into chunk B again. Maybe this is not realistic, but in the absence of a mechanism to pull data back from an overflow chunk, it seems at least a theoretical possibility that there could be > 1 continuation inodes per file per chunk. Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by "Delete file in chunk A". If you are referring to same file thats growing, then deletion is not possible, because individual parts of any file in any chunk cannot be deleted. No, I'm referring to a different file. The scenario is that you have a growing file in a nearly full disk with files being deleted (and thus space being freed) such that allocations for the growing file bounce back and forth between chunks. In such scenario either lot of continuation inodes < number of chunks would be created or lot of sparse pieces would be created. But we can certainly enforce the constraint of one continuation inode per file per chunk, excluding the file's primary chunk in which it started. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Alan Cox wrote: Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by How are you handling the allocation in this situation, are you assuming that a chunk is "out of bounds" because part of a file already lives on it or simply keeping a single inode per chunk which has multiple sparse pieces of the file on it ? ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write 16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have to find another chunk to use ? Hello Alan, You re-use the same inode with multiple sparse pieces. This way you avoid hopping around continuation inodes and coming back to same chunk with which you started but this time on a different continuation inode. This may not be I/O intensive for successive traversals if the continuation inodes are pinned in the memory, but it certainly is a waste of resource - inodes. Not allowing this would make worst case of every file having a continuation inode in every chunk, even worse; may be like only single file exist in the file system and rest all inodes in all chunks (including file's own chunk) are continuation inodes. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Jeff Dike wrote: How about this case: Growing file starts in chunk A. Overflows into chunk B. Delete file in chunk A. Growing file overflows chunk B and spots new free space in chunk A (and nothing anywhere else) Overflows into chunk A Delete file in chunk B. Overflow into chunk B again. Maybe this is not realistic, but in the absence of a mechanism to pull data back from an overflow chunk, it seems at least a theoretical possibility that there could be > 1 continuation inodes per file per chunk. Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by "Delete file in chunk A". If you are referring to same file thats growing, then deletion is not possible, because individual parts of any file in any chunk cannot be deleted. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Andreas Dilger wrote: How do you recover if fsfuzzer takes out a cnode in the chain? The chunk is marked clean, but clearly corrupted and needs fixing and you don't know what it was pointing at. Hence you have a pointer to a trashed cnode *somewhere* that you need to find and fix, and a bunch of orphaned cnodes that nobody points to *somewhere else* in the filesystem that you have to find. That's a full scan fsck case, isn't? Presumably, the cnodes in the other chunks contain forward and back references. Those need to contain at minimum inode + generation + chunk to avoid problem of pointing to a _different_ inode after such corruption caused the old inode to be deleted and a new one allocated in its place. If the cnode in each chunk is more than just a singly-linked list, the file as a whole could survive multiple chunk corruptions, though there would now be holes in the file. It seems that any sort of damage to the underlying storage (e.g. media error, I/O error or user brain explosion) results in the need to do a full fsck and hence chunkfs gives you no benefit in this case. Yes, what originated from discussions on #linuxfs is that redundancy is required for cnodes, in order to avoid checking the entire file system in search of a dangling cnode reference or for "parent" of a cnode. If corruption, due to any reason, occurs in any other part of the file system, it would be localized for that chunk. Even if entire fsck is needed, chances of which are rare, full fsck of chunked file system is no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6 for chunkfs wouldn't add whole lot. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Nikita Danilov wrote: Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. Perhaps, I should talk about how continuation inodes are managed / located on disk. (This is how it is in my current implementation) Right now, there is no distinction between an inode and continuation inode (also referred to as 'cnode' below), except for the EXT2_IS_CONT_FL flag. Every inode holds a list of static number of inodes, currently limited to 4. The structure looks like this: -- -- | cnode 0 |-->| cnode 0 |--> to another cnode or NULL -- -- | cnode 1 |- | cnode 1 |- -- | -- | | cnode 2 |-- | | cnode 2 |-- | -- | | -- | | | cnode 3 | | | | cnode 3 | | | -- | | -- | | | | || | | inodes inodes or NULL I.e. only first cnode in the list carries forward the chain if all the slots are occupied. Every cnode# field contains { ino_t cnode; __u32 start;/* starting logical block number */ __u32 end; /* ending logical block number */ } (current implementation has just one field: cnode) I thought of this structure to avoid recursion and / or use of any data structure while traversing the continuation inodes. Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode has any sparse portions. 'start' and 'end' fields are used to speed-up finding a cnode given a logical block number without the need of actually reading the inode - this can be done away with, perhaps more conveniently by, pinning the cnodes in memory as and when read. Now going back to the Nikita's question, all the cnodes for an inode need to be scanned iff 'start' field or number of blocks or flag EXT2_SPARSE_CONT_FL in any of its cnodes is altered. And yes, the whole attempt is to reduce the number of continuation inodes. Comments, suggestions welcome. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, 23 Apr 2007, Amit Gud wrote: On Mon, 23 Apr 2007, Arjan van de Ven wrote: > The other thing which we should consider is that chunkfs really > requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a "chunk space" number and a 32 bit local inode number ;) (same for blocks) For inodes, yes, either 64-bit inode or some field for the chunk id in which the inode is. But for block numbers, you don't. Because individual chunks manage part of the whole file system in an independent way. They have their block bitmaps starting at an offset. Inode bitmaps, however, remains same. In that sense, we also can do away without having chunk identifier encoded into inode number and chunkfs would still be fine with it. But we will then loose inode uniqueness property, which could well be OK as it is with other file systems in which inode number is not sufficient for unique identification of an inode. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, 23 Apr 2007, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a "chunk space" number and a 32 bit local inode number ;) (same for blocks) For inodes, yes, either 64-bit inode or some field for the chunk id in which the inode is. But for block numbers, you don't. Because individual chunks manage part of the whole file system in an independent way. They have their block bitmaps starting at an offset. Inode bitmaps, however, remains same. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Suparna Bhattacharya wrote: Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Mainly I'm trying to gather following: - Graph of continuation inodes vs. the file system fragmentation (or aging) factor with varying configurations of chunk sizes - Graph of wall clock time vs. disk size + data on the disk with both chunkfs and native ext2, and/or other file systems AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] ChunkFS: fs fission for faster fsck
This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be "no" for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud ext2-chunkfs-diff.patch.gz Description: Binary data
Re: ChunkFS - measuring cross-chunk references
Karuna sagar K wrote: Hi, The attached code contains program to estimate the cross-chunk references for ChunkFS file system (idea from Valh). Below are the results: Nice to see some numbers! But would be really nice to know: - what the chunk size is - how the files were created or, more vaguely, how 'aged' the fs is - what is the chunk allocation algorithm Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Remove dead ext2 code
AG -- May the source be with you. http://www.cis.ksu.edu/~gud Remove dead code. Unless used for padding (which doesn't seem so), no users of s_dir_count. Signed-off-by: Amit Gud <[EMAIL PROTECTED]> --- ext2_fs_sb.h.copy 2007-03-25 20:36:45.0 -0500 +++ ext2_fs_sb.h2007-03-25 20:37:55.0 -0500 @@ -47,7 +47,6 @@ struct ext2_sb_info { int s_first_ino; spinlock_t s_next_gen_lock; u32 s_next_generation; - unsigned long s_dir_count; u8 *s_debts; struct percpu_counter s_freeblocks_counter; struct percpu_counter s_freeinodes_counter;