Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote: I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. Could we please stop this stupid ext4-centrism? XFS is ready so we can put in the syscalls backed by XFS. We have already done this with the xattr syscalls in 2.4, btw. Then again I don't think we should put it in quite yet, because this thread has degraded into creeping featurism, please give me some more time to preparate a semi-coheret rants about this.. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote: Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. No, mingming and I both work off of the patch queue (which is also stored in git). So what mingming was asking for exactly was just posting the incremental patches and tagging them appropriately to avoid confusion. I tried building the patch queue earlier in the week and it there were multiple oops/panics as I ran things through various regression tests, but that may have been fixed since (the tree was broken over the weekend and I may have grabbed a broken patch series) or it may have been a screw up on my part feeding them into our testing grid. I haven't had time to try again this week, but I'll try to put together a new tested ext4 patchset over the weekend. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. I don't think we have a problem here. What we have now is fine, and it was just people kvetching that Amit reposted patches that were already in -mm and ext4. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
Theodore Tso wrote: I don't think we have a problem here. What we have now is fine, and It's fine for ext4, but not the wider world. This is a common problem created by parallel development when code dependencies exist. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. Presumably you mean 2.6.23. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
Theodore Tso wrote: On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote: Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. No, mingming and I both work off of the patch queue (which is also stored in git). So what mingming was asking for exactly was just posting the incremental patches and tagging them appropriately to avoid confusion. I tried building the patch queue earlier in the week and it there were multiple oops/panics as I ran things through various regression tests,but that may have been fixed since (the tree was broken over the weekend and I may have grabbed a broken patch series) or it may have been a screw up on my part feeding them into our testing grid. I haven't had time to try again this week, but I'll try to put together a new tested ext4 patchset over the weekend. I think the ext4 patch queue is in good shape now. Shaggy have tested in on dbench, fsx, and tiobench, tests runs fine. and BULL team has benchmarked the latest ext4 patch queue with iozone and FFSB. Regards, Mingming I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. I don't think we have a problem here. What we have now is fine, and it was just people kvetching that Amit reposted patches that were already in -mm and ext4. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote: In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. Presumably you mean 2.6.23. Yes, sorry. I meant once Linus releases 2.6.22, and we would be aiming to merge before the 2.6.23-rc1 window. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fallocate support for bitmap-based files
Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. I believe that Sreenivasa will mainly be doing the implementation work. The basic plan is as follows: - Create (with tune2fs and mke2fs) a hidden file using one of the reserved inode numbers. That file will be sized to have one bit for each block in the partition. Let's call this the unwritten block file. The unwritten block file will be initialised with all-zeroes - at fallocate()-time, allocate the blocks to the user's file (in some yet-to-be-determined fashion) and, for each one which is uninitialised, set its bit in the unwritten block file. The set bit means this block is uninitialised and needs to be zeroed out on read. - truncate() would need to clear out set-bits in the unwritten blocks file. - When the fs comes to read a block from disk, it will need to consult the unwritten blocks file to see if that block should be zeroed by the CPU. - When the unwritten-block is written to, its bit in the unwritten blocks file gets zeroed. - An obvious efficiency concern: if a user file has no unwritten blocks in it, we don't need to consult the unwritten blocks file. Need to work out how to do this. An obvious solution would be to have a number-of-unwritten-blocks counter in the inode. But do we have space for that? (I expect google and others would prefer that the on-disk format be compatible with legacy ext2!) - One concern is the following scenario: - Mount fs with new kernel, fallocate() some blocks to a file. - Now, mount the fs under old kernel (which doesn't understand the unwritten blocks file). - This kernel will be able to read uninitialised data from that fallocated-to file, which is a security concern. - Now, the old kernel writes some data to a fallocated block. But this kernel doesn't know that it needs to clear that block's flag in the unwritten blocks file! - Now mount that fs under the new kernel and try to read that file. The flag for the block is set, so this kernel will still zero out the data on a read, thus corrupting the user's data So how to fix this? Perhaps with a per-inode flag indicating this inode has unwritten blocks. But to fix this problem, we'd require that the old kernel clear out that flag. Can anyone propose a solution to this? Ah, I can! Use the compatibility flags in such a way as to prevent the old kernel from mounting this filesystem at all. To mount this fs under an old kernel the user will need to run some tool which will - read the unwritten blocks file - for each set-bit in the unwritten blocks file, zero out the corresponding block - zero out the unwritten blocks file - rewrite the superblock to indicate that this fs may now be mounted by an old kernel. Sound sane? - I'm assuming that there are more reserved inodes available, and that the changes to tune2fs and mke2fs will be basically a copy-n-paste job from the `tune2fs -j' code. Correct? - I haven't thought about what fsck changes would be needed. Presumably quite a few. For example, fsck should check that set-bits in the unwriten blobks file do not correspond to freed blocks. If they do, that should be fixed up. And fsck can check each inodes number-of-unwritten-blocks counters against the unwritten blocks file (if we implement the per-inode number-of-unwritten-blocks counter) What else should fsck do? - I haven't thought about the implications of porting this into ext3/4. Probably the commit to the unwritten blocks file will need to be atomic with the commit to the user's file's metadata, so the unwritten-blocks file will effectively need to be in journalled-data mode. Or, more likely, we access the unwritten blocks file via the blockdev pagecache (ie: use bmap, like the journal file) and then we're just talking direct to the disk's blocks and it becomes just more fs metadata. - I guess resize2fs will need to be taught about the unwritten blocks file: to shrink and grow it appropriately. That's all I can think of for now - I probably missed something. Suggestions and thought are sought, please. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. I believe that Sreenivasa will mainly be doing the implementation work. The basic plan is as follows: - Create (with tune2fs and mke2fs) a hidden file using one of the reserved inode numbers. That file will be sized to have one bit for each block in the partition. Let's call this the unwritten block file. The unwritten block file will be initialised with all-zeroes - at fallocate()-time, allocate the blocks to the user's file (in some yet-to-be-determined fashion) and, for each one which is uninitialised, set its bit in the unwritten block file. The set bit means this block is uninitialised and needs to be zeroed out on read. - truncate() would need to clear out set-bits in the unwritten blocks file. By truncating the blocks file at the correct byte offset, only needing to zero some bits of the last byte of the file. - When the fs comes to read a block from disk, it will need to consult the unwritten blocks file to see if that block should be zeroed by the CPU. - When the unwritten-block is written to, its bit in the unwritten blocks file gets zeroed. - An obvious efficiency concern: if a user file has no unwritten blocks in it, we don't need to consult the unwritten blocks file. Need to work out how to do this. An obvious solution would be to have a number-of-unwritten-blocks counter in the inode. But do we have space for that? Would it be too expensive to test the blocks-file page each time a bit is cleared to see if it is all-zero, and then free the page, making it a hole? This test would stop if if finds any non-zero word, so it may not be too bad. (This could further be done on a block basis if the block size is less than a page.) (I expect google and others would prefer that the on-disk format be compatible with legacy ext2!) - One concern is the following scenario: - Mount fs with new kernel, fallocate() some blocks to a file. - Now, mount the fs under old kernel (which doesn't understand the unwritten blocks file). - This kernel will be able to read uninitialised data from that fallocated-to file, which is a security concern. - Now, the old kernel writes some data to a fallocated block. But this kernel doesn't know that it needs to clear that block's flag in the unwritten blocks file! - Now mount that fs under the new kernel and try to read that file. The flag for the block is set, so this kernel will still zero out the data on a read, thus corrupting the user's data So how to fix this? Perhaps with a per-inode flag indicating this inode has unwritten blocks. But to fix this problem, we'd require that the old kernel clear out that flag. Can anyone propose a solution to this? Ah, I can! Use the compatibility flags in such a way as to prevent the old kernel from mounting this filesystem at all. To mount this fs under an old kernel the user will need to run some tool which will - read the unwritten blocks file - for each set-bit in the unwritten blocks file, zero out the corresponding block - zero out the unwritten blocks file - rewrite the superblock to indicate that this fs may now be mounted by an old kernel. Sound sane? Yeah. I think it would have to be done under a compatibility flag. Is going back to an older kernel really that important? I think it's more important to make sure it can't be mounted by an older kernel if bad things can happen, and they can. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
Dave Kleikamp wrote: On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. I believe that Sreenivasa will mainly be doing the implementation work. The basic plan is as follows: - Create (with tune2fs and mke2fs) a hidden file using one of the reserved inode numbers. That file will be sized to have one bit for each block in the partition. Let's call this the unwritten block file. The unwritten block file will be initialised with all-zeroes - at fallocate()-time, allocate the blocks to the user's file (in some yet-to-be-determined fashion) and, for each one which is uninitialised, set its bit in the unwritten block file. The set bit means this block is uninitialised and needs to be zeroed out on read. - truncate() would need to clear out set-bits in the unwritten blocks file. By truncating the blocks file at the correct byte offset, only needing to zero some bits of the last byte of the file. We were thinking the unwritten blocks file would be indexed by physical block number of the block device. There wouldn't be a logical to physical relationship for the blocks, so we wouldn't be able to get away with truncating the blocks file itself. - When the fs comes to read a block from disk, it will need to consult the unwritten blocks file to see if that block should be zeroed by the CPU. - When the unwritten-block is written to, its bit in the unwritten blocks file gets zeroed. - An obvious efficiency concern: if a user file has no unwritten blocks in it, we don't need to consult the unwritten blocks file. Need to work out how to do this. An obvious solution would be to have a number-of-unwritten-blocks counter in the inode. But do we have space for that? Would it be too expensive to test the blocks-file page each time a bit is cleared to see if it is all-zero, and then free the page, making it a hole? This test would stop if if finds any non-zero word, so it may not be too bad. (This could further be done on a block basis if the block size is less than a page.) When clearing the bits, we'd likely see a large stream of writes to the unwritten blocks, which could result in a O(n^2) pass of rescanning the page over and over. Maybe a per-unwritten-block-file block per-block-header with a count that could be cheaply tested? Ie: the unwritten block file is composed of blocks that each have a small header that contains count -- when the count hits zero, we could punch a hole in the file. (I expect google and others would prefer that the on-disk format be compatible with legacy ext2!) - One concern is the following scenario: - Mount fs with new kernel, fallocate() some blocks to a file. - Now, mount the fs under old kernel (which doesn't understand the unwritten blocks file). - This kernel will be able to read uninitialised data from that fallocated-to file, which is a security concern. - Now, the old kernel writes some data to a fallocated block. But this kernel doesn't know that it needs to clear that block's flag in the unwritten blocks file! - Now mount that fs under the new kernel and try to read that file. The flag for the block is set, so this kernel will still zero out the data on a read, thus corrupting the user's data So how to fix this? Perhaps with a per-inode flag indicating this inode has unwritten blocks. But to fix this problem, we'd require that the old kernel clear out that flag. Can anyone propose a solution to this? Ah, I can! Use the compatibility flags in such a way as to prevent the old kernel from mounting this filesystem at all. To mount this fs under an old kernel the user will need to run some tool which will - read the unwritten blocks file - for each set-bit in the unwritten blocks file, zero out the corresponding block - zero out the unwritten blocks file - rewrite the superblock to indicate that this fs may now be mounted by an old kernel. Sound sane? Yeah. I think it would have to be done under a compatibility flag. Is going back to an older kernel really that important? I think it's more important to make sure it can't be mounted by an older kernel if bad things can happen, and they can. Ya, I too was originally thinking of a compat flag to keep the old kernel from mounting the filesystem. We'd arrange our bootup scripts to check for compatibility and call out to tune2fs (or some other tool) to down convert (by simply writing out zero blocks for each bit set and clearing the bitmap). Mike Waychison - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. Furthermore, I suspect (especially after listening to a very fasting Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks ago) that for many of Google's workloads, using a filesystem blocksize of 16K or 32K might not be a bad thing in any case. It would be a lot simpler - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, 2007-06-29 at 16:52 -0400, Mike Waychison wrote: Dave Kleikamp wrote: By truncating the blocks file at the correct byte offset, only needing to zero some bits of the last byte of the file. We were thinking the unwritten blocks file would be indexed by physical block number of the block device. There wouldn't be a logical to physical relationship for the blocks, so we wouldn't be able to get away with truncating the blocks file itself. I misunderstood. I was thinking about a block-file per regular file (that had preallocated blocks). Ignore that comment. - When the fs comes to read a block from disk, it will need to consult the unwritten blocks file to see if that block should be zeroed by the CPU. - When the unwritten-block is written to, its bit in the unwritten blocks file gets zeroed. - An obvious efficiency concern: if a user file has no unwritten blocks in it, we don't need to consult the unwritten blocks file. Need to work out how to do this. An obvious solution would be to have a number-of-unwritten-blocks counter in the inode. But do we have space for that? Would it be too expensive to test the blocks-file page each time a bit is cleared to see if it is all-zero, and then free the page, making it a hole? This test would stop if if finds any non-zero word, so it may not be too bad. (This could further be done on a block basis if the block size is less than a page.) When clearing the bits, we'd likely see a large stream of writes to the unwritten blocks, which could result in a O(n^2) pass of rescanning the page over and over. If you start checking for zero at the bit that was just zeroed, you'd likely find a non-zero bit right away, so you wouldn't be looking at too much of the page in the typical case. Maybe a per-unwritten-block-file block per-block-header with a count that could be cheaply tested? Ie: the unwritten block file is composed of blocks that each have a small header that contains count -- when the count hits zero, we could punch a hole in the file. Having the data be just a bitmap seems more elegant to me. It would be nice to avoid keeping a count in the bitmap page if possible. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, 29 Jun 2007 16:55:25 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? Mainline, preferably. I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? umm, it's worth it if you don't want to wear the overhead of journalling, and/or if you don't want to wait on the, err, rather slow progress of ext4. What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. Furthermore, I suspect (especially after listening to a very fasting Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks ago) that for many of Google's workloads, using a filesystem blocksize of 16K or 32K might not be a bad thing in any case. It would be a lot simpler Hadn't thought of that. Also, it's unclear to me why google is going this way rather than using (perhaps suitably-tweaked) ext2 reservations code. Because the stock ext2 block allcoator sucks big-time. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Jun 29, 2007 16:55 -0400, Theodore Tso wrote: What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. I don't think ext2 is safe for 8TB filesystems anyways, so this isn't a huge loss. The other possibility is, assuming Google likes ext2 because they don't care about e2fsck, is to patch ext4 to not use any journaling (i.e. make all of the ext4_journal*() wrappers be no-ops). That way they would get extents, mballoc and other speedups. That said, what is the reason for not using ext3? Presumably performance (which is greatly improved in ext4) or is there something else? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: E2fsprogs 1.40 release imminent!
Am Montag 25 Juni 2007 schrieb Theodore Ts'o: ... contains what I hope to be the e2fsprogs 1.40 release. If folks could test it and let me know if they find any embarassing bugs, I would greatly appreciate it. It can also be found at: There are a bunch of patches applied in the gentoo package that might be worth looking at: http://dev.gentoo.org/~hanno/e2fsprogs/ All of them still seem to apply to the current mercurial source. I'm not really familiar with them, just noted that there are a lot of patches and nobody seemed to care sending upstream in the past. e2fsprogs-1.32-mk_cmds-cosmetic.patc e2fsprogs-1.38-tests-locale.patch e2fsprogs-1.39-makefile.patch e2fsprogs-1.39-parse-types.patch e2fsprogs-1.39-util-strptime.patch e2fsprogs-1.40-libintl.patch -- Hanno Böck Blog: http://www.hboeck.de/ GPG: 3DBD3B20 Jabber: [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part.
Re: fallocate support for bitmap-based files
Andrew Morton wrote: On Fri, 29 Jun 2007 16:55:25 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: Guys, Mike and Sreenivasa at google are looking into implementing fallocate() on ext2. Of course, any such implementation could and should also be portable to ext3 and ext4 bitmapped files. What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? Mainline, preferably. I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? umm, it's worth it if you don't want to wear the overhead of journalling, and/or if you don't want to wait on the, err, rather slow progress of ext4. What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. Furthermore, I suspect (especially after listening to a very fasting Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks ago) that for many of Google's workloads, using a filesystem blocksize of 16K or 32K might not be a bad thing in any case. It would be a lot simpler Hadn't thought of that. Also, it's unclear to me why google is going this way rather than using (perhaps suitably-tweaked) ext2 reservations code. Because the stock ext2 block allcoator sucks big-time. The primary reason this is a problem is that our writers into these files aren't neccesarily coming from the same hosts in the cluster, so their arrival times aren't sequential. It ends up looking to the kernel like a random write workload, which in turn ends up causing odd fragmentation patterns that aren't very deterministic. That data is often eventually streamed off the disk though, which is when the fragmentation hurts. Currently, our clustered filesystem supports pre-allocation of the target chunks of files, but this is implemented by writting effectively zeroes to files, which in turn causes pagecache churn and a double write-out of the blocks. Recently, we've changed the code to minimize this pagecache churn and double write out by performing an ftruncate to extend files, but then we'll be back to square-one in terms of fragmentation for the random writes. Relying on (a tweaked) reservations code is also somewhat limitting at this stage given that reservations are lost on close(fd). Unless we change the lifetime of the reservations (maybe for the lifetime of the in-core inode?), crank up the reservation sizes and deal with the overcommit issues, I can't think of any better way at this time to deal with the problem. Mike Waychison - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] BIG_BG vs extended META_BG in ext4
Hi folks, I've been looking at getting around some of the limitations imposed by the block groups and was wondering what are peoples thoughts about implementing this using either bigger block groups or storing the bitmaps and inode tables outside of the block groups. I think the BIG_BG feature is better suited to the design philosophy of ext2/3. Since all the important meta-data is easily accessible thanks to the static filesystem layout, I would expect for easier fsck recovery. This should also provide with some performance improvements for both extents (allowing each extent to be larger than 128M) as well as fsck since bitmaps would be place closer together. An extended version of metadata block group could provide better performance improvements during fsck time since we could pack all of the filesystem bitmaps together. Having the inode tables separated from the block groups could mean that we could implement dynamic inodes in the future as well. This feature seems like it would be more invasive for e2fspros at first glance (at least for fsck). Also, with no metadata in the block groups, there is essentially no need to have a concept of block groups anymore which would mean that this is a completely different filesystem layout compared to ext2/3. Since I have not much experience with ext4 development, I was wondering if anybody had any opinion as to which of these two methods would better serve the need of the intended users and see which one would be worth to prototype first. Comments? -JRS - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
Andreas Dilger wrote: On Jun 29, 2007 16:55 -0400, Theodore Tso wrote: What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the block initialized bitmap inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. I don't think ext2 is safe for 8TB filesystems anyways, so this isn't a huge loss. This is reference to the idea of overloading the high-bit and not related to the PAGE_SIZE blocks correct? The other possibility is, assuming Google likes ext2 because they don't care about e2fsck, is to patch ext4 to not use any journaling (i.e. make all of the ext4_journal*() wrappers be no-ops). That way they would get extents, mballoc and other speedups. We do care about the e2fsck problem, though the cost/benefit of e2fsck times/memory problems vs the overhead of journalling doesn't weigh in journalling's favour for a lot of our per-spindle-latency bound applications. These apps manage to get pretty good disk locality guarantees and the journal overheads can induce undesired head movement. ext4 does look very promising, though I'm not certain it's ready for our consumption. What are people's thoughts on providing ext3 non-journal mode? We could benefit from several of the additions to ext3 that aren't available in ext2 and disabling journalling there sounds much more feasible for us instead of trying to backport each ext3 component to ext2. Mike Waychison That said, what is the reason for not using ext3? Presumably performance (which is greatly improved in ext4) or is there something else? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: E2fsprogs 1.40 release imminent!
On Fri, Jun 29, 2007 at 11:51:29PM +0200, Hanno Böck wrote: Am Montag 25 Juni 2007 schrieb Theodore Ts'o: ... contains what I hope to be the e2fsprogs 1.40 release. If folks could test it and let me know if they find any embarassing bugs, I would greatly appreciate it. It can also be found at: There are a bunch of patches applied in the gentoo package that might be worth looking at: http://dev.gentoo.org/~hanno/e2fsprogs/ All of them still seem to apply to the current mercurial source. I'm not really familiar with them, just noted that there are a lot of patches and nobody seemed to care sending upstream in the past. e2fsprogs-1.32-mk_cmds-cosmetic.patc e2fsprogs-1.38-tests-locale.patch e2fsprogs-1.39-makefile.patch e2fsprogs-1.39-parse-types.patch e2fsprogs-1.39-util-strptime.patch e2fsprogs-1.40-libintl.patch Thanks, none of these are critical. I'll go through them and apply them after the 1.40 release. e2fsprogs-1.39-util-strptime.patch was fixed in another way in my sources. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Jun 29, 2007 18:26 -0400, Mike Waychison wrote: Andreas Dilger wrote: I don't think ext2 is safe for 8TB filesystems anyways, so this isn't a huge loss. This is reference to the idea of overloading the high-bit and not related to the PAGE_SIZE blocks correct? Correct - just that the high-bit use wouldn't unduely impact the already-existing 8TB limit of ext2. The other thing to note is that Val Henson already ported the ext3 reservation code to ext2, so this is a pretty straight forward option for you and also doesn't affect the on-disk format. The other possibility is, assuming Google likes ext2 because they don't care about e2fsck, is to patch ext4 to not use any journaling (i.e. make all of the ext4_journal*() wrappers be no-ops). That way they would get extents, mballoc and other speedups. We do care about the e2fsck problem, though the cost/benefit of e2fsck times/memory problems vs the overhead of journalling doesn't weigh in journalling's favour for a lot of our per-spindle-latency bound applications. These apps manage to get pretty good disk locality guarantees and the journal overheads can induce undesired head movement. You could push the journal to a separate spindle, but that may not be practical. ext4 does look very promising, though I'm not certain it's ready for our consumption. FYI, the extents code (the most complex part of ext4) has been running for a couple of years on many PB of storage at CFS, so it is by no means new and untried code. There are definitely less-well tested changes in ext4 but they are mostly straight forward. I'm not saying you should jump right into ext4, but it isn't as far away as you might think. What are people's thoughts on providing ext3 non-journal mode? We could benefit from several of the additions to ext3 that aren't available in ext2 and disabling journalling there sounds much more feasible for us instead of trying to backport each ext3 component to ext2. This is something we've talked about for a long time, and I'd be happy to have this possibility. This would also allow you to take similar advantage of extents, the improved allocator and other features. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] BIG_BG vs extended META_BG in ext4
On Jun 29, 2007 17:09 -0500, Jose R. Santos wrote: I think the BIG_BG feature is better suited to the design philosophy of ext2/3. Since all the important meta-data is easily accessible thanks to the static filesystem layout, I would expect for easier fsck recovery. This should also provide with some performance improvements for both extents (allowing each extent to be larger than 128M) as well as fsck since bitmaps would be place closer together. An extended version of metadata block group could provide better performance improvements during fsck time since we could pack all of the filesystem bitmaps together. Having the inode tables separated from the block groups could mean that we could implement dynamic inodes in the future as well. This feature seems like it would be more invasive for e2fspros at first glance (at least for fsck). Also, with no metadata in the block groups, there is essentially no need to have a concept of block groups anymore which would mean that this is a completely different filesystem layout compared to ext2/3. Since I have not much experience with ext4 development, I was wondering if anybody had any opinion as to which of these two methods would better serve the need of the intended users and see which one would be worth to prototype first. I don't think there is actually any fundamental difference between these proposals. The reality is that we cannot change the semantics of the META_BG flag at this point, since both e2fsprogs and ext3/ext4 in the kernel understand META_BG to mean only group descriptor backups are in groups {0, 1, last} of the metagroup and nothing else. If we want to allow the bitmaps and inode table outside the group they represent then this needs to be a separate feature flag, and we may as well include the additional improvement of the BIG_BG feature at the same time. I don't think this really any reason to claim there is no need to have a concept of block groups. Also note that e2fsprogs already reserves the bg_free_*_bg fields for BIG_BG in the expanded group descriptors, though there is no official definition for BIG_BG: struct ext4_group_desc { [ ext3_group_desc ] __u32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */ __u32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */ __u32 bg_inode_table_hi; /* Inodes table block MSB */ __u16 bg_free_blocks_count_hi;/* Free blocks count MSB */ __u16 bg_free_inodes_count_hi;/* Free inodes count MSB */ __u16 bg_used_dirs_count_hi; /* Directories count MSB */ __u16 bg_pad; __u32 bg_reserved2[3]; }; Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html