Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Andrew, Thanks for the review comments! On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch implements the fallocate() system call and adds support for i386, x86_64 and powerpc. ... +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be said? Ok. I will add a detailed comment here. +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (len == 0 || offset 0) + goto out; The posix spec implies that negative `len' is permitted - presumably allocate ahead of `offset'. How peculiar. I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also doesn't explicitly say anything about allowing negative 'len'. + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file-f_mode FMODE_WRITE)) + goto out_fput; + + inode = file-f_path.dentry-d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode-i_mode)) + goto out_fput; + + ret = -ENODEV; + if (!S_ISREG(inode-i_mode)) + goto out_fput; So we return ENODEV against an S_ISBLK fd, as per the posix spec. That seems a bit silly of them. True. + ret = -EFBIG; + if (offset + len inode-i_sb-s_maxbytes) + goto out_fput; This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment here would settle the reader's mind. Ok. I will add a check here for wrap though zero. + if (inode-i_op inode-i_op-fallocate) + ret = inode-i_op-fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. If we're not going to handle negative `len' then we should check for it. Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of the function). +out_fput: + fput(file); +out: + return ret; +} +EXPORT_SYMBOL(sys_fallocate); I don't believe this needs to be exported to modules? Ok. Will remove it. +/* + * fallocate() modes + */ +#define FA_ALLOCATE0x1 +#define FA_DEALLOCATE 0x2 Now those aren't in posix. They should be documented, along with their expected semantics. Will add a comment describing the role of these modes. #ifdef __KERNEL__ #include linux/linkage.h @@ -1125,6 +1131,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *, int, loff_t, loff_t); I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? Ok. Will add the variable names here. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] ext4: Extent overlap bugfix
On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:41:01 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +unsigned int ext4_ext_check_overlap(struct inode *inode, + struct ext4_extent *newext, + struct ext4_ext_path *path) +{ + unsigned long b1, b2; + unsigned int depth, len1; + + b1 = le32_to_cpu(newext-ee_block); + len1 = le16_to_cpu(newext-ee_len); + depth = ext_depth(inode); + if (!path[depth].p_ext) + goto out; + b2 = le32_to_cpu(path[depth].p_ext-ee_block); + + /* get the next allocated block if the extent in the path +* is before the requested block(s) */ + if (b2 b1) { + b2 = ext4_ext_next_allocated_block(path); + if (b2 == EXT_MAX_BLOCK) + goto out; + } + + if (b1 + len1 b2) { Are we sure that b1+len cannot wrap through zero here? No. Will add a check here for this. Thanks! + newext-ee_len = cpu_to_le16(b2 - b1); + return 1; + } -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: This patch has the ext4 implemtation of fallocate system call. ... + /* ext4_can_extents_be_merged should have checked that either +* both extents are uninitialized, or both aren't. Thus we +* need to check only one of them here. +*/ Please always format multiline comments like this: /* * ext4_can_extents_be_merged should have checked that either * both extents are uninitialized, or both aren't. Thus we * need to check only one of them here. */ Ok. ... +/* + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do something. If the file is using bitmaps then we will do something else. But what? Here is where it should be described. Ok. Will expand the description. +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + int ret, ret2, nblocks = 0, retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode-i_blkbits; + + /* Currently supporting (pre)allocate mode _only_ */ + if (mode != FA_ALLOCATE) + return -EOPNOTSUPP; + + if (!(EXT4_I(inode)-i_flags EXT4_EXTENTS_FL)) + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) +- block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the segment range already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal space we might need for any value of len. + handle=ext4_journal_start(inode, credits + Please always put spaces around =A Ok. + EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1); And around + Ok. + if (IS_ERR(handle)) + return PTR_ERR(handle); +retry: + ret = 0; + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. Ok. Will do that. + if (ret 0 test_bit(BH_New, map_bh.b_state) Use buffer_new() here. A separate patch which fixes the three existing instances of open-coded BH_foo usage would be appreciated. Ok. +((block + ret) (i_size_read(inode) blkbits))) Check for wrap though the sign bit and through zero please. Ok. + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + + /* Time to update the file size. +* Update only when preallocation was requested beyond the file size. +*/ Fix comment layout. Ok. + if ((offset + len) i_size_read(inode)) { Both the lhs and the rhs here are signed. Please review for possible overflows through
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:46:23 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ + struct ext4_extent_header *eh; + unsigned int depth, len; + int merge_done=0, uninitialized = 0; space around =, please. Many people prefer not to do the multiple-definitions-per-line, btw: int merge_done = 0; int uninitialized = 0; Ok. Will make the change. reasons: - If gives you some space for a nice comment - It makes patches much more readable, and it makes rejects easier to fix - standardisation. + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + eh = path[depth].p_hdr; + + while (ex EXT_LAST_EXTENT(eh)) { + if (!ext4_can_extents_be_merged(inode, ex, ex + 1)) + break; + /* merge with next extent! */ + if (ext4_ext_is_uninitialized(ex)) + uninitialized = 1; + ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex) + + ext4_ext_get_actual_len(ex + 1)); + if (uninitialized) + ext4_ext_mark_uninitialized(ex); + + if (ex + 1 EXT_LAST_EXTENT(eh)) { + len = (EXT_LAST_EXTENT(eh) - ex - 1) + * sizeof(struct ext4_extent); + memmove(ex + 1, ex + 2, len); + } + eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1); Kenrel convention is to put spaces around - Will fix this. + merge_done = 1; + BUG_ON(eh-eh_entries == 0); eek, scary BUG_ON. Do we really need to be that severe? Would it be better to warn and run ext4_error() here? Ok. + } + + return merge_done; +} + + ... +/* + * ext4_ext_convert_to_initialized: + * this function is called by ext4_ext_get_blocks() if someone tries to write + * to an uninitialized extent. It may result in splitting the uninitialized + * extent into multiple extents (upto three). Atleast one initialized extent + * and atmost two uninitialized extents can result. There are some typos here + * There are three possibilities: + * a No split required: Entire extent should be initialized. + * b Split into two extents: Only one end of the extent is being written to. + * c Split into three extents: Somone is writing in middle of the extent. and here Ok. Will fix them. + */ +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, + struct ext4_ext_path *path, + ext4_fsblk_t iblock, + unsigned long max_blocks) +{ + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex; + struct ext4_extent_header *eh; + unsigned int allocated, ee_block, ee_len, depth; + ext4_fsblk_t newblock; + int err = 0, ret = 0; + + depth = ext_depth(inode); + eh = path[depth].p_hdr; + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex-ee_block); + ee_len = ext4_ext_get_actual_len(ex); + allocated = ee_len - (iblock - ee_block); + newblock = iblock - ee_block + ext_pblock(ex); + ex2 = ex; + + /* ex1: ee_block to iblock - 1 : uninitialized */ + if (iblock ee_block) { + ex1 = ex; + ex1-ee_len = cpu_to_le16(iblock - ee_block); + ext4_ext_mark_uninitialized(ex1); + ex2 = newex; + } + /* for sanity, update the length of the ex2 extent before +* we insert ex3, if ex1 is NULL. This is to avoid temporary +* overlap of blocks. +*/ + if (!ex1 allocated max_blocks) + ex2-ee_len = cpu_to_le16(max_blocks); + /* ex3: to ee_block + ee_len : uninitialised */ + if (allocated max_blocks) { + unsigned int newdepth; + ex3 = newex; + ex3-ee_block = cpu_to_le32(iblock + max_blocks); + ext4_ext_store_pblock(ex3, newblock + max_blocks); + ex3-ee_len = cpu_to_le16(allocated - max_blocks); + ext4_ext_mark_uninitialized(ex3); + err = ext4_ext_insert_extent(handle, inode, path, ex3); + if (err) + goto out; + /* The depth, and hence eh ex might change +* as part of the insert above. +*/ + newdepth = ext_depth(inode); + if (newdepth != depth) + { Use if (newdepth != depth) { Ok. + depth=newdepth; spaces Ok. + path = ext4_ext_find_extent(inode, iblock, NULL); + if (IS_ERR(path)) { +
Re: [PATCH 1/4] Add unix COW io manager
On Fri, May 04, 2007 at 02:43:21PM +0530, Aneesh Kumar K.V wrote: + * unix_cow_io.c --- This is the Unix (well, really POSIX) implementation + * of the COW I/O manager. I really wish you had based this not on unix_io.c, but rather on test_io.c, for two reasons (a) this reduces code duplication, (b) it means that we can chain/stack I/O managers. i.e., test_io - cow_manager - unix_io. The one change I would recommend is that instead of assigning to a globally visible external variable (as I did in test_io, and which I now regret as a not necessarily being portable --- different shared library implementations have different restrictions on whether you can access global variables defined in shared libraries from an application or vice versa --- with AIX having some of the most peverse shared library restritions, at least 7-8 years ago when I had the misfortune of trying to make Kerberos v5 shared libraries work on AIX :-). Instead, what I would recommend is creating a function which an application could call to set the backing I/O manager for the cow_manager, as well as the filenames to use for the tdb journal. The second change I would recommend is to NOT call it cow_manager but rather an undo_manager, since that makes it clear that what is being backed up is not the work that we are about to do, but rather the original contents of the disk. I would also rename the ext4_replay tool something like e2undofs, since replay has the connotation that you are applying changes that were done in a log, when in fact what you are really doing is backing out changes. We want to make sure we avoid this confusion, since Xen and UML and most other virtualization engines, for good reasons, do the exact opposite --- the COW file contains the changes, and the base file/device remains unchanged. + retval = tdb_store(data-tdb, tdb_key, tdb_data, TDB_INSERT); + if (retval == -1) { + /* FIXME!! Not sure what should be the error */ + retval = EXT2_ET_SHORT_WRITE; + goto error_out; + } You need to write the tdb_store() around a tdb_transaction_start() and tdb_transaction_commit() pair, so that we can be sure the original data is actually synced to disk. Otherwise, on a crash, we could have a situation where the original data on disk has been overwritten, but the backup of the original block was never forced to disk! - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote: On 4/26/07, Amit K. Arora [EMAIL PROTECTED] wrote: /* + * ext4_ext_try_to_merge: + * tries to merge the ex extent to the next extent in the tree. + * It always tries to merge towards right. If you want to merge towards + * left, pass ex - 1 as argument instead of ex. + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns + * 1 if they got merged. + */ +int ext4_ext_try_to_merge(struct inode *inode, + struct ext4_ext_path *path, + struct ext4_extent *ex) +{ Please either use proper kerneldoc format or drop ext4_ext_try_to_merge from the comment. Ok, Thanks. -- Regards, Amit Arora - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Jakub Jelinek wrote: is what glibc does ATM. Seems we violate the case where len == 0, as EINVAL in that case is shall fail. But reading the standard to imply negative len is ok is too much guessing, there is no word what it means when len is negative and required storage for regular file data starting at offset and continuing for len bytes doesn't make sense for negative size. This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote: Andreas Dilger wrote: If this code could also (or optionally just) increase the size of inodes it would be very useful. AFAIK right now it will only change the inodes from block-mapped to extent-mapped? Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). One generic concern I have about simply migrating existing files to use extents is that unless we actually also combine the tool with defragmentation support, the file ends up being fairly badly fragmented simply because of the holes left in the file from the indirect blocks. Therefore the tree ends up being far larger than it needs to be, and future attempts allocate blocks in that block group will fill in the holes, further degrading the performance of the filesystem and whatever file is being written at the time. The other reason why increasing the inode size would actually be more valuable is that right now, if we have the disk space, the user can do on-line migration of a file simply by copying it and then moving it over the original file --- and if we have delayed allocation support and sufficient memory, we can avoid the fragmentation problems currently with the off-line ext4migration approach. So we may need to think a bit about what's the best way to handle ext4migrate. It may be that correct approach is to combine it with a defragmentation tool, either on-line or off-line. That is correct. My next step is to enhance the tool to support migration to large inodes. I would actually suggest that the place to add that functionality would be via tune2fs -I inode_size. Since mke2fs -I inode_size is used to set the inode size, it makes sense that tune2fs -I would change the inode size. Obviously this would have to be done when the filesystem is not mounted, and tune2fs should abort gracefully if determines that the filesystem is mounted. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
Hi, Theodore Tso wrote: On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote: Andreas Dilger wrote: If this code could also (or optionally just) increase the size of inodes it would be very useful. AFAIK right now it will only change the inodes from block-mapped to extent-mapped? Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). One of the option i was thinking was to use this tool to migrate to extent map and then towards the end use the online defrag ioctl to defrag the resulting ext4 inode. That way we don't need to add kernel code that convert ext3 inode to ext4 inode while defragmentation. Last time i looked at the online defrag code, i was not able to figure out a easy way to take indirect map inode as input, then defrag the same and write the result in extent map. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
Theodore Tso wrote: That is correct. My next step is to enhance the tool to support migration to large inodes. I would actually suggest that the place to add that functionality would be via tune2fs -I inode_size. Since mke2fs -I inode_size is used to set the inode size, it makes sense that tune2fs -I would change the inode size. Obviously this would have to be done when the filesystem is not mounted, and tune2fs should abort gracefully if determines that the filesystem is mounted. That is nice !!. I will add the large inode migration support to tune2fs. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On Sun, May 06, 2007 at 09:40:14PM -0700, Andrew Morton wrote: On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] wrote: Steps to reproduce: Create a 3G partition, say /dev/vol1/project mke2fs -j -b 4096 /dev/vol1/project 22812 mount it ext2online /dev/vol1/project said: | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b | ext2online: ext2_ioctl: No space left on device | | ext2online: unable to resize /dev/mapper/vol1-project kernel said: | JBD: ext2online wants too many credits (721 256) There's a threshold for the problem depending on the initial size. This one fails: mke2fs -j -b 4096 /dev/3GB-blockdev 32768 (mount + ext2online or resize2fs) kernel: JBD: resize2fs wants too many credits (1034 1024) Add one block to the initial mke2fs (32768+1 == 32769) and the problem is gone. Without the -b 4096 there's another resize problem mke2fs -j /dev/loop1 2048 mount /dev/loop1 /1 resize2fs /dev/loop1 says: resize2fs 1.40-WIP (14-Nov-2006) Filesystem at /dev/loop1 is mounted on /1; on-line resizing required old desc_blocks = 1, new_desc_blocks = 12 Performing an on-line resize of /dev/loop1 to 3072000 (1k) blocks. resize2fs: Invalid argument While trying to add group #256 and the kernel says: May 7 15:36:08 lokka EXT3-fs warning (device loop1): verify_reserved_gdb: May 7 15:36:08 lokka reserved GDT 10 missing grp 1 (8202) May 7 15:36:08 lokka After that, the filesystem has been resized to 2GB. I recall a 2G (?) limit for ext3 resizing with 1k blocksize but trying the above with 4096 1k blocks seems to work. fsck says it's ok all the time. -- Frank - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On Mon, May 07, 2007 at 12:26:26AM +0200, Frank van Maarseveen wrote: 2.6.20.6, FC4: |JBD: ext2online wants too many credits (744 256) You would be better off using resize2fs from e2fsprogs 1.39, BTW About the only reason to keep using the ext2resize packages is ext2prepare, which is the off-line tool to add a resizing inode after the fact, and the main reason that functionality hasn't been absorbed into e2fsprogs is that it needs to be completely rewritten before I'm willing to trust and maintain it... What is the limitation I should be aware of? Has it something to do with the journal log size? Yes, the problem is the journal log size. Since the original filesystem was so small, the journal was kept small so as not to use a huge percentage of the filesystem size. Of course, with a larger filesystem you probably want a larger journal anyway to get better performance. So that brings up a related problem, which is we don't have a way of dynamically increasing the size of the journal, which we probably would want to do as part of resizing the filesystem size. So the short-term workaround for now, if you know that you're going to take a filesystem which is 22M and increase it to 3G, is to create it with largish journal in the first place: mke2fs -j -b 4096 -J size=16M /dev/vol1/project 22812 Note that if you were creating a 3G filesystem from scratch, by default it would be using a 128M journal --- but it's a little hard to fit a 128M journal in a 22M filesystem. :-/ We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On Mon, 7 May 2007, Theodore Tso wrote: We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. why not just allocate a new journal (something along the lines of convert the ext3 to ext2 then back to ext3 would be the ugly, scripted way of doing it) or is the problem that you are trying to resize things without remounting them (and therefor without flushing the journal) David Lang - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote: On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + int ret, ret2, nblocks = 0, retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode-i_blkbits; + + /* Currently supporting (pre)allocate mode _only_ */ + if (mode != FA_ALLOCATE) + return -EOPNOTSUPP; + + if (!(EXT4_I(inode)-i_flags EXT4_EXTENTS_FL)) + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Ok. Will add this in the function description as well. Also, posix says nothing about fallocate() returning ENOTTY. Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support non-extent files. Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote: On Mon, 7 May 2007, Theodore Tso wrote: We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. why not just allocate a new journal (something along the lines of convert the ext3 to ext2 then back to ext3 would be the ugly, scripted way of doing it) or is the problem that you are trying to resize things without remounting them (and therefor without flushing the journal) Yes, I'm referring to the difficulties of getting this right when doing online (mounted) resizing, as that was the context of the original bug report. Clearly the right thing for resize2fs to do when it's doing an off-line resize is to just adjust the journal inode and make it bigger. I'll add that to my todo list, but of course, patches to do that would be gratefully accepted... - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On Mon, May 07, 2007 at 07:16:31PM +0530, Aneesh Kumar K.V wrote: One of the option i was thinking was to use this tool to migrate to extent map and then towards the end use the online defrag ioctl to defrag the resulting ext4 inode. As we discussed on the phone, my recommendation would be to take your existing code and move it into the kernel so that triggered off of an ioctl, your code could migrate an inode from using indirect blocks to extents while a filesystem is mounted. The main things you will need to watch for is to make sure the inode is locked so that another process doesn't try to extend or truncate it, and to use the JBD layer to provide appropriate journalling support. Given that ext4migrated imported the kernel extent functions, it should be relatively straightforward to simply make them use the kernel extent functions while in kernel space. Once the the inode has been converted on-line then it can be defragmented on-line. That would be much more convenient than having to unmount the filesystem to do the off-line migration, followed by mounting it to do the defragmentation. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
One of our testers filed a bug that said mkfs.ext3 is much slower when mke2fs.conf is missing... This is because the shipped defaults in mke2fs.conf do not match the shipped defaults in the mkfs code itself; he wound up making a 1k block filesystem on a very large block device, for example. So - How about this patch, to bring them back into line? Which makes me wonder; having defaults in 2 different places is bound to get out of sync; should we instead generate both code config file defaults (and maybe man page defaults) from a common source? Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line for current defaults... Thanks, -Eric Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: e2fsprogs-hg/misc/mke2fs.c === --- e2fsprogs-hg.orig/misc/mke2fs.c +++ e2fsprogs-hg/misc/mke2fs.c @@ -1288,7 +1288,8 @@ static void PRS(int argc, char *argv[]) tmp = tmp2 = NULL; if (fs_param.s_rev_level != EXT2_GOOD_OLD_REV) { profile_get_string(profile, defaults, base_features, 0, - filetype,sparse_super, tmp); + sparse_super,filetype,resize_inode,dir_index, + tmp); profile_get_string(profile, fs_types, fs_type, base_features, tmp, tmp2); edit_feature(tmp2, fs_param.s_feature_compat); @@ -1365,7 +1366,7 @@ static void PRS(int argc, char *argv[]) if (blocksize = 0) { profile_get_integer(profile, defaults, blocksize, 0, - 1024, use_bsize); + 4096, use_bsize); profile_get_integer(profile, fs_types, fs_type, blocksize, use_bsize, use_bsize); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Accomodate 32-bit uid/guid values in e2fsprogs
e2fsprogs doesn't handle large ( 16 bit) UID/GID... Create an fs by user with large UID: bash-3.1$ id uid=501666(newuser) gid=501666(newuser) groups=501666(newuser) bash-3.1$ /sbin/mke2fs -q fsfile fsfile is not a block special device. Proceed anyway? (y,n) y Now mount it: [EMAIL PROTECTED] tmp]# mount -o loop fsfile /mnt/tmp/ [EMAIL PROTECTED] tmp]# ls -la /mnt/tmp/ total 13 drwxr-xr-x 3 42914 42914 1024 May 7 11:55 . drwxr-xr-x 4 root root 28 Jan 23 10:34 .. drwx-- 2 root root 12288 May 7 11:55 lost+found uid/gid is wrong (only bottom 16 bits)! (it's wrong on disk...) And even if I get it right on disk: [EMAIL PROTECTED] tmp]# ls -lan /mnt/tmp/ total 13 drwxr-xr-x 3 501666 501666 1024 May 7 11:57 . drwxr-xr-x 4 0 028 Jan 23 10:34 .. drwx-- 2 0 0 12288 May 7 11:57 lost+found debugfs has further problems: [EMAIL PROTECTED] tmp]# /sbin/debugfs /tmp/fsfile debugfs 1.39 (29-May-2006) debugfs: stat . Inode: 2 Type: directoryMode: 0755 Flags: 0x0 Generation: 0 User: 42914 Group: 42914 Size: 1024 File ACL: 0Directory ACL: 0 Links: 3 Blockcount: 2 Fragment: Address: 0Number: 0Size: 0 ctime: 0x463f5a8a -- Mon May 7 11:57:46 2007 atime: 0x463f5a90 -- Mon May 7 11:57:52 2007 mtime: 0x463f5a8a -- Mon May 7 11:57:46 2007 BLOCKS: (0):508 TOTAL: 1 debugfs: ls -l 2 40755 (2) 42914 429141024 7-May-2007 11:57 . 2 40755 (2) 42914 429141024 7-May-2007 11:57 .. 11 40700 (2) 0 0 12288 7-May-2007 11:57 lost+found The attached patch creates inode_uid() and inode_gid() macros to accommodate the high bits when reading uid/gid, on platforms that support it. It also accommodates large UID/GID at mkfs time. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Index: e2fsprogs-hg/debugfs/debugfs.c === --- e2fsprogs-hg.orig/debugfs/debugfs.c +++ e2fsprogs-hg/debugfs/debugfs.c @@ -534,7 +534,7 @@ void internal_dump_inode(FILE *out, cons prefix, inode-i_mode 0777, inode-i_flags, inode-i_generation); fprintf(out, %sUser: %5d Group: %5d Size: , - prefix, inode-i_uid, inode-i_gid); + prefix, inode_uid(*inode), inode_gid(*inode)); if (LINUX_S_ISREG(inode-i_mode)) { __u64 i_size = (inode-i_size | ((unsigned long long)inode-i_size_high 32)); Index: e2fsprogs-hg/debugfs/ls.c === --- e2fsprogs-hg.orig/debugfs/ls.c +++ e2fsprogs-hg/debugfs/ls.c @@ -88,7 +88,7 @@ static int list_dir_proc(ext2_ino_t dir } fprintf(ls-f, %c%6u%c %6o (%d) %5d %5d , lbr, ino, rbr, inode.i_mode, dirent-name_len 8, - inode.i_uid, inode.i_gid); + inode_uid(inode), inode_gid(inode)); if (LINUX_S_ISDIR(inode.i_mode)) fprintf(ls-f, %5d, inode.i_size); else Index: e2fsprogs-hg/debugfs/lsdel.c === --- e2fsprogs-hg.orig/debugfs/lsdel.c +++ e2fsprogs-hg/debugfs/lsdel.c @@ -23,7 +23,7 @@ struct deleted_info { ext2_ino_t ino; unsigned short mode; - unsigned short uid; + __u32 uid; __u64 size; time_t dtime; int num_blocks; @@ -160,7 +160,7 @@ void do_lsdel(int argc, char **argv) delarray[num_delarray].ino = ino; delarray[num_delarray].mode = inode.i_mode; - delarray[num_delarray].uid = inode.i_uid; + delarray[num_delarray].uid = inode_uid(inode); delarray[num_delarray].size = inode.i_size; if (!LINUX_S_ISDIR(inode.i_mode)) delarray[num_delarray].size |= Index: e2fsprogs-hg/e2fsck/message.c === --- e2fsprogs-hg.orig/e2fsck/message.c +++ e2fsprogs-hg/e2fsck/message.c @@ -304,12 +304,10 @@ static _INLINE_ void expand_inode_expres inode-i_dir_acl : 0)); break; case 'u': - printf(%d, (inode-i_uid | - (inode-osd2.linux2.l_i_uid_high 16))); + printf(%d, inode_uid(*inode)); break; case 'g': - printf(%d, (inode-i_gid | - (inode-osd2.linux2.l_i_gid_high 16))); + printf(%d, inode_gid(*inode)); break; case 't': if (LINUX_S_ISREG(inode-i_mode)) Index: e2fsprogs-hg/lib/ext2fs/ext2_fs.h === ---
Re: Creating a 32bit blocks filesystem.
On Mon, 7 May 2007 11:19:52 -0500 Jose R. Santos [EMAIL PROTECTED] wrote: I tried the patches and while the tools such as mkfs and debugfs seem to work fine, I am still unable to mount a filesystem with block numbers exceeding 32 bits. Correction, I tested debugfs on a 32bit blocks filesystem. Both debugfs and fsck failed when attempting to use a 64bit filesystem. -JRS - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of our testers filed a bug that said mkfs.ext3 is much slower when mke2fs.conf is missing... This is because the shipped defaults in mke2fs.conf do not match the shipped defaults in the mkfs code itself; he wound up making a 1k block filesystem on a very large block device, for example. So - How about this patch, to bring them back into line? It doesn't actually bring them completely back into line, since mke2fs will use different block sizes depending on the size of the filesystem. So your patch makes the default probably a bit more reasonable, and so I'll probably end up applying it, but it definitely isn't a complete replacement for /etc/mke2fs.conf. How likely do you think the case will be that mke2fs.conf would be missing? I'm trying to figure out how high priority of an item this really is. Which makes me wonder; having defaults in 2 different places is bound to get out of sync; should we instead generate both code config file defaults (and maybe man page defaults) from a common source? I had been working on the assumption that the defaults if mke2fs.conf were not present were more in the nature of emergency defaults as opposed something that could be used a fully functional set of configuration parameters. So the assumption was that when you installed the RPM, mke2fs.conf would also be there. We could enhance the profile code so that it could read in the profile from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, but that adds bloat --- the question is how necessary do we think that really is? Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On May 06, 2007 21:40 -0700, Andrew Morton wrote: On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] wrote: 2.6.20.6, FC4: I created a 91248k ext3 fs with 4k blocksize: Next, I tried to resize it to about 3G using ext2online while mounted: At that time the kernel said: |JBD: ext2online wants too many credits (744 256) What is the limitation I should be aware of? Has it something to do with the journal log size? Yes, for very small filesystems the default journal size is only 4MB, and the maximum transaction size is 1MB (256 blocks). If the filesystem was 128MB or larger in the initial mke2fs then the journal would be 16MB and the resize would get up to 1024 blocks for the transaction. I'm not sure what the right solution is for this. If you know you will be resizing the fs you could increase the initial journal size at mke2fs time (-J size=16). Alternately, it is possible resize to an intermediate size and then delete and recreate the journal via tune2fs (which would be the larger size by default) but that can only be done offline. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 03, 2007 21:31 -0700, Andrew Morton wrote: On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote: + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do something. If the file is using bitmaps then we will do something else. But what? Here is where it should be described. My understanding is that glibc will handle zero-filling of files for filesystems that do not support fallocate(). +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + handle_t *handle; + ext4_fsblk_t block, max_blocks; + int ret, ret2, nblocks = 0, retries = 0; + struct buffer_head map_bh; + unsigned int credits, blkbits = inode-i_blkbits; + + /* Currently supporting (pre)allocate mode _only_ */ + if (mode != FA_ALLOCATE) + return -EOPNOTSUPP; + + if (!(EXT4_I(inode)-i_flags EXT4_EXTENTS_FL)) + return -ENOTTY; So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Also, posix says nothing about fallocate() returning ENOTTY. I _think_ this is to convince glibc to do the zero-filling in userspace, but I'm not up on the API specifics. + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) +- block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Also, since @path is not being given then truncate_mutex is not needed. + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and ext4_error() would be safer and more useful here. Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
Theodore Tso wrote: On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of our testers filed a bug that said mkfs.ext3 is much slower when mke2fs.conf is missing... This is because the shipped defaults in mke2fs.conf do not match the shipped defaults in the mkfs code itself; he wound up making a 1k block filesystem on a very large block device, for example. So - How about this patch, to bring them back into line? It doesn't actually bring them completely back into line, since mke2fs will use different block sizes depending on the size of the filesystem. So your patch makes the default probably a bit more reasonable, and so I'll probably end up applying it, but it definitely isn't a complete replacement for /etc/mke2fs.conf. Well, sure, I don't mean for it to *replace* mke2fs.conf... Hm, does having [defaults] blocksize = 4096 in mke2fs.conf turn off the blocksize heuristics and force 4k? is what's in mke2fs.conf a starting point or an absolute? I guess I need to read up on the code... How likely do you think the case will be that mke2fs.conf would be missing? I'm trying to figure out how high priority of an item this really is. Well, not too likely, although for some reason I guess it happened in the installer root in FC6 or so. That's what raised the issue. We could enhance the profile code so that it could read in the profile from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, but that adds bloat --- the question is how necessary do we think that really is? I guess it doesn't really sound *necessary* - it's just that if we have 2 different defaults and they drift, it can be confusing... -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-ext4-1
On Mon, 2007-04-30 at 11:14 -0400, Theodore Ts'o wrote: I've respun the ext4 development patchset, with Amit's updated fallocate patches. I've added Dave's patch to add ia64 support to the fallocate system call, but *not* the XFS fallocate support patches. (Probably better for them to live in an xfs tree, where they can more easily tested and updated.) Yes, we haven't reached complete closure on the fallocate system call calling convention, but it's enough for us to get more testing in -mm. Also added Johann's jbd2-stats-through-procfs patches; it provides useful help in turning the size of the journal, which will be useful in benchmarking efforts. In addition, Alex Tomas's patch to free just-allocated patches when there is an error inserting the extent into the extent tree has also been included. The patches have been compile-tested on x86, and compile/run-tested on x86/UML. Would appreciate reports about testing on other platforms. I have tested this patch series on ppc64, x86_64 with dbench/tiobench/fsx, all runs fine. I am not sure what level of testing Amit has done about the fallocate() and preallocation code on various archs. I couldn't find a available s390 and ia64 machines with free partition yet. In any case, it would be useful to add a new set of testsuites for the new fallocate() syscall and fsstress in LTP testsuites to automatically the preallocation code in ext4/XFS. thanks, Mingming Thanks, - Ted P.S. One bug which I've noted --- if there is a failure due to disk filling up, running e2fsck on the filesystem will show that the i_blocks fields on the inodes where there was a failure to allocate disk blocks are left incorrect. I'm guessing this is a bug in the delayed allocation patches. Alex, when you have a moment, could you take a look? Thanks!! - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) + - block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() succeeded and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. Does the proposed implementation handle quotas correctly, btw? Has that been tested? Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][take 4] e2fsprogs: Add ext4migrate
On May 07, 2007 19:16 +0530, Aneesh Kumar K.V wrote: Theodore Tso wrote: Actually, changing the inode size is actually going to be more useful to a larger number of people, since they can use it today to support in-inode EA's. In addition, changing the inode size must be done off-line, whereas it's not so clear that off-line conversion to extents is the best way to go (more on that below). That way we don't need to add kernel code that convert ext3 inode to ext4 inode while defragmentation. Last time i looked at the online defrag code, i was not able to figure out a easy way to take indirect map inode as input, then defrag the same and write the result in extent map. I think the online defrag code _should_ be fairly easily made to handle defrag of block-mapped files, so long as it is always creating extent- mapped files in the end. All that should be necessary is to have the kernel read data from the block-mapped file and write it into the newly mapped blocks. That functionality needs to exist for a long time anyways to support upgrading the filesystem so it shouldn't be a huge burden. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: JBD: ext2online wants too many credits (744 256)
On May 07, 2007 11:50 -0400, Theodore Tso wrote: On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote: On Mon, 7 May 2007, Theodore Tso wrote: We're going to have to rethink the defaults of the journal size as it relates to on-line resizing, but there might not be a lot of great answers here. We'll also have to look at the on-line resizing code and see if there's a way to break up the resize operation into smaller transactions as a way of avoiding this problem --- but that would still leave the user stuck with a pathetically small 4M journal on a 3G filesystem. Clearly the right thing for resize2fs to do when it's doing an off-line resize is to just adjust the journal inode and make it bigger. I'll add that to my todo list, but of course, patches to do that would be gratefully accepted... For that matter, there was a paper from U. Wisconsin showing that having the journal in the middle of the filesystem gave noticably better performance due to lower average seek times. If anyone is looking at messing with the journal that is probably also a good and easy place to start (e.g. may be as easy as just giving a goal block of s_blocks_count / 2 to the journal create code). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults
On May 07, 2007 14:52 -0500, Eric Sandeen wrote: Theodore Tso wrote: How likely do you think the case will be that mke2fs.conf would be missing? I'm trying to figure out how high priority of an item this really is. Well, not too likely, although for some reason I guess it happened in the installer root in FC6 or so. That's what raised the issue. Ah, good point - there are probably lots of installers and rescue disks that grab mke2fs but don't know anything about /etc/mke2fs.conf. We could enhance the profile code so that it could read in the profile from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs, but that adds bloat --- the question is how necessary do we think that really is? I guess it doesn't really sound *necessary* - it's just that if we have 2 different defaults and they drift, it can be confusing... Since the shipped mke2fs.conf is only on the order of a few hundred bytes I don't think it is a huge issue to include it. I like the idea of it being compiled into the code and yet consistent with the default install. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 7 May 2007 15:21:04 -0700 Andreas Dilger [EMAIL PROTECTED] wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced what do folks think? - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: Andreas Dilger wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote: On Mon, 7 May 2007 05:37:54 -0600 Andreas Dilger [EMAIL PROTECTED] wrote: + block = offset blkbits; + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) blkbits) +- block; + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing without at least looking at `len'? Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). I think the use of ext4_journal_extend() (as Amit has proposed) will help here, but it is not sufficient. Because under some circumstances, a journal_extend() failure could mean that we fail to allocate all the required disk space. If it is infrequent enough, that is acceptable when the caller is using fallocate() for performance reasons. But it is very much not acceptable if the caller is using fallocate() for space-reservation reasons. If you used fallocate to reserve 1GB of disk and fallocate() succeeded and you later get ENOSPC then you'd have a right to get a bit upset. So I think the ext3/4 fallocate() implementation will need to be implemented as a loop: while (len) { journal_start(); len -= do_fallocate(len, ...); journal_stop(); } I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to ask for in each journal_start()? +/* + * ext4_fallocate: + * preallocate space for a file + * mode is for future use, e.g. for unallocating preallocated blocks etc. + */ +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) +{ + mutex_lock(EXT4_I(inode)-truncate_mutex); + credits = ext4_ext_calc_credits_for_insert(inode, NULL); + mutex_unlock(EXT4_I(inode)-truncate_mutex); I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only (EXT4_RESERVE_TRANS_BLOCKS + 32) + handle=ext4_journal_start(inode, credits + + EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1); + if (IS_ERR(handle)) + return PTR_ERR(handle); +retry: + ret = 0; + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. Does the proposed implementation handle quotas
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao [EMAIL PROTECTED] wrote: + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote: On Mon, 7 May 2007 19:14:42 -0400 Theodore Tso [EMAIL PROTECTED] wrote: On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating please spend time if needed to find a window covers at least N free blocks. So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. I am not sure whether this worth the effort though. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use direct-IO or sync_file_range()+fadvise(FADV_DONTNEED). This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. A PVR implementor would take all this over and would do it themselves, for sure. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced what do folks think? An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote: On Mon, 07 May 2007 17:00:24 -0700 Mingming Cao [EMAIL PROTECTED] wrote: + while (ret = 0 ret max_blocks) { + block = block + ret; + max_blocks = max_blocks - ret; + ret = ext4_ext_get_blocks(handle, inode, block, + max_blocks, map_bh, + EXT4_CREATE_UNINITIALIZED_EXT, 0); + BUG_ON(!ret); + if (ret 0 test_bit(BH_New, map_bh.b_state) +((block + ret) (i_size_read(inode) blkbits))) + nblocks = nblocks + ret; + } + + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) + goto retry; + Now the interesting question is: what do we do if we get halfway through this loop and then run out of space? We could leave the disk all filled up and then return failure to the caller, but that's pretty poor behaviour, IMO. The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop so we could get a lower chance to get more fragmented. yes, but my point is that the proposed behaviour is really quite bad. I agree your point, that's why I mention it only helped the fragmentation issue but not the ENOSPC case. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to have more free space. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On May 07, 2007 19:02 -0400, Jeff Garzik wrote: Andreas Dilger wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs
On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote: e2fsprogs doesn't handle large ( 16 bit) UID/GID... Applied, with one correction: --- e2fsprogs-hg.orig/misc/mke2fs.c +++ e2fsprogs-hg/misc/mke2fs.c @@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys _(while reading root inode)); exit(1); } - inode.i_uid = getuid(); - if (inode.i_uid) - inode.i_gid = getgid(); + uid = getuid(); + inode.i_uid = uid; + inode.i_uid_high = uid 16; + if (inode.i_uid) { ^^^ This should be uid instead. Otherwise, the gid won't be set if the uid is a multiple of 65536. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted
On Thu, Apr 19, 2007 at 05:35:36PM +0530, Kalpak Shah wrote: This patch removes a code snippet from check_ea_in_inode() in pass1 which checks if the EA values in the inode are sorted or not. The comments in fs/ext*/xattr.c state that the EA values in the external EA block are sorted but those in the inode need not be sorted. I have also attached a test image which has unsorted EAs in the inodes. The current e2fsck wrongly clears the EAs in the inode. Applied. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e2fsprogs: Check journal inode sanity and recreate journal
On Thu, Mar 15, 2007 at 02:43:47AM +0530, Kalpak Shah wrote: Index: e2fsprogs-1.39/lib/ext2fs/mkjournal.c === --- e2fsprogs-1.39.orig/lib/ext2fs/mkjournal.c +++ e2fsprogs-1.39/lib/ext2fs/mkjournal.c + if (fs-super-s_blocks_count 2048) { + fputs((\nFilesystem too small for a journal\n), stderr); + return 0; + } Code in lib/ext2fs isn't allowed to do any output to stdio (except for debugging purposes). It causes internationalization problems, and it's just in general a bad idea for ext2fs library code to try to do any UI. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html