Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Wed 07-11-07 19:20:38, Theodore Tso wrote: On Wed, Nov 07, 2007 at 03:36:05PM +0100, Jan Kara wrote: What if more than one application wants to use this facility? That should be fine - let's see: Each application keeps somewhere a time when it started a scan of a subtree (or it can actually remember a time when it set the flag for each directory), during the scan, it sets the flag on each directory. When it wakes up to recheck the subtree it just compares the rtime against the stored time - if rtime is greater, subtree has been modified since the last scan and we recurse in it and when we are finished with it we set the flag. Now notice that we don't care about the flag when we check for changes - we care only for rtime - so if there are several applications interested in the same subtree, the flag just gets set more often and thus the update of rtime happens more often but the same scheme still works fine. OK, so in this case you don't need to set rtime on the every single file inode, but only directory inode, right? Because you're only Yes, that's actually what I'm doing - sorry if I didn't make it clear earlier. using checking the rtime at the directory level, and not the flag. And it's just as easy for you to check the rtime flag for the file's containing directory (modulo magic vis-a-vis hard links) as the file's inode. Exactly. I'm just really wishing that rtime and the rtime flag didn't have live on disk, but could rather be in memory. If you only needed to save the directory flags and rtimes, that might actually be doable. I already gave some thought to this but there seemed to be some drawbacks. Query I want to support is: given a directory, tell me which of its subdirectories (arbitrarily deep below) have been modified since time T. That is what you need to support faster rsync, updatedb and similar loads. Also I want to allow a reboot to happen inbetween the modification and a query (handling a crash correctly would be nice too but honestly my current implementation is not completely reliable in this regard either) so some pernament storage is needed in any case. What I can imagine we could do is to report all modifications to userspace - that has a problem that there are *many* possible modifications but we are interested only whether there happened some since time T. We could improve this by an in-memory inode flag I'm not interested in modifications any further and reporting the change only if the parent directory does not have this flag set (note that this flag gets lost when we evict the inode from memory). But I would say that in the end all this message passing, climbing the tree from userspace and maintaining data structure in memory and on disk would cost use more than the current implementation... Also it has the disadvantage that we miss the modifications which happen before we start the userspace daemon catching the events. Doing this in kernel memory has a problem how to solve the persistency across reboots (dumping mod's to userspace on request?) and also on my system you'd have roughly a few MB of pinned memory for these purposes... Plausible but I don't really like it... Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess people would find some :). I don't get it here - you need to scan the whole subtree and set the flag only during the initial scan. Later, you need to scan and set the flag only for directories in whose subtree something changed. Similarty rtime needs to be updated for each inode at most once after the scan. OK, so in the worst case every single file in a kernel source tree might change after doing an extreme git checkout. That means around 36k of files get updated. So if you have to set/clear the rtime flag during the checkout process 36k file inodes would have to have their rtime flag cleared, plus 2k worth of directory inodes; but those would probably be folded into other changes made to the inodes anyway. But Yes, here the impact is hardly measurable as I've written in the previous email. then when trackerd goes back and scans the subtree, if you are actually setting rtime flags for every single file inode, then that's 38k of indoes that need updating. If you only need to set the rtime flags for directories, that's only 2k worth of extra gratuitous inode updates. As I wrote above, the flag
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
Ah, OK, so the two things that I didn't get from your patch description are: 1) the rtime flag and rtime field are only set on directories 2) the intended use is not trackerd and its ilk, but rsync and updatedb, so it is desirable that scan/queries be persistent across reboots But then the major hole in this scheme is still the issue of hard links. The rsync program is still going to have to scan the entire subtree looking for hard links, since an inode with multiple links into the directory tree can't guarantee that all of its parent directories will have their rtime field updated. A program like updatedb which only cares about filenames will be OK, since that means it really only cares about knowing when directories have changed, and you can't have hard links to directories. The other problem, of course, is that this feature would become ext 2/3/4 specific, and I could see future filesystems possibly wanting this. So this raises the question of whether the interface should be at the VFS layer or not --- and if so, how to handle querying whether a particulra filesystem supports it, and what happens if you have a subtree which is covered by a filesystem that doesn't support rtime? So a program like rsync would need to scan /proc/self/mounts to see whether or not it would be safe to use this feature in the first place. And, of course, rsync would need to know whether it has write access to the tree in order to set flags in the directory, and what to do if some portion of the subtree isn't writeable by rsync. On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote: Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess people would find some :). Sometimes people like to use rsync to copy a subtree to which they have read access but not write access. (And here note that it's not enough to have write access, you actually need to *own* all of the directories in the subtree). Yes, it's safe to let any user *set* the rtime flag, but we couldn't let them clear the rtime flag, since then they would be able to hide a file modification from some other (potentially privileged) process. Speaking of security, I assume your patch will never allow rtime to go backwards (for example if the user attempts to backdate a file's mtime field using the utime() or utimes() system call)? I guess I'm convinced that updatedb could use this facility, but there are enough asteriks around it that I'm not sure that rsync could safely use this feature in production. I don't doubt that in a cold cache case, it would speed up rsync, but because it doesn't handle hard links, it's not reliable. Since rsync often gets used for backups, this is a big deal. There are also questions about what to do if rsync doesn't have write access to the filesystem, or if there is a non-rtime capable filesystem mounted in the subtree, etc., that can be worked around, but would add a lot of complexity and grottiness to the rsync source tree. Is the rsync maintainer really willing to add all of the necessary hair to support this rtime facility into their program? - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH 3/3] Recursive mtime for ext3
On Thu 08-11-07 09:37:59, Theodore Tso wrote: Ah, OK, so the two things that I didn't get from your patch description are: 1) the rtime flag and rtime field are only set on directories 2) the intended use is not trackerd and its ilk, but rsync and updatedb, so it is desirable that scan/queries be persistent across reboots But then the major hole in this scheme is still the issue of hard links. The rsync program is still going to have to scan the entire subtree looking for hard links, since an inode with multiple links into the directory tree can't guarantee that all of its parent directories will have their rtime field updated. Not really - initially rsync can scan a tree for hardlinks and remember where they are. If a hardlink to a file is created, an rtime update is sent up the tree via the path used to create the link. So during next scan, rsync will see the file is modified and finds out that its nlink is 1 and adds it to the list of hardlinked files. So for things like regular backups hardlinks can be dealt with efficiently. A program like updatedb which only cares about filenames will be OK, since that means it really only cares about knowing when directories have changed, and you can't have hard links to directories. The other problem, of course, is that this feature would become ext 2/3/4 specific, and I could see future filesystems possibly wanting this. So this raises the question of whether the interface should be at the VFS layer or not --- and if so, how to handle querying whether a particulra filesystem supports it, and what happens if you have a subtree which is covered by a filesystem that doesn't support rtime? So a program like rsync would need to scan /proc/self/mounts to see whether or not it would be safe to use this feature in the first Yes, being filesystem specific and thus requiring special handling of mount points is a disadvantage of this approach. place. And, of course, rsync would need to know whether it has write access to the tree in order to set flags in the directory, and what to do if some portion of the subtree isn't writeable by rsync. Yes, the cases where we cannot modify the flag in a tree would have to be handled (similarly as the cases where the filesystem simply does not support the feature). I don't think it wouldn't be too complicated but I have not the modification for rsync yet, so I can underestimate... On Thu, Nov 08, 2007 at 11:56:42AM +0100, Jan Kara wrote: Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the watcher. Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess people would find some :). Sometimes people like to use rsync to copy a subtree to which they have read access but not write access. (And here note that it's not enough to have write access, you actually need to *own* all of the directories in the subtree). Yes, so in such cases my feature won't be able to help. But I think there are still enough cases where it would help. Yes, it's safe to let any user *set* the rtime flag, but we couldn't let them clear the rtime flag, since then they would be able to hide a file modification from some other (potentially privileged) process. Good point. Speaking of security, I assume your patch will never allow rtime to go backwards (for example if the user attempts to backdate a file's mtime field using the utime() or utimes() system call)? No, the patch does not allow this. But anyway in case user has enough rights to change file's mtime, would it really be a security concern? I guess I'm convinced that updatedb could use this facility, but there are enough asteriks around it that I'm not sure that rsync could safely use this feature in production. I don't doubt that in a cold cache case, it would speed up rsync, but because it doesn't handle hard links, it's not reliable. Since rsync often gets used for backups, this is a big deal. There are also questions about what to do if rsync doesn't have write access to the filesystem, or if there is a non-rtime capable filesystem mounted in the subtree, etc., that can be worked around, but would add a lot of complexity and grottiness to the rsync source tree. Is the rsync maintainer really willing to add all of the necessary hair to support this rtime facility into their program? Hardlinks can be worked-around as I wrote above and there would have to be a fallback in case we cannot set the
Re: delalloc space accounting problem.
because new delalloc patch doesn't have reservation integrated. I'm going to implement this after data=order support. thanks, Alex Eric Sandeen wrote: It appears that delalloc lets me copy 50M of data onto a 30M filesystem; at least I never get ENOSPC back, although I wind up with several files that have 1M length but 0 blocks. I've filed a bug in the kernel bug tracker, I think we could use a central place to track issues: http://bugzilla.kernel.org/show_bug.cgi?id=9329 I'll try to find time to look into this unless someone knows offhand where the problem is... Thanks, -Eric (p.s. should I get ext[234] bug mail routed to this list, or would that be annoying?) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
delalloc space accounting problem.
It appears that delalloc lets me copy 50M of data onto a 30M filesystem; at least I never get ENOSPC back, although I wind up with several files that have 1M length but 0 blocks. I've filed a bug in the kernel bug tracker, I think we could use a central place to track issues: http://bugzilla.kernel.org/show_bug.cgi?id=9329 I'll try to find time to look into this unless someone knows offhand where the problem is... Thanks, -Eric (p.s. should I get ext[234] bug mail routed to this list, or would that be annoying?) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9329] New: ext4: delalloc space accounting problem drops data
On Thu, 8 Nov 2007 09:42:10 -0800 (PST) [EMAIL PROTECTED] wrote: http://bugzilla.kernel.org/show_bug.cgi?id=9329 Summary: ext4: delalloc space accounting problem drops data Product: File System Version: 2.5 KernelVersion: 2.6.24-rc1 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: ext4 AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] 2.6.24-rc1 + ext4 git patch queue from last week or so. It appears that delalloc does not track used space properly, and fails to return ENOSPC as appropriate: [EMAIL PROTECTED] ~]# mkfs.ext3 -I 256 /dev/sdb7 32768 [EMAIL PROTECTED] ~]# mount -t ext4dev -o data=writeback,delalloc,extents,mballoc /dev/sdb7 /mnt/test [EMAIL PROTECTED] ~]# df -h /mnt/test FilesystemSize Used Avail Use% Mounted on /dev/sdb7 30M 4.5M 24M 16% /mnt/test [EMAIL PROTECTED] ~]# du -h /tmp/1Mfile 1.1M/tmp/1Mfile [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile /mnt/test/1Mfile-$I; done [EMAIL PROTECTED] ~]# df -h /mnt/test FilesystemSize Used Avail Use% Mounted on /dev/sdb7 30M 30M 0 100% /mnt/test all resulting files are 1M in length: [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep -v 1048576 [EMAIL PROTECTED] ~]# ls -l /mnt/test/1M* | grep 1048576 | wc -l 50 but many of them have silently dropped data on the floor: [EMAIL PROTECTED] ~]# du -hc /mnt/test/1Mfile-* | grep -v 1.0M 596K/mnt/test/1Mfile-26 0 /mnt/test/1Mfile-27 0 /mnt/test/1Mfile-28 0 /mnt/test/1Mfile-29 0 /mnt/test/1Mfile-30 snip When mounted with nodelalloc, I get proper behavior: [EMAIL PROTECTED] ~]# for I in `seq 1 50`; do cp /tmp/1Mfile /mnt/test/1Mfile-$I; done cp: writing `/mnt/test/1Mfile-26': No space left on device cp: writing `/mnt/test/1Mfile-27': No space left on device cp: writing `/mnt/test/1Mfile-28': No space left on device cp: writing `/mnt/test/1Mfile-29': No space left on device snip -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix check_mntent_file() to pass mode for open(O_CREAT)
On my FC8 install, ismounted.c fails to build because open(O_CREAT) is used without passing a mode. The following trivial patch fixes it. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Index: e2fsprogs-1.40.2/lib/ext2fs/ismounted.c === --- e2fsprogs-1.40.2.orig/lib/ext2fs/ismounted.c +++ e2fsprogs-1.40.2/lib/ext2fs/ismounted.c @@ -147,7 +147,7 @@ static errcode_t check_mntent_file(const is_root: #define TEST_FILE /.ismount-test-file *mount_flags |= EXT2_MF_ISROOT; - fd = open(TEST_FILE, O_RDWR|O_CREAT); + fd = open(TEST_FILE, O_RDWR|O_CREAT, 0600); if (fd 0) { if (errno == EROFS) *mount_flags |= EXT2_MF_READONLY; Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix check_mntent_file() to pass mode for open(O_CREAT)
Andreas Dilger wrote: On my FC8 install, ismounted.c fails to build because open(O_CREAT) is used without passing a mode. The following trivial patch fixes it. You can add: Acked-by: Eric Sandeen [EMAIL PROTECTED] 'cause it's an awful lot like the patch I sent for the same issue back on 8/16 ;-) Guess I should have followed that up with a ping. (though your 0600 mode is probably better than my 0644 was) Andreas, did you also run into trouble with struct_io_manager's -open calls triggering this test? I sent a patch for that, [PATCH] rename -open and -close ops in struct_io_manager too... maybe the glibc #define tricks got smarter and don't trigger that now? -Eric Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Index: e2fsprogs-1.40.2/lib/ext2fs/ismounted.c === --- e2fsprogs-1.40.2.orig/lib/ext2fs/ismounted.c +++ e2fsprogs-1.40.2/lib/ext2fs/ismounted.c @@ -147,7 +147,7 @@ static errcode_t check_mntent_file(const is_root: #define TEST_FILE /.ismount-test-file *mount_flags |= EXT2_MF_ISROOT; - fd = open(TEST_FILE, O_RDWR|O_CREAT); + fd = open(TEST_FILE, O_RDWR|O_CREAT, 0600); if (fd 0) { if (errno == EROFS) *mount_flags |= EXT2_MF_READONLY; Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html