Re: [RFD] Incremental fsck

2008-01-12 Thread Daniel Phillips
On Wednesday 09 January 2008 01:16, Andreas Dilger wrote:
> While an _incremental_ fsck isn't so easy for existing filesystem
> types, what is pretty easy to automate is making a read-only snapshot
> of a filesystem via LVM/DM and then running e2fsck against that.  The
> kernel and filesystem have hooks to flush the changes from cache and
> make the on-disk state consistent.
>
> You can then set the the ext[234] superblock mount count and last
> check time via tune2fs if all is well, or schedule an outage if there
> are inconsistencies found.
>
> There is a copy of this script at:
> http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html
>
> Note that it might need some tweaks to run with DM/LVM2
> commands/output, but is mostly what is needed.

You can do this now with ddsnap (an out-of-tree device mapper target) 
either by checking a local snapshot or a replicated snapshot on a 
different machine, see:

http://zumastor.org/

Doing the check on a remote machine seems attractive because the fsck 
does not create a load on the server.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-12 Thread Theodore Tso
On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
> 
> Ok, but let's look at this a bit more opportunistic / optimistic.
> 
> Even after a black-out shutdown, the corruption is pretty minimal, using 
> ext3fs at least.
>

After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all.  If you have crappy hardware, then all bets are off

> So let's take advantage of this fact and do an optimistic fsck, to
> assure integrity per-dir, and assume no external corruption.  Then
> we release this checked dir to the wild (optionally ro), and check
> the next.  Once we find external inconsistencies we either fix it
> unconditionally, based on some preconfigured actions, or present the
> user with options.

So what can you check?  The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.

What is very hard to check is whether or not the link count on the
inode is correct.  Suppose the link count is 1, but there are actually
two directory entries pointing at it.  Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname.  Oops.  Data Loss.

This is why doing incremental, on-line fsck'ing is *hard*.  You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse.  And
it's not just the hard link count.  There is a similar issue with the
block allocation bitmap.  Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.

One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only.  There are a few caveats, however  (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.)  (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write.  You can't just do a "mount -o remount" if the filesystem
was modified under the OS's nose.

> All this could be per-dir or using some form of on-the-fly file-block-zoning.
> 
> And there probably is a lot more to it, but it should conceptually be 
> possible, with more thoughts though...

Many things are possible, in the NASA sense of "with enough thrust,
anything will fly".  Whether or not it is *useful* and *worthwhile*
are of course different questions!  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] security: call security_file_permission from rw_verify_area

2008-01-12 Thread James Morris
Please review.  

Tested with SELinux in enforcing mode.

---

All instances of rw_verify_area() are followed by a call to
security_file_permission(), so just call the latter from the former.

Signed-off-by: James Morris <[EMAIL PROTECTED]>
---
 fs/compat.c |4 ---
 fs/read_write.c |   63 +--
 fs/splice.c |8 ---
 3 files changed, 24 insertions(+), 51 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 15078ce..5216c3f 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1104,10 +1104,6 @@ static ssize_t compat_do_readv_writev(int type, struct 
file *file,
if (ret < 0)
goto out;
 
-   ret = security_file_permission(file, type == READ ? MAY_READ:MAY_WRITE);
-   if (ret)
-   goto out;
-
fnv = NULL;
if (type == READ) {
fn = file->f_op->read;
diff --git a/fs/read_write.c b/fs/read_write.c
index ea1f94c..c4d3d17 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -197,25 +197,27 @@ int rw_verify_area(int read_write, struct file *file, 
loff_t *ppos, size_t count
 {
struct inode *inode;
loff_t pos;
+   int retval = -EINVAL;
 
inode = file->f_path.dentry->d_inode;
if (unlikely((ssize_t) count < 0))
-   goto Einval;
+   return retval;
pos = *ppos;
if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
-   goto Einval;
+   return retval;
 
if (unlikely(inode->i_flock && mandatory_lock(inode))) {
-   int retval = locks_mandatory_area(
+   retval = locks_mandatory_area(
read_write == READ ? FLOCK_VERIFY_READ : 
FLOCK_VERIFY_WRITE,
inode, file, pos, count);
if (retval < 0)
return retval;
}
+   retval = security_file_permission(file,
+   read_write == READ ? MAY_READ : MAY_WRITE);
+   if (retval)
+   return retval;
return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
-
-Einval:
-   return -EINVAL;
 }
 
 static void wait_on_retry_sync_kiocb(struct kiocb *iocb)
@@ -267,18 +269,15 @@ ssize_t vfs_read(struct file *file, char __user *buf, 
size_t count, loff_t *pos)
ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
count = ret;
-   ret = security_file_permission (file, MAY_READ);
-   if (!ret) {
-   if (file->f_op->read)
-   ret = file->f_op->read(file, buf, count, pos);
-   else
-   ret = do_sync_read(file, buf, count, pos);
-   if (ret > 0) {
-   fsnotify_access(file->f_path.dentry);
-   add_rchar(current, ret);
-   }
-   inc_syscr(current);
+   if (file->f_op->read)
+   ret = file->f_op->read(file, buf, count, pos);
+   else
+   ret = do_sync_read(file, buf, count, pos);
+   if (ret > 0) {
+   fsnotify_access(file->f_path.dentry);
+   add_rchar(current, ret);
}
+   inc_syscr(current);
}
 
return ret;
@@ -325,18 +324,15 @@ ssize_t vfs_write(struct file *file, const char __user 
*buf, size_t count, loff_
ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
count = ret;
-   ret = security_file_permission (file, MAY_WRITE);
-   if (!ret) {
-   if (file->f_op->write)
-   ret = file->f_op->write(file, buf, count, pos);
-   else
-   ret = do_sync_write(file, buf, count, pos);
-   if (ret > 0) {
-   fsnotify_modify(file->f_path.dentry);
-   add_wchar(current, ret);
-   }
-   inc_syscw(current);
+   if (file->f_op->write)
+   ret = file->f_op->write(file, buf, count, pos);
+   else
+   ret = do_sync_write(file, buf, count, pos);
+   if (ret > 0) {
+   fsnotify_modify(file->f_path.dentry);
+   add_wchar(current, ret);
}
+   inc_syscw(current);
}
 
return ret;
@@ -603,9 +599,6 @@ static ssize_t do_readv_writev(int type, struct file *file,
ret = rw_verify_area(type, file, pos, tot_len);
if (ret < 0)
goto out;
-   ret = security_file_permission(file, type == READ ? MAY_READ : 
MAY_WRITE);
-   if (ret)
-   goto out;
 
fnv = NULL;
if (type == READ) {
@@ -737,10 +730,6 @@ static ss

Re: [RFD] Incremental fsck

2008-01-12 Thread Al Boldi
Bodo Eggert wrote:
> Al Boldi <[EMAIL PROTECTED]> wrote:
> > Even after a black-out shutdown, the corruption is pretty minimal, using
> > ext3fs at least.  So let's take advantage of this fact and do an
> > optimistic fsck, to assure integrity per-dir, and assume no external
> > corruption.  Then we release this checked dir to the wild (optionally
> > ro), and check the next. Once we find external inconsistencies we either
> > fix it unconditionally, based on some preconfigured actions, or present
> > the user with options.
>
> Maybe we can know the changes that need to be done in order to fix the
> filesystem. Let's record this information in - eh - let's call it a
> journal!

Don't mistake data=journal as an fsck replacement.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html