[PATCH] reiserfs: shrink superblock if no xattrs

2007-02-23 Thread Alexey Dobriyan
This makes in-core superblock fit into one cacheline here.

Before:
struct dentry *xattr_root;   /*   124 4 */
/* --- cacheline 1 boundary (128 bytes) --- */
struct rw_semaphorexattr_dir_sem;/*   12812 */
intj_errno;  /*   140 4 */
}; /* size: 144, cachelines: 2 */
   /* sum members: 142, holes: 1, sum holes: 2 */
   /* last cacheline: 16 bytes */

After:
intj_errno;  /*   124 4 */
/* --- cacheline 1 boundary (128 bytes) --- */
}; /* size: 128, cachelines: 1 */
   /* sum members: 126, holes: 1, sum holes: 2 */

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 fs/reiserfs/super.c|6 --
 include/linux/reiserfs_fs_sb.h |3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -433,12 +433,13 @@ int remove_save_link(struct inode *inode
 static void reiserfs_kill_sb(struct super_block *s)
 {
if (REISERFS_SB(s)) {
+#ifdef CONFIG_REISERFS_FS_XATTR
if (REISERFS_SB(s)-xattr_root) {
d_invalidate(REISERFS_SB(s)-xattr_root);
dput(REISERFS_SB(s)-xattr_root);
REISERFS_SB(s)-xattr_root = NULL;
}
-
+#endif
if (REISERFS_SB(s)-priv_root) {
d_invalidate(REISERFS_SB(s)-priv_root);
dput(REISERFS_SB(s)-priv_root);
@@ -1563,9 +1564,10 @@ static int reiserfs_fill_super(struct su
REISERFS_SB(s)-s_alloc_options.preallocmin = 0;
/* Preallocate by 16 blocks (17-1) at once */
REISERFS_SB(s)-s_alloc_options.preallocsize = 17;
+#ifdef CONFIG_REISERFS_FS_XATTR
/* Initialize the rwsem for xattr dir */
init_rwsem(REISERFS_SB(s)-xattr_dir_sem);
-
+#endif
/* setup default block allocator options */
reiserfs_init_alloc_options(s);
 
--- a/include/linux/reiserfs_fs_sb.h
+++ b/include/linux/reiserfs_fs_sb.h
@@ -401,9 +401,10 @@ struct reiserfs_sb_info {
int reserved_blocks;/* amount of blocks reserved for further 
allocations */
spinlock_t bitmap_lock; /* this lock on now only used to protect 
reserved_blocks variable */
struct dentry *priv_root;   /* root of /.reiserfs_priv */
+#ifdef CONFIG_REISERFS_FS_XATTR
struct dentry *xattr_root;  /* root of /.reiserfs_priv/.xa */
struct rw_semaphore xattr_dir_sem;
-
+#endif
int j_errno;
 #ifdef CONFIG_QUOTA
char *s_qf_names[MAXQUOTAS];

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch between 2.6.19 and nfs-utils-1.0.10 nfsctl_arg structure???

2007-02-23 Thread Wouter Batelaan

To all who took an interest, and Neil Brown who gave help:

I've got the nfs server working now.

At the end there were several issues
(note that we're on dual-MIPS embedded system, starting off
with a very minimal system because of space constraints):

- /etc/services was not correct
- no network loopback interface
- trying to export part of an NFSROOT (imported) file system
- /etc/exports netmask problem
- having to run portmap on client machine as well
- missing fsid option in exports file

:-(I'm not 100% sure that ALL of the above were/are critical)

I temporarily also used the unfs3 server with success, but
eventually got the knfsd working.

Thanks for the help.

Kind Regards,
Wouter Batelaan
NXP Semiconductors.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


i_mutex and deadlock

2007-02-23 Thread Steve French (smfltc)
A field in i_size_write (i_size_seqcount) must be protected against 
simultaneous update otherwise we risk looping in i_size_read.


The suggestion in fs.h is to use i_mutex which seems too dangerous due 
to the possibility of deadlock.


There are 65 places in the fs directory which lock an i_mutex, and seven 
more in the mm directory.   The vfs does clearly lock file inodes in 
some paths before calling into a particular filesystem (nfs, ext3, cifs 
etc.) - in particular for fsync but probably for others that are harder 
to trace.  This seems to introduce the possibility of deadlock if a 
filesystem also uses i_mutex to protect file size updates


Documentation/filesystems/Locking describes the use of i_mutex (was 
i_sem previously) and indicates that it is held by the vfs on three 
additional calls on file inodes which concern me (for deadlock 
possibility), setattr, truncate and unlink.


nfs seems to limit its use of i_mutex to llseek and invalidate_mapping, 
and does not appear to grab the i_mutex (or any sem for that matter) to 
protect i_size_write
(nfs calls i_size_write in nfs_grow_file) - and for the case of 
nfs_fhget (in which they bypass i_size_write and set i_size directly) 
does not seem to grab i_mutex either.


ext3 also does not use i_mutex for this purpose (protecting 
i_size_write) - ony to protect a journalling ioctl.


I am concerned about using i_mutex to protect the cifs calls to 
i_size_write (although it seems to fix a problem reported in i_size_read 
under stress) because of the following:


1) no one else calls i_size_write AFAIK (on our file inodes)
2) we don't block inside i_size_write do we ... (so why in the world do 
they take a slow mutex instead of a fast spinlock)
3) we don't really know what happens inside fsync (the paths through the 
page cache code seem complex and we don't want to reenter writepage in 
low memory conditions and deadlock updating the file size), and there is 
some concern that the vfs takes the i_mutex in other paths on file 
inodes before entering our code and could deadlock.


Any reason, why an fs shouldn't simply use something else (a spinlock) 
other than i_mutex to protect the i_size_write call?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-cifs-client] i_mutex and deadlock

2007-02-23 Thread Dave Kleikamp
On Fri, 2007-02-23 at 10:02 -0600, Steve French (smfltc) wrote:
 A field in i_size_write (i_size_seqcount) must be protected against 
 simultaneous update otherwise we risk looping in i_size_read.
 
 The suggestion in fs.h is to use i_mutex which seems too dangerous due 
 to the possibility of deadlock.

I'm not sure if it's as much a suggestion as a way of documenting the
locking  that exists (or existed when the comment was written).

... i_size_write() does need locking around it  (normally i_mutex) ...

 There are 65 places in the fs directory which lock an i_mutex, and seven 
 more in the mm directory.   The vfs does clearly lock file inodes in 
 some paths before calling into a particular filesystem (nfs, ext3, cifs 
 etc.) - in particular for fsync but probably for others that are harder 
 to trace.  This seems to introduce the possibility of deadlock if a 
 filesystem also uses i_mutex to protect file size updates
 
 Documentation/filesystems/Locking describes the use of i_mutex (was 
 i_sem previously) and indicates that it is held by the vfs on three 
 additional calls on file inodes which concern me (for deadlock 
 possibility), setattr, truncate and unlink.
 
 nfs seems to limit its use of i_mutex to llseek and invalidate_mapping, 
 and does not appear to grab the i_mutex (or any sem for that matter) to 
 protect i_size_write
 (nfs calls i_size_write in nfs_grow_file) - and for the case of 
 nfs_fhget (in which they bypass i_size_write and set i_size directly) 
 does not seem to grab i_mutex either.
 
 ext3 also does not use i_mutex for this purpose (protecting 
 i_size_write) - ony to protect a journalling ioctl.
 
 I am concerned about using i_mutex to protect the cifs calls to 
 i_size_write (although it seems to fix a problem reported in i_size_read 
 under stress) because of the following:
 
 1) no one else calls i_size_write AFAIK (on our file inodes)

I think you're right.

 2) we don't block inside i_size_write do we ... (so why in the world do 
 they take a slow mutex instead of a fast spinlock)

My guess, is that in existing cases, it was already being held, so there
is no need to do something different.  I'm not sure if the comment is
still accurate.  What locking protects it in generic_commit_write() and
nobh_commit_write()?

 3) we don't really know what happens inside fsync (the paths through the 
 page cache code seem complex and we don't want to reenter writepage in 
 low memory conditions and deadlock updating the file size), and there is 
 some concern that the vfs takes the i_mutex in other paths on file 
 inodes before entering our code and could deadlock.
 
 Any reason, why an fs shouldn't simply use something else (a spinlock) 
 other than i_mutex to protect the i_size_write call?

i_mutex doesn't make sense in your case.  Use whatever makes sense in
cifs.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


end to end error recovery musings

2007-02-23 Thread Ric Wheeler
In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.


My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one appliance like 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.


The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.


Each box has a watchdog timer that can be set to fire after at most 2 
minutes.


(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).


Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.


   (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.


   (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...


   (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.


We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)


ric



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Andreas Dilger
On Feb 23, 2007  16:03 -0800, H. Peter Anvin wrote:
 Ric Wheeler wrote:
(1) read-ahead often means that we will  retry every bad sector at 
 least twice from the file system level. The first time, the fs read 
 ahead request triggers a speculative read that includes the bad sector 
 (triggering the error handling mechanisms) right before the real 
 application triggers a read does the same thing.  Not sure what the 
 answer is here since read-ahead is obviously a huge win in the normal case.
 
 Probably the only sane thing to do is to remember the bad sectors and 
 avoid attempting reading them; that would mean marking automatic 
 versus explicitly requested requests to determine whether or not to 
 filter them against a list of discovered bad blocks.

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.  For that matter, a huge win
would be to have the MD RAID layer rewrite only the bad sector (in hopes
of the disk relocating it) instead of failing the whiole disk.  Otherwise,
a few read errors on different disks in a RAID set can take the whole
system offline.  Apologies if this is already done in recent kernels...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Andreas Dilger wrote:

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.


Certainly if the overwrite is successful.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Theodore Tso
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
  Probably the only sane thing to do is to remember the bad sectors and 
  avoid attempting reading them; that would mean marking automatic 
  versus explicitly requested requests to determine whether or not to 
  filter them against a list of discovered bad blocks.
 
 And clearing this list when the sector is overwritten, as it will almost
 certainly be relocated at the disk level.  For that matter, a huge win
 would be to have the MD RAID layer rewrite only the bad sector (in hopes
 of the disk relocating it) instead of failing the whiole disk.  Otherwise,
 a few read errors on different disks in a RAID set can take the whole
 system offline.  Apologies if this is already done in recent kernels...

And having a way of making this list available to both the filesystem
and to a userspace utility, so they can more easily deal with doing a
forced rewrite of the bad sector, after determining which file is
involved and perhaps doing something intelligent (up to and including
automatically requesting a backup system to fetch a backup version of
the file, and if it can be determined that the file shouldn't have
been changed since the last backup, automatically fixing up the
corrupted data block :-).

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html