Re: [NFS] [PATCH] Make UDF exportable

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
   + dotdot.d_name.name = ..;
   + dotdot.d_name.len = 2;
   +
   + lock_kernel();
   + if (!udf_find_entry(child-d_inode, dotdot, fibh, cfi))
   + goto out_unlock;
Have you ever tried this? I think this could never work. UDF doesn't have
  entry named .. in a directory. You have to search for an entry that has
  in fileCharacteristics set bit FID_FILE_CHAR_PARENT. Maybe you could
  hack-around udf_find_entry() to recognize .. dentry and do the search
  accordingly.
 Probably not. I just tested that I could read files and navigate the
 directory structure. However looking into UDF I think you are right - it
 will fail.
 I have extended udf_find_entry() to do an explicit check based on
 fileCharacteristics as you propose.
 How do I actually test this case?

 - Mount the filesystem from the server.
 - 'cd' a few directories down into the filesystem.
 - reboot the server(1)
 - on the client 'ls -l'.

(1) A full reboot isn't needed.  Just unexport, unmount, remount,
re-export on the server.

alternately, use a non-linux client and cd down into the filesystem
and
ls -l ..

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Neil Brown
On Thursday January 17, [EMAIL PROTECTED] wrote:
 
 On Jan 17 2008 00:43, Karel Zak wrote:
  
  Seems like a plain bad idea to me.  There will be any number of home-made
  /proc/mounts parsers and we don't know what they do.
 
  So, let's use /proc/mounts_v2  ;-)
 
 Was not it like don't use /proc for new things?

I thought it was don't use /proc for new things that aren't process
related.

And as the mount table is per process..

A host has a bunch of mounted filesystems (struct super_block), and
each process has some subset of these stitched together into a mount
tree (struct vfsmount / struct namespace).

There needs to be something in /proc that exposes the vfsmount tree.

Arguably there should be something else - maybe in sysfs - that
provides access to the struct superblock object.

And there needs to be a clear way to relate information from one with
information from the other.

In the tradition of stat, statm, status, maybe the former should be
 /proc/$PID/mountm
:-)

Hey, I just found /proc/X/mountstats.  How does this fit in to the big
picture?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Race between generic_forget_inode() and sync_sb_inodes()?

2007-11-29 Thread Neil Brown

Hi David,

On Friday November 30, [EMAIL PROTECTED] wrote:
 
 
 I came across this because I've been making changes to XFS to avoid the
 inode hash, and I've found that I need to remove the inode from the
 dirty list when setting I_WILL_FREE to avoid this race. I can't see
 how this race is avoided when inodes are hashed, so I'm wondering
 if we've just been lucky or there's something that I'm missing that
 means the above does not occur.

Looking at inode.c in 2.6.23-mm1, in generic_forget_inode, I see code:

if (!hlist_unhashed(inode-i_hash)) {
if (!(inode-i_state  (I_DIRTY|I_SYNC)))
list_move(inode-i_list, inode_unused);

so it looks to me like:
   If the inode is hashed and dirty, then move it (off the s_dirty
   list) to inode_unused.

So it seems to me that generic_forget_inode also finds it needs to
remove the inode from the dirty list when setting I_WILL_FREE.

Maybe we are looking at different kernel versions?  Maybe I
misunderstood your problem?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Race between generic_forget_inode() and sync_sb_inodes()?

2007-11-29 Thread Neil Brown
On Friday November 30, [EMAIL PROTECTED] wrote:
 On Fri, Nov 30, 2007 at 09:07:06AM +1100, Neil Brown wrote:
  
  Hi David,
  
  On Friday November 30, [EMAIL PROTECTED] wrote:
   
   
   I came across this because I've been making changes to XFS to avoid the
   inode hash, and I've found that I need to remove the inode from the
   dirty list when setting I_WILL_FREE to avoid this race. I can't see
   how this race is avoided when inodes are hashed, so I'm wondering
   if we've just been lucky or there's something that I'm missing that
   means the above does not occur.
  
  Looking at inode.c in 2.6.23-mm1, in generic_forget_inode, I see code:
  
  if (!hlist_unhashed(inode-i_hash)) {
  if (!(inode-i_state  (I_DIRTY|I_SYNC)))
  list_move(inode-i_list, inode_unused);
  
  so it looks to me like:
 If the inode is hashed and dirty, then move it (off the s_dirty
 list) to inode_unused.
 
 That check is for if the inode is _not_ dirty or being sync, right?
 Or have I just not had enough coffee this morning?

:-)  And I cannot even blame the lack of coffee as I don't drink it.

My second guess is that we have been lucky which is hard to believe.

I wonder if iput (and even iget) should BUG on I_WILL_FREE as well...

Perplexed.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive slowdown when re-querying large nfs dir - CORRECTION

2007-11-07 Thread Neil Brown
On Thursday November 8, [EMAIL PROTECTED] wrote:
 
 Not really a credible difference as the reported difference is between
 two *clients* and the speed of getattr vs lookup would depend on the
 *server*. 

Sorry, my bad.  I misread your original problem description.  It would
appear to be a server difference.

Maybe an strace -tt of the nfs server might show some significant
difference.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Neil Brown
On Tuesday November 6, [EMAIL PROTECTED] wrote:
  On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi [EMAIL PROTECTED] wrote:
  Al Boldi wrote:
   There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
   entries) using a simple ls -l.
  
   On 2.6.23 client and server running userland rpc.nfs.V2:
   first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
   more tries: time -p ls -l 2k+ entry dir  in ~8sec
  
   first  try: time -p ls -l 5k+ entry dir  in ~9sec
   more tries: time -p ls -l 5k+ entry dir  in ~180sec
  
   On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
   first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
   more tries: time -p ls -l 2k+ entry dir  in ~7sec
  
   first  try: time -p ls -l 5k+ entry dir  in ~8sec
   more tries: time -p ls -l 5k+ entry dir  in ~43sec
  
   Remounting the nfs-dir on the client resets the problem.
  
   Any ideas?
  
  Ok, I played some more with this, and it turns out that nfsV3 is a lot 
  faster.  But, this does not explain why the 2.4.31 kernel is still over 
  4-times faster than 2.6.23.
  
  Can anybody explain what's going on?
  
 
 Sure, Neil can! ;)

Nuh.
He said userland rpc.nfs.Vx.  I only do kernel-land NFS.  In these
days of high specialisation, each line of code is owned by a different
person, and finding the right person is hard

I would suggest getting a 'tcpdump -s0' trace and seeing (with
wireshark) what is different between the various cases.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Neil Brown
On Tuesday October 30, [EMAIL PROTECTED] wrote:
 
 Of course snapshot cow elements may be part of more generic element
 trees.  In general there may be more than one consumer of block usage
 hints in a given filesystem's element tree, and their locations in that
 tree are not predictable.  This means the block extents mentioned in
 the usage hints need to be subject to the block mapping algorithms
 provided by the element tree.  As those algorithms are currently
 implemented using bio mapping and splitting, the easiest and simplest
 way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say We've got this really cool thing to make
snapshots much faster - wanna share?  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

 
 First we need a mechanism to indicate that a bio is a hint rather
 than a real IO.  Perhaps the easiest way is to add a new flag to
 the bi_rw field:
 
 #define BIO_RW_HINT   5   /* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

 
 Next we'll need three bio hints types with the following semantics.
 
 BIO_HINT_ALLOCATE
 The bio's block extent will soon be written by the filesystem
 and any COW that may be necessary to achieve that should begin
 now.  If the COW is going to fail, the bio should fail.  Note
 that this provides a way for the filesystem to manage when and
 how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. Please allocate 128 blocks, maybe 'here' 
 OK, 128 blocks allocated, but they are actually over 'there'.

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
Allocate 128 blocks, I don't care where.

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

 
 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

 
 BIO_HINT_DONTCOW
 (the Bart Simpson BIO).  The bio's block extent is not needed
 in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the correct thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is Give me a
subordinate device which is not snapshotted, size X, then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

 
 Comments?

On the whole it seems reasonably sane  providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Neil Brown
On Thursday October 25, [EMAIL PROTECTED] wrote:
 On Mon, 22 Oct 2007, Pekka Enberg wrote:
  On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
   Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
   Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
   if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
   a special code to get the LRU rotation right, not useful elsewhere.
   Though Documentation/filesystems/vfs.txt does imply wider use.
  
   I think this is where people use the phrase go figure ;)
  
  Heh. As far as I can tell, the implication of wider use was added by
  Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some
  VFS documentation, so perhaps he might know? Neil?

I just read the code, tried to understand it, translated that
understanding into English, and put that in vfs.txt.
It is very possible that what I wrote didn't match the intention of
the author, but it seemed to match the behaviour of the code.

The patch looks like it makes perfect sense to me.
Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE
without unlocking the page, and this has precisely the effect of:
   ClearPageReclaim();  (if the call path was through pageout)
   SetPageActive();  (if the call was through shrink_page_list)
   unlock_page();

With the patch, the -writepage method does the SetPageActive and the
unlock_page, which on the whole seems cleaner.

We seem to have lost a call to ClearPageReclaim - I don't know if that
is significant.

 
 Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
 stole his own secret and used it when concocting ramdisk_writepage.
 Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
 revealed the secret to the uninitiated in 2.6.17: now, what's the
 appropriate punishment for that?

Surely the punishment should be for writing hidden undocumented hacks
in the first place!  I vote we just make him maintainer for the whole
kernel - that will keep him so busy that he will never have a chance
to do it again :-)

 --- 2.6.24-rc1/Documentation/filesystems/vfs.txt  2007-10-24 
 07:15:11.0 +0100
 +++ linux/Documentation/filesystems/vfs.txt   2007-10-24 08:42:07.0 
 +0100
 @@ -567,9 +567,7 @@ struct address_space_operations {
If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to
try too hard if there are problems, and may choose to write out
other pages from the mapping if that is easier (e.g. due to
 -  internal dependencies).  If it chooses not to start writeout, it
 -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
 -  calling -writepage on that page.
 +  internal dependencies).
  

It seems that the new requirement is that if the address_space
chooses not to write out the page, it should now call SetPageActive().
If that is the case, I think it should be explicit in the
documentation - please?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] export operations rewrite

2007-09-27 Thread Neil Brown
On Friday September 14, [EMAIL PROTECTED] wrote:
 This patchset is a medium scale rewrite of the export operations
 interface.  The goal is to make the interface less complex, and
 easier to understand from the filesystem side, aswell as preparing
 generic support for exporting of 64bit inode numbers.

These 19 patches
Acked-by: NeilBrown [EMAIL PROTECTED]

I didn't do a very thorough review this time, but I did the previous
time, and the one problem I found (module dependencies) has been
addressed - Thanks.

NeilBrown

 
 This touches all nfs exporting filesystems, and I've done testing
 on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs,
 jfs)
 
 Compared to the last version I've fixed the white space issues that
 checkpatch.pl complained about.
 
 Note that this patch series is against mainline.  There will be some
 xfs changes landing in -mm soon that revamp lots of the code touched
 here.  They should hopefully include the first path in the series so
 it can be simply dropped, but the xfs conversion will need some smaller
 updates.  I will send this update as soon as the xfs tree updates get
 pulled into -mm.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] block_device_operations prototype changes

2007-08-27 Thread Neil Brown
On Monday August 27, [EMAIL PROTECTED] wrote:
   * a bug (AFAICT) in md.c - we open raid components r/w and if it
 fails, it fails.  Good for our purposes, but... how about raid0 on read-only
 devices?  In any case, we have a ready place to store mode_t, so it's not a
 problem for getting the right value to blkdev_put().  FWIW, I think that
 allowing fallback to r/o (and making the entire array r/o, of course) would
 be a good thing to have.  Any comments from drivers/md folks?

I've never heard any suggestion of anybody wanting to include a
readonly device in an md array -  I guess one could imagine an array
of CDROMs, but I doubt anyone would actually create one.
However I agree that falling back to read-only (particular if the
error value indicated that this was the only problem) would be a
sensible thing to do in some circumstances.

More interestingly:  one could argue that when the md array is
switched to read-only, each component device should be reopened as
read-only.  So I would really like the interface to have a way to
switch a device between read-only and read-write without closing and
re-opening.

Though I guess opening and then closing as you suggest below would be OK.

   * open_bdev_excl()/close_bdev_excl().  Needs an extra argument for
 the latter.  Two interesting callers:
   * kill_block_super() - need to store relevant mode_t in superblock,
 along with reference to bdev.  Note that just looking at sb-s_flags is *not*
 enough - some filesystems go read-only on errors and that changes -s_flags.
 Another side of that is explicit remount from r/o to r/w and there we have
 a bug - we do _not_ tell the driver that something had happened.  Proper
 fix is simple enough - bdget() + blkdev_get() for write + blkdev_put() with
 old mode_t (i.e. FMODE_READ) once we are done remounting (and similar for
 explicit remount to r/o).


I would *really* like to see this.  When the root filesystem gets
mounted read-only, I want md to know so that it can mark the array as
'clean' straight away.  I really don't like having a reboot-notifier
to mark all arrays as 'clean' on shutdown.

So each device would be responsible for keeping a count of 'readonly'
and 'read-write' accesses?  Or would that be in common code?

I look forward to this all getting cleaned up!

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Neil Brown

It just occurred to me:

 If i_version is 64bit, then knfsd would need to be careful when
 reading it on a 32bit host.  What are the locking rules?

 Presumably it is only updated under i_mutex protection, but having to
 get i_mutex to read it would seem a little heavy handed.

 Should it use a seqlock like i_size?
 Could we use the same seqlock that i_size uses, or would we need a
 separate one?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vm/fs meetup details

2007-07-06 Thread Neil Brown
On Friday July 6, [EMAIL PROTECTED] wrote:
 Hi,
 
 On Fri, Jul 06, 2007 at 05:57:49PM +0200, Jörn Engel wrote:
...
  
  Interesting idea.  Is it possible to attach several address spaces to an
  inode?  That would cure some headaches.
 
 GFS2 already uses something like this, in fact by having a second inode
 to contain the second address space. Thats a bit of a hack but we can't
...
 
 So that would certainly be an issue that I'd like to discuss to see
 what can be worked out in that area,
 
 Steve.

Maybe the question here is:

  What support should common code provide for caching indexing
  metadata?

Common code already provides the page cache that is very nice for
caching file data.
Some filesystems use the blockdev page cache to cache index metadata
by physical address.  But I think that increasingly filesystems want
to cache index metadata by some sort of virtual address.
A second page cache address space would be suitable if the addresses
were dense, and would be acceptable if the blocks were page-sized (or
larger).  But for non-dense, non-page-sized blocks, a radix tree of
pages is less than ideal (I think).

My filesystem (LaFS, which is actually beginning to work thanks to
Novell's HackWeek) uses non-dense, non-page-sized blocks both for file
indexing and for directories and while I have a working solution for
each case, there is room for improvements that might fit well with
other filesystems too.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how do versioning filesystems take snapshot of opened files?

2007-07-03 Thread Neil Brown
On Tuesday July 3, [EMAIL PROTECTED] wrote:
 
 Getting a snapshot that is useful with respect to application data
 requires help from the application.

Certainly.

  The app needs to be shutdown or
 paused prior to the snapshot and then started up again after the
 snapshot is taken.

Alternately, the app needs to be able to cope with unexpected system
shutdown (aka crash) and the same ability will allow it to cope with
an atomic snapshot.  It may be able to recover more efficiently from
an expected shutdown, so being able to tell the app about an impending
snapshot is probably a good idea, but it should be advisory only.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-25 Thread Neil Brown
On Tuesday June 26, [EMAIL PROTECTED] wrote:
 Posting it here seems the best thing to do.
 
 To the inventor goes naming privilege and I'm calling this one softer raid.
 It is a form of storage raid implemented in software, as contrasted to
 software and hardware raid which are dependent on using required hardware.
 
 To create a loop filesystem is straight forward. The commands are dd,
 mkfs.*, mount -o loop. Basically what I propose is that the image file is
 copied to another harddisk (in the case of ide not on the same cable) and
 it too is mounted in conjunction of the original with cooperation. When a
 read request for a block of say 100k is made, the kernel pulls 50k from
 each disk - maybe a simple doubling of throughput.

Sounds a lot like RAIF - ask google for details.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-25 Thread Neil Brown
On Tuesday June 26, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  Sounds a lot like RAIF - ask google for details.
 
 I did not know about RAIF. RAIF merges separate filesystems? That is a
 good idea in itself.
 
 My idea is for driver that provides a filesystem from image files it
 controls. Because it knows these resources it uses access in tandem to
 attain performance.

???
(reads original description in more detail).

So... the filesystem images are identical in both copies, and the
interesting bit is that the image is just a file on some filesystem.

So could I implement your idea by:

  dd if=/dev/zero of=/1/bigfile count=lotsandlots
  dd if=/dev/zero of=/2/bigfile count=lotsandlots
  losetup /dev/loop1 /1/bigfile
  losetup /dev/loop2 /2/bigfile
  mdadm -C /dev/md0 --level=raid1 --raid-disks=2 /dev/loop1 /dev/loop2
  mkfs /dev/md0
  mount /dev/md2 /space

??

Why would you bother?  Well, I do it a lot of code testing, but I
would be hard pressed to make a case for that sort of config in
production.

You said:
  1) the disks need
not be the same size or from the same manufacturer; 2) the supporting code
would be cross-platform.

md/raid already works happily with different sized drives from
different manufacturers (for raid1, it only uses as much space as the smaller
drive provides, For raid0 it uses it all).
I don't know what you mean by '2'.

So I still cannot see anything particularly new.  What am I missing?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-25 Thread Neil Brown
On Tuesday June 26, [EMAIL PROTECTED] wrote:
 
 Thanks for the brief howto there. I'll install the mdadm suite and
 experiment. It seems like a userspace driver?

mdadm is a userspace tool for managing the 'md' driver which is in the
linux kernel.

  I don't know what you mean by '2'.
 
 2 meant in the way that a driver for a PC raid card is not much use on
 a Sparc.

Ahh... That is one of the great things about Linux.  You code in C,
and the driver works on every architecture (providing you are
reasonably careful, and the arch actually supports the hardware). :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Neil Brown
On Sunday June 24, [EMAIL PROTECTED] wrote:
  
 +#define PG_blocks20  /* Page has block mappings */
 +

I've only had a very quick look, but this line looks *very* wrong.
You should be using PG_private.

There should never be any confusion about whether -private has
buffers or blocks attached as the only routines that ever look in
-private are address_space operations  (or should be.  I think 'NULL'
is sometimes special cased, as in try_to_release_page.  It would be
good to do some preliminary work and tidy all that up).

Why do you think you need PG_blocks?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-22 Thread Neil Brown
On Friday June 22, [EMAIL PROTECTED] wrote:
  
  Yes. Your use case is different than mine.
 
 My use case is being able to protect data reliably.  Yours?

Saying protect data is nearly meaningless without a threat model.
I bet you don't try to protect data from a direct nuclear hit, or a
court order.


AA has a fairly clear threat model.  It involves a flaw in a
program being used by an external agent to cause it to use
privileges it would not normally exercise to subvert privacy or
integrity.
I think this model matches a lot of real threats that real sysadmins
have real concerns about.  I believe that the design of AA addresses
this model quite well. 

What is your threat model?  Maybe it is just different.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Neil Brown
On Friday June 1, [EMAIL PROTECTED] wrote:
 On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
  David Chinner wrote:
  That sounds like a good idea - we can leave the existing
  WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
  behaviour that only guarantees ordering. The filesystem can then
  choose which to use where appropriate
  
  So what if you want a synchronous write, but DON'T care about the order? 
 
 submit_bio(WRITE_SYNC, bio);
 
 Already there, already used by XFS, JFS and direct I/O.

Are you sure?

You seem to be saying that WRITE_SYNC causes the write to be safe on
media before the request returns.  That isn't my understanding.
I think (from comments near the definition and a quick grep through
the code) that WRITE_SYNC expedites the delivery of the request
through the elevator, but doesn't do anything special about getting it
onto the media.
It essentially say Submit this request now, don't wait for more
request to bundle with it for better bandwidth utilisation

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] i_version update

2007-05-30 Thread Neil Brown
On Wednesday May 30, [EMAIL PROTECTED] wrote:
 On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
 
  The aim is to fulfill a NFSv4 requirement for rfc3530:
  5.5.  Mandatory Attributes - Definitions
  Name#   DataType   Access   Description
  ___
  change  3   uint64   READ A value created by the
  server that the client can use to determine if file
  data, directory contents or attributes of the object
 
 
 File data writes are included in this list of things that need to
 increment the version field. Hence to fulfill the crash requirement,
 that implies server data writes either need to be synchronous or
 journalled...

I think that would be excessive.

The important property if the 'change' number is:
 If the 'change' number is still the same, then the data and 
 metadata etc is still the same.

The converse of this (if the data+metadata is still then same then the
'change' is still the same) is valuable but less critical.   Having
the 'change' change when it doesn't need to will defeat client-side
caching and so will reduce performance but not correctness.

So after a crash, I think it is perfectly acceptable to assign a
change number that is known to be different to anything previously
supplied if there is any doubt about recent change history.

e.g. suppose we had a filesystem with 1-second resolution mtime, and
an in-memory 'change' counter that was incremented on every change.
When we load an inode from storage, we initialise the counter to
   -1: if the mtime is earlier than current_seconds
   current_nanoseconds:  if the mtime is equal to current_seconds.

We arrange that when the ctime changes, the change number is reset to 0.

Then when the 'change' number of an inode is required, we use the
bottom 32bits of the 'change' counter and the 32bits of the mtime.

This will provide a change number that normally changes only when the
file changes and doesn't require any extra storage on disk.
The change number will change inappropriately only when the inode has
fallen out of cache and is being reload, which is either after a crash
(hopefully rare) of when a file hasn't been used for a while, implying
that it is unlikely that any client has it in cache.

So in summary: I think it is impossible to have a change number that
changes *only* when content changes (as your 'crash' example suggests)
and it is quite achievable to have a change number that changes rarely
when the content doesn't change.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Tuesday May 29, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
   md/dm modules could keep count of requests as has been suggested
   (though that would be a fairly big change for raid0 as it currently
   doesn't know when a request completes - bi_endio goes directly to the
   filesystem). 
 
 Are you sure?  I believe that dm handles bi_endio because it waits for 
 all in progress bio to complete before switching tables.

I was taking about md/raid0, not dm-stripe.
md/raid0 (and md/linear) currently never know that a request has
completed.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 There are two things I'm not sure you covered.
 
 First, disks which don't support flush but do have a cache dirty 
 status bit you can poll at times like shutdown. If there are no drivers 
 which support these, it can be ignored.

There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have -o nobarriers or -o barriers=0,
or the inverse.
md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

 
 Second, NAS (including nbd?). Is there enough information to handle this 
 really rigt?

NAS means lots of things, including NFS and CIFS where this doesn't
apply.
For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
For 'iscsi', I guess it works just the same as SCSI...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote:
  What exactly do you want to know, and why do you care?
 
 If someone explicitly mounts -o barrier and the underlying device
 cannot do it, then we want to issue a warning or reject the
 mount.

I guess that makes sense.
But apparently you cannot tell what a device supports until you write
to it.
So maybe you need to write some metadata with as a barrier, then ask
the device what it's barrier status is.  The options might be:
  YES - barriers are fully handled
  NO  - best effort, but due to missing device features, it might not
work
  DISABLED - admin has requested that barriers be ignored.

??
 
 
  The idea is that every struct block_device supports barriers.  If the
  underlying hardware doesn't support them directly, then they get
  simulated by draining the queue and issuing a flush.
 
 Ok. But you also seem to be implying that there will be devices that
 cannot support barriers.

It seems there will always be hardware that doesn't meet specs.  If a
device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing
barriers all the way to the media would be hard..

 
 Even if all devices do eventually support barriers, it may take some
 time before we reach that goal.  Why not start by making it easy to
 determine what the capabilities of each device are. This can then be
 removed once we reach the holy grail

I'd rather not add something that we plan to remove.  We currently
have -EOPNOTSUP.  I don't think there is much point having more than
that.

I would really like to get to the stage where -EOPNOTSUP is never
returned.  If a filesystem cares, it could 'ask' as suggested above.
What would be a good interface for asking?
What if the truth changes (as can happen with md or dm)?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 Neil Brown writes:
   
 
 [...]
 
   Thus the general sequence might be:
   
 a/ issue all preceding writes.
 b/ issue the commit write with BIO_RW_BARRIER
 c/ wait for the commit to complete.
If it was successful - done.
If it failed other than with EOPNOTSUPP, abort
else continue
 d/ wait for all 'preceding writes' to complete
 e/ call blkdev_issue_flush
 f/ issue commit write without BIO_RW_BARRIER
 g/ wait for commit write to complete
  if it failed, abort
 h/ call blkdev_issue
 DONE
   
   steps b and c can be left out if it is known that the device does not
   support barriers.  The only way to discover this to try and see if it
   fails.
   
   I don't think any filesystem follows all these steps.
 
 It seems that steps b/ -- h/ are quite generic, and can be implemented
 once in a generic code (with some synchronization mechanism like
 wait-queue at d/).

Yes and no.
It depends on what you mean by preceding write.

If you implement this in the filesystem, the filesystem can wait only
for those writes where it has an ordering dependency.   If you
implement it in common code, then you have to wait for all writes
that were previously issued.

e.g.
  If you have two different filesystems on two different partitions on
  the one device, why should writes in one filesystem wait for a
  barrier issued in the other filesystem.
  If you have a single filesystem with one thread doing lot of
  over-writes (no metadata changes) and the another doing lots of
  metadata changes (requiring journalling and barriers) why should the
  data write be held up by the metadata updates?

So I'm not actually convinced that doing this is common code is the
best approach.  But it is the easiest.  The common code should provide
the barrier and flushing primitives, and the filesystem gets to use
them however it likes.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Neil Brown

Thanks everyone for your input.  There was some very valuable
observations in the various emails.
I will try to pull most of it together and bring out what seem to be
the important points.


1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.

 This is certainly a very attractive position - it makes the interface
 cleaner and makes life easier for filesystems and other clients of
 the block interface.
 Currently filesystems handle -EOPNOTSUP by
  a/ resubmitting the request without the BARRIER (after waiting for
earlier requests to complete) and
  b/ possibly printing an error message to the kernel logs.

 The block layer can do both of these just as easily and it does make
 sense to do it there.

 md/dm modules could keep count of requests as has been suggested
 (though that would be a fairly big change for raid0 as it currently
 doesn't know when a request completes - bi_endio goes directly to the
 filesystem). 
 However I think the idea of a zero-length BIO_RW_BARRIER would be a
 good option.  raid0 could send one of these down each device, and
 when they all return, the barrier request can be sent to it's target
 device(s).

 I think this is a worthy goal that we should work towards.

2/ Maybe barriers provide stronger semantics than are required.

 All write requests are synchronised around a barrier write.  This is
 often more than is required and apparently can cause a measurable
 slowdown.

 Also the FUA for the actual commit write might not be needed.  It is
 important for consistency that the preceding writes are in safe
 storage before the commit write, but it is not so important that the
 commit write is immediately safe on storage.  That isn't needed until
 a 'sync' or 'fsync' or similar.

 One possible alternative is:
   - writes can overtake barriers, but barrier cannot overtake writes.
   - flush before the barrier, not after.

 This is considerably weaker, and hence cheaper. But I think it is
 enough for all filesystems (providing it is still an option to call
 blkdev_issue_flush on 'fsync').

 Another alternative would be to tag each bio was being in a
 particular barrier-group.  Then bio's in different groups could
 overtake each other in either direction, but a BARRIER request must
 be totally ordered w.r.t. other requests in the barrier group.
 This would require an extra bio field, and would give the filesystem
 more appearance of control.  I'm not yet sure how much it would
 really help...
 It would allow us to set FUA on all bios with a non-zero
 barrier-group.  That would mean we don't have to flush the entire
 cache, just those blocks that are critical but I'm still not sure
 it's a good idea.

 Of course, these weaker rules would only apply inside the elevator.
 Once the request goes to the device we need to work with what the
 device provides, which probably means total-ordering around the
 barrier. 

 I think this requires more discussion before a way forward is clear.

3/ Do we need explicit control of the 'ordered' mode?

  Consider a SCSI device that has NV RAM cache.  mode_sense reports
  that write-back is enabled, so _FUA or _FLUSH will be used.
  But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode.
  But it seems there is no way to query this information.
  Using _FLUSH causes the NVRAM to be flushed to media which is a
  terrible performance problem.
  Setting SYNC_NV doesn't work on the particular device in question.
  We currently tell customers to mount with -o nobarriers, but that
  really feels like the wrong solution.  We should be telling the scsi
  device don't flush.
  An advantage of 'nobarriers' is it can go in /etc/fstab.  Where
  would you record that a SCSI drive should be set to
  QUEUE_ORDERD_DRAIN ??


I think the implementation priorities here are:

1/ implement a zero-length BIO_RW_BARRIER option.
2/ Use it (or otherwise) to make all dm and md modules handle
   barriers (and loop?).
3/ Devise and implement appropriate fall-backs with-in the block layer
   so that  -EOPNOTSUP is never returned.
4/ Remove unneeded cruft from filesystems (and elsewhere).

Comments?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Neil Brown
On Friday May 25, [EMAIL PROTECTED] wrote:
 2007/5/25, Neil Brown [EMAIL PROTECTED]:
   - Are there other bit that we could handle better?
  BIO_RW_FAILFAST?  BIO_RW_SYNC?  What exactly do they mean?
 
 BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no)
 error recovery. Mainly used by mutlipath targets to avoid long SCSI
 recovery. This should just be propagated when passing requests on.

Is it much or no?
Would it be reasonable to use this for reads from a non-degraded
raid1?  What about writes?

What I would really like is some clarification on what sort of errors
get retried, how often, and how much timeout there is..

And does the 'error' code returned in -bi_end_io allow us to
differentiate media errors from other errors yet?

 
 BIO_RW_SYNC: means this is a bio of a synchronous request. I don't
 know whether there are more uses to it but this at least causes queues
 to be flushed immediately instead of waiting for more requests for a
 short time. Should also just be passed on. Otherwise performance gets
 poor since something above will rather wait for the current
 request/bio to complete instead of sending more.

Yes, this one is pretty straight forward.. I mentioned it more as a
reminder to my self that I really should support it in raid5 :-(

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:
  
  Thanks everyone for your input.  There was some very valuable
  observations in the various emails.
  I will try to pull most of it together and bring out what seem to be
  the important points.
  
  
  1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.
 
 Sounds good to me, but how do we test to see if the underlying
 device supports barriers? Do we just assume that they do and
 only change behaviour if -o nobarrier is specified in the mount
 options?
 

What exactly do you want to know, and why do you care?

The idea is that every struct block_device supports barriers.  If the
underlying hardware doesn't support them directly, then they get
simulated by draining the queue and issuing a flush.
Theoretically there could be devices which have a write-back cache
that cannot be flushed, and you couldn't implement barriers on such a
device.  So throw it out and buy another?

As far as I can tell, the only thing XFS does differently with devices
that don't support barriers is that it prints a warning message to the
kernel logs.  If the underlying device printed the message when it
detected that barriers couldn't be supported, XFS wouldn't need to
care at all.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Neil Brown

This mail is about an issue that has been of concern to me for quite a
while and I think it is (well past) time to air it more widely and try
to come to a resolution.

This issue is how write barriers (the block-device kind, not the
memory-barrier kind) should be handled by the various layers.

The following is my understanding, which could well be wrong in
various specifics.  Corrections and other comments are more than
welcome.



What are barriers?
==
Barriers (as generated by requests with BIO_RW_BARRIER) are intended
to ensure that the data in the barrier request is not visible until
all writes submitted earlier are safe on the media, and that the data
is safe on the media before any subsequently submitted requests
are visible on the device.

This is achieved by tagging request in the elevator (or any other
request queue) so that no re-ordering is performed around a
BIO_RW_BARRIER request, and by sending appropriate commands to the
device so that any write-behind caching is defeated by the barrier
request.

Along side BIO_RW_BARRIER is blkdev_issue_flush which calls
q-issue_flush_fn.   This can be used to achieve similar effects.

There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.

Conversely, blkdev_issue_flush must be supported on any device that
uses write-behind caching (it if cannot be supported, then
write-behind caching should be turned off, at least by default).

We can think of there being three types of devices:
 
1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
  there is it is non-volatile.  Once a write completes it is 
  completely safe.  Such a device does not require barriers
  or -issue_flush_fn, and can respond to them either by a
  no-op or with -EOPNOTSUPP (the former is preferred).

2/ FLUSHABLE.
  A FLUSHABLE device may have a volatile write-behind cache.
  This cache can be flushed with a call to blkdev_issue_flush.
  It may not support barrier requests.

3/ BARRIER.
  A BARRIER device supports both blkdev_issue_flush and
  BIO_RW_BARRIER.  Either may be used to synchronise any
  write-behind cache to non-volatile storage (media).

Handling of SAFE and FLUSHABLE devices is essentially the same and can
work on a BARRIER device.  The BARRIER device has the option of more
efficient handling.

How does a filesystem use this?
===

A filesystem will often have a concept of a 'commit' block which makes
an assertion about the correctness of other blocks in the filesystem.
In the most gross sense, this could be the writing of the superblock
of an ext2 filesystem, with the dirty bit clear.  This write commits
all other writes to the filesystem that precede it.

More subtle/useful is the commit block in a journal as with ext3 and
others.  This write commits some number of preceding writes in the
journal or elsewhere.

The filesystem will want to ensure that all preceding writes are safe
before writing the barrier block.  There are two ways to achieve this.

1/  Issue all 'preceding writes', wait for them to complete (bi_endio
   called), call blkdev_issue_flush, issue the commit write, wait
   for it to complete, call blkdev_issue_flush a second time.
   (This is needed for FLUSHABLE)

2/ Set the BIO_RW_BARRIER bit in the write request for the commit
block.
   (This is more efficient on BARRIER).

The second, while much easier, can fail.  So a filesystem should be
prepared to deal with that failure by falling back to the first
option.
Thus the general sequence might be:

  a/ issue all preceding writes.
  b/ issue the commit write with BIO_RW_BARRIER
  c/ wait for the commit to complete.
 If it was successful - done.
 If it failed other than with EOPNOTSUPP, abort
 else continue
  d/ wait for all 'preceding writes' to complete
  e/ call blkdev_issue_flush
  f/ issue commit write without BIO_RW_BARRIER
  g/ wait for commit write to complete
   if it failed, abort
  h/ call blkdev_issue
  DONE

steps b and c can be left out if it is known that the device does not
support barriers.  The only way to discover this to try and see if it
fails.

I don't think any filesystem follows all these steps.

 ext3 has the right structure, but it doesn't include steps e and h.
 reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
  that is only on the fsync path, so it isn't really protecting
  general journal commits.
 XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
   depending on a whether it thinks the device handles barriers,
   and finally 'g'.

 I haven't looked at other filesystems.

So for devices that support BIO_RW_BARRIER, and for devices that don't
need any flush, they work OK, but for device that need flushing, but
don't support BIO_RW_BARRIER, none of them work.  This should be easy
to fix.



RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
 XOR it (0^0=1), and hence fills up the host disk.

Uhmm... you need to check your maths.

$ perl -e 'printf %d\n, 0^0;'
0

:-)
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-17 Thread Neil Brown
On Friday May 18, [EMAIL PROTECTED] wrote:
  Fix confirmed, filled the whole 11T hard disk, without crashing.
 I presume this would go into 2.6.22

Yes, and probably 2.6.21.y, though the patch will be slightly
different, see below.
 
 Thanks again.

And thank-you for pursuing this with me.

NeilBrown


---
Avoid overflow in raid0 calculation with large components.

If a raid0 has a component device larger than 4TB, and is accessed on
a 32bit machines, then as 'chunk' is unsigned lock,
   chunk  chunksize_bits
can overflow (this can be as high as the size of the device in KB).
chunk itself will not overflow (without triggering a BUG).

So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.

Cc: Jeff Zheng [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid0.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 16:14:12.0 +1000
@@ -415,7 +415,7 @@ static int raid0_make_request (request_q
raid0_conf_t *conf = mddev_to_conf(mddev);
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;
-   unsigned long chunk;
+   sector_t chunk;
sector_t block, rsect;
const int rw = bio_data_dir(bio);
 
@@ -470,7 +470,6 @@ static int raid0_make_request (request_q
 
sector_div(x, zone-nb_dev);
chunk = x;
-   BUG_ON(x != (sector_t)chunk);
 
x = block  chunksize_bits;
tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
 Any ideas what are the minimum export operation(s) that cifs would need to 
 add to export under nfsd?  It was not clear to me after reading the 
 Exporting document in Documentation directory.

You need to be able to map a dentry to a filehandle (you get about 20
bytes) and back again.
If CIFS provides some fix-length identifier for files, then you might
be able to do it.  If not, cannot really do it at all.  And I suspect
the later.  (There are other requirements, like get_parent, but we
could probably work around those if we really needed to).

Theoretically, you could make it work with NFSv4 and volatile file
handles, but I doubt it would really work in practice.  I don't think
the volatile concept quite stretch as far as you would need.

Probably the best way to nfs-export a CIFS filesystem is to use the
user-space nfs server.  It caches recently used filenames and uses a
filehandle which is a hash of the name.  It works on a best-effort
basis, and if, for example, the server restarts, you will lose
connections to open files.  While it is not perfect, it can be very
useful in some situations.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
 Here is the information of the created raid0. Hope it is enough.

Thanks.
Everything looks fine here.

The only difference of any significance between the working and
non-working configurations is that in the non-working, the component
devices are larger than 2Gig, and hence have sector offsets greater
than 32 bits.

This does cause a slightly different code path in one place, but I
cannot see it making a difference.  But maybe it does.

What architecture is this running on?
What C compiler are you using?

Can you try with this patch?  It is the only thing that I can find
that could conceivably go wrong.

Thanks,
NeilBrown

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid0.c |1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 10:34:02.0 +1000
@@ -461,6 +461,7 @@ static int raid0_make_request (request_q
  
while (block = (zone-zone_offset + zone-size)) 
zone++;
+   BUG_ON(block  zone-zone_offset);
 
sect_in_chunk = bio-bi_sector  ((chunk_size1) -1);
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
 
  The only difference of any significance between the working 
  and non-working configurations is that in the non-working, 
  the component devices are larger than 2Gig, and hence have 
  sector offsets greater than 32 bits.
 
 Do u mean 2T here?, but in both configuartion, the component devices are
 larger than 2T (2.25T5.5T).

Yes, I meant 2T, and yes, the components are always over 2T.  So I'm
at a complete loss.  The raid0 code follows the same paths and does
the same things and uses 64bit arithmetic where needed.

So I have no idea how there could be a difference between these two
cases.  

I'm at a loss...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
  If CIFS provides some fix-length identifier for files, then 
  you might be able to do it
 
 Most CIFS servers (Windows on NTFS, Samba etc.) can return a unique 
 identifier (a 64 bit inode number), in conjunction with the volume id, 
 that is probably good enough ... right?  This can be returned on various 
 calls (level 0x03EE file_internal_info - returns only this number).  If 
 reverse lookup is required - ie given a unique identifier what is its 
 path name - there are probably a few different ways to handle this but 
 presumably local filesystems run into the same issue.

Yes, that unique identifier sounds like it would be suitable to put
in the filehandle.
But reverse lookup is definitely required.
Providing you can turn this into a 'struct inode *' in the filesystem,
the code in exportfs/ can help turn that into a fully connected
dentry.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
 I tried the patch, same problem show up, but no bug_on report
 
 Is there any other things I can do?
 

What is the nature of the corruption?  Is it data in a file that is
wrong when you read it back, or does the filesystem metadata get
corrupted?

Can you try the configuration that works, and sha1sum the files after
you have written them to make sure that they really are correct?
My thought here is maybe there is a bad block on one device, and the
block is used for data in the 'working' config, and for metadata in
the 'broken' config.

Can you try a degraded raid10 configuration. e.g.

   mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \
   /dev/second missing

That will lay out the data in exactly the same place as with raid0,
but will use totally different code paths to access it.  If you still
get a problem, then it isn't in the raid0 code.

Maybe try version 1 metadata (mdadm --metadata=1).  I doubt that would
make a difference, but as I am grasping at straws already, it may be a
straw woth trying.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
 On Thu, 17 May 2007, Neil Brown wrote:
 
  On Thursday May 17, [EMAIL PROTECTED] wrote:
 
  The only difference of any significance between the working
  and non-working configurations is that in the non-working,
  the component devices are larger than 2Gig, and hence have
  sector offsets greater than 32 bits.
 
  Do u mean 2T here?, but in both configuartion, the component devices are
  larger than 2T (2.25T5.5T).
 
  Yes, I meant 2T, and yes, the components are always over 2T.
 
 2T decimal or 2T binary?
 

Either.  The smallest as actually 2.75T (typo above).
Precisely it was
  2929641472  kilobytes
or 
  5859282944 sectors
or 
  0x15D3D9000 sectors.

So it is over 32bits already...

Uhm, I just noticed something.
'chunk' is unsigned long, and when it gets shifted up, we might lose
bits.  That could still happen with the 4*2.75T arrangement, but is
much more likely in the 2*5.5T arrangement.

Jeff, can you try this patch?

Thanks.
NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid0.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 15:02:15.0 +1000
@@ -475,7 +475,7 @@ static int raid0_make_request (request_q
x = block  chunksize_bits;
tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
}
-   rsect = (((chunk  chunksize_bits) + zone-dev_offset)1)
+   rsect = sector_t)chunk  chunksize_bits) + zone-dev_offset)1)
+ sect_in_chunk;
  
bio-bi_bdev = tmp_dev-bdev;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
 
 Uhm, I just noticed something.
 'chunk' is unsigned long, and when it gets shifted up, we might lose
 bits.  That could still happen with the 4*2.75T arrangement, but is
 much more likely in the 2*5.5T arrangement.

Actually, it cannot be a problem with the 4*2.75T arrangement.
  chuck  chunksize_bits

will not exceed the size of the underlying device *in*kilobytes*.
In that case that is 0xAE9EC800 which will git in a 32bit long.
We don't double it to make sectors until after we add
zone-dev_offset, which is sector_t and so 64bit arithmetic is used.

So I'm quite certain this bug will cause exactly the problems
experienced!!

 
 Jeff, can you try this patch?

Don't bother about the other tests I mentioned, just try this one.
Thanks.

NeilBrown

 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 ### Diffstat output
  ./drivers/md/raid0.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
 --- .prev/drivers/md/raid0.c  2007-05-17 10:33:30.0 +1000
 +++ ./drivers/md/raid0.c  2007-05-17 15:02:15.0 +1000
 @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
   x = block  chunksize_bits;
   tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
   }
 - rsect = (((chunk  chunksize_bits) + zone-dev_offset)1)
 + rsect = sector_t)chunk  chunksize_bits) + zone-dev_offset)1)
   + sect_in_chunk;
   
   bio-bi_bdev = tmp_dev-bdev;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-15 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
 
  Anybody have a clue?
 

No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 0/18] export operations rewrite

2007-05-13 Thread Neil Brown
On Sunday May 13, [EMAIL PROTECTED] wrote:
 On Mon, May 07, 2007 at 08:51:00AM +0100, Christoph Hellwig wrote:
  On Mon, May 07, 2007 at 10:06:10AM +1000, Neil Brown wrote:
   On Monday May 7, [EMAIL PROTECTED] wrote:

Would you be able to respin that second patch series with one of those
changes?
   
   Of course it is actually the first series of patches that introduces
   this problem.  So maybe just a full respin??
  
  I'll take a look at how to solve this issues.
 
 I'll take some more time until I can look at this.  Can you send of the
 first patch series to Linus in the meantime?

Oh.. the problem is in the second series, isn't it.  I had somehow
convinced myself it was in the first.
And the merge window has now closed, but I'll submit it to -mm
shortly.
Sorry 'bout that.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 0/18] export operations rewrite

2007-05-06 Thread Neil Brown
On Friday May 4, [EMAIL PROTECTED] wrote:
 On Tuesday May 1, [EMAIL PROTECTED] wrote:
   - the calling conversion on the decode side where we first call
 -decode_fh to split the filehandle into two blobs only to
 recurse back into exportfs and then recurse back into the filesystem
 seems rather odd.  By having two methods to get the dentry and
 parent directly from the on the wire file handle this big callstack
 collapses to a very simple one.
 
 This is the bit I was particularly missing.  I see now how this aspect
 was awkward before, and how your changes make the flow clearer.
 
 Getting rid of s_export_op-find_exported_dentry is something I'm very
 happy about.  There was actually bug lurking there that I never got
 around to fixing.  The code:
 - /* Ok, we can export it */;
 - if (!inode-i_sb-s_export_op-find_exported_dentry)
 - inode-i_sb-s_export_op-find_exported_dentry =
 - find_exported_dentry;
 
 assumes that the address of find_exported_dentry will never change.
 But if you unload nfsd.ko, then re-load it. it could be different, yet a
 filesystem could still have to old value in it's s_export_op.  That
 would be bad.
 I'm glad it is gone.

...but now I remember why it was there...

I didn't want filesystems to depend on exportfs.ko.
nfsd, depends on exportfs, but currently filesystems don't.  They
can be compiled into the kernel without including exportfs.

With your patches in place, I got a compile error with a config in
which nfsd is a module (and so exportfs is) but ext2 and ext3 are
compiled in.
The error was that generic_fh_to_dentry and generic_fh_to_parent were
not defined.

Now those two functions together with exportfs_d_alloc are fairly
small and could go directly in fs/something.c.  They could almost be
inlined in linux/exportfs.h.

Would you be able to respin that second patch series with one of those
changes?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 0/18] export operations rewrite

2007-05-06 Thread Neil Brown
On Monday May 7, [EMAIL PROTECTED] wrote:
 
 Would you be able to respin that second patch series with one of those
 changes?

Of course it is actually the first series of patches that introduces
this problem.  So maybe just a full respin??

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] nfsctl: Use vfs_path_lookup

2007-05-06 Thread Neil Brown
On Sunday May 6, [EMAIL PROTECTED] wrote:
 use vfs_path_lookup instead of open-coding the necessary functionality.
 
 Signed-off-by: Josef 'Jeff' Sipek [EMAIL PROTECTED]

Acked-by: NeilBrown [EMAIL PROTECTED]

Thanks,
NeilBrown

 ---
  fs/nfsctl.c |   16 ++--
  1 files changed, 6 insertions(+), 10 deletions(-)
 
 diff --git a/fs/nfsctl.c b/fs/nfsctl.c
 index c043136..51f1b31 100644
 --- a/fs/nfsctl.c
 +++ b/fs/nfsctl.c
 @@ -23,19 +23,15 @@
  static struct file *do_open(char *name, int flags)
  {
   struct nameidata nd;
 + struct vfsmount *mnt;
   int error;
  
 - nd.mnt = do_kern_mount(nfsd, 0, nfsd, NULL);
 + mnt = do_kern_mount(nfsd, 0, nfsd, NULL);
 + if (IS_ERR(mnt))
 + return (struct file *)mnt;
  
 - if (IS_ERR(nd.mnt))
 - return (struct file *)nd.mnt;
 -
 - nd.dentry = dget(nd.mnt-mnt_root);
 - nd.last_type = LAST_ROOT;
 - nd.flags = 0;
 - nd.depth = 0;
 -
 - error = path_walk(name, nd);
 + error = vfs_path_lookup(mnt-mnt_root, mnt, name, 0, nd);
 + mntput(mnt);/* drop do_kern_mount reference */
   if (error)
   return ERR_PTR(error);
  
 -- 
 1.5.2.rc1.20.g86b9
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 0/18] export operations rewrite

2007-05-04 Thread Neil Brown
On Tuesday May 1, [EMAIL PROTECTED] wrote:
 On Fri, Mar 30, 2007 at 01:34:53PM +1000, Neil Brown wrote:
  On Saturday March 17, [EMAIL PROTECTED] wrote:
  
  less that 2 weeks later
 
 more than one month later :)
 

Thanks for your explanations.

  - the calling conversion on the decode side where we first call
-decode_fh to split the filehandle into two blobs only to
recurse back into exportfs and then recurse back into the filesystem
seems rather odd.  By having two methods to get the dentry and
parent directly from the on the wire file handle this big callstack
collapses to a very simple one.

This is the bit I was particularly missing.  I see now how this aspect
was awkward before, and how your changes make the flow clearer.

Getting rid of s_export_op-find_exported_dentry is something I'm very
happy about.  There was actually bug lurking there that I never got
around to fixing.  The code:
-   /* Ok, we can export it */;
-   if (!inode-i_sb-s_export_op-find_exported_dentry)
-   inode-i_sb-s_export_op-find_exported_dentry =
-   find_exported_dentry;

assumes that the address of find_exported_dentry will never change.
But if you unload nfsd.ko, then re-load it. it could be different, yet a
filesystem could still have to old value in it's s_export_op.  That
would be bad.
I'm glad it is gone.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/44] mm: trim more holes

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 
 If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
 we may have failed the write operation despite prepare_write having
 instantiated blocks past i_size. Fix this, and consolidate the trimming into
 one place.
 
..
 @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
   cur_iov, iov_offset, bytes);
   flush_dcache_page(page);
   status = a_ops-commit_write(file, page, offset, offset+bytes);
 - if (status == AOP_TRUNCATED_PAGE) {
 - page_cache_release(page);
 - continue;
 + if (unlikely(status  0))
 + goto fs_write_aop_error;
 + if (unlikely(copied != bytes)) {
 + status = -EFAULT;
 + goto fs_write_aop_error;
   }

It isn't clear to me that you are handling the case
   status == AOP_TRUNCATED_PAGE
here.  AOP_TRUNCATED_PAGE is  0 (0x80001 to be precise)

Maybe -commit_write cannot return AOP_TRUNCATED_PAGE.  If that is
true, then a comment to that effect (i.e. that the old code was wrong)
in the change log might easy review. 

Or did I miss something?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 +  write_begin: This is intended as a replacement for prepare_write. Called
 +by the generic buffered write code to ask the filesystem to prepare
 +to write len bytes at the given offset in the file. flags is a field
 +for AOP_FLAG_xxx flags, described in include/linux/mm.h.

Putting This is intended as a replacement.. there sees a bit
dangerous.  It could well accidentally remain when the documentation
for prepare_write gets removed.  I would make it a separate paragraph
and flesh it out.  And include text from prepare_write before that
gets removed.

   write_begin:
 This is intended as a replacement for prepare_write.  The key 
 differences being that:
- it returns a locked page (in *pagep) rather than
  being given a pre-locked page:
- it can pass arbitrary state to write_end rather than
  having to hide stuff in some filesystem-internal
  data structure 
- The (largely undocumented) flags option.

 Called by  the generic bufferred write code to ask an
 address_space to prepare to write len bytes at the given
 offset in the file.

 The address_space should check that the write will be able to
 complete, by allocating space if necessary and doing any other
 internal housekeeping.  If the write will update parts of any
 basic-blocks on storage, then those blocks should be pre-read
 (if they haven't been read already) so that the updated blocks
 can be written out properly.
 The possible flags are listed in include/linux/fs.h (not
 mm.h) and include
AOP_FLAG_UNINTERRUPTIBLE:
It is unclear how this should be used.  No
current code handles it.

(together with the rest...)
 +
 +The filesystem must return the locked pagecache page for the caller
 +to write into.
 +
 +A void * may be returned in fsdata, which then gets passed into
 +write_end.
 +
 +Returns  0 on failure, in which case all cleanup must be done and
 +write_end not called. 0 on success, in which case write_end must
 +be called.


As you are not including perform_write in the current patchset, maybe
it is best not to include the documentation yet either?

 +  perform_write: This is a single-call, bulk version of write_begin/write_end
 +operations. It is only used in the buffered write path (write_begin
 +must still be implemented), and not for in-kernel writes to 
 pagecache.
 +It takes an iov_iter structure, which provides a descriptor for the
 +source data (and has associated iov_iter_xxx helpers to operate on
 +that data). There are also file, mapping, and pos arguments, which
 +specify the destination of the data.
 +
 +Returns  0 on failure if nothing was written out, otherwise returns
 +the number of bytes copied into pagecache.
 +
 +fs/libfs.c provides a reasonable template to start with, 
 demonstrating
 +iov_iter routines, and iteration over the destination pagecache.
 +

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 
 BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
 an initial read or other sequence they might be using to handle the
 case of a short write. ecryptfs uses it, others can too.
 
 For buffered writes, this doesn't get passed in (unless they are
 coming from kernel space), so I was debating whether to have it at
 all.  However, in the previous API, _nobody_ had to worry about
 short writes, so this flag means I avoid making an API that is
 fundamentally less performant in some situations.

Ahhh I think I get it now.

  In general, the address_space must cope with the possibility that
  fewer than the expected number of bytes is copied.  This may leave
  parts of the page with invalid data.  This can be handled by
  pre-loading the page with valid data, however this may cause a
  significant performance cost.
  The write_begin/write_end interface provide two mechanism by which
  this case can be handled more efficiently.
  1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
If that is set, inefficient preparation can be avoided.  However the
most common write paths will never set this flag.
  2/ The return from write_end can declare that fewer bytes have been
accepted. e.g. part of the page may have been loaded from backing
store, overwriting some of the newly written bytes.  If this
return value is reduced, a new write_begin/write_end cycle
may be called to attempt to write the bytes again.

Also
+  write_end: After a successful write_begin, and data copy, write_end must
+be called. len is the original len passed to write_begin, and copied
+is the amount that was able to be copied (they must be equal if
+write_begin was called with intr == 0).
+

That should be ... called without AOP_FLAG_UNINTERRUPTIBLE being
set.
And that was able to be copied is misleading, as the copy is not done in
write_end.  Maybe that was accepted.

It seems to make sense now.  I might try re-reviewing the patches based
on this improved understanding only a public holiday looms :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
  On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
   On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
 + page = __grab_cache_page(mapping, index);

btw, __grab_cache_page should probably get a more descriptive and
non-__-prefixed name now that it's used all over the place.
   
   Agreed. Suggestions? ;)
  
  find_or_create_cache_page given that's it's like find_or_create_page +
  add_to_page_cache?
 
 find_or_create_page adds to page cache as well, though :P

I would really like to see the word 'lock' in there, as it does
return a locked page, and when I first saw __grab_cache_page recently
there was an unlock_page afterwards and I couldn't see where the lock
happened, and I was confused for a little while.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-25 Thread Neil Brown
On Friday February 23, [EMAIL PROTECTED] wrote:
 On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
   Probably the only sane thing to do is to remember the bad sectors and 
   avoid attempting reading them; that would mean marking automatic 
   versus explicitly requested requests to determine whether or not to 
   filter them against a list of discovered bad blocks.
  
  And clearing this list when the sector is overwritten, as it will almost
  certainly be relocated at the disk level.  For that matter, a huge win
  would be to have the MD RAID layer rewrite only the bad sector (in hopes
  of the disk relocating it) instead of failing the whiole disk.  Otherwise,
  a few read errors on different disks in a RAID set can take the whole
  system offline.  Apologies if this is already done in recent kernels...

Yes, current md does this.

 
 And having a way of making this list available to both the filesystem
 and to a userspace utility, so they can more easily deal with doing a
 forced rewrite of the bad sector, after determining which file is
 involved and perhaps doing something intelligent (up to and including
 automatically requesting a backup system to fetch a backup version of
 the file, and if it can be determined that the file shouldn't have
 been changed since the last backup, automatically fixing up the
 corrupted data block :-).
 
   - Ted

So we want a clear path for media read errors from the device up to
user-space.  Stacked devices (like md) would do appropriate mappings
maybe (for raid0/linear at least.  Other levels wouldn't tolerate
errors).
There would need to be a limit on the number of 'bad blocks' that is
recorded.  Maybe a mechanism to clear old bad  blocks from the list is
needed.

Maybe if generic make request gets a request for a block which
overlaps a 'bad-block' it returns an error immediately.

Do we want a path in the other direction to handle write errors?  The
file system could say Don't worry to much if this block cannot be
written, just return an error and I will write it somewhere else?
This might allow md not to fail a whole drive if there is a single
write error.
Or is that completely un-necessary as all modern devices do bad-block
relocation for us?
Is there any need for a bad-block-relocating layer in md or dm?

What about corrected-error counts?  Drives provide them with SMART.
The SCSI layer could provide some as well.  Md can do a similar thing
to some extent.  Where these are actually useful predictors of pending
failure is unclear, but there could be some value.
e.g. after a certain number of recovered errors raid5 could trigger a
background consistency check, or a filesystem could trigger a
background fsck should it support that.


Lots of interesting questions... not so many answers.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch between 2.6.19 and nfs-utils-1.0.10 nfsctl_arg structure???

2007-02-21 Thread Neil Brown
On Wednesday February 21, [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote on 2007-02-21 00:04:40:
 
   You will need patch f988443a84528bd30c2f474efa5e2c511959f19b [1]
   or run
  mount -t nfsd nfsd /proc/fs/nfs/nfsd
   before starting mountd.
 
 I applied the patch, and attempted the mount cmd above.
 I assume you mistyped the directory path, because
 /proc/fs/nfs/nfsd does not exist, so I used /proc/fs/nfsd

Sorry, yes.  You chose the right path.

 
 Unfortunately it has not made any difference.
 
   The differences are not significant.
 
 Really? Surely if userspace uses this order
 
  struct nfsctl_uidmap   u_umap;
  struct nfsctl_fhparm   u_getfh;
  struct nfsctl_fdparm   u_getfd;
  struct nfsctl_fsparm   u_getfs;
 
 but kernelspace expects this
  struct nfsctl_fdparm   u_getfd;
  struct nfsctl_fsparm   u_getfs;

They are inside a 'union'  Order is unimportant.

 
 then we have significant differences?
 But if you're sure, then what else can be wrong?
 My /etc/exports file contains (adding all 'unsafe' options I can find):
 
 /p 
 (rw,sync,no_root_squash,no_subtree_check,crossmnt,insecure,nohide,insecure_locks,no_acl)
 
 I tried with and without (empty) hosts.allow and hosts.deny files.
 
 I'm running out of ideas :-(

Run
  strace -o /tmp/strace -s 1000 -p `pidof mountd`
on the server, then attempt the mount on the client.
Then interrupt the 'strace' and send me /tmp/trace.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch between 2.6.19 and nfs-utils-1.0.10 nfsctl_arg structure???

2007-02-20 Thread Neil Brown
On Tuesday February 20, [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm trying to get an nfs server setup on our embedded 2.6.19 kernel 
 system. I've chosen to use the kernel nfsd module,
 and am using the latest nfs-utils.
 It all starts running, but I get this:
  mount -t nfs 127.0.0.1:/etc /tmp/try
 mount: 127.0.0.1:/etc failed, reason given by server: Permission denied
 mount: nfsmount failed: Bad file descriptor
 mount: Mounting 127.0.0.1:/etc on /tmp/try failed: Bad file descriptor
 
 Having added a bit of debugging /var/log/messages shows this:
 daemon.warn mountd[545]: getfh.c:73: getfh_size (/etc)
 daemon.warn mountd[545]: nfsctl.c:26: syscall nfsctl
 daemon.warn mountd[545]: nfsctl.c:28: syscall nfsctl = -1
 daemon.warn mountd[545]: getfh failed: Operation not permitted

You will need patch f988443a84528bd30c2f474efa5e2c511959f19b [1]
or run
   mount -t nfsd nfsd /proc/fs/nfs/nfsd
before starting mountd.


 
 Looking in detail at the syscall made, I see that the structure
 definition for nfsctl_arg is different between kernel and nfs-utils.
 
 Huh? What am I missing here? Any pointers appreciated.
 Wouter.

The differences are not significant.

NeilBrown

[1] 
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f988443a84528bd30c2f474efa5e2c511959f19b


 
 Kernel 2.6.19.1: include/linux/nfsd/syscall.h:
 struct nfsctl_arg {
   int ca_version; /* safeguard */
   union {
   struct nfsctl_svc   u_svc;
   struct nfsctl_clientu_client;
   struct nfsctl_exportu_export;
   struct nfsctl_fdparmu_getfd;
   struct nfsctl_fsparmu_getfs;
  snip
   void *u_ptr;
   } u;
 
 
 nfs-utils 1.0.10: support/include/nfs/nfs.h:
 struct nfsctl_arg {
   int ca_version; /* safeguard */
   union {
   struct nfsctl_svc   u_svc;
   struct nfsctl_clientu_client;
   struct nfsctl_exportu_export;
   struct nfsctl_uidmapu_umap;
   struct nfsctl_fhparmu_getfh;
   struct nfsctl_fdparmu_getfd;
   struct nfsctl_fsparmu_getfs;
   } u;
 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Neil Brown
On Tuesday February 6, [EMAIL PROTECTED] wrote:
 On Mon, Feb 05, 2007 at 07:20:35PM -0800, Andreas Gruenbacher wrote:
  It's actually not hard to fix, and nfsd would look a little less weird. 
  But 
  what would this add, what do pathnames mean in the context of nfsd, and 
  would 
  nfsd actually become less weird?
 
 It's not actually a pathname we care about, but a vfsmount + dentry
 combo.  That one means as much in nfsd as elsewhere.  We want nfsd
 to obey r/o or noatime mount flags if /export/foo is exported with them
 but /foo not.  Even better would be to change nfsd so it creates it's
 own non-visible vfsmount for the filesystems it exports..

What would be the benefit of having private non-visible vfsmounts?
Sounds like a recipe for confusion?

It is possible that mountd might start doing bind-mounts to create the
'pseudo filesystem' thing for NFSv4, but they would be very visible
(under /var/lib/nfs/v4root or something).  So having it's own vfsmount
might make sense, but I don't get 'non-visible'.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks

2007-02-06 Thread Neil Brown
On Tuesday February 6, [EMAIL PROTECTED] wrote:
 On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote:
  Who cares? There is no way to export a partial directory, and in any
  case the subtree_check crap is borken beyond repair (see cross-directory
  renames which lead to actual changes to the filehandle - broken, broken,
  broken).
 
 It's getting a little oftopic for this thread, but is there a chance we
 could just kill that crap after a resonable deprecation period?

Probably.  A couple of years?
nfs-utils 1.1.0 is due to stop subtree_check from being the default.
After that is released, the next kernel can start printing warnings
that it might stop working in a couple of years.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix d_path for lazy unmounts

2007-02-04 Thread Neil Brown
On Friday February 2, [EMAIL PROTECTED] wrote:
 Hello,
 
 here is a bugfix to d_path. Please apply (after 2.6.20).

Looks good!  Just a couple of little comments (to prove that I have
really read it and thought about it :-)

 
 Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED]

Reviewed-by: NeilBrown [EMAIL PROTECTED]

 
 Index: linux-2.6/fs/dcache.c
 ===
 --- linux-2.6.orig/fs/dcache.c
 +++ linux-2.6/fs/dcache.c
 @@ -1739,45 +1739,43 @@ shouldnt_be_hashed:
   * @rootmnt: vfsmnt to which the root dentry belongs
   * @buffer: buffer to return value in
   * @buflen: buffer length
 + * @fail_deleted: what to return for deleted files
   *
 - * Convert a dentry into an ASCII path name. If the entry has been deleted
 - * the string  (deleted) is appended. Note that this is ambiguous.
 + * Convert a dentry into an ASCII path name. If the entry has been deleted,
 + * then if @fail_deleted is true, ERR_PTR(-ENOENT) is returned. Otherwise,
 + * the the string  (deleted) is appended. Note that this is ambiguous.

The behaviour in the face of a lazy unmount should be clarified in
this comment.

And I cannot help wondering if that behaviour should - in some cases -
be 'fail'.

i.e. if sys_getcwd is called on a directory that is no longer
connected to the root, it isn't clear to me that it should return
without an error.
Without your patch it can return garbage which is clearly wrong.
With you patch it will return a relative path name, which is also
wrong (it isn't a valid path that leads to the current working directory).
I would suggest that 'fail_deleted' be (e.g.) changed to
'fail_condition' where two conditions are defined
#define DPATH_FAIL_DELETED 1
#define DPATH_FAIL_DISCONNECTED 2

and both these are passed in by sys_getcwd.


 @@ -1791,33 +1789,49 @@ static char * __d_path( struct dentry *d
   parent = dentry-d_parent;
   prefetch(parent);
   namelen = dentry-d_name.len;
 - buflen -= namelen + 1;
 - if (buflen  0)
 + if (buflen = namelen)
   goto Elong;

This bothers me.  You appear to be comparing buflen with namelen, but
are about the change buflen by 'namelen + 1'.  It looks like a bug.

In reality, you are comparing buflen  namelen+1 but spelling it as
buflen = namelen.  I would prefer the full spelling with least room
for confusion.


 - end -= namelen;
 - memcpy(end, dentry-d_name.name, namelen);
 - *--end = '/';
 - retval = end;
 + buflen -= namelen + 1;
 + buffer -= namelen;
 + memcpy(buffer, dentry-d_name.name, namelen);
 + *--buffer = '/';

This wouldn't be my preference either.  It is important that 'buflen'
and 'buffer' move in step, but when I see
buflen -= namelen + 1;
buffer -= namelen;
it looks like they aren't.  Maybe:
 + buflen -= namelen + 1;
 + buffer -= namelen + 1;
 + memcpy(buffer+1, dentry-d_name.name, namelen);
 + *buffer = '/';

or am I being too picky?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File system enhancement handled above the file system level.

2000-11-29 Thread Neil Brown

On Wednesday November 29, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  "hlfsd" stands for Home Link File System Daemon, and a google search
  on hlfsd finds it quite nicely.
 ---
   So we would configure /tmp to be type of NFS server served
 by hlfsd which would point to some real tmp space.  In our case 
 we want to segregate by 'group' or a process's 'label'.  So people
 in the same group/label would share a common 'tmp'.

For an NFS server, the only information that is available is the uid
and a list of groupids.
If you wanted it to be group based you would need to create some rule
for choosing which group.
Maybe groups with gid between 1 and 11000 could be
"name-space-groups" and you could administratively assure that each
user is only ever in one of these.  This might require changing the
newgroup command if people use that.
Given then, it should be trivial to modify hlfsd to pick the gid in
the right range and stick it in the returned symlink.

If you want to base the selection on something else, you would have to
write your own filesystem, but that shouldn't be too hard (about an
hour's work I suspect).  You would want the be able to determine the
criterea directly from the task structure though.  I don't think that
digging around in the environment, which is somewhere in userspace, is
a good idea.
But you will presumably be using group ids to control access to these
tmp directories, so using groups control location of them is probably
acceptable too??

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]