Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:

 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.


I'd like to second the proposal, but it would be more useful to bring the hint
down to the physical devices.

There is an ongoing discussion about adding 'Trim' ATA command for notifying
the drive about the deleted blocks.

http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

This is especially useful for the storage device like Solid State Drive (SSD).

Dongjun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Christoph Hellwig
On Sun, Oct 28, 2007 at 02:05:16PM -0400, Erez Zadok wrote:
 
 Sure.  I assume you mean an internal function to encapsulate the entire case
 statement's code, one for each of the FIO* cases.

Yes.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Christoph Hellwig
On Sun, Oct 28, 2007 at 07:57:47PM -0700, Daniel Phillips wrote:
 On 10/28/07, Christoph Hellwig [EMAIL PROTECTED] wrote:
  While you're at it, it's probably worth splitting this out into
  a small helper function.
 
 Why? Is the same pattern called from more than one place?

Becauase it's a lot more readable.  ioctl subcommands are invidividual
functionality, and separating them out into small self-contained functions
makes the code a lot more readable.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Christoph Hellwig
On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote:
 +/**
 + * vfs_ioctl - call filesystem specific ioctl methods
 + *
 + * @filp: [in] open file to invoke ioctl method on
 + * @cmd:  [in] ioctl command to execute
 + * @arg:  [in/out] command-specific argument for ioctl

I've never seen these in/out annotations and doubt they're valid in
kerneldoc.  Randy?

 + * Invokes filesystem specific -unlock_ioctl, if one exists; otherwise

unlocked_ioctl

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Christoph Hellwig
 +static int __ioctl_fibmap(struct file *filp, int __user *p)

I'd say kill the __ prefix for all the functions you're adding.

 +static int __ioctl_fionbio(struct file *filp, unsigned long arg)

 +static int __ioctl_fioasync(unsigned int fd, struct file *filp,
 + unsigned long arg)

For these two I'd add a

void __user *argp = (void __user *)arg;

in the caller and then just pass argp down.  That way we have the cast in
one place.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann
On Tuesday 30 October 2007, Dongjun Shin wrote:
 There is an ongoing discussion about adding 'Trim' ATA command for notifying
 the drive about the deleted blocks.
 
 http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
 
 This is especially useful for the storage device like Solid State Drive (SSD).
 
This make me curious, why would t13 want to invent a new command when
there is already the erase command from CFA? 

It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 This make me curious, why would t13 want to invent a new command when
 there is already the erase command from CFA?

 It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
 should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
 http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf


I'm not sure about the background.
However, it's definitely a sign that passing the deleted block info
to the flash-based storage is useful.

Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
after being passed to the device. I think that other bio should not be
reordered
accross that hint (just like barrier).

Dongjun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:

 Not sure. Why shouldn't you be able to reorder the hints provided that
 they don't overlap with read/write bios for the same block?


You're right. The bios can be reordered if they don't overlap with hint.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel
On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote:
 On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:
 
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 I'd like to second the proposal, but it would be more useful to bring the hint
 down to the physical devices.

Absolutely.  Logfs would love to have an erase operation for block
devices as well.  However the above doesn't quite match my needs,
because the blocks _will_ be read in the future.

There are two reasons for reading things back later.  The good one is to
determine whether the segment was erased or not.  Reads should return
either valid data or one of (all-0xff, all-0x00, -ESOMETHING).  Having
a dedicated error code would be best.

And getting the device erasesize would be useful as well, for obvious
reasons.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger  Dragon
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection

2007-10-30 Thread J. Bruce Fields
From: J. Bruce Fields [EMAIL PROTECTED]

It's currently possible to send posix_locks_deadlock() into an infinite
loop (under the BKL).

For now, fix this just by bailing out after a few iterations.  We may
want to fix this in a way that better clarifies the semantics of
deadlock detection.  But that will take more time, and this minimal fix
is probably adequate for any realistic scenario, and is simple enough to
be appropriate for applying to stable kernels now.

Thanks to George Davis for reporting the problem.

Cc: George G. Davis [EMAIL PROTECTED]
Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
---
 fs/locks.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

I didn't see objections to this quick fix (just to the followup that
attempts to rip out posix deadlock detection entirely), so I'm
resending with just comment modifications.

I haven't given up on a more comprehensive solution, but I think we
really need to apply some fix now.

--b.

diff --git a/fs/locks.c b/fs/locks.c
index 0127a28..8b8388e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -696,17 +696,28 @@ EXPORT_SYMBOL(posix_test_lock);
  * Note: the above assumption may not be true when handling lock requests
  * from a broken NFS client. But broken NFS clients have a lot more to
  * worry about than proper deadlock detection anyway... --okir
+ *
+ * However, the failure of this assumption (also possible in the case of
+ * multiple tasks sharing the same open file table) also means there's no
+ * guarantee that the loop below will terminate.  As a hack, we give up
+ * after a few iterations.
  */
+
+#define MAX_DEADLK_ITERATIONS 10
+
 static int posix_locks_deadlock(struct file_lock *caller_fl,
struct file_lock *block_fl)
 {
struct file_lock *fl;
+   int i = 0;
 
 next_task:
if (posix_same_owner(caller_fl, block_fl))
return 1;
list_for_each_entry(fl, blocked_list, fl_link) {
if (posix_same_owner(fl, block_fl)) {
+   if (i++  MAX_DEADLK_ITERATIONS)
+   return 0;
fl = fl-fl_next;
block_fl = fl;
goto next_task;
-- 
1.5.3.4.208.gc990

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Randy Dunlap
On Tue, 30 Oct 2007 09:56:53 + Christoph Hellwig wrote:

 On Sun, Oct 28, 2007 at 08:40:56PM -0400, Erez Zadok wrote:
  +/**
  + * vfs_ioctl - call filesystem specific ioctl methods
  + *
  + * @filp: [in] open file to invoke ioctl method on
  + * @cmd:  [in] ioctl command to execute
  + * @arg:  [in/out] command-specific argument for ioctl
 
 I've never seen these in/out annotations and doubt they're valid in
 kerneldoc.  Randy?

They are just treated as part of the parameter explanation text.
I don't see any problem with them.

  + * Invokes filesystem specific -unlock_ioctl, if one exists; otherwise
 
 unlocked_ioctl


---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection

2007-10-30 Thread Alan Cox
On Tue, 30 Oct 2007 11:20:02 -0400
J. Bruce Fields [EMAIL PROTECTED] wrote:

 From: J. Bruce Fields [EMAIL PROTECTED]
 
 It's currently possible to send posix_locks_deadlock() into an infinite
 loop (under the BKL).
 
 For now, fix this just by bailing out after a few iterations.  We may
 want to fix this in a way that better clarifies the semantics of
 deadlock detection.  But that will take more time, and this minimal fix
 is probably adequate for any realistic scenario, and is simple enough to
 be appropriate for applying to stable kernels now.
 
 Thanks to George Davis for reporting the problem.
 
 Cc: George G. Davis [EMAIL PROTECTED]
 Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]

Acked-by: Alan Cox [EMAIL PROTECTED]

Its a good fix for now and I doubt any real world user has that complex a
locking pattern to break.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel
On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
 On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 
  Not sure. Why shouldn't you be able to reorder the hints provided that
  they don't overlap with read/write bios for the same block?
 
 You're right. The bios can be reordered if they don't overlap with hint.

I would keep things simpler.  Bios can be reordered, full stop.  If an
erase and a write overlap, the caller (filesystem?) has to add a
barrier.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-30 Thread J. Bruce Fields
On Mon, Oct 29, 2007 at 08:06:04AM +, Alan Cox wrote:
 On Sun, 28 Oct 2007 13:43:21 -0400
 J. Bruce Fields [EMAIL PROTECTED] wrote:
 
  From: J. Bruce Fields [EMAIL PROTECTED]
  
  We currently attempt to return -EDEALK to blocking fcntl() file locking
  requests that would create a cycle in the graph of tasks waiting on
  locks.
  
  This is inefficient: in the general case it requires us determining
  whether we're adding a cycle to an arbitrary directed acyclic graph.
  And this calculation has to be performed while holding a lock (currently
  the BKL) that prevents that graph from changing.
  
  It has historically been a source of bugs; most recently it was noticed
  that it could loop indefinitely while holding the BKL.
  
  It seems unlikely to be useful to applications:
  - The difficulty of implementation has kept standards from
requiring it.  (E.g. SUSv3 : Since implementation of full
deadlock detection is not always feasible, the [EDEADLK] error
was made optional.)  So portable applications may not be able to
depend on it.
  - It only detects deadlocks that involve nothing but local posix
file locks; deadlocks involving network filesystems or other kinds
of locks or resources are missed.
  
  It therefore seems best to remove deadlock detection.
  
  Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
 
 
 NAK. This is an ABI change

OK, fair enough.

 and one that was rejected before when this was last discussed in
 detail.

That previous discussion (http://marc.info/?t=10865285743r=1w=2)
read the spec wrong and--now that I reread it--failed to address the
original bug report.  In fact it looks to me like the actual bug
reported:

http://bugzilla.kernel.org/show_bug.cgi?id=2829

was probably a minor variant of the one which George Davis stumbled on
again here.  That's a little depressing.

 Moving it out of BKL makes a ton of sense,

That would at least reduce the impact of bugs here, yeah.  It would be
great to figure out how to start getting locks.c and lockd out from
under the BKL, but I don't personally have the time now.  (And I believe
there's been a failed attempt or two so it's not completely easy.)

 even adding a don't check flag makes a lot of sense.

That's an idea I'll keep in mind, though I'm not convinced it'll help
much (or that applications would actually use it).

 The failure case for removing this feature is obscure and hard to debug
 application hangs for the afflicted programs - not nice for users at all.

Yeah.  I'd still be curious for any data about applications that
actually rely on posix deadlock detection, though.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann
On Tuesday 30 October 2007, Jörn Engel wrote:
 On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
  On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
  
   Not sure. Why shouldn't you be able to reorder the hints provided that
   they don't overlap with read/write bios for the same block?
  
  You're right. The bios can be reordered if they don't overlap with hint.
 
 I would keep things simpler.  Bios can be reordered, full stop.  If an
 erase and a write overlap, the caller (filesystem?) has to add a
 barrier.

I thought bios were already ordered if they affect the same blocks.
Either way, I agree that an erase should not be treated special on
the bio layer, its ordering should be handled the same way we do it
for writes.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Randy Dunlap
On Tue, 30 Oct 2007 17:14:36 + Christoph Hellwig wrote:

 On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote:
  They are just treated as part of the parameter explanation text.
  I don't see any problem with them.
 
 Well, it's completely inconsistant with any other kerneldoc..

True.  and it has an advantage.

---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty
Hi,

While testing hotplug memory remove, I ran into this issue. Given a
range of pages hotplug memory remove tries to migrate those pages.

migrate_pages() keeps failing to migrate pages containing pagecache
pages for reiserfs files. I noticed that reiserfs doesn't have 
-migratepage() ops. So, fallback_migrate_page() code tries to
do try_to_release_page(). try_to_release_page() fails to drop_buffers()
since b_count == 1. Here is what my debug shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to zero ? 

Thanks,
Badari 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Erez Zadok
In message [EMAIL PROTECTED], Christoph Hellwig writes:
 On Tue, Oct 30, 2007 at 08:22:40AM -0700, Randy Dunlap wrote:
  They are just treated as part of the parameter explanation text.
  I don't see any problem with them.
 
 Well, it's completely inconsistant with any other kerneldoc..

If it doesn't hurt, and kerneldoc will present it as such, then I'd leave
the [in/out] in place.  Ioctls are the few places where you can have
variables used as both input and output; so _somehow_ we need to capture
that behavior.

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Erez Zadok
BTW, what's the origin of this oddity in fs/ioctl.c:

#ifdef __sparc__
/* SunOS compatibility item. */
if (O_NONBLOCK != O_NDELAY)
flag |= O_NDELAY;
#endif

It seems rather odd to have architecture-specific code in the VFS, no?

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason
On Tue, 30 Oct 2007 10:27:04 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 Hi,
 
 While testing hotplug memory remove, I ran into this issue. Given a
 range of pages hotplug memory remove tries to migrate those pages.
 
 migrate_pages() keeps failing to migrate pages containing pagecache
 pages for reiserfs files. I noticed that reiserfs doesn't have 
 -migratepage() ops. So, fallback_migrate_page() code tries to
 do try_to_release_page(). try_to_release_page() fails to
 drop_buffers() since b_count == 1. Here is what my debug shows:
 
   migrate pages failed pfn 258111/flags 3f801
   bh cb53f6e0 flags 110029 count 1
   
 Any one know why the b_count == 1 and not getting dropped to zero ? 

If these are file data pages, the count is probably elevated as part of
the data=ordered tracking.  You can verify this via b_private, or just
mount data=writeback to double check.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Christoph Hellwig
On Tue, Oct 30, 2007 at 01:49:48PM -0400, Erez Zadok wrote:
 BTW, what's the origin of this oddity in fs/ioctl.c:
 
 #ifdef __sparc__
   /* SunOS compatibility item. */
   if (O_NONBLOCK != O_NDELAY)
   flag |= O_NDELAY;
 #endif
 
 It seems rather odd to have architecture-specific code in the VFS, no?

When Dave did the sparc port he followed sunos ABIs for lots of things.
When these ABIs are hidden behind ioctl arguments they can't easily
be handled in arch code and we need warts like that.  It's definitively
not recommended for new ports..

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring

2007-10-30 Thread Erez Zadok

This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as
follows.  This series is against v2.6.24-rc1-423-g97855b4.

Patch 1: just applies coding standards to fs/ioctl.c (while I'm at it, I
figured it's worth cleaning VFS files one at a time).

Patch 2: does two things:

(a) Renames the old vfs_ioctl to do_ioctl, because the comment above it
clearly indicates that it is an internal function not to be exported to
modules; therefore it should have a more traditional do_XXX internal
function name.  The new do_ioctl is exported in fs.h but not to
modules.

(b) Renames the old (static) do_ioctl to vfs_ioctl because the names vfs_XXX
should preferably be reserved to callable VFS functions which modules
may call, as other vfs_XXX functions already do.  Export the new
vfs_ioctl to (GPL) modules so others can use it (including Unionfs and
eCryptfs).

Patch 3: factors out the switch statements' cases for
FIBMAP/FIONBIO/FIOASYNC, into three small static helper functions.

Patch 4: demonstrates how Unionfs can use the new vfs_ioctl.  I successfully
tested unionfs with this new exported vfs_ioctl.  (eCryptfs could do the
same.)

I'd like to propose that the first 3 patches be merged in -mm and even
mainline, pending review.

Erez Zadok (4):
  VFS: apply coding standards to fs/ioctl.c
  VFS: swap do_ioctl and vfs_ioctl names
  VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls
  Unionfs: use vfs_ioctl

 fs/compat_ioctl.c   |2 
 fs/ioctl.c  |  224 
 fs/unionfs/commonfops.c |   36 +--
 include/linux/fs.h  |4 
 4 files changed, 141 insertions(+), 125 deletions(-)

Cheers,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Unionfs: use vfs_ioctl

2007-10-30 Thread Erez Zadok
Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/unionfs/commonfops.c |   36 ++--
 1 files changed, 6 insertions(+), 30 deletions(-)

diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c
index 7654bcb..c99b519 100644
--- a/fs/unionfs/commonfops.c
+++ b/fs/unionfs/commonfops.c
@@ -661,35 +661,6 @@ out:
return err;
 }
 
-/* pass the ioctl to the lower fs */
-static long do_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-   struct file *lower_file;
-   int err;
-
-   lower_file = unionfs_lower_file(file);
-
-   err = security_file_ioctl(lower_file, cmd, arg);
-   if (err)
-   goto out;
-
-   err = -ENOTTY;
-   if (!lower_file || !lower_file-f_op)
-   goto out;
-   if (lower_file-f_op-unlocked_ioctl) {
-   err = lower_file-f_op-unlocked_ioctl(lower_file, cmd, arg);
-   } else if (lower_file-f_op-ioctl) {
-   lock_kernel();
-   err = lower_file-f_op-ioctl(
-   lower_file-f_path.dentry-d_inode,
-   lower_file, cmd, arg);
-   unlock_kernel();
-   }
-
-out:
-   return err;
-}
-
 /*
  * return to user-space the branch indices containing the file in question
  *
@@ -756,6 +727,7 @@ out:
 long unionfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
long err;
+   struct file *lower_file;
 
unionfs_read_lock(file-f_path.dentry-d_sb);
 
@@ -779,7 +751,11 @@ long unionfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 
default:
/* pass the ioctl down */
-   err = do_ioctl(file, cmd, arg);
+   lower_file = unionfs_lower_file(file);
+   if (lower_file)
+   err = vfs_ioctl(lower_file, cmd, arg);
+   else
+   err = -ENOTTY;
break;
}
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] VFS: apply coding standards to fs/ioctl.c

2007-10-30 Thread Erez Zadok
Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/ioctl.c |  164 +++-
 1 files changed, 84 insertions(+), 80 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index c2a773e..652cacf 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -12,8 +12,8 @@
 #include linux/fs.h
 #include linux/security.h
 #include linux/module.h
+#include linux/uaccess.h
 
-#include asm/uaccess.h
 #include asm/ioctls.h
 
 static long do_ioctl(struct file *filp, unsigned int cmd,
@@ -45,31 +45,31 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
 {
int error;
int block;
-   struct inode * inode = filp-f_path.dentry-d_inode;
+   struct inode *inode = filp-f_path.dentry-d_inode;
int __user *p = (int __user *)arg;
 
switch (cmd) {
-   case FIBMAP:
-   {
-   struct address_space *mapping = filp-f_mapping;
-   int res;
-   /* do we support this mess? */
-   if (!mapping-a_ops-bmap)
-   return -EINVAL;
-   if (!capable(CAP_SYS_RAWIO))
-   return -EPERM;
-   if ((error = get_user(block, p)) != 0)
-   return error;
-
-   lock_kernel();
-   res = mapping-a_ops-bmap(mapping, block);
-   unlock_kernel();
-   return put_user(res, p);
-   }
-   case FIGETBSZ:
-   return put_user(inode-i_sb-s_blocksize, p);
-   case FIONREAD:
-   return put_user(i_size_read(inode) - filp-f_pos, p);
+   case FIBMAP:
+   {
+   struct address_space *mapping = filp-f_mapping;
+   int res;
+   /* do we support this mess? */
+   if (!mapping-a_ops-bmap)
+   return -EINVAL;
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   error = get_user(block, p);
+   if (error)
+   return error;
+   lock_kernel();
+   res = mapping-a_ops-bmap(mapping, block);
+   unlock_kernel();
+   return put_user(res, p);
+   }
+   case FIGETBSZ:
+   return put_user(inode-i_sb-s_blocksize, p);
+   case FIONREAD:
+   return put_user(i_size_read(inode) - filp-f_pos, p);
}
 
return do_ioctl(filp, cmd, arg);
@@ -82,81 +82,85 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
  * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
  */
-int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned 
long arg)
+int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+ unsigned long arg)
 {
unsigned int flag;
int on, error = 0;
 
switch (cmd) {
-   case FIOCLEX:
-   set_close_on_exec(fd, 1);
-   break;
+   case FIOCLEX:
+   set_close_on_exec(fd, 1);
+   break;
 
-   case FIONCLEX:
-   set_close_on_exec(fd, 0);
-   break;
+   case FIONCLEX:
+   set_close_on_exec(fd, 0);
+   break;
 
-   case FIONBIO:
-   if ((error = get_user(on, (int __user *)arg)) != 0)
-   break;
-   flag = O_NONBLOCK;
+   case FIONBIO:
+   error = get_user(on, (int __user *)arg);
+   if (error)
+   break;
+   flag = O_NONBLOCK;
 #ifdef __sparc__
-   /* SunOS compatibility item. */
-   if(O_NONBLOCK != O_NDELAY)
-   flag |= O_NDELAY;
+   /* SunOS compatibility item. */
+   if (O_NONBLOCK != O_NDELAY)
+   flag |= O_NDELAY;
 #endif
-   if (on)
-   filp-f_flags |= flag;
-   else
-   filp-f_flags = ~flag;
+   if (on)
+   filp-f_flags |= flag;
+   else
+   filp-f_flags = ~flag;
+   break;
+
+   case FIOASYNC:
+   error = get_user(on, (int __user *)arg);
+   if (error)
break;
-
-   case FIOASYNC:
-   if ((error = get_user(on, (int __user *)arg)) != 0)
-   break;
-   flag = on ? FASYNC : 0;
-
-   /* Did FASYNC state change ? */
-   if ((flag ^ filp-f_flags)  FASYNC) {
-   if 

[PATCH 2/4] VFS: swap do_ioctl and vfs_ioctl names

2007-10-30 Thread Erez Zadok
Rename old vfs_ioctl to do_ioctl, because the comment above it clearly
indicates that it is an internal function not to be exported to modules;
therefore it should have a more traditional do_XXX name.  The new do_ioctl
is exported in fs.h but not to modules.

Rename the old do_ioctl to vfs_ioctl because the names vfs_XXX should
preferably be reserved to callable VFS functions which modules may call, as
many other vfs_XXX functions already do.  Export the new vfs_ioctl to GPL
modules so others can use it (including Unionfs and eCryptfs).  Add DocBook
for new vfs_ioctl.

Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/compat_ioctl.c  |2 +-
 fs/ioctl.c |   29 +
 include/linux/fs.h |4 +++-
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index a4284cc..a1604ce 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2972,7 +2972,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
}
 
  do_ioctl:
-   error = vfs_ioctl(filp, fd, cmd, arg);
+   error = do_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 652cacf..1ab7b7d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -16,8 +16,20 @@
 
 #include asm/ioctls.h
 
-static long do_ioctl(struct file *filp, unsigned int cmd,
-   unsigned long arg)
+/**
+ * vfs_ioctl - call filesystem specific ioctl methods
+ * @filp: [in] open file to invoke ioctl method on
+ * @cmd:  [in] ioctl command to execute
+ * @arg:  [in/out] command-specific argument for ioctl
+ *
+ * Invokes filesystem specific -unlocked_ioctl, if one exists; otherwise
+ * invokes * filesystem specific -ioctl method.  If neither method exists,
+ * returns -ENOTTY.
+ *
+ * Returns 0 on success, -errno on error.
+ */
+long vfs_ioctl(struct file *filp, unsigned int cmd,
+  unsigned long arg)
 {
int error = -ENOTTY;
 
@@ -39,6 +51,7 @@ static long do_ioctl(struct file *filp, unsigned int cmd,
  out:
return error;
 }
+EXPORT_SYMBOL_GPL(vfs_ioctl);
 
 static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
@@ -72,18 +85,18 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
return put_user(i_size_read(inode) - filp-f_pos, p);
}
 
-   return do_ioctl(filp, cmd, arg);
+   return vfs_ioctl(filp, cmd, arg);
 }
 
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
- * vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
+ * do_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
  */
-int vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
- unsigned long arg)
+int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+unsigned long arg)
 {
unsigned int flag;
int on, error = 0;
@@ -152,7 +165,7 @@ int vfs_ioctl(struct file *filp, unsigned int fd, unsigned 
int cmd,
if (S_ISREG(filp-f_path.dentry-d_inode-i_mode))
error = file_ioctl(filp, cmd, arg);
else
-   error = do_ioctl(filp, cmd, arg);
+   error = vfs_ioctl(filp, cmd, arg);
break;
}
return error;
@@ -172,7 +185,7 @@ asmlinkage long sys_ioctl(unsigned int fd, unsigned int 
cmd, unsigned long arg)
if (error)
goto out_fput;
 
-   error = vfs_ioctl(filp, fd, cmd, arg);
+   error = do_ioctl(filp, fd, cmd, arg);
  out_fput:
fput_light(filp, fput_needed);
  out:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..d513f16 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1924,7 +1924,9 @@ extern int vfs_stat_fd(int dfd, char __user *, struct 
kstat *);
 extern int vfs_lstat_fd(int dfd, char __user *, struct kstat *);
 extern int vfs_fstat(unsigned int, struct kstat *);
 
-extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
+extern long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
+extern int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
+   unsigned long arg);
 
 extern void get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] VFS: factor out three helpers for FIBMAP/FIONBIO/FIOASYNC file ioctls

2007-10-30 Thread Erez Zadok
Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/ioctl.c |  129 +++-
 1 files changed, 75 insertions(+), 54 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1ab7b7d..cd8c1a3 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -53,32 +53,34 @@ long vfs_ioctl(struct file *filp, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfs_ioctl);
 
+static int ioctl_fibmap(struct file *filp, int __user *p)
+{
+   struct address_space *mapping = filp-f_mapping;
+   int res, block;
+
+   /* do we support this mess? */
+   if (!mapping-a_ops-bmap)
+   return -EINVAL;
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   res = get_user(block, p);
+   if (res)
+   return res;
+   lock_kernel();
+   res = mapping-a_ops-bmap(mapping, block);
+   unlock_kernel();
+   return put_user(res, p);
+}
+
 static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 {
-   int error;
-   int block;
struct inode *inode = filp-f_path.dentry-d_inode;
int __user *p = (int __user *)arg;
 
switch (cmd) {
case FIBMAP:
-   {
-   struct address_space *mapping = filp-f_mapping;
-   int res;
-   /* do we support this mess? */
-   if (!mapping-a_ops-bmap)
-   return -EINVAL;
-   if (!capable(CAP_SYS_RAWIO))
-   return -EPERM;
-   error = get_user(block, p);
-   if (error)
-   return error;
-   lock_kernel();
-   res = mapping-a_ops-bmap(mapping, block);
-   unlock_kernel();
-   return put_user(res, p);
-   }
+   return ioctl_fibmap(filp, p);
case FIGETBSZ:
return put_user(inode-i_sb-s_blocksize, p);
case FIONREAD:
@@ -88,6 +90,57 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
return vfs_ioctl(filp, cmd, arg);
 }
 
+static int ioctl_fionbio(struct file *filp, int __user *argp)
+{
+   unsigned int flag;
+   int on, error;
+
+   error = get_user(on, argp);
+   if (error)
+   return error;
+   flag = O_NONBLOCK;
+#ifdef __sparc__
+   /* SunOS compatibility item. */
+   if (O_NONBLOCK != O_NDELAY)
+   flag |= O_NDELAY;
+#endif
+   if (on)
+   filp-f_flags |= flag;
+   else
+   filp-f_flags = ~flag;
+   return error;
+}
+
+static int ioctl_fioasync(unsigned int fd, struct file *filp,
+ int __user *argp)
+{
+   unsigned int flag;
+   int on, error;
+
+   error = get_user(on, argp);
+   if (error)
+   return error;
+   flag = on ? FASYNC : 0;
+
+   /* Did FASYNC state change ? */
+   if ((flag ^ filp-f_flags)  FASYNC) {
+   if (filp-f_op  filp-f_op-fasync) {
+   lock_kernel();
+   error = filp-f_op-fasync(fd, filp, on);
+   unlock_kernel();
+   } else
+   error = -ENOTTY;
+   }
+   if (error)
+   return error;
+
+   if (on)
+   filp-f_flags |= FASYNC;
+   else
+   filp-f_flags = ~FASYNC;
+   return error;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -98,8 +151,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
 int do_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 unsigned long arg)
 {
-   unsigned int flag;
-   int on, error = 0;
+   int error = 0;
+   int __user *argp = (int __user *)arg;
 
switch (cmd) {
case FIOCLEX:
@@ -111,43 +164,11 @@ int do_ioctl(struct file *filp, unsigned int fd, unsigned 
int cmd,
break;
 
case FIONBIO:
-   error = get_user(on, (int __user *)arg);
-   if (error)
-   break;
-   flag = O_NONBLOCK;
-#ifdef __sparc__
-   /* SunOS compatibility item. */
-   if (O_NONBLOCK != O_NDELAY)
-   flag |= O_NDELAY;
-#endif
-   if (on)
-   filp-f_flags |= flag;
-   else
-   filp-f_flags = ~flag;
+   error = ioctl_fionbio(filp, argp);
break;
 
case FIOASYNC:
-   error = get_user(on, (int __user *)arg);
-   if (error)
-   break;
-   flag = on ? FASYNC : 0;
-
-   /* Did FASYNC state change ? */
-   if ((flag ^ filp-f_flags)  FASYNC) {
-   if (filp-f_op  filp-f_op-fasync) {
-   lock_kernel();
-   error = filp-f_op-fasync(fd, filp, on);
- 

Re: migratepage failures on reiserfs

2007-10-30 Thread Badari Pulavarty
On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
 On Tue, 30 Oct 2007 10:27:04 -0800
 Badari Pulavarty [EMAIL PROTECTED] wrote:
 
  Hi,
  
  While testing hotplug memory remove, I ran into this issue. Given a
  range of pages hotplug memory remove tries to migrate those pages.
  
  migrate_pages() keeps failing to migrate pages containing pagecache
  pages for reiserfs files. I noticed that reiserfs doesn't have 
  -migratepage() ops. So, fallback_migrate_page() code tries to
  do try_to_release_page(). try_to_release_page() fails to
  drop_buffers() since b_count == 1. Here is what my debug shows:
  
  migrate pages failed pfn 258111/flags 3f801
  bh cb53f6e0 flags 110029 count 1
  
  Any one know why the b_count == 1 and not getting dropped to zero ? 
 
 If these are file data pages, the count is probably elevated as part of
 the data=ordered tracking.  You can verify this via b_private, or just
 mount data=writeback to double check.


Chris,

That was my first assumption. But after looking at reiserfs_releasepage
(), realized that it would do reiserfs_free_jh() and clears the
b_private. I couldn't easily find out who has the ref. against this
bh.

bh cbdaaf00 flags 110029 count 1 private 0

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-30 Thread Chris Mason
On Tue, 30 Oct 2007 13:54:05 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

 On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:
  On Tue, 30 Oct 2007 10:27:04 -0800
  Badari Pulavarty [EMAIL PROTECTED] wrote:
  
   Hi,
   
   While testing hotplug memory remove, I ran into this issue. Given
   a range of pages hotplug memory remove tries to migrate those
   pages.
   
   migrate_pages() keeps failing to migrate pages containing
   pagecache pages for reiserfs files. I noticed that reiserfs
   doesn't have -migratepage() ops. So, fallback_migrate_page()
   code tries to do try_to_release_page(). try_to_release_page()
   fails to drop_buffers() since b_count == 1. Here is what my debug
   shows:
   
 migrate pages failed pfn 258111/flags 3f801
 bh cb53f6e0 flags 110029 count 1
 
   Any one know why the b_count == 1 and not getting dropped to
   zero ? 
  
  If these are file data pages, the count is probably elevated as
  part of the data=ordered tracking.  You can verify this via
  b_private, or just mount data=writeback to double check.
 
 
 Chris,
 
 That was my first assumption. But after looking at
 reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
 and clears the b_private. I couldn't easily find out who has the ref.
 against this bh.
 
 bh cbdaaf00 flags 110029 count 1 private 0
 

If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
it is currently under IO?

The page isn't locked, but data=ordered does IO directly on the buffer
heads, without taking the page lock.

The easy way to narrow our search is to try without data=ordered, it is
certainly complicating things.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring

2007-10-30 Thread Andrew Morton
On Tue, 30 Oct 2007 15:39:55 -0400
Erez Zadok [EMAIL PROTECTED] wrote:

 This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as
 follows.

The problem is of course that you need these in your tree for ongoing
development, but 2.6.25 is months away.  They look OK to me so I suggest
that you go ahead and commit them to your git tree and I'll drop them
again.  Please resend them for merging in the 2.6.25-rc1 merge window.


unionfs has been hanging around for a long time now and we should work
towards getting it into 2.6.25 or dropped from -mm (sorry).  Right now
would be a great time to get this process underway.  

I have only a partial memory of what the sticking points were, and I have
basically zero knowledge of what was done to address them.  So could you
please, over the next few weeks:

- Send out a description of what the issues were, and how they were addressed

- If issues remain, describe their impact and possible workarounds, all
  that stuff.

- If it mostly-survives all that design-level review and consideration
  then let's go for it: get all the patches cleaned up and consolidated and
  get them emailed out for review no later than 2.6.24-rc5.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-30 Thread Badari

Chris Mason wrote:

On Tue, 30 Oct 2007 13:54:05 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

  

On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote:


On Tue, 30 Oct 2007 10:27:04 -0800
Badari Pulavarty [EMAIL PROTECTED] wrote:

  

Hi,

While testing hotplug memory remove, I ran into this issue. Given
a range of pages hotplug memory remove tries to migrate those
pages.

migrate_pages() keeps failing to migrate pages containing
pagecache pages for reiserfs files. I noticed that reiserfs
doesn't have -migratepage() ops. So, fallback_migrate_page()
code tries to do try_to_release_page(). try_to_release_page()
fails to drop_buffers() since b_count == 1. Here is what my debug
shows:

migrate pages failed pfn 258111/flags 3f801
bh cb53f6e0 flags 110029 count 1

Any one know why the b_count == 1 and not getting dropped to
zero ? 


If these are file data pages, the count is probably elevated as
part of the data=ordered tracking.  You can verify this via
b_private, or just mount data=writeback to double check.
  

Chris,

That was my first assumption. But after looking at
reiserfs_releasepage (), realized that it would do reiserfs_free_jh()
and clears the b_private. I couldn't easily find out who has the ref.
against this bh.

bh cbdaaf00 flags 110029 count 1 private 0




If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps
it is currently under IO?
  

Its BH_Req | BH_Uptodate. Its not under IO.

The page isn't locked, but data=ordered does IO directly on the buffer
heads, without taking the page lock.

The easy way to narrow our search is to try without data=ordered, it is
certainly complicating things.
  


I can try that, its my root filesystem :(

Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park
On 10/31/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 On Tuesday 30 October 2007, Jörn Engel wrote:
  On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
   On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
   
Not sure. Why shouldn't you be able to reorder the hints provided that
they don't overlap with read/write bios for the same block?
  
   You're right. The bios can be reordered if they don't overlap with hint.
 
  I would keep things simpler. Bios can be reordered, full stop. If an
  erase and a write overlap, the caller (filesystem?) has to add a
  barrier.

 I thought bios were already ordered if they affect the same blocks.
 Either way, I agree that an erase should not be treated special on
 the bio layer, its ordering should be handled the same way we do it
 for writes.


To support the new ATA command (trim, or dataset), the suggested hint
is not enough.
We have to send the bio with data (at least one sector or more) since
the new ATA command requests the dataset information.

And also we have to strictly follow the order using barrier or other
methods at filesystem level
For example, the delete operation in ext3.
1. delete some file
2. ext3_delete_inode() called
3. ... - ext3_free_blocks_sb() releases the free blocks
4. If it sends the hints here, it breaks the ext3 power off recovery
scheme since it trims the data from given information after reboot
5. after transaction, all dirty pages are flushed. after this work, we
can trim the free blocks safely.

Another approach is modifying the block framework.
At  I/O scheduler, it don't merge the hint bio (in my terminology, bio
control info) with general bio. In this case we also consider the
reordering problem.
I'm not sure it is possible at this time.

Thank you,
Kyungmin Park
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread David Chinner
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
 On Tuesday October 30, [EMAIL PROTECTED] wrote:
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 If the allocation unit of the storage device (e.g. a few MB) does not
 match the allocation unit of the filesystem (e.g. a few KB) then for
 this to be useful either the storage device must start recording tiny
 allocations, or the filesystem should re-release areas as they grow.
 i.e. when releasing a range of a device, look in the filesystem's usage
 records for the largest surrounding free space, and release all of that.

I figured that the easiest way around this is reporting free space
extents, not the amoutn actually freed. e.g.

4k in file A @ block 10
4k in file B @ block 11
4k free space @ block 12
4k in file C @ block 13
1008k in free space at block 14.

If we free file A, we report that we've released an extent of 4k @ block 10.
if we then free file B, we report we've released an extent of 12k @ block 10.
If we then free file C, we report a release of 1024k @ block 10.

Then the underlying device knows what the aggregated free space regions
are and can easily release large regions without needing to track tiny
allocations and frees done by the filesystem.

 I guess that is equally domain specific, but the difference is that if
 you try to read from the DONTCOW part of the snapshot, you get bad
 old data, where as if you try to access the subordinate device of a
 snapshot, you get an IO error - which is probably safer.

If you read from a DONTCOW region you should get zeros back - it's
a hole in the snapshot.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-10-30 Thread Zan Lynx

On Tue, 2007-10-30 at 16:08 -0700, Badari wrote:
 Chris Mason wrote:
  On Tue, 30 Oct 2007 13:54:05 -0800
[cut]
  The easy way to narrow our search is to try without data=ordered, it is
  certainly complicating things.

 
 I can try that, its my root filesystem :(

You meant to write can't?

Download BusyBox and build it into an initramfs.  It's pretty easy, you
can do it yourself.  Or you could download the Debian mkinitramfs (I
think) package and look at it.  Or the Fedora equivalent (I think
mkinitrd).

Then you can boot into that and mount your / with whatever options you
like.

Here's what I use for my own custom BusyBox initramfs /init script:
#!/bin/sh

get_arg() {
local arg=$1
local x=`cat /proc/cmdline`
for i in $x; do
if [ ${i%=*} = $arg ]; then
echo ${i#*=}
break
fi
done
}

do_switch() {
mount -t proc none /proc
local root=`get_arg root`
local flags=`get_arg rootflags`
mount $root /new ${flags:+-o $flags}
umount /proc
cd /new
exec switch_root . /sbin/init
}

do_shell() {
exec /sbin/init
}

# The following will wait 2 seconds for Enter before booting.
trap do_switch ALRM

target=$$
( sleep 2; kill -ALRM $target ) 
alarm=$!

echo -n Press Enter for a shell: 
while read action; do
kill $alarm
break
done
do_shell

-- 
Zan Lynx [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks
On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote:
 On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:
 
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 
 I'd like to second the proposal, but it would be more useful to bring the hint
 down to the physical devices.
 
 There is an ongoing discussion about adding 'Trim' ATA command for notifying
 the drive about the deleted blocks.
 
 http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

What an interesting document.  Am I reading the change markup correctly,
did it get *simpler* in the last revision?  Wow.

I agree that BIO_HINT_RELEASE would be a good match for the proposed
Trim command.  But I don't think we'll ever be issuing Trims with
more than a single LBA Range Entry, that feature seems unhelpful.

The Trim proposal doesn't specify what happens when a sector which
is already deallocated is deallocated again, presumably this is
supposed to be harmless?

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks
On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
 On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
  On Tuesday October 30, [EMAIL PROTECTED] wrote:
   BIO_HINT_RELEASE
   The bio's block extent is no longer in use by the filesystem
   and will not be read in the future.  Any storage used to back
   the extent may be released without any threat to filesystem
   or data integrity.
  
  If the allocation unit of the storage device (e.g. a few MB) does not
  match the allocation unit of the filesystem (e.g. a few KB) then for
  this to be useful either the storage device must start recording tiny
  allocations, or the filesystem should re-release areas as they grow.
  i.e. when releasing a range of a device, look in the filesystem's usage
  records for the largest surrounding free space, and release all of that.
 
 I figured that the easiest way around this is reporting free space
 extents, not the amoutn actually freed. e.g.
 
   4k in file A @ block 10
   4k in file B @ block 11
   4k free space @ block 12
   4k in file C @ block 13
   1008k in free space at block 14.
 
 If we free file A, we report that we've released an extent of 4k @ block 10.
 if we then free file B, we report we've released an extent of 12k @ block 10.
 If we then free file C, we report a release of 1024k @ block 10.
 
 Then the underlying device knows what the aggregated free space regions
 are and can easily release large regions without needing to track tiny
 allocations and frees done by the filesystem.

If you could do that in the filesystem, it certainly solve the problem.
In which case I'll explicitly allow for the hint's extent to overlap
extents previous extents thus hinted, and define the semantics
for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
I think that will make the semantics a little clearer.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html