Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-08 Thread Steven Whitehouse
Hi,

On Thu, 2006-12-07 at 14:05 -0500, Wendy Cheng wrote:
> Steven Whitehouse wrote:
> > Hi,
> >
> > On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
> >   
> >>> I was taking my cue here from ext3 which does something similar. The
> >>> filemap_fdatawrite() is done by the VFS before this is called with a
> >>> filemap_fdatawait() afterwards. This was intended to flush the metadata
> >>> via (eventually) ->write_inode() although I guess I should be calling
> >>> write_inode_now() instead?
> >>>   
> >> oh I see, you're jsut trying to write the inode itself, not the pages.
> >>
> >> write_inode_now() will write the pages, which you seem to not want to do.
> >> Whatever.  The APIs here are a bit awkward.
> >> 
> >
> > I've added a comment to explain whats going on here, and also the
> > following patch. I know it could be better, but its still an improvement
> > on what was there before,
> >
> >
> >   
> Steve,
> 
> I'm in the middle of something else and don't have upstream kernel 
> source handy at this moment. But I read akpm's comment as 
> "write_inode_now" would do writepage and that is *not* what you want (?) 
> (since vfs has done that before this call is invoked). I vaguely 
> recalled I did try write_inode_now() on GFS1 once but had to replace it 
> with "sync_inode" on RHEL4 (for the reason that I can't remember at this 
> moment). I suggest you keep "sync_inode" (at least for a while until we 
> can prove other call can do better). This "sync_inode" has been well 
> tested out (with GFS1's fsync call).
> 
Ok. Its gone upstream now, but I'm happy to revert that change if it
turns out to be a problem. My tests show identical performance between
the two calls, but maybe there is a corner case I missed?

Both calls do writepage() but since the VFS has already done it for us,
the chances of there being any more dirty pages to write is fairly
small, so its unlikely to be much of a problem I think, but I'm willing
to be proved wrong if there is a good reason.

> There is another issue. It is a gray area. Note that you don't grab any 
> glock here ... so if someone *has* written something in other nodes, 
> this sync could miss it (?). This depends on how people expects a 
> fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
> shared lock here so it will force other node to flush the data (I 
> personally think this is a more correct behavior). Your call though.
> 
> -- Wendy
> 
Its a tricky one to deal with. I would expect that the chances of an
application relying on an fsync on one node to cause a cross-cluster
flush is relatively unlikely. It would mean that there would have to be
another communication channel between the processes on the different
nodes through which the node that was writing data could request a flush
and then receive notification that it has finished, otherwise it would
not seem to make any sense. It would seem an odd way to write an
application, but maybe one does exist which does this somewhere.

Delving back into the history it looks like this is a change (with
respect to gfs1) made by Ken rather than myself. I don't mind adding
this feature though, but even so what we have now is still a marked
improvement on what was there previously I think,

Steve.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-08 Thread Steven Whitehouse
Hi,

On Thu, 2006-12-07 at 14:05 -0500, Wendy Cheng wrote:
 Steven Whitehouse wrote:
  Hi,
 
  On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:

  I was taking my cue here from ext3 which does something similar. The
  filemap_fdatawrite() is done by the VFS before this is called with a
  filemap_fdatawait() afterwards. This was intended to flush the metadata
  via (eventually) -write_inode() although I guess I should be calling
  write_inode_now() instead?

  oh I see, you're jsut trying to write the inode itself, not the pages.
 
  write_inode_now() will write the pages, which you seem to not want to do.
  Whatever.  The APIs here are a bit awkward.
  
 
  I've added a comment to explain whats going on here, and also the
  following patch. I know it could be better, but its still an improvement
  on what was there before,
 
 

 Steve,
 
 I'm in the middle of something else and don't have upstream kernel 
 source handy at this moment. But I read akpm's comment as 
 write_inode_now would do writepage and that is *not* what you want (?) 
 (since vfs has done that before this call is invoked). I vaguely 
 recalled I did try write_inode_now() on GFS1 once but had to replace it 
 with sync_inode on RHEL4 (for the reason that I can't remember at this 
 moment). I suggest you keep sync_inode (at least for a while until we 
 can prove other call can do better). This sync_inode has been well 
 tested out (with GFS1's fsync call).
 
Ok. Its gone upstream now, but I'm happy to revert that change if it
turns out to be a problem. My tests show identical performance between
the two calls, but maybe there is a corner case I missed?

Both calls do writepage() but since the VFS has already done it for us,
the chances of there being any more dirty pages to write is fairly
small, so its unlikely to be much of a problem I think, but I'm willing
to be proved wrong if there is a good reason.

 There is another issue. It is a gray area. Note that you don't grab any 
 glock here ... so if someone *has* written something in other nodes, 
 this sync could miss it (?). This depends on how people expects a 
 fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
 shared lock here so it will force other node to flush the data (I 
 personally think this is a more correct behavior). Your call though.
 
 -- Wendy
 
Its a tricky one to deal with. I would expect that the chances of an
application relying on an fsync on one node to cause a cross-cluster
flush is relatively unlikely. It would mean that there would have to be
another communication channel between the processes on the different
nodes through which the node that was writing data could request a flush
and then receive notification that it has finished, otherwise it would
not seem to make any sense. It would seem an odd way to write an
application, but maybe one does exist which does this somewhere.

Delving back into the history it looks like this is a change (with
respect to gfs1) made by Ken rather than myself. I don't mind adding
this feature though, but even so what we have now is still a marked
improvement on what was there previously I think,

Steve.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-07 Thread Wendy Cheng

Steven Whitehouse wrote:

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
  

I was taking my cue here from ext3 which does something similar. The
filemap_fdatawrite() is done by the VFS before this is called with a
filemap_fdatawait() afterwards. This was intended to flush the metadata
via (eventually) ->write_inode() although I guess I should be calling
write_inode_now() instead?
  

oh I see, you're jsut trying to write the inode itself, not the pages.

write_inode_now() will write the pages, which you seem to not want to do.
Whatever.  The APIs here are a bit awkward.



I've added a comment to explain whats going on here, and also the
following patch. I know it could be better, but its still an improvement
on what was there before,


  

Steve,

I'm in the middle of something else and don't have upstream kernel 
source handy at this moment. But I read akpm's comment as 
"write_inode_now" would do writepage and that is *not* what you want (?) 
(since vfs has done that before this call is invoked). I vaguely 
recalled I did try write_inode_now() on GFS1 once but had to replace it 
with "sync_inode" on RHEL4 (for the reason that I can't remember at this 
moment). I suggest you keep "sync_inode" (at least for a while until we 
can prove other call can do better). This "sync_inode" has been well 
tested out (with GFS1's fsync call).


There is another issue. It is a gray area. Note that you don't grab any 
glock here ... so if someone *has* written something in other nodes, 
this sync could miss it (?). This depends on how people expects a 
fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
shared lock here so it will force other node to flush the data (I 
personally think this is a more correct behavior). Your call though.


-- Wendy




>From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
From: Steven Whitehouse <[EMAIL PROTECTED]>
Date: Thu, 7 Dec 2006 09:13:14 -0500
Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()

This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode & metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
---
 fs/gfs2/ops_file.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 7bd971b..b3f1e03 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
  * is set. For stuffed inodes we must flush the log in order to
  * ensure that all data is on disk.
  *
+ * The call to write_inode_now() is there to write back metadata and
+ * the inode itself. It does also try and write the data, but thats
+ * (hopefully) a no-op due to the VFS having already called 
filemap_fdatawrite()
+ * for us.
+ *
  * Returns: errno
  */
 
@@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,

struct inode *inode = dentry->d_inode;
int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
int ret = 0;
-   struct writeback_control wbc = {
-   .sync_mode = WB_SYNC_ALL,
-   .nr_to_write = 0,
-   };
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
@@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,
 
 	if (sync_state != 0) {

if (!datasync)
-   ret = sync_inode(inode, );
+   ret = write_inode_now(inode, 0);
 
 		if (gfs2_is_stuffed(GFS2_I(inode)))

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-07 Thread Wendy Cheng

Steven Whitehouse wrote:

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
  

I was taking my cue here from ext3 which does something similar. The
filemap_fdatawrite() is done by the VFS before this is called with a
filemap_fdatawait() afterwards. This was intended to flush the metadata
via (eventually) -write_inode() although I guess I should be calling
write_inode_now() instead?
  

oh I see, you're jsut trying to write the inode itself, not the pages.

write_inode_now() will write the pages, which you seem to not want to do.
Whatever.  The APIs here are a bit awkward.



I've added a comment to explain whats going on here, and also the
following patch. I know it could be better, but its still an improvement
on what was there before,


  

Steve,

I'm in the middle of something else and don't have upstream kernel 
source handy at this moment. But I read akpm's comment as 
write_inode_now would do writepage and that is *not* what you want (?) 
(since vfs has done that before this call is invoked). I vaguely 
recalled I did try write_inode_now() on GFS1 once but had to replace it 
with sync_inode on RHEL4 (for the reason that I can't remember at this 
moment). I suggest you keep sync_inode (at least for a while until we 
can prove other call can do better). This sync_inode has been well 
tested out (with GFS1's fsync call).


There is another issue. It is a gray area. Note that you don't grab any 
glock here ... so if someone *has* written something in other nodes, 
this sync could miss it (?). This depends on how people expects a 
fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
shared lock here so it will force other node to flush the data (I 
personally think this is a more correct behavior). Your call though.


-- Wendy




From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
From: Steven Whitehouse [EMAIL PROTECTED]
Date: Thu, 7 Dec 2006 09:13:14 -0500
Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()

This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode  metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse [EMAIL PROTECTED]
Cc: Andrew Morton [EMAIL PROTECTED]
---
 fs/gfs2/ops_file.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 7bd971b..b3f1e03 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
  * is set. For stuffed inodes we must flush the log in order to
  * ensure that all data is on disk.
  *
+ * The call to write_inode_now() is there to write back metadata and
+ * the inode itself. It does also try and write the data, but thats
+ * (hopefully) a no-op due to the VFS having already called 
filemap_fdatawrite()
+ * for us.
+ *
  * Returns: errno
  */
 
@@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,

struct inode *inode = dentry-d_inode;
int sync_state = inode-i_state  (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
int ret = 0;
-   struct writeback_control wbc = {
-   .sync_mode = WB_SYNC_ALL,
-   .nr_to_write = 0,
-   };
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)-i_gl);
@@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,
 
 	if (sync_state != 0) {

if (!datasync)
-   ret = sync_inode(inode, wbc);
+   ret = write_inode_now(inode, 0);
 
 		if (gfs2_is_stuffed(GFS2_I(inode)))

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)-i_gl);
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/