RE: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

2013-02-12 Thread Peter Staubach
The "+" thing seems a little odd.  Why not use "||" instead?  The sum of the 
two returns isn't really the important thing, is it?  It is that either call to 
svc_close_list() returns non-zero.

Thanx...

ps


-Original Message-
From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] 
On Behalf Of J. Bruce Fields
Sent: Tuesday, February 12, 2013 3:46 PM
To: Stanislav Kinsbursky
Cc: a...@linux-foundation.org; linux-...@vger.kernel.org; 
trond.mykleb...@netapp.com; linux-kernel@vger.kernel.org; de...@openvz.org
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Tue, Feb 12, 2013 at 01:52:32PM +0400, Stanislav Kinsbursky wrote:
> 12.02.2013 00:58, J. Bruce Fields пишет:
> 
> >  void svc_close_net(struct svc_serv *serv, struct net *net)
> >  {
> >-svc_close_list(serv, >sv_tempsocks, net);
> >-svc_close_list(serv, >sv_permsocks, net);
> >-
> >-svc_clear_pools(serv, net);
> >-/*
> >- * At this point the sp_sockets lists will stay empty, since
> >- * svc_xprt_enqueue will not add new entries without taking the
> >- * sp_lock and checking XPT_BUSY.
> >- */
> >-svc_clear_list(serv, >sv_tempsocks, net);
> >-svc_clear_list(serv, >sv_permsocks, net);
> >+int closed;
> >+int delay = 0;
> >+
> >+again:
> >+closed = svc_close_list(serv, >sv_permsocks, net);
> >+closed += svc_close_list(serv, >sv_tempsocks, net);
> >+if (closed) {
> >+svc_clean_up_xprts(serv, net);
> >+msleep(delay++);
> >+goto again;
> >+}
> 
> Frankly, this hunk above makes me feel sick... :( But I have no better 
> idea right now...
> Maybe make this hunk a bit less weird (this is from my POW only, of course), 
> like this:
> 
> > +   while (svc_close_list(serv, >sv_permsocks, net) +
> > +  svc_close_list(serv, >sv_tempsocks, net)) {
> > +   svc_clean_up_xprts(serv, net);
> > +   msleep(delay++);
> > +   }
> 
> ?

OK, that's a little more compact at least.

--b.

> 
> Anyway, thanks!
> 
> Acked-by: Stanislav Kinsbursky 
> 
> --
> Best regards,
> Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

2013-02-12 Thread Peter Staubach
The + thing seems a little odd.  Why not use || instead?  The sum of the 
two returns isn't really the important thing, is it?  It is that either call to 
svc_close_list() returns non-zero.

Thanx...

ps


-Original Message-
From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] 
On Behalf Of J. Bruce Fields
Sent: Tuesday, February 12, 2013 3:46 PM
To: Stanislav Kinsbursky
Cc: a...@linux-foundation.org; linux-...@vger.kernel.org; 
trond.mykleb...@netapp.com; linux-kernel@vger.kernel.org; de...@openvz.org
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Tue, Feb 12, 2013 at 01:52:32PM +0400, Stanislav Kinsbursky wrote:
 12.02.2013 00:58, J. Bruce Fields пишет:
 snip
   void svc_close_net(struct svc_serv *serv, struct net *net)
   {
 -svc_close_list(serv, serv-sv_tempsocks, net);
 -svc_close_list(serv, serv-sv_permsocks, net);
 -
 -svc_clear_pools(serv, net);
 -/*
 - * At this point the sp_sockets lists will stay empty, since
 - * svc_xprt_enqueue will not add new entries without taking the
 - * sp_lock and checking XPT_BUSY.
 - */
 -svc_clear_list(serv, serv-sv_tempsocks, net);
 -svc_clear_list(serv, serv-sv_permsocks, net);
 +int closed;
 +int delay = 0;
 +
 +again:
 +closed = svc_close_list(serv, serv-sv_permsocks, net);
 +closed += svc_close_list(serv, serv-sv_tempsocks, net);
 +if (closed) {
 +svc_clean_up_xprts(serv, net);
 +msleep(delay++);
 +goto again;
 +}
 
 Frankly, this hunk above makes me feel sick... :( But I have no better 
 idea right now...
 Maybe make this hunk a bit less weird (this is from my POW only, of course), 
 like this:
 
  +   while (svc_close_list(serv, serv-sv_permsocks, net) +
  +  svc_close_list(serv, serv-sv_tempsocks, net)) {
  +   svc_clean_up_xprts(serv, net);
  +   msleep(delay++);
  +   }
 
 ?

OK, that's a little more compact at least.

--b.

 
 Anyway, thanks!
 
 Acked-by: Stanislav Kinsbursky skinsbur...@parallels.com
 
 --
 Best regards,
 Stanislav Kinsbursky
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-04 Thread Peter Staubach

Miklos Szeredi wrote:

In FUSE interrupts are sent to userspace, and the filesystem decides
what to do with them.  So it is entirely possible and valid for a
filesystem to ignore an interrupt.  If an operation was non-blocking
(such as one returning an error), then there would in fact be no
purpose in checking interrupts.

  
  

Why do you think that it is valid to ignore pending signals?
You seem to be asserting that it okay for processes to hang,
uninterruptibly, when accessing files on fuse mounted file
systems?

Perhaps the right error to return when there is a signal
pending is EINTR and not ESTALE or some other error?  There
has to be some way for the application to detect that its
system call was interrupted due to a signal pending.



Traditionally a lot of filesystem related system calls are not
interruptible, and for good reason.  For example what happens, if an
app receives a signal, while the filesystem is performing a rename()
request?  It would be very confusing if the call returned EINTR, but
the rename would successfully complete regardless.

We had a related problem with the open(O_CREAT) call in fuse, which
was interruptible between the creation and the actual open because of
a design mistake.  So it could return EINTR, after the file was
created, and this broke a real world application (don't have details
at hand, but could dig them out if you are interested).

I don't know what NFS does, but returning EINTR without actually
canceling an operation in the server is generally not a good idea.

  


This is what NFS has been doing, for several decades, and no one
has complained yet.  It is just generally accepted.  I do agree
that it isn't the best of semantics, but it does seem to work and
does solve a real problem which exists if you don't allow an
operation to be interrupted.  The alternative, for NFS clients,
was potentially to block an application until a server, which
might never come back up, comes back up.  It was a serious
problem and worse than this resolution.

Yes, I'd like to hear the details and find out why it was a
problem.  If you allow the fuse file system to block waiting
on things which may never occur, than you are going to have a
problem.  I would suggest considering this now instead of waiting
until it is too late.  We can learn from the NFS experience instead
of just dismissing it.



So while sending a signal might reliably work in NFS to break out of
the loop, it does not necessarily work for other filesystems, and fuse
may not be the only one affected.

  
  

Have you noticed another one?  I would be happy to chat with the
developers for that file system to see if this support would
negatively impact them.



Oh, I have no idea.  And I wouldn't want to do a full audit of all the
filesystems to find out.  But if you do, please go ahead.

  


Well, you brought it up.  I thought that perhaps you had something
other than FUD.


A few solutions come to mind, perhaps the best is to introduce a
kernel internal errno value (ERETRYSTALE), that forces the relevant
system calls to be retried.

NFS could transform ESTALE errors to ERETRYSTALE and get the desired
behavior, while other filesystems would not be affected.
  

We don't need more error numbers, we've got plenty already.  :-)



That's a rather poor excuse against a simple solution which would
spare us some backward compatibility problems.

  


Potential backwards compatibility problems and none are even known
or even considered.

The solution here isn't to create more hacks and a new error number
for this purpose is just a hack.


Do you have anything more specific about any real problems?
I see lots of "mays" and "coulds", but I don't see anything
that I can do to make this support better.



Implement the above suggestion?  Or something else.

Otherwise I have to NAK this patch due to the possibility of it
breaking existing fuse installations.


Please describe this real and existing fuse installation so that I can
better understand the situation and the real requirements here.

Instead of attempting to block this proposal, what about considering
how to architect fuse to handle the situation instead of pretending
that fuse won't have the same problem to solve if it isn't solved
here?  I have a real problem to solve and I need to get it resolved.
I have real customers, with real problems, and not just theoretical
and vague ones.

 ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-04 Thread Peter Staubach

Miklos Szeredi wrote:
  
  

Would you describe the situation that would cause the kernel to
go into an infinite loop, please?



The patch basically does:

do {
...
error = inode->i_op->foo()
...
} while (error == ESTALE);

What is the guarantee, that ->foo() will not always return ESTALE?
  

You skimmed over some stuff, like the pathname lookup component
contained in the first set of dots...

I can't guarantee that ->foo() won't always return ESTALE.

That said, the loop is not unbreakable.  At least for NFS, a signal
to the process will interrupt the loop because the error returned
will change from ESTALE to EINTR.



In FUSE interrupts are sent to userspace, and the filesystem decides
what to do with them.  So it is entirely possible and valid for a
filesystem to ignore an interrupt.  If an operation was non-blocking
(such as one returning an error), then there would in fact be no
purpose in checking interrupts.

  


Why do you think that it is valid to ignore pending signals?
You seem to be asserting that it okay for processes to hang,
uninterruptibly, when accessing files on fuse mounted file
systems?

Perhaps the right error to return when there is a signal
pending is EINTR and not ESTALE or some other error?  There
has to be some way for the application to detect that its
system call was interrupted due to a signal pending.


So while sending a signal might reliably work in NFS to break out of
the loop, it does not necessarily work for other filesystems, and fuse
may not be the only one affected.

  


Have you noticed another one?  I would be happy to chat with the
developers for that file system to see if this support would
negatively impact them.


Also up till now, returning ESTALE in a fuse filesystem was a
perfectly valid thing to do.  This patch changes the behavior of that
rather drastically.  There might be installed systems that rely on
current behavior, and we want to avoid breaking those on a kernel
upgrade.

  


Perhaps the explanation for what ESTALE means was not clear?
If there are fuse file systems which really do support the
notion of ESTALE, then it seems to me that they would also
benefit from this support, ie. the ability to do some recovery
from the situation.


A few solutions come to mind, perhaps the best is to introduce a
kernel internal errno value (ERETRYSTALE), that forces the relevant
system calls to be retried.

NFS could transform ESTALE errors to ERETRYSTALE and get the desired
behavior, while other filesystems would not be affected.


We don't need more error numbers, we've got plenty already.  :-)

Do you have anything more specific about any real problems?
I see lots of "mays" and "coulds", but I don't see anything
that I can do to make this support better.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-04 Thread Peter Staubach

Miklos Szeredi wrote:
  
  

Would you describe the situation that would cause the kernel to
go into an infinite loop, please?



The patch basically does:

do {
...
error = inode-i_op-foo()
...
} while (error == ESTALE);

What is the guarantee, that -foo() will not always return ESTALE?
  

You skimmed over some stuff, like the pathname lookup component
contained in the first set of dots...

I can't guarantee that -foo() won't always return ESTALE.

That said, the loop is not unbreakable.  At least for NFS, a signal
to the process will interrupt the loop because the error returned
will change from ESTALE to EINTR.



In FUSE interrupts are sent to userspace, and the filesystem decides
what to do with them.  So it is entirely possible and valid for a
filesystem to ignore an interrupt.  If an operation was non-blocking
(such as one returning an error), then there would in fact be no
purpose in checking interrupts.

  


Why do you think that it is valid to ignore pending signals?
You seem to be asserting that it okay for processes to hang,
uninterruptibly, when accessing files on fuse mounted file
systems?

Perhaps the right error to return when there is a signal
pending is EINTR and not ESTALE or some other error?  There
has to be some way for the application to detect that its
system call was interrupted due to a signal pending.


So while sending a signal might reliably work in NFS to break out of
the loop, it does not necessarily work for other filesystems, and fuse
may not be the only one affected.

  


Have you noticed another one?  I would be happy to chat with the
developers for that file system to see if this support would
negatively impact them.


Also up till now, returning ESTALE in a fuse filesystem was a
perfectly valid thing to do.  This patch changes the behavior of that
rather drastically.  There might be installed systems that rely on
current behavior, and we want to avoid breaking those on a kernel
upgrade.

  


Perhaps the explanation for what ESTALE means was not clear?
If there are fuse file systems which really do support the
notion of ESTALE, then it seems to me that they would also
benefit from this support, ie. the ability to do some recovery
from the situation.


A few solutions come to mind, perhaps the best is to introduce a
kernel internal errno value (ERETRYSTALE), that forces the relevant
system calls to be retried.

NFS could transform ESTALE errors to ERETRYSTALE and get the desired
behavior, while other filesystems would not be affected.


We don't need more error numbers, we've got plenty already.  :-)

Do you have anything more specific about any real problems?
I see lots of mays and coulds, but I don't see anything
that I can do to make this support better.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-04 Thread Peter Staubach

Miklos Szeredi wrote:

In FUSE interrupts are sent to userspace, and the filesystem decides
what to do with them.  So it is entirely possible and valid for a
filesystem to ignore an interrupt.  If an operation was non-blocking
(such as one returning an error), then there would in fact be no
purpose in checking interrupts.

  
  

Why do you think that it is valid to ignore pending signals?
You seem to be asserting that it okay for processes to hang,
uninterruptibly, when accessing files on fuse mounted file
systems?

Perhaps the right error to return when there is a signal
pending is EINTR and not ESTALE or some other error?  There
has to be some way for the application to detect that its
system call was interrupted due to a signal pending.



Traditionally a lot of filesystem related system calls are not
interruptible, and for good reason.  For example what happens, if an
app receives a signal, while the filesystem is performing a rename()
request?  It would be very confusing if the call returned EINTR, but
the rename would successfully complete regardless.

We had a related problem with the open(O_CREAT) call in fuse, which
was interruptible between the creation and the actual open because of
a design mistake.  So it could return EINTR, after the file was
created, and this broke a real world application (don't have details
at hand, but could dig them out if you are interested).

I don't know what NFS does, but returning EINTR without actually
canceling an operation in the server is generally not a good idea.

  


This is what NFS has been doing, for several decades, and no one
has complained yet.  It is just generally accepted.  I do agree
that it isn't the best of semantics, but it does seem to work and
does solve a real problem which exists if you don't allow an
operation to be interrupted.  The alternative, for NFS clients,
was potentially to block an application until a server, which
might never come back up, comes back up.  It was a serious
problem and worse than this resolution.

Yes, I'd like to hear the details and find out why it was a
problem.  If you allow the fuse file system to block waiting
on things which may never occur, than you are going to have a
problem.  I would suggest considering this now instead of waiting
until it is too late.  We can learn from the NFS experience instead
of just dismissing it.



So while sending a signal might reliably work in NFS to break out of
the loop, it does not necessarily work for other filesystems, and fuse
may not be the only one affected.

  
  

Have you noticed another one?  I would be happy to chat with the
developers for that file system to see if this support would
negatively impact them.



Oh, I have no idea.  And I wouldn't want to do a full audit of all the
filesystems to find out.  But if you do, please go ahead.

  


Well, you brought it up.  I thought that perhaps you had something
other than FUD.


A few solutions come to mind, perhaps the best is to introduce a
kernel internal errno value (ERETRYSTALE), that forces the relevant
system calls to be retried.

NFS could transform ESTALE errors to ERETRYSTALE and get the desired
behavior, while other filesystems would not be affected.
  

We don't need more error numbers, we've got plenty already.  :-)



That's a rather poor excuse against a simple solution which would
spare us some backward compatibility problems.

  


Potential backwards compatibility problems and none are even known
or even considered.

The solution here isn't to create more hacks and a new error number
for this purpose is just a hack.


Do you have anything more specific about any real problems?
I see lots of mays and coulds, but I don't see anything
that I can do to make this support better.



Implement the above suggestion?  Or something else.

Otherwise I have to NAK this patch due to the possibility of it
breaking existing fuse installations.


Please describe this real and existing fuse installation so that I can
better understand the situation and the real requirements here.

Instead of attempting to block this proposal, what about considering
how to architect fuse to handle the situation instead of pretending
that fuse won't have the same problem to solve if it isn't solved
here?  I have a real problem to solve and I need to get it resolved.
I have real customers, with real problems, and not just theoretical
and vague ones.

 ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Miklos Szeredi wrote:

This doesn't apply to -mm, because the ro-mounts stuff touches a lot
of the same places as this patch.  You probably need to rebase this on
top of those changes.

  
  

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.



If a broken NFS server or FUSE filesysem keeps returning ESTALE, this
goes into an infinite loop.  How are we planning to deal with that?

  
  

Would you describe the situation that would cause the kernel to
go into an infinite loop, please?



The patch basically does:

do {
...
error = inode->i_op->foo()
...
} while (error == ESTALE);

What is the guarantee, that ->foo() will not always return ESTALE?


You skimmed over some stuff, like the pathname lookup component
contained in the first set of dots...

I can't guarantee that ->foo() won't always return ESTALE.

That said, the loop is not unbreakable.  At least for NFS, a signal
to the process will interrupt the loop because the error returned
will change from ESTALE to EINTR.

These changes include the base assumption that the components of
the underlying file system are basically reliable, that there is
a way to deal with bugs and/or malicious entities in the short
term, and that these things will be dealt with appropriately
in the longer term.

The short term resolution is a signal.  The longer term fix is
to hunt down the bug or the malicious entity and either make it
go away or fence it off via some security measure or another to
prevent it from causing another problem.

If the underlying file system is the type that could potentially
return ESTALE, then it needs to be aware of the system architecture
and handle things appropriately.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Miklos Szeredi wrote:

This doesn't apply to -mm, because the ro-mounts stuff touches a lot
of the same places as this patch.  You probably need to rebase this on
top of those changes.

  

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.



If a broken NFS server or FUSE filesysem keeps returning ESTALE, this
goes into an infinite loop.  How are we planning to deal with that?

  


Would you describe the situation that would cause the kernel to
go into an infinite loop, please?

Please note that, at least for NFS, this looping is interruptible
by the user, so the system can't hang without anything that can
be done.


And it has to be dealt with either in the VFS, or in the kernel parts
of the relevant filesystems.  We can't just say, fix the broken
servers, especially not with FUSE, where the server is totally
untrusted.


Nope, certainly can't depend upon fixing servers.  The client
should not depend upon the server to avoid things like looping.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] enhanced NFS ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Trond Myklebust wrote:

On Fri, 2008-02-01 at 15:58 -0500, Peter Staubach wrote:
  

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.



As far as I can see, this patch can be applied separately from the VFS
fixes. If so, would it make sense for me to take charge of this patch in
the NFS tree, while Andrew queues up the other two VFS changes in -mm?


Yes, I think that this would make good sense.

   Thanx...

  ps

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] enhanced NFS ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.

  Thanx...

 ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.24.i686/fs/nfs/inode.c.org
+++ linux-2.6.24.i686/fs/nfs/inode.c
@@ -192,7 +192,7 @@ void nfs_invalidate_atime(struct inode *
  */
 static void nfs_invalidate_inode(struct inode *inode)
 {
-   set_bit(NFS_INO_STALE, _FLAGS(inode));
+   nfs_handle_estale(-ESTALE, inode);
nfs_zap_caches_locked(inode);
 }
 
@@ -385,6 +385,8 @@ nfs_setattr(struct dentry *dentry, struc
error = NFS_PROTO(inode)->setattr(dentry, , attr);
if (error == 0)
nfs_refresh_inode(inode, );
+   else
+   nfs_handle_estale(error, inode);
unlock_kernel();
return error;
 }
@@ -629,7 +631,7 @@ int nfs_release(struct inode *inode, str
 int
 __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
-   int  status = -ESTALE;
+   int status = -ESTALE;
struct nfs_fattr fattr;
struct nfs_inode *nfsi = NFS_I(inode);
 
@@ -640,15 +642,25 @@ __nfs_revalidate_inode(struct nfs_server
lock_kernel();
if (is_bad_inode(inode))
goto out_nowait;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) && !S_ISDIR(inode->i_mode))
goto out_nowait;
 
status = nfs_wait_on_inode(inode);
if (status < 0)
goto out;
 
+   /*
+* Do we believe that the file handle is still stale?
+* For non-directories, once stale, always stale.
+* For directories, believe the stale status for the
+* attribute cache timeout period, and then try again.
+* This will help to address the problem of the server
+* admin "accidently" unexporting a file system without
+* stopping the NFS server first.
+*/
status = -ESTALE;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) &&
+   (!S_ISDIR(inode->i_mode) || !nfs_attribute_timeout(inode)))
goto out;
 
status = NFS_PROTO(inode)->getattr(server, NFS_FH(inode), );
@@ -656,11 +668,9 @@ __nfs_revalidate_inode(struct nfs_server
dfprintk(PAGECACHE, "nfs_revalidate_inode: (%s/%Ld) getattr 
failed, error=%d\n",
 inode->i_sb->s_id,
 (long long)NFS_FILEID(inode), status);
-   if (status == -ESTALE) {
+   nfs_handle_estale(status, inode);
+   if (status == -ESTALE)
nfs_zap_caches(inode);
-   if (!S_ISDIR(inode->i_mode))
-   set_bit(NFS_INO_STALE, _FLAGS(inode));
-   }
goto out;
}
 
@@ -986,14 +996,28 @@ static int nfs_update_inode(struct inode
__FUNCTION__, inode->i_sb->s_id, inode->i_ino,
atomic_read(>i_count), fattr->valid);
 
-   if (nfsi->fileid != fattr->fileid)
-   goto out_fileid;
+   if (nfsi->fileid != fattr->fileid) {
+   printk(KERN_ERR "NFS: server %s error: fileid changed\n"
+   "fsid %s: expected fileid 0x%Lx, got 0x%Lx\n",
+   NFS_SERVER(inode)->nfs_client->cl_hostname,
+   inode->i_sb->s_id,
+   (long long)nfsi->fileid, (long long)fattr->fileid);
+   goto out_err;
+   }
 
/*
 * Make sure the inode's type hasn't changed.
 */
-   if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
-   goto out_changed;
+   if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT)) {
+   /*
+* Big trouble! The inode has become a different object.
+*/
+   printk(KERN_DEBUG "%s: inode %ld mode changed, %07o to %07o\n",
+   __FUNCTION__, inode->i_ino, inode->i_mode, fattr->mode);
+   goto out_err;
+   }
+
+   nfs_clear_estale(inode);
 
server = NFS_SERVER(inode);
/* Update the fsid? */
@@ -1099,12 +1123,7 @@ static i

[PATCH 1/3] enhanced lookup ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

This is a patch to enhance ESTALE error handling during the
lookup process.  The error, ESTALE, can occur when out of data
dentries, stored in the dcache, is used to translate a pathname
component to a dentry.  When this occurs, the dentry which
contains the pointer to the inode which refers to the non-existent
file is dropped from the dcache and then the lookup process
started again.  Care is taken to ensure that forward process is
always being made.  If forward process is not detected, then the
lookup process is terminated and the error, ENOENT, is returned
to the caller.

  Thanx...

 ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.24.i686/fs/namei.c.org
+++ linux-2.6.24.i686/fs/namei.c
@@ -741,7 +741,7 @@ static __always_inline void follow_dotdo
 {
struct fs_struct *fs = current->fs;
 
-   while(1) {
+   while (1) {
struct vfsmount *parent;
struct dentry *old = nd->dentry;
 
@@ -840,7 +840,7 @@ static fastcall int __link_path_walk(con
lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
 
/* At this point we know we have a real path component. */
-   for(;;) {
+   for (;;) {
unsigned long hash;
struct qstr this;
unsigned int c;
@@ -992,7 +992,7 @@ return_reval:
 */
if (nd->dentry && nd->dentry->d_sb &&
(nd->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) {
-   err = -ESTALE;
+   err = -ENOENT;
/* Note: we do not d_invalidate() */
if (!nd->dentry->d_op->d_revalidate(nd->dentry, nd))
break;
@@ -1003,6 +1003,8 @@ out_dput:
dput_path(, nd);
break;
}
+   if (err == -ESTALE)
+   d_drop(nd->dentry);
path_release(nd);
 return_err:
return err;
@@ -1019,13 +1021,24 @@ static int fastcall link_path_walk(const
 {
struct nameidata save = *nd;
int result;
+   struct dentry *svd;
 
/* make sure the stuff we saved doesn't go away */
dget(save.dentry);
mntget(save.mnt);
 
+   svd = nd->dentry;
result = __link_path_walk(name, nd);
-   if (result == -ESTALE) {
+   while (result == -ESTALE) {
+   /*
+* If no progress was made looking up the pathname,
+* then stop and return ENOENT instead of ESTALE.
+*/
+   if (nd->dentry == svd) {
+   result = -ENOENT;
+   break;
+   }
+   svd = nd->dentry;
*nd = save;
dget(nd->dentry);
mntget(nd->mnt);
@@ -1712,7 +1725,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
 
acc_mode = ACC_MODE(flag);
 
@@ -1739,7 +1755,8 @@ int open_namei(int dfd, const char *path
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 
@@ -1812,10 +1829,17 @@ ok:
return 0;
 
 exit_dput:
+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(, nd);
 exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd->dentry);
+   path_release(nd);
+   goto top;
+   }
path_release(nd);
return error;
 
@@ -1825,7 +1849,7 @@ do_link:
goto exit_dput;
/*
 * This is subtle. Instead of calling do_follow_link() we do the
-* thing by hands. The reason is that this way we have zero link_count
+* thing by hand. The reason is that this way we have zero link_count
 * and path_walk() (called from ->follow_link) honoring LOOKUP_PARENT.
 * After that we have the parent and last component, i.e.
 * we are in the same situation as after the first path_walk().
@@ -1844,6 +1868,8 @@ do_link:
 * with "intent.open".
 */
release_open_intent(nd);
+   if (error == ESTALE)
+   goto top;
return error;
}
nd->flags &= ~LOOKUP_PARENT;
@@ -1857,7 +1883,7 @@ do_link:
goto exit;
}
error = -ELOOP;
-   if (count++==32) {
+   if (count++ == 32) {
__putname(nd->last.name);
goto exit;
}


[PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.

A partial list of the updated system calls are stat, stat64,
lstat, lstat64, mkdir, link, open, access, chmod, chown,
readlink, utime, utimes, chdir, chroot, rename, exec, mknod,
statfs, inotify, setxattr, getxattr, and listxattr.  Due to
common code factoring, other system calls may have been
included too, but were not explicitly tested.

  Thanx...

 ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.24.i686/fs/namei.c.org
+++ linux-2.6.24.i686/fs/namei.c
@@ -1956,6 +1982,7 @@ asmlinkage long sys_mknodat(int dfd, con
if (IS_ERR(tmp))
return PTR_ERR(tmp);
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, );
if (error)
goto out;
@@ -1986,6 +2013,8 @@ asmlinkage long sys_mknodat(int dfd, con
}
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 
@@ -2021,8 +2050,8 @@ int vfs_mkdir(struct inode *dir, struct 
 
 asmlinkage long sys_mkdirat(int dfd, const char __user *pathname, int mode)
 {
-   int error = 0;
-   char * tmp;
+   int error;
+   char *tmp;
struct dentry *dentry;
struct nameidata nd;
 
@@ -2031,6 +2060,7 @@ asmlinkage long sys_mkdirat(int dfd, con
if (IS_ERR(tmp))
goto out_err;
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, );
if (error)
goto out;
@@ -2046,6 +2076,8 @@ asmlinkage long sys_mkdirat(int dfd, con
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 out_err:
@@ -2125,23 +2157,24 @@ static long do_rmdir(int dfd, const char
struct nameidata nd;
 
name = getname(pathname);
-   if(IS_ERR(name))
+   if (IS_ERR(name))
return PTR_ERR(name);
 
+top:
error = do_path_lookup(dfd, name, LOOKUP_PARENT, );
if (error)
goto exit;
 
-   switch(nd.last_type) {
-   case LAST_DOTDOT:
-   error = -ENOTEMPTY;
-   goto exit1;
-   case LAST_DOT:
-   error = -EINVAL;
-   goto exit1;
-   case LAST_ROOT:
-   error = -EBUSY;
-   goto exit1;
+   switch (nd.last_type) {
+   case LAST_DOTDOT:
+   error = -ENOTEMPTY;
+   goto exit1;
+   case LAST_DOT:
+   error = -EINVAL;
+   goto exit1;
+   case LAST_ROOT:
+   error = -EBUSY;
+   goto exit1;
}
mutex_lock_nested(>d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash();
@@ -2154,6 +2187,8 @@ exit2:
mutex_unlock(>d_inode->i_mutex);
 exit1:
path_release();
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2206,12 +2241,14 @@ static long do_unlinkat(int dfd, const c
char * name;
struct dentry *dentry;
struct nameidata nd;
-   struct inode *inode = NULL;
+   struct inode *inode;
 
name = getname(pathname);
if(IS_ERR(name))
return PTR_ERR(name);
 
+top:
+   inode = NULL;
error = do_path_lookup(dfd, name, LOOKUP_PARENT, );
if (error)
goto exit;
@@ -2237,6 +2274,8 @@ static long do_unlinkat(int dfd, const c
iput(inode);/* truncate the inode here */
 exit1:
path_release();
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2301,6 +2340,7 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(to))
goto out_putname;
 
+top:
error = do_path_lookup(newdfd, to, LOOKUP_PARENT, );
if (error)
goto out;
@@ -2314,6 +2354,8 @@ asmlinkage long sys_symlinkat(const char
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(to);
 out_putname:
@@ -2389,6 +2431,7 @@ asmlinkage long sys_linkat(int olddfd, c
if (IS_ERR(to))
return PTR_ERR(to);
 
+top:
error = __user_walk_fd(olddfd, oldname,
   flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
   _nd);
@@ -2408,6 +2

[PATCH 0/3] enhanced ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

Here is version 2 of a patch set which modifies the system to
enhance the ESTALE error handling for system calls which take
pathnames as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server.  This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client.  The NFS server also
returns this error when the file resides upon a file system
which is no longer exported.  Additionally, some NFS servers
even change the file handle when a file is renamed, although
this practice is discouraged.

This error occurs even if a file or directory, with the same
name, is recreated on the server without the client being
aware of it.  The file handle refers to a specific instance
of a file and deleting the file and then recreating it creates
a new instance of the file.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server.  This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

This error can also occur when a change is made on the server
in between looking up different components of the pathname to
be looked up or between a successful lookup and a subsequent
operation.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this.  These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics.  In the above example,
stat(2), restarting at the pathname lookup will either cause the
system call to succeed or fail, depending upon whether the
file really exists or not.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care is taken to ensure that forward progress is always being
made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again.  Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ.  In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

The loose consistency model of file systems such as NFS is
exacerbated by the large granularity of timestamps available
for files on file systems such ext3.  The NFS client may not
be able to detect changes in directories due to multiple
changes occurring in the same second, for example.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks.  When two or more copies of this program
are running, many ESTALE errors can be seen over the network.
Without these patches, the test program errors out almost
immediately.  With these patches, the test program runs
for as long one desires.

Comments?

  Thanx...

 ps
#
#define _XOPEN_SOURCE 500
#define _LARGEFILE64_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void mkdir_test(void);
void link_test(void);
void open_test(void);
void access_test(void);
void chmod_test(void);
void chown_test(void);
void readlink_test(void);
void utimes_test(void);
void chdir_test(void);
void chroot_test(void);
void rename_test(void);
void exec_test(void);
void mknod_test(void);
void statfs_test(void);
void truncate_test(void);
void xattr_test(void);
void inotify_test(void);

struct tests {
	void (*test)(void);
};

struct tests tests[] = {
	mkdir_test,
	link_test,
	open_test,
	access_test,
	chmod_test,
	chown_test,
	readlink_test,
	utimes_test,
	chdir_test,
	chroot_test,
	rename_test,
	exec_test,
	mknod_test,
	statfs_test,
	truncate_test,
	xattr_test,
	inotify_test

[PATCH 0/3] enhanced ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

Here is version 2 of a patch set which modifies the system to
enhance the ESTALE error handling for system calls which take
pathnames as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server.  This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client.  The NFS server also
returns this error when the file resides upon a file system
which is no longer exported.  Additionally, some NFS servers
even change the file handle when a file is renamed, although
this practice is discouraged.

This error occurs even if a file or directory, with the same
name, is recreated on the server without the client being
aware of it.  The file handle refers to a specific instance
of a file and deleting the file and then recreating it creates
a new instance of the file.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server.  This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

This error can also occur when a change is made on the server
in between looking up different components of the pathname to
be looked up or between a successful lookup and a subsequent
operation.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this.  These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics.  In the above example,
stat(2), restarting at the pathname lookup will either cause the
system call to succeed or fail, depending upon whether the
file really exists or not.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care is taken to ensure that forward progress is always being
made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again.  Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ.  In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

The loose consistency model of file systems such as NFS is
exacerbated by the large granularity of timestamps available
for files on file systems such ext3.  The NFS client may not
be able to detect changes in directories due to multiple
changes occurring in the same second, for example.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks.  When two or more copies of this program
are running, many ESTALE errors can be seen over the network.
Without these patches, the test program errors out almost
immediately.  With these patches, the test program runs
for as long one desires.

Comments?

  Thanx...

 ps
#
#define _XOPEN_SOURCE 500
#define _LARGEFILE64_SOURCE
#include sys/types.h
#include sys/stat.h
#include sys/statfs.h
#include sys/inotify.h
#include errno.h
#include fcntl.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include signal.h

void mkdir_test(void);
void link_test(void);
void open_test(void);
void access_test(void);
void chmod_test(void);
void chown_test(void);
void readlink_test(void);
void utimes_test(void);
void chdir_test(void);
void chroot_test(void);
void rename_test(void);
void exec_test(void);
void mknod_test(void);
void statfs_test(void);
void truncate_test(void);
void xattr_test(void);
void inotify_test(void);

struct tests {
	void (*test)(void);
};

struct tests tests[] = {
	mkdir_test,
	link_test,
	open_test,
	access_test,
	chmod_test,
	chown_test,
	readlink_test,
	utimes_test,
	chdir_test,
	chroot_test,
	

[PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.

A partial list of the updated system calls are stat, stat64,
lstat, lstat64, mkdir, link, open, access, chmod, chown,
readlink, utime, utimes, chdir, chroot, rename, exec, mknod,
statfs, inotify, setxattr, getxattr, and listxattr.  Due to
common code factoring, other system calls may have been
included too, but were not explicitly tested.

  Thanx...

 ps

Signed-off-by: Peter Staubach [EMAIL PROTECTED]
--- linux-2.6.24.i686/fs/namei.c.org
+++ linux-2.6.24.i686/fs/namei.c
@@ -1956,6 +1982,7 @@ asmlinkage long sys_mknodat(int dfd, con
if (IS_ERR(tmp))
return PTR_ERR(tmp);
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -1986,6 +2013,8 @@ asmlinkage long sys_mknodat(int dfd, con
}
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 
@@ -2021,8 +2050,8 @@ int vfs_mkdir(struct inode *dir, struct 
 
 asmlinkage long sys_mkdirat(int dfd, const char __user *pathname, int mode)
 {
-   int error = 0;
-   char * tmp;
+   int error;
+   char *tmp;
struct dentry *dentry;
struct nameidata nd;
 
@@ -2031,6 +2060,7 @@ asmlinkage long sys_mkdirat(int dfd, con
if (IS_ERR(tmp))
goto out_err;
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -2046,6 +2076,8 @@ asmlinkage long sys_mkdirat(int dfd, con
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 out_err:
@@ -2125,23 +2157,24 @@ static long do_rmdir(int dfd, const char
struct nameidata nd;
 
name = getname(pathname);
-   if(IS_ERR(name))
+   if (IS_ERR(name))
return PTR_ERR(name);
 
+top:
error = do_path_lookup(dfd, name, LOOKUP_PARENT, nd);
if (error)
goto exit;
 
-   switch(nd.last_type) {
-   case LAST_DOTDOT:
-   error = -ENOTEMPTY;
-   goto exit1;
-   case LAST_DOT:
-   error = -EINVAL;
-   goto exit1;
-   case LAST_ROOT:
-   error = -EBUSY;
-   goto exit1;
+   switch (nd.last_type) {
+   case LAST_DOTDOT:
+   error = -ENOTEMPTY;
+   goto exit1;
+   case LAST_DOT:
+   error = -EINVAL;
+   goto exit1;
+   case LAST_ROOT:
+   error = -EBUSY;
+   goto exit1;
}
mutex_lock_nested(nd.dentry-d_inode-i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(nd);
@@ -2154,6 +2187,8 @@ exit2:
mutex_unlock(nd.dentry-d_inode-i_mutex);
 exit1:
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2206,12 +2241,14 @@ static long do_unlinkat(int dfd, const c
char * name;
struct dentry *dentry;
struct nameidata nd;
-   struct inode *inode = NULL;
+   struct inode *inode;
 
name = getname(pathname);
if(IS_ERR(name))
return PTR_ERR(name);
 
+top:
+   inode = NULL;
error = do_path_lookup(dfd, name, LOOKUP_PARENT, nd);
if (error)
goto exit;
@@ -2237,6 +2274,8 @@ static long do_unlinkat(int dfd, const c
iput(inode);/* truncate the inode here */
 exit1:
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2301,6 +2340,7 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(to))
goto out_putname;
 
+top:
error = do_path_lookup(newdfd, to, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -2314,6 +2354,8 @@ asmlinkage long sys_symlinkat(const char
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(to);
 out_putname:
@@ -2389,6 +2431,7 @@ asmlinkage long sys_linkat(int olddfd, c
if (IS_ERR(to))
return PTR_ERR(to);
 
+top:
error = __user_walk_fd(olddfd, oldname,
   flags  AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0

[PATCH 3/3] enhanced NFS ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.

  Thanx...

 ps

Signed-off-by: Peter Staubach [EMAIL PROTECTED]
--- linux-2.6.24.i686/fs/nfs/inode.c.org
+++ linux-2.6.24.i686/fs/nfs/inode.c
@@ -192,7 +192,7 @@ void nfs_invalidate_atime(struct inode *
  */
 static void nfs_invalidate_inode(struct inode *inode)
 {
-   set_bit(NFS_INO_STALE, NFS_FLAGS(inode));
+   nfs_handle_estale(-ESTALE, inode);
nfs_zap_caches_locked(inode);
 }
 
@@ -385,6 +385,8 @@ nfs_setattr(struct dentry *dentry, struc
error = NFS_PROTO(inode)-setattr(dentry, fattr, attr);
if (error == 0)
nfs_refresh_inode(inode, fattr);
+   else
+   nfs_handle_estale(error, inode);
unlock_kernel();
return error;
 }
@@ -629,7 +631,7 @@ int nfs_release(struct inode *inode, str
 int
 __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
-   int  status = -ESTALE;
+   int status = -ESTALE;
struct nfs_fattr fattr;
struct nfs_inode *nfsi = NFS_I(inode);
 
@@ -640,15 +642,25 @@ __nfs_revalidate_inode(struct nfs_server
lock_kernel();
if (is_bad_inode(inode))
goto out_nowait;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode)  !S_ISDIR(inode-i_mode))
goto out_nowait;
 
status = nfs_wait_on_inode(inode);
if (status  0)
goto out;
 
+   /*
+* Do we believe that the file handle is still stale?
+* For non-directories, once stale, always stale.
+* For directories, believe the stale status for the
+* attribute cache timeout period, and then try again.
+* This will help to address the problem of the server
+* admin accidently unexporting a file system without
+* stopping the NFS server first.
+*/
status = -ESTALE;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) 
+   (!S_ISDIR(inode-i_mode) || !nfs_attribute_timeout(inode)))
goto out;
 
status = NFS_PROTO(inode)-getattr(server, NFS_FH(inode), fattr);
@@ -656,11 +668,9 @@ __nfs_revalidate_inode(struct nfs_server
dfprintk(PAGECACHE, nfs_revalidate_inode: (%s/%Ld) getattr 
failed, error=%d\n,
 inode-i_sb-s_id,
 (long long)NFS_FILEID(inode), status);
-   if (status == -ESTALE) {
+   nfs_handle_estale(status, inode);
+   if (status == -ESTALE)
nfs_zap_caches(inode);
-   if (!S_ISDIR(inode-i_mode))
-   set_bit(NFS_INO_STALE, NFS_FLAGS(inode));
-   }
goto out;
}
 
@@ -986,14 +996,28 @@ static int nfs_update_inode(struct inode
__FUNCTION__, inode-i_sb-s_id, inode-i_ino,
atomic_read(inode-i_count), fattr-valid);
 
-   if (nfsi-fileid != fattr-fileid)
-   goto out_fileid;
+   if (nfsi-fileid != fattr-fileid) {
+   printk(KERN_ERR NFS: server %s error: fileid changed\n
+   fsid %s: expected fileid 0x%Lx, got 0x%Lx\n,
+   NFS_SERVER(inode)-nfs_client-cl_hostname,
+   inode-i_sb-s_id,
+   (long long)nfsi-fileid, (long long)fattr-fileid);
+   goto out_err;
+   }
 
/*
 * Make sure the inode's type hasn't changed.
 */
-   if ((inode-i_mode  S_IFMT) != (fattr-mode  S_IFMT))
-   goto out_changed;
+   if ((inode-i_mode  S_IFMT) != (fattr-mode  S_IFMT)) {
+   /*
+* Big trouble! The inode has become a different object.
+*/
+   printk(KERN_DEBUG %s: inode %ld mode changed, %07o to %07o\n,
+   __FUNCTION__, inode-i_ino, inode-i_mode, fattr-mode);
+   goto out_err;
+   }
+
+   nfs_clear_estale(inode);
 
server = NFS_SERVER(inode);
/* Update the fsid? */
@@ -1099,12 +1123,7 @@ static int nfs_update_inode(struct inode
nfsi-cache_validity = ~NFS_INO_REVAL_FORCED;
 
return 0;
- out_changed:
-   /*
-* Big trouble! The inode has become a different object

Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Miklos Szeredi wrote:

This doesn't apply to -mm, because the ro-mounts stuff touches a lot
of the same places as this patch.  You probably need to rebase this on
top of those changes.

  

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.



If a broken NFS server or FUSE filesysem keeps returning ESTALE, this
goes into an infinite loop.  How are we planning to deal with that?

  


Would you describe the situation that would cause the kernel to
go into an infinite loop, please?

Please note that, at least for NFS, this looping is interruptible
by the user, so the system can't hang without anything that can
be done.


And it has to be dealt with either in the VFS, or in the kernel parts
of the relevant filesystems.  We can't just say, fix the broken
servers, especially not with FUSE, where the server is totally
untrusted.


Nope, certainly can't depend upon fixing servers.  The client
should not depend upon the server to avoid things like looping.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] enhanced NFS ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Trond Myklebust wrote:

On Fri, 2008-02-01 at 15:58 -0500, Peter Staubach wrote:
  

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.



As far as I can see, this patch can be applied separately from the VFS
fixes. If so, would it make sense for me to take charge of this patch in
the NFS tree, while Andrew queues up the other two VFS changes in -mm?


Yes, I think that this would make good sense.

   Thanx...

  ps

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] enhanced syscall ESTALE error handling (v2)

2008-02-01 Thread Peter Staubach

Miklos Szeredi wrote:

This doesn't apply to -mm, because the ro-mounts stuff touches a lot
of the same places as this patch.  You probably need to rebase this on
top of those changes.

  
  

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.



If a broken NFS server or FUSE filesysem keeps returning ESTALE, this
goes into an infinite loop.  How are we planning to deal with that?

  
  

Would you describe the situation that would cause the kernel to
go into an infinite loop, please?



The patch basically does:

do {
...
error = inode-i_op-foo()
...
} while (error == ESTALE);

What is the guarantee, that -foo() will not always return ESTALE?


You skimmed over some stuff, like the pathname lookup component
contained in the first set of dots...

I can't guarantee that -foo() won't always return ESTALE.

That said, the loop is not unbreakable.  At least for NFS, a signal
to the process will interrupt the loop because the error returned
will change from ESTALE to EINTR.

These changes include the base assumption that the components of
the underlying file system are basically reliable, that there is
a way to deal with bugs and/or malicious entities in the short
term, and that these things will be dealt with appropriately
in the longer term.

The short term resolution is a signal.  The longer term fix is
to hunt down the bug or the malicious entity and either make it
go away or fence it off via some security measure or another to
prevent it from causing another problem.

If the underlying file system is the type that could potentially
return ESTALE, then it needs to be aware of the system architecture
and handle things appropriately.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Peter Staubach

Linus Torvalds wrote:

On Fri, 18 Jan 2008, Ingo Oeser wrote:
  

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"? 


That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)



I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:


 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.


  


I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted.  They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).


So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.


Simple is good.  However, too simple is not good.  I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Peter Staubach

Linus Torvalds wrote:

On Fri, 18 Jan 2008, Ingo Oeser wrote:
  

Can we get if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before? 


That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)



I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:


 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.


  


I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted.  They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).


So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.


Simple is good.  However, too simple is not good.  I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

J. Bruce Fields wrote:

On Fri, Jan 18, 2008 at 01:12:03PM -0500, Peter Staubach wrote:
  

Chuck Lever wrote:


On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
  

I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.

Unless someone is doing something malicious.  Or if the server is  
repeatedly returning ESTALE for some reason.


  

If the server is repeatedly returning ESTALE, then the pathname
resolution will fail to make progress and give up, return ENOENT
to the user level.

A malicious user on the network can cause so many other problems
than just something like this too.  But, in this case, the user
would have to predict why and when the client was issuing a
specific operation and know whether or not to return ESTALE.
This seems quite far fetched and quite unlikely to me.



Any idea what the consequences would be in this case?  It at least
shouldn't overflow the stack, or freeze the whole machine (because it
spins indefinitely under some crucial lock), or panic, etc.  (If the one
filesystem just becomes unusable--well, fine, what better can you hope
for in the presence of a malicious server or network?)


Assuming that such a user could precisely and accurately predict
when to return ESTALE, the particular system call would just stay
in the kernel, sending out requests to the NFS server.

It wouldn't overflow the stack because the recovery is done by
looping and not by recursion and unless there is a bug that needs
to be fixed, all necessary resources are released before the
retries occur.  The machine wouldn't freeze because as soon as
the request is sent, the process blocks and some other process
can be scheduled.  The process should be interruptible, so even
it could be signaled to stop the activity.

It seems to me that mostly, the file system will become unusable,
but as Bruce points out, what do you expect in the presence of a
malicious entity?  If such are a concern, then measures such as
stronger security can be employed to prevent them from wreaking
havoc.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is 
passed to the file system during each lookup that forces each 
component in the path to be revalidated on the server.  This has 
no possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we should 
give up.  When I addressed this issue two years ago, the two-try 
logic was the only acceptable solution because there's no way to 
guarantee the pathname resolution will ever finish unless we put a 
hard limit on it.




I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.


Unless someone is doing something malicious.  Or if the server is 
repeatedly returning ESTALE for some reason.




If the server is repeatedly returning ESTALE, then the pathname
resolution will fail to make progress and give up, return ENOENT
to the user level.

A malicious user on the network can cause so many other problems
than just something like this too.  But, in this case, the user
would have to predict why and when the client was issuing a
specific operation and know whether or not to return ESTALE.
This seems quite far fetched and quite unlikely to me.


There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close all 
the windows that result from sloppy file time stamps without 
completely disabling client-side caching.  The NFS protocol relies 
on file time stamps to manage cache coherence.  If the server is 
lying about time stamps, there's no way the client can cache 
coherently.




Server bug or not, it is something that the client has to live
with.  We can't get the server file system fixed, so it is
something that we should find a way to live with.  This support
can help.


We haven't identified a server-side solution yet, but that doesn't 
mean it doesn't exist.




No, it doesn't and I, and most everyone else, would also like to
see such a solution.  That said, I am pretty sure that we are not
going to get a fix for ext3 and forcing everyone to move away from
ext3 is not a good solution either.

If we address the time stamp problem in the client, should we also go 
to lengths to address it in every other corner of the NFS client?  
Should we also address every other server bug we discover with a 
client side fix?




These aren't asked seriously, are they?

When possible, we get the server bug fixed.  When not possible,
such as the time stamp issue with ext3, we attempt work around
it as best as possible.



Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then it's 
a real error.  Is there a fixed and valid recovery script for the 
client in this case that will allow the mkdir to proceed?




Why do you think that it is an error?


Because this is a problem that sometimes requires application-level 
recovery.  Can we guarantee that retrying the mkdir is the right thing 
to do every time?




When would not retrying the MKDIR be the right thing to do?
When doing a mkdir("a/b"), the user can not tell nor cares
which instance of directory "a" is the one that gets "b" created
in it.

Which cases are the ones that you see that require user
level recovery?


It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.


Have you tried this client against a file server when you unexport the 
filesystem under test?  The server returns ESTALE no matter what the 
client does.  Should the client continue to retry the request if the 
file system has been permanently taken offline?




Since the NFS client supports "intr", then why not continue to
retry the request?  It certainly won't hurt the network, trying
at most once every acdirmin timeout seconds.  This, by default,
would be once every 30 seconds.

This would alleviate a long standing 

Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

J. Bruce Fields wrote:

On Fri, Jan 18, 2008 at 11:45:52AM -0500, Peter Staubach wrote:
  

Matthew Wilcox wrote:


On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
  

 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)



Gratuitous reformatting?

  
  

Elimination of an overly long line?



I usually try to gather any coding style, comment grammar, etc., fixes
into a single patch or two at the beginning of a series.  That keeps the
substantive patches (the hardest to understand) shorter.

  


That's probably great advice.  I can easily enough undo the change
since it does not affect the functionality of the patch.  It was
made while I was doing the analysis for the patch and to make the
style better match the style used in other surrounding routines.

   Thanx...

  ps


--b.

  

@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
acc_mode = ACC_MODE(flag);
 @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 @@ -1812,10 +1833,17 @@ ok:
return 0;
  exit_dput:
+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(, nd);
 exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd->dentry);
+   path_release(nd);
+   goto top;
+   }



I wonder if a tail-call might not work better here.
  

"Tail-call"?

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is 
passed to the file system during each lookup that forces each 
component in the path to be revalidated on the server.  This has no 
possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we should 
give up.  When I addressed this issue two years ago, the two-try logic 
was the only acceptable solution because there's no way to guarantee 
the pathname resolution will ever finish unless we put a hard limit on 
it.




I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.


There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close all 
the windows that result from sloppy file time stamps without 
completely disabling client-side caching.  The NFS protocol relies on 
file time stamps to manage cache coherence.  If the server is lying 
about time stamps, there's no way the client can cache coherently.




Server bug or not, it is something that the client has to live
with.  We can't get the server file system fixed, so it is
something that we should find a way to live with.  This support
can help.


and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.


A use case or two would be useful to explore (on linux-nfs or 
linux-fsdevel, rather than lkml).




I created a bunch of use cases in the gensyscall.c program that
I attached to the original description of the problem and my
proposed solution.  It was very useful in generating many, many
ESTALE errors over the wire from a variety of different over the
wire operations, which were originally getting returned to the
user level.


Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then it's a 
real error.  Is there a fixed and valid recovery script for the client 
in this case that will allow the mkdir to proceed?




Why do you think that it is an error?

It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.

Admittedly, the NFS client could recover more cleanly from some of 
these problems, but given the architecture of the Linux VFS, it will 
be difficult to address some of the corner cases. 


Could you outline some of these corner cases that this proposal
would not address, please?

I ran the test program for many hours, against several different
servers, and although I can't prove completeness, was not able to
show any ESTALE errors being returned unexpectedly.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is passed 
to the file system during each lookup that forces each component in 
the path to be revalidated on the server.  This has no possibility of 
causing an infinite loop.


Is there some part of this logic that is no longer working? 


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.  There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.

Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Matthew Wilcox wrote:

On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
  

@@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
mntget(save.mnt);
 
 	result = __link_path_walk(name, nd);

-   if (result == -ESTALE) {
+   while (result == -ESTALE) {
+   /*
+* If no progress was made looking up the pathname,
+* then stop and return ENOENT instead of ESTALE.
+*/
+   if (nd->dentry == save.dentry) {
+   result = -ENOENT;
+   break;
+   }
*nd = save;
dget(nd->dentry);
mntget(nd->mnt);
nd->flags |= LOOKUP_REVAL;
result = __link_path_walk(name, nd);
+   /*
+* If no progress was made this time, then return
+* ENOENT instead of ESTALE because no recovery
+* is possible to recover the stale file handle.
+*/
+   if (result == -ESTALE && nd->dentry == save.dentry)
+   result = -ENOENT;
}
 
 	dput(save.dentry);



Why do you need both of these tests?  The first one should be enough,
surely?

  


Yes, good point.


@@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
  * @create_mode: create intent flags
  */
 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)



Gratuitous reformatting?

  


Elimination of an overly long line?


@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
 
 	acc_mode = ACC_MODE(flag);
 
@@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path

/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 
@@ -1812,10 +1833,17 @@ ok:

return 0;
 
 exit_dput:

+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(, nd);
 exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd->dentry);
+   path_release(nd);
+   goto top;
+   }



I wonder if a tail-call might not work better here.


"Tail-call"?

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.

   Thanx...

  ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.23.i686/fs/nfs/inode.c.org
+++ linux-2.6.23.i686/fs/nfs/inode.c
@@ -192,7 +192,7 @@ void nfs_invalidate_atime(struct inode *
  */
 static void nfs_invalidate_inode(struct inode *inode)
 {
-   set_bit(NFS_INO_STALE, _FLAGS(inode));
+   nfs_handle_estale(-ESTALE, inode);
nfs_zap_caches_locked(inode);
 }
 
@@ -385,6 +385,8 @@ nfs_setattr(struct dentry *dentry, struc
error = NFS_PROTO(inode)->setattr(dentry, , attr);
if (error == 0)
nfs_refresh_inode(inode, );
+   else
+   nfs_handle_estale(error, inode);
unlock_kernel();
return error;
 }
@@ -629,7 +631,7 @@ int nfs_release(struct inode *inode, str
 int
 __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
-   int  status = -ESTALE;
+   int status = -ESTALE;
struct nfs_fattr fattr;
struct nfs_inode *nfsi = NFS_I(inode);
 
@@ -640,15 +642,25 @@ __nfs_revalidate_inode(struct nfs_server
lock_kernel();
if (is_bad_inode(inode))
goto out_nowait;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) && !S_ISDIR(inode->i_mode))
goto out_nowait;
 
status = nfs_wait_on_inode(inode);
if (status < 0)
goto out;
 
+   /*
+* Do we believe that the file handle is still stale?
+* For non-directories, once stale, always stale.
+* For directories, believe the stale status for the
+* attribute cache timeout period, and then try again.
+* This will help to address the problem of the server
+* admin "accidently" unexporting a file system without
+* stopping the NFS server first.
+*/
status = -ESTALE;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) &&
+   (!S_ISDIR(inode->i_mode) || !nfs_attribute_timeout(inode)))
goto out;
 
status = NFS_PROTO(inode)->getattr(server, NFS_FH(inode), );
@@ -656,11 +668,9 @@ __nfs_revalidate_inode(struct nfs_server
dfprintk(PAGECACHE, "nfs_revalidate_inode: (%s/%Ld) getattr 
failed, error=%d\n",
 inode->i_sb->s_id,
 (long long)NFS_FILEID(inode), status);
-   if (status == -ESTALE) {
+   nfs_handle_estale(status, inode);
+   if (status == -ESTALE)
nfs_zap_caches(inode);
-   if (!S_ISDIR(inode->i_mode))
-   set_bit(NFS_INO_STALE, _FLAGS(inode));
-   }
goto out;
}
 
@@ -986,14 +996,28 @@ static int nfs_update_inode(struct inode
__FUNCTION__, inode->i_sb->s_id, inode->i_ino,
atomic_read(>i_count), fattr->valid);
 
-   if (nfsi->fileid != fattr->fileid)
-   goto out_fileid;
+   if (nfsi->fileid != fattr->fileid) {
+   printk(KERN_ERR "NFS: server %s error: fileid changed\n"
+   "fsid %s: expected fileid 0x%Lx, got 0x%Lx\n",
+   NFS_SERVER(inode)->nfs_client->cl_hostname,
+   inode->i_sb->s_id,
+   (long long)nfsi->fileid, (long long)fattr->fileid);
+   goto out_err;
+   }
 
/*
 * Make sure the inode's type hasn't changed.
 */
-   if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT))
-   goto out_changed;
+   if ((inode->i_mode & S_IFMT) != (fattr->mode & S_IFMT)) {
+   /*
+* Big trouble! The inode has become a different object.
+*/
+   printk(KERN_DEBUG "%s: inode %ld mode changed, %07o to %07o\n",
+   __FUNCTION__, inode->i_ino, inode->i_mode, fattr->mode);
+   goto out_err;
+   }
+
+   nfs_clear_estale(inode);
 
server = NFS_SERVER(inode);
/* Update the fsid? */
@@ -1099,12 +1123,7 @@ s

[PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

This is a patch to enhance ESTALE error handling during the
lookup process.  The error, ESTALE, can occur when out of data
dentries, stored in the dcache, is used to translate a pathname
component to a dentry.  When this occurs, the dentry which
contains the pointer to the inode which refers to the non-existent
file is dropped from the dcache and then the lookup process
started again.  Care is taken to ensure that forward process is
always being made.  If forward process is not detected, then the
lookup process is terminated and the error, ENOENT, is returned
to the caller.

   Thanx...

  ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.23.i686/fs/namei.c.org
+++ linux-2.6.23.i686/fs/namei.c
@@ -741,7 +741,7 @@ static __always_inline void follow_dotdo
 {
struct fs_struct *fs = current->fs;
 
-   while(1) {
+   while (1) {
struct vfsmount *parent;
struct dentry *old = nd->dentry;
 
@@ -840,7 +840,7 @@ static fastcall int __link_path_walk(con
lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
 
/* At this point we know we have a real path component. */
-   for(;;) {
+   for (;;) {
unsigned long hash;
struct qstr this;
unsigned int c;
@@ -992,7 +992,7 @@ return_reval:
 */
if (nd->dentry && nd->dentry->d_sb &&
(nd->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) {
-   err = -ESTALE;
+   err = -ENOENT;
/* Note: we do not d_invalidate() */
if (!nd->dentry->d_op->d_revalidate(nd->dentry, nd))
break;
@@ -1003,6 +1003,8 @@ out_dput:
dput_path(, nd);
break;
}
+   if (err == -ESTALE)
+   d_drop(nd->dentry);
path_release(nd);
 return_err:
return err;
@@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
mntget(save.mnt);
 
result = __link_path_walk(name, nd);
-   if (result == -ESTALE) {
+   while (result == -ESTALE) {
+   /*
+* If no progress was made looking up the pathname,
+* then stop and return ENOENT instead of ESTALE.
+*/
+   if (nd->dentry == save.dentry) {
+   result = -ENOENT;
+   break;
+   }
*nd = save;
dget(nd->dentry);
mntget(nd->mnt);
nd->flags |= LOOKUP_REVAL;
result = __link_path_walk(name, nd);
+   /*
+* If no progress was made this time, then return
+* ENOENT instead of ESTALE because no recovery
+* is possible to recover the stale file handle.
+*/
+   if (result == -ESTALE && nd->dentry == save.dentry)
+   result = -ENOENT;
}
 
dput(save.dentry);
@@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
  * @create_mode: create intent flags
  */
 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)
 {
return __path_lookup_intent_open(dfd, name, lookup_flags|LOOKUP_CREATE,
nd, open_flags, create_mode);
@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
 
acc_mode = ACC_MODE(flag);
 
@@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 
@@ -1812,10 +1833,17 @@ ok:
return 0;
 
 exit_dput:
+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(, nd);
 exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd->dentry);
+   path_release(nd);
+   goto top;
+   }
path_release(nd);
return error;
 
@@ -1825,7 +1853,7 @@ do_link:
goto exit_dput;
/*
 * This is subtle. Instead of calling do_follow_link() we do the
-* thing by hands. The reason is that this w

[PATCH 2/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.

A partial list of the updated system calls are stat, stat64,
lstat, lstat64, mkdir, link, open, access, chmod, chown,
readlink, utime, utimes, chdir, chroot, rename, exec, mknod,
statfs, inotify, setxattr, getxattr, and listxattr.  Due to
common code factoring, other system calls may have been
included too, but were not explicitly tested.

   Thanx...

  ps

Signed-off-by: Peter Staubach <[EMAIL PROTECTED]>
--- linux-2.6.23.i686/fs/namei.c.org
+++ linux-2.6.23.i686/fs/namei.c
@@ -1956,6 +1986,7 @@ asmlinkage long sys_mknodat(int dfd, con
if (IS_ERR(tmp))
return PTR_ERR(tmp);
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, );
if (error)
goto out;
@@ -1986,6 +2017,8 @@ asmlinkage long sys_mknodat(int dfd, con
}
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 
@@ -2021,8 +2054,8 @@ int vfs_mkdir(struct inode *dir, struct 
 
 asmlinkage long sys_mkdirat(int dfd, const char __user *pathname, int mode)
 {
-   int error = 0;
-   char * tmp;
+   int error;
+   char *tmp;
struct dentry *dentry;
struct nameidata nd;
 
@@ -2031,6 +2064,7 @@ asmlinkage long sys_mkdirat(int dfd, con
if (IS_ERR(tmp))
goto out_err;
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, );
if (error)
goto out;
@@ -2046,6 +2080,8 @@ asmlinkage long sys_mkdirat(int dfd, con
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 out_err:
@@ -2125,23 +2161,24 @@ static long do_rmdir(int dfd, const char
struct nameidata nd;
 
name = getname(pathname);
-   if(IS_ERR(name))
+   if (IS_ERR(name))
return PTR_ERR(name);
 
+top:
error = do_path_lookup(dfd, name, LOOKUP_PARENT, );
if (error)
goto exit;
 
-   switch(nd.last_type) {
-   case LAST_DOTDOT:
-   error = -ENOTEMPTY;
-   goto exit1;
-   case LAST_DOT:
-   error = -EINVAL;
-   goto exit1;
-   case LAST_ROOT:
-   error = -EBUSY;
-   goto exit1;
+   switch (nd.last_type) {
+   case LAST_DOTDOT:
+   error = -ENOTEMPTY;
+   goto exit1;
+   case LAST_DOT:
+   error = -EINVAL;
+   goto exit1;
+   case LAST_ROOT:
+   error = -EBUSY;
+   goto exit1;
}
mutex_lock_nested(>d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash();
@@ -2154,6 +2191,8 @@ exit2:
mutex_unlock(>d_inode->i_mutex);
 exit1:
path_release();
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2206,12 +2245,14 @@ static long do_unlinkat(int dfd, const c
char * name;
struct dentry *dentry;
struct nameidata nd;
-   struct inode *inode = NULL;
+   struct inode *inode;
 
name = getname(pathname);
if(IS_ERR(name))
return PTR_ERR(name);
 
+top:
+   inode = NULL;
error = do_path_lookup(dfd, name, LOOKUP_PARENT, );
if (error)
goto exit;
@@ -2237,6 +2278,8 @@ static long do_unlinkat(int dfd, const c
iput(inode);/* truncate the inode here */
 exit1:
path_release();
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2301,6 +2344,7 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(to))
goto out_putname;
 
+top:
error = do_path_lookup(newdfd, to, LOOKUP_PARENT, );
if (error)
goto out;
@@ -2314,6 +2358,8 @@ asmlinkage long sys_symlinkat(const char
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
path_release();
+   if (error == -ESTALE)
+   goto top;
 out:
putname(to);
 out_putname:
@@ -2389,6 +2435,7 @@ asmlinkage long sys_linkat(int olddfd, c
if (IS_ERR(to))
return PTR_ERR(to);
 
+top:
error = __user_walk_fd(olddfd, oldname,
   flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
   _nd);
@@ -2408,6 +2

[PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server.  This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client.  It can also happen
when the file resides upon a file system which is no longer
exported.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server.  This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this.  These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care can be taken to ensure that forward progress is always
being made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again.  Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ.  In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks.  When two or more copies of this program
are running, many ESTALE errors can be seen over the network.

Comments?

   Thanx...

  ps
#
#define _XOPEN_SOURCE 500
#define _LARGEFILE64_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void mkdir_test(void);
void link_test(void);
void open_test(void);
void access_test(void);
void chmod_test(void);
void chown_test(void);
void readlink_test(void);
void utimes_test(void);
void chdir_test(void);
void chroot_test(void);
void rename_test(void);
void exec_test(void);
void mknod_test(void);
void statfs_test(void);
void truncate_test(void);
void xattr_test(void);
void inotify_test(void);

struct tests {
	void (*test)(void);
};

struct tests tests[] = {
	mkdir_test,
	link_test,
	open_test,
	access_test,
	chmod_test,
	chown_test,
	readlink_test,
	utimes_test,
	chdir_test,
	chroot_test,
	rename_test,
	exec_test,
	mknod_test,
	statfs_test,
	truncate_test,
	xattr_test,
	inotify_test
};

pid_t test_pids[sizeof(tests) / sizeof(tests[0])];

pid_t parent_pid;

void kill_tests(int);

int
main(int argc, char *argv[])
{
	int i;

	parent_pid = getpid();

	sigset(SIGINT, kill_tests);

	sighold(SIGINT);

	for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
		test_pids[i] = fork();
		if (test_pids[i] == 0) {
			for (;;)
(*tests[i].test)();
			/* NOTREACHED */
		}
	}

	sigrelse(SIGINT);

	pause();
}

void
kill_tests(int sig)
{
	int i;

	for (i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) {
		if (test_pids[i] != -1) {
			if (kill(test_pids[i], SIGTERM) < 0)
perror("kill");
		}
	}

	exit(0);
}

void
check_error(int error, char *operation)
{

	if (error < 0 && errno == ESTALE) {
		perror(operation);
		kill(parent_pid, SIGINT);
		pause();
	}
}

void
check_error_child(int error, char *operation)
{

	if (error < 0 && errno == ESTALE) {
		perror(operation);
		kill(parent_pid, SIGINT);
		exit(1);
	}
}

void
do_stats(char *file)
{
	int error;
	struct stat stbuf;
	struct stat64 stbuf64;

	error = stat(file, );
	check_error(error, "stat");

	error = stat64(file, );
	check_error(error, "stat64");

	error = lstat(file, );
	check_error(error, "lstat");

	error = lstat64(file, );
	

Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Matthew Wilcox wrote:

On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
  

@@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
mntget(save.mnt);
 
 	result = __link_path_walk(name, nd);

-   if (result == -ESTALE) {
+   while (result == -ESTALE) {
+   /*
+* If no progress was made looking up the pathname,
+* then stop and return ENOENT instead of ESTALE.
+*/
+   if (nd-dentry == save.dentry) {
+   result = -ENOENT;
+   break;
+   }
*nd = save;
dget(nd-dentry);
mntget(nd-mnt);
nd-flags |= LOOKUP_REVAL;
result = __link_path_walk(name, nd);
+   /*
+* If no progress was made this time, then return
+* ENOENT instead of ESTALE because no recovery
+* is possible to recover the stale file handle.
+*/
+   if (result == -ESTALE  nd-dentry == save.dentry)
+   result = -ENOENT;
}
 
 	dput(save.dentry);



Why do you need both of these tests?  The first one should be enough,
surely?

  


Yes, good point.


@@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
  * @create_mode: create intent flags
  */
 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)



Gratuitous reformatting?

  


Elimination of an overly long line?


@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
 
 	acc_mode = ACC_MODE(flag);
 
@@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path

/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 
@@ -1812,10 +1833,17 @@ ok:

return 0;
 
 exit_dput:

+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(path, nd);
 exit:
if (!IS_ERR(nd-intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd-dentry);
+   path_release(nd);
+   goto top;
+   }



I wonder if a tail-call might not work better here.


Tail-call?

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

This patch adds handling for the error, ESTALE, to the system
calls which take pathnames as arguments.  The algorithm used
is to detect that an ESTALE error has occurred during an
operation subsequent to the lookup process and then to unwind
appropriately and then to perform the lookup process again.
Eventually, either the lookup process will return an error
or a valid dentry/inode combination and then operation can
succeed or fail based on its own merits.

A partial list of the updated system calls are stat, stat64,
lstat, lstat64, mkdir, link, open, access, chmod, chown,
readlink, utime, utimes, chdir, chroot, rename, exec, mknod,
statfs, inotify, setxattr, getxattr, and listxattr.  Due to
common code factoring, other system calls may have been
included too, but were not explicitly tested.

   Thanx...

  ps

Signed-off-by: Peter Staubach [EMAIL PROTECTED]
--- linux-2.6.23.i686/fs/namei.c.org
+++ linux-2.6.23.i686/fs/namei.c
@@ -1956,6 +1986,7 @@ asmlinkage long sys_mknodat(int dfd, con
if (IS_ERR(tmp))
return PTR_ERR(tmp);
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -1986,6 +2017,8 @@ asmlinkage long sys_mknodat(int dfd, con
}
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 
@@ -2021,8 +2054,8 @@ int vfs_mkdir(struct inode *dir, struct 
 
 asmlinkage long sys_mkdirat(int dfd, const char __user *pathname, int mode)
 {
-   int error = 0;
-   char * tmp;
+   int error;
+   char *tmp;
struct dentry *dentry;
struct nameidata nd;
 
@@ -2031,6 +2064,7 @@ asmlinkage long sys_mkdirat(int dfd, con
if (IS_ERR(tmp))
goto out_err;
 
+top:
error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -2046,6 +2080,8 @@ asmlinkage long sys_mkdirat(int dfd, con
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(tmp);
 out_err:
@@ -2125,23 +2161,24 @@ static long do_rmdir(int dfd, const char
struct nameidata nd;
 
name = getname(pathname);
-   if(IS_ERR(name))
+   if (IS_ERR(name))
return PTR_ERR(name);
 
+top:
error = do_path_lookup(dfd, name, LOOKUP_PARENT, nd);
if (error)
goto exit;
 
-   switch(nd.last_type) {
-   case LAST_DOTDOT:
-   error = -ENOTEMPTY;
-   goto exit1;
-   case LAST_DOT:
-   error = -EINVAL;
-   goto exit1;
-   case LAST_ROOT:
-   error = -EBUSY;
-   goto exit1;
+   switch (nd.last_type) {
+   case LAST_DOTDOT:
+   error = -ENOTEMPTY;
+   goto exit1;
+   case LAST_DOT:
+   error = -EINVAL;
+   goto exit1;
+   case LAST_ROOT:
+   error = -EBUSY;
+   goto exit1;
}
mutex_lock_nested(nd.dentry-d_inode-i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(nd);
@@ -2154,6 +2191,8 @@ exit2:
mutex_unlock(nd.dentry-d_inode-i_mutex);
 exit1:
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2206,12 +2245,14 @@ static long do_unlinkat(int dfd, const c
char * name;
struct dentry *dentry;
struct nameidata nd;
-   struct inode *inode = NULL;
+   struct inode *inode;
 
name = getname(pathname);
if(IS_ERR(name))
return PTR_ERR(name);
 
+top:
+   inode = NULL;
error = do_path_lookup(dfd, name, LOOKUP_PARENT, nd);
if (error)
goto exit;
@@ -2237,6 +2278,8 @@ static long do_unlinkat(int dfd, const c
iput(inode);/* truncate the inode here */
 exit1:
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 exit:
putname(name);
return error;
@@ -2301,6 +2344,7 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(to))
goto out_putname;
 
+top:
error = do_path_lookup(newdfd, to, LOOKUP_PARENT, nd);
if (error)
goto out;
@@ -2314,6 +2358,8 @@ asmlinkage long sys_symlinkat(const char
 out_unlock:
mutex_unlock(nd.dentry-d_inode-i_mutex);
path_release(nd);
+   if (error == -ESTALE)
+   goto top;
 out:
putname(to);
 out_putname:
@@ -2389,6 +2435,7 @@ asmlinkage long sys_linkat(int olddfd, c
if (IS_ERR(to))
return PTR_ERR(to);
 
+top:
error = __user_walk_fd(olddfd, oldname,
   flags  AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0

[PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.

The error, ESTALE, was originally introduced to handle the
situation where a file handle, which NFS uses to uniquely
identify a file on the server, no longer refers to a valid file
on the server.  This can happen when the file is removed on the
server, either by an application on the server, some other
client accessing the server, or sometimes even by another
mounted file system from the same client.  It can also happen
when the file resides upon a file system which is no longer
exported.

The error, ESTALE, is usually seen when cached directory
information is used to convert a pathname to a dentry/inode pair.
The information is discovered to be out of date or stale when a
subsequent operation is sent to the NFS server.  This can easily
happen in system calls such as stat(2) when the pathname is
converted a dentry/inode pair using cached information, but then
a subsequent GETATTR call to the server discovers that the file
handle is no longer valid.

System calls which take pathnames as arguments should never see
ESTALE errors from situations like this.  These system calls
should either fail with an ENOENT error if the pathname can not
be successfully be translated to a dentry/inode pair or succeed
or fail based on their own semantics.

ESTALE errors which occur during the lookup process can be
handled by dropping the dentry which refers to the non-existent
file from the dcache and then restarting the lookup process.
Care can be taken to ensure that forward progress is always
being made in order to avoiding infinite loops.

ESTALE errors which occur during operations subsequent to the
lookup process can be handled by unwinding appropriately and
then performing the lookup process again.  Eventually, either
the lookup process will succeed or fail correctly or the
subsequent operation will succeed or fail on its own merits.

This support is desired in order to tighten up recovery from
discovering stale resources due to the loose cache consistency
semantics that file systems such as NFS employ.  In particular,
there are several large Red Hat customers, converting from
Solaris to Linux, who desire this support in order that their
applications environments continue to work.

Please note that system calls which do not take pathnames as
arguments or perhaps use file descriptors to identify the
file to be manipulated may still fail with ESTALE errors.
There is no recovery possible with these systems calls like
there is with system calls which take pathnames as arguments.

This support was tested using the attached programs and
running multiple copies on mounted file systems which do not
share superblocks.  When two or more copies of this program
are running, many ESTALE errors can be seen over the network.

Comments?

   Thanx...

  ps
#
#define _XOPEN_SOURCE 500
#define _LARGEFILE64_SOURCE
#include sys/types.h
#include sys/stat.h
#include sys/statfs.h
#include sys/inotify.h
#include errno.h
#include fcntl.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include signal.h

void mkdir_test(void);
void link_test(void);
void open_test(void);
void access_test(void);
void chmod_test(void);
void chown_test(void);
void readlink_test(void);
void utimes_test(void);
void chdir_test(void);
void chroot_test(void);
void rename_test(void);
void exec_test(void);
void mknod_test(void);
void statfs_test(void);
void truncate_test(void);
void xattr_test(void);
void inotify_test(void);

struct tests {
	void (*test)(void);
};

struct tests tests[] = {
	mkdir_test,
	link_test,
	open_test,
	access_test,
	chmod_test,
	chown_test,
	readlink_test,
	utimes_test,
	chdir_test,
	chroot_test,
	rename_test,
	exec_test,
	mknod_test,
	statfs_test,
	truncate_test,
	xattr_test,
	inotify_test
};

pid_t test_pids[sizeof(tests) / sizeof(tests[0])];

pid_t parent_pid;

void kill_tests(int);

int
main(int argc, char *argv[])
{
	int i;

	parent_pid = getpid();

	sigset(SIGINT, kill_tests);

	sighold(SIGINT);

	for (i = 0; i  sizeof(tests) / sizeof(tests[0]); i++) {
		test_pids[i] = fork();
		if (test_pids[i] == 0) {
			for (;;)
(*tests[i].test)();
			/* NOTREACHED */
		}
	}

	sigrelse(SIGINT);

	pause();
}

void
kill_tests(int sig)
{
	int i;

	for (i = 0; i  sizeof(tests) / sizeof(tests[0]); i++) {
		if (test_pids[i] != -1) {
			if (kill(test_pids[i], SIGTERM)  0)
perror(kill);
		}
	}

	exit(0);
}

void
check_error(int error, char *operation)
{

	if (error  0  errno == ESTALE) {
		perror(operation);
		kill(parent_pid, SIGINT);
		pause();
	}
}

void
check_error_child(int error, char *operation)
{

	if (error  0  errno == ESTALE) {
		perror(operation);
		kill(parent_pid, SIGINT);
		exit(1);
	}
}

void
do_stats(char *file)
{
	int error;
	struct stat stbuf;
	struct stat64 stbuf64;

	error = stat(file, stbuf);
	check_error(error, stat);

	error = stat64(file, stbuf64);
	check_error(error, stat64);

	

[PATCH 3/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

The patch enhanced the ESTALE error handling for NFS mounted
file systems.  It expands the number of places that the NFS
client checks for ESTALE returns from the server.

It also enhances the ESTALE handling for directories by
occasionally retrying revalidation to check to see whether the
directory becomes valid again.  This sounds odd, but can occur
when a systems administrator, accidently or unknowingly,
unexports a file system which is in use.  All active
non-directory files become permanently inaccessible, but
directories can be become accessible again after the
administrator re-exports the file system.  This is a situation
that users have been complaining about for years and this
support can help to alleviate their situations.

   Thanx...

  ps

Signed-off-by: Peter Staubach [EMAIL PROTECTED]
--- linux-2.6.23.i686/fs/nfs/inode.c.org
+++ linux-2.6.23.i686/fs/nfs/inode.c
@@ -192,7 +192,7 @@ void nfs_invalidate_atime(struct inode *
  */
 static void nfs_invalidate_inode(struct inode *inode)
 {
-   set_bit(NFS_INO_STALE, NFS_FLAGS(inode));
+   nfs_handle_estale(-ESTALE, inode);
nfs_zap_caches_locked(inode);
 }
 
@@ -385,6 +385,8 @@ nfs_setattr(struct dentry *dentry, struc
error = NFS_PROTO(inode)-setattr(dentry, fattr, attr);
if (error == 0)
nfs_refresh_inode(inode, fattr);
+   else
+   nfs_handle_estale(error, inode);
unlock_kernel();
return error;
 }
@@ -629,7 +631,7 @@ int nfs_release(struct inode *inode, str
 int
 __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
-   int  status = -ESTALE;
+   int status = -ESTALE;
struct nfs_fattr fattr;
struct nfs_inode *nfsi = NFS_I(inode);
 
@@ -640,15 +642,25 @@ __nfs_revalidate_inode(struct nfs_server
lock_kernel();
if (is_bad_inode(inode))
goto out_nowait;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode)  !S_ISDIR(inode-i_mode))
goto out_nowait;
 
status = nfs_wait_on_inode(inode);
if (status  0)
goto out;
 
+   /*
+* Do we believe that the file handle is still stale?
+* For non-directories, once stale, always stale.
+* For directories, believe the stale status for the
+* attribute cache timeout period, and then try again.
+* This will help to address the problem of the server
+* admin accidently unexporting a file system without
+* stopping the NFS server first.
+*/
status = -ESTALE;
-   if (NFS_STALE(inode))
+   if (NFS_STALE(inode) 
+   (!S_ISDIR(inode-i_mode) || !nfs_attribute_timeout(inode)))
goto out;
 
status = NFS_PROTO(inode)-getattr(server, NFS_FH(inode), fattr);
@@ -656,11 +668,9 @@ __nfs_revalidate_inode(struct nfs_server
dfprintk(PAGECACHE, nfs_revalidate_inode: (%s/%Ld) getattr 
failed, error=%d\n,
 inode-i_sb-s_id,
 (long long)NFS_FILEID(inode), status);
-   if (status == -ESTALE) {
+   nfs_handle_estale(status, inode);
+   if (status == -ESTALE)
nfs_zap_caches(inode);
-   if (!S_ISDIR(inode-i_mode))
-   set_bit(NFS_INO_STALE, NFS_FLAGS(inode));
-   }
goto out;
}
 
@@ -986,14 +996,28 @@ static int nfs_update_inode(struct inode
__FUNCTION__, inode-i_sb-s_id, inode-i_ino,
atomic_read(inode-i_count), fattr-valid);
 
-   if (nfsi-fileid != fattr-fileid)
-   goto out_fileid;
+   if (nfsi-fileid != fattr-fileid) {
+   printk(KERN_ERR NFS: server %s error: fileid changed\n
+   fsid %s: expected fileid 0x%Lx, got 0x%Lx\n,
+   NFS_SERVER(inode)-nfs_client-cl_hostname,
+   inode-i_sb-s_id,
+   (long long)nfsi-fileid, (long long)fattr-fileid);
+   goto out_err;
+   }
 
/*
 * Make sure the inode's type hasn't changed.
 */
-   if ((inode-i_mode  S_IFMT) != (fattr-mode  S_IFMT))
-   goto out_changed;
+   if ((inode-i_mode  S_IFMT) != (fattr-mode  S_IFMT)) {
+   /*
+* Big trouble! The inode has become a different object.
+*/
+   printk(KERN_DEBUG %s: inode %ld mode changed, %07o to %07o\n,
+   __FUNCTION__, inode-i_ino, inode-i_mode, fattr-mode);
+   goto out_err;
+   }
+
+   nfs_clear_estale(inode);
 
server = NFS_SERVER(inode);
/* Update the fsid? */
@@ -1099,12 +1123,7 @@ static int nfs_update_inode(struct inode
nfsi-cache_validity = ~NFS_INO_REVAL_FORCED;
 
return 0;
- out_changed:
-   /*
-* Big trouble! The inode has become a different object

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is passed 
to the file system during each lookup that forces each component in 
the path to be revalidated on the server.  This has no possibility of 
causing an infinite loop.


Is there some part of this logic that is no longer working? 


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.  There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.

Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is 
passed to the file system during each lookup that forces each 
component in the path to be revalidated on the server.  This has 
no possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we should 
give up.  When I addressed this issue two years ago, the two-try 
logic was the only acceptable solution because there's no way to 
guarantee the pathname resolution will ever finish unless we put a 
hard limit on it.




I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.


Unless someone is doing something malicious.  Or if the server is 
repeatedly returning ESTALE for some reason.




If the server is repeatedly returning ESTALE, then the pathname
resolution will fail to make progress and give up, return ENOENT
to the user level.

A malicious user on the network can cause so many other problems
than just something like this too.  But, in this case, the user
would have to predict why and when the client was issuing a
specific operation and know whether or not to return ESTALE.
This seems quite far fetched and quite unlikely to me.


There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close all 
the windows that result from sloppy file time stamps without 
completely disabling client-side caching.  The NFS protocol relies 
on file time stamps to manage cache coherence.  If the server is 
lying about time stamps, there's no way the client can cache 
coherently.




Server bug or not, it is something that the client has to live
with.  We can't get the server file system fixed, so it is
something that we should find a way to live with.  This support
can help.


We haven't identified a server-side solution yet, but that doesn't 
mean it doesn't exist.




No, it doesn't and I, and most everyone else, would also like to
see such a solution.  That said, I am pretty sure that we are not
going to get a fix for ext3 and forcing everyone to move away from
ext3 is not a good solution either.

If we address the time stamp problem in the client, should we also go 
to lengths to address it in every other corner of the NFS client?  
Should we also address every other server bug we discover with a 
client side fix?




These aren't asked seriously, are they?

When possible, we get the server bug fixed.  When not possible,
such as the time stamp issue with ext3, we attempt work around
it as best as possible.



Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then it's 
a real error.  Is there a fixed and valid recovery script for the 
client in this case that will allow the mkdir to proceed?




Why do you think that it is an error?


Because this is a problem that sometimes requires application-level 
recovery.  Can we guarantee that retrying the mkdir is the right thing 
to do every time?




When would not retrying the MKDIR be the right thing to do?
When doing a mkdir(a/b), the user can not tell nor cares
which instance of directory a is the one that gets b created
in it.

Which cases are the ones that you see that require user
level recovery?


It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.


Have you tried this client against a file server when you unexport the 
filesystem under test?  The server returns ESTALE no matter what the 
client does.  Should the client continue to retry the request if the 
file system has been permanently taken offline?




Since the NFS client supports intr, then why not continue to
retry the request?  It certainly won't hurt the network, trying
at most once every acdirmin timeout seconds.  This, by default,
would be once every 30 seconds.

This would alleviate a long standing complaint that when an
admin uses a poor

Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

J. Bruce Fields wrote:

On Fri, Jan 18, 2008 at 11:45:52AM -0500, Peter Staubach wrote:
  

Matthew Wilcox wrote:


On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
  

 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)



Gratuitous reformatting?

  
  

Elimination of an overly long line?



I usually try to gather any coding style, comment grammar, etc., fixes
into a single patch or two at the beginning of a series.  That keeps the
substantive patches (the hardest to understand) shorter.

  


That's probably great advice.  I can easily enough undo the change
since it does not affect the functionality of the patch.  It was
made while I was doing the analysis for the patch and to make the
style better match the style used in other surrounding routines.

   Thanx...

  ps


--b.

  

@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
acc_mode = ACC_MODE(flag);
 @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 @@ -1812,10 +1833,17 @@ ok:
return 0;
  exit_dput:
+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(path, nd);
 exit:
if (!IS_ERR(nd-intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd-dentry);
+   path_release(nd);
+   goto top;
+   }



I wonder if a tail-call might not work better here.
  

Tail-call?

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Chuck Lever wrote:

On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote:

Chuck Lever wrote:

Hi Peter-

On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote:

Hi.

Here is a patch set which modifies the system to enhance the
ESTALE error handling for system calls which take pathnames
as arguments.


The VFS already handles ESTALE.

If a pathname resolution encounters an ESTALE at any point, the 
resolution is restarted exactly once, and an additional flag is 
passed to the file system during each lookup that forces each 
component in the path to be revalidated on the server.  This has no 
possibility of causing an infinite loop.


Is there some part of this logic that is no longer working?


The VFS does not fully handle ESTALE.  An ESTALE error can occur
during the second pathname resolution attempt.


If an ESTALE occurs during the second resolution attempt, we should 
give up.  When I addressed this issue two years ago, the two-try logic 
was the only acceptable solution because there's no way to guarantee 
the pathname resolution will ever finish unless we put a hard limit on 
it.




I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.


There are lots of
reasons, some of which are the 1 second resolution from some file
systems on the server


Which is a server bug, AFAICS.  It's simply impossible to close all 
the windows that result from sloppy file time stamps without 
completely disabling client-side caching.  The NFS protocol relies on 
file time stamps to manage cache coherence.  If the server is lying 
about time stamps, there's no way the client can cache coherently.




Server bug or not, it is something that the client has to live
with.  We can't get the server file system fixed, so it is
something that we should find a way to live with.  This support
can help.


and the window in between the revalidation
and the actual use of the file handle associated with each
dentry/inode pair.


A use case or two would be useful to explore (on linux-nfs or 
linux-fsdevel, rather than lkml).




I created a bunch of use cases in the gensyscall.c program that
I attached to the original description of the problem and my
proposed solution.  It was very useful in generating many, many
ESTALE errors over the wire from a variety of different over the
wire operations, which were originally getting returned to the
user level.


Also, there was no support for ESTALE errors which occur during
subsequent operations to the pathname resolution process.  For
example, during a mkdir(2) operation, the ESTALE can occur from
the over the wire MKDIR operation after the LOOKUP operations
have all succeeded.


If the final operation fails after a pathname resolution, then it's a 
real error.  Is there a fixed and valid recovery script for the client 
in this case that will allow the mkdir to proceed?




Why do you think that it is an error?

It can easily occur if the directory in which the new directory
is to be created disppears after it is looked up and before the
MKDIR is issued.

The recovery is to perform the lookup again.

Admittedly, the NFS client could recover more cleanly from some of 
these problems, but given the architecture of the Linux VFS, it will 
be difficult to address some of the corner cases. 


Could you outline some of these corner cases that this proposal
would not address, please?

I ran the test program for many hours, against several different
servers, and although I can't prove completeness, was not able to
show any ESTALE errors being returned unexpectedly.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

Hi.

This is a patch to enhance ESTALE error handling during the
lookup process.  The error, ESTALE, can occur when out of data
dentries, stored in the dcache, is used to translate a pathname
component to a dentry.  When this occurs, the dentry which
contains the pointer to the inode which refers to the non-existent
file is dropped from the dcache and then the lookup process
started again.  Care is taken to ensure that forward process is
always being made.  If forward process is not detected, then the
lookup process is terminated and the error, ENOENT, is returned
to the caller.

   Thanx...

  ps

Signed-off-by: Peter Staubach [EMAIL PROTECTED]
--- linux-2.6.23.i686/fs/namei.c.org
+++ linux-2.6.23.i686/fs/namei.c
@@ -741,7 +741,7 @@ static __always_inline void follow_dotdo
 {
struct fs_struct *fs = current-fs;
 
-   while(1) {
+   while (1) {
struct vfsmount *parent;
struct dentry *old = nd-dentry;
 
@@ -840,7 +840,7 @@ static fastcall int __link_path_walk(con
lookup_flags = LOOKUP_FOLLOW | (nd-flags  LOOKUP_CONTINUE);
 
/* At this point we know we have a real path component. */
-   for(;;) {
+   for (;;) {
unsigned long hash;
struct qstr this;
unsigned int c;
@@ -992,7 +992,7 @@ return_reval:
 */
if (nd-dentry  nd-dentry-d_sb 
(nd-dentry-d_sb-s_type-fs_flags  FS_REVAL_DOT)) {
-   err = -ESTALE;
+   err = -ENOENT;
/* Note: we do not d_invalidate() */
if (!nd-dentry-d_op-d_revalidate(nd-dentry, nd))
break;
@@ -1003,6 +1003,8 @@ out_dput:
dput_path(next, nd);
break;
}
+   if (err == -ESTALE)
+   d_drop(nd-dentry);
path_release(nd);
 return_err:
return err;
@@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
mntget(save.mnt);
 
result = __link_path_walk(name, nd);
-   if (result == -ESTALE) {
+   while (result == -ESTALE) {
+   /*
+* If no progress was made looking up the pathname,
+* then stop and return ENOENT instead of ESTALE.
+*/
+   if (nd-dentry == save.dentry) {
+   result = -ENOENT;
+   break;
+   }
*nd = save;
dget(nd-dentry);
mntget(nd-mnt);
nd-flags |= LOOKUP_REVAL;
result = __link_path_walk(name, nd);
+   /*
+* If no progress was made this time, then return
+* ENOENT instead of ESTALE because no recovery
+* is possible to recover the stale file handle.
+*/
+   if (result == -ESTALE  nd-dentry == save.dentry)
+   result = -ENOENT;
}
 
dput(save.dentry);
@@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
  * @create_mode: create intent flags
  */
 static int path_lookup_create(int dfd, const char *name,
- unsigned int lookup_flags, struct nameidata *nd,
- int open_flags, int create_mode)
+   unsigned int lookup_flags, struct nameidata *nd,
+   int open_flags, int create_mode)
 {
return __path_lookup_intent_open(dfd, name, lookup_flags|LOOKUP_CREATE,
nd, open_flags, create_mode);
@@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
int acc_mode, error;
struct path path;
struct dentry *dir;
-   int count = 0;
+   int count;
+
+top:
+   count = 0;
 
acc_mode = ACC_MODE(flag);
 
@@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
+   flag, mode);
if (error)
return error;
 
@@ -1812,10 +1833,17 @@ ok:
return 0;
 
 exit_dput:
+   if (error == -ESTALE)
+   d_drop(path.dentry);
dput_path(path, nd);
 exit:
if (!IS_ERR(nd-intent.open.file))
release_open_intent(nd);
+   if (error == -ESTALE) {
+   d_drop(nd-dentry);
+   path_release(nd);
+   goto top;
+   }
path_release(nd);
return error;
 
@@ -1825,7 +1853,7 @@ do_link:
goto exit_dput;
/*
 * This is subtle. Instead of calling do_follow_link() we do the
-* thing by hands. The reason is that this way we have zero link_count
+* thing by hand. The reason is that this way we have zero link_count
 * and path_walk

Re: [PATCH 0/3] enhanced ESTALE error handling

2008-01-18 Thread Peter Staubach

J. Bruce Fields wrote:

On Fri, Jan 18, 2008 at 01:12:03PM -0500, Peter Staubach wrote:
  

Chuck Lever wrote:


On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote:
  

I can probably imagine a situation where the pathname resolution
would never finish, but I am not sure that it could ever happen
in nature.

Unless someone is doing something malicious.  Or if the server is  
repeatedly returning ESTALE for some reason.


  

If the server is repeatedly returning ESTALE, then the pathname
resolution will fail to make progress and give up, return ENOENT
to the user level.

A malicious user on the network can cause so many other problems
than just something like this too.  But, in this case, the user
would have to predict why and when the client was issuing a
specific operation and know whether or not to return ESTALE.
This seems quite far fetched and quite unlikely to me.



Any idea what the consequences would be in this case?  It at least
shouldn't overflow the stack, or freeze the whole machine (because it
spins indefinitely under some crucial lock), or panic, etc.  (If the one
filesystem just becomes unusable--well, fine, what better can you hope
for in the presence of a malicious server or network?)


Assuming that such a user could precisely and accurately predict
when to return ESTALE, the particular system call would just stay
in the kernel, sending out requests to the NFS server.

It wouldn't overflow the stack because the recovery is done by
looping and not by recursion and unless there is a bug that needs
to be fixed, all necessary resources are released before the
retries occur.  The machine wouldn't freeze because as soon as
the request is sent, the process blocks and some other process
can be scheduled.  The process should be interruptible, so even
it could be signaled to stop the activity.

It seems to me that mostly, the file system will become unusable,
but as Bruce points out, what do you expect in the presence of a
malicious entity?  If such are a concern, then measures such as
stronger security can be employed to prevent them from wreaking
havoc.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

2008/1/11, Peter Staubach <[EMAIL PROTECTED]>:
  

Anton Salikhmetov wrote:


From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().
  

Sorry, one other issue to throw out too -- an mmap'd block device
should also have its inode time fields updated.  This is a little
tricky because the inode referenced via mapping->host isn't the
one that needs to have the time fields updated on.

I have attached the patch that I submitted last.  It is quite out
of date, but does show my attempt to resolve some of these issues.



Thanks for your feedback!

Now I'm looking at your solution and thinking about which parts of it
I could adapt to the infrastructure I'm trying to develop.

However, I would like to address the block device case within
a separate project. But for now, I want the msync() and fsync()
system calls to update ctime and mtime at least for memory-mapped
regular files properly. I feel that even this little improvement could address
many customer's troubles such as the one Jacob Oestergaard reported
in the bug #2645.


Not that I disagree and I also have customers who would really like
to see this situation addressed so that I can then fix it in RHEL,
but the block device issue was raised by Andrew Morton during my
first attempt to get a patch integrated.

Just so that you are aware of who has raised which issues...  :-)

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().


Sorry, one other issue to throw out too -- an mmap'd block device
should also have its inode time fields updated.  This is a little
tricky because the inode referenced via mapping->host isn't the
one that needs to have the time fields updated on.

I have attached the patch that I submitted last.  It is quite out
of date, but does show my attempt to resolve some of these issues.

   Thanx...

  ps
--- linux-2.6.20.i686/fs/buffer.c.org
+++ linux-2.6.20.i686/fs/buffer.c
@@ -710,6 +710,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty_buffers(struct page *page)
 {
struct address_space * const mapping = page_mapping(page);
+   int ret = 0;
 
if (unlikely(!mapping))
return !TestSetPageDirty(page);
@@ -727,7 +728,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(>private_lock);
 
if (TestSetPageDirty(page))
-   return 0;
+   goto out;
 
write_lock_irq(>tree_lock);
if (page->mapping) {/* Race with truncate? */
@@ -740,7 +741,11 @@ int __set_page_dirty_buffers(struct page
}
write_unlock_irq(>tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-   return 1;
+   ret = 1;
+out:
+   if (page_mapped(page))
+   set_bit(AS_MCTIME, >flags);
+   return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 
--- linux-2.6.20.i686/fs/fs-writeback.c.org
+++ linux-2.6.20.i686/fs/fs-writeback.c
@@ -167,6 +167,13 @@ __sync_single_inode(struct inode *inode,
 
spin_unlock(_lock);
 
+   if (test_and_clear_bit(AS_MCTIME, >flags)) {
+   if (S_ISBLK(inode->i_mode))
+   bd_inode_update_time(inode);
+   else
+   inode_update_time(inode);
+   }
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
--- linux-2.6.20.i686/fs/inode.c.org
+++ linux-2.6.20.i686/fs/inode.c
@@ -1201,8 +1201,8 @@ void touch_atime(struct vfsmount *mnt, s
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: file accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1212,9 +1212,8 @@ EXPORT_SYMBOL(touch_atime);
  * timestamps are handled by the server.
  */
 
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
 
@@ -1238,7 +1237,7 @@ void file_update_time(struct file *file)
mark_inode_dirty_sync(inode);
 }
 
-EXPORT_SYMBOL(file_update_time);
+EXPORT_SYMBOL(inode_update_time);
 
 int inode_needs_sync(struct inode *inode)
 {
--- linux-2.6.20.i686/fs/block_dev.c.org
+++ linux-2.6.20.i686/fs/block_dev.c
@@ -608,6 +608,22 @@ void bdput(struct block_device *bdev)
 
 EXPORT_SYMBOL(bdput);
  
+void bd_inode_update_time(struct inode *inode)
+{
+   struct block_device *bdev = inode->i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   spin_lock(_lock);
+   list_for_each(p, >bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode);
+   }
+   spin_unlock(_lock);
+}
+
 static struct block_device *bd_acquire(struct inode *inode)
 {
struct block_device *bdev;
--- linux-2.6.20.i686/include/linux/fs.h.org
+++ linux-2.6.20.i686/include/linux/fs.h
@@ -1488,6 +1488,7 @@ extern struct block_device *bdget(dev_t)
 extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
+extern void bd_inode_update_time(struct inode *);
 extern struct block_device *open_by_devnum(dev_t, unsigned);
 extern const struct address_space_operations def_blk_aops;
 #else
@@ -1892,7 +1893,11 @@ extern int buffer_migrate_page(struct ad
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
-extern void file_update_time(struct file *file);
+extern void inode_update_time(struct inode *inode);
+static inline void file_update_time(struct file *file)
+{
+   inode_update_time(file->f_path.dentry->d_inode);
+}
 
 static inline ino_t parent_ino(struct dentry *dentry)
 {
--- 

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().

  


What happens if the application does not issue either an msync
or an fsync call, but either just munmap's the region or just
keeps on manipulating it?  It appears to me that the file times
will never be updated in these cases.

It seems to me that the file times should be updated eventually,
and perhaps even regularly if the file is being constantly
updated via the mmap'd region.

   Thanx...

  ps


Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(>tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   set_bit(AS_MCTIME, >flags);
 
 	return 1;

 }
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c5b954e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*

  * This is needed for the following functions:
@@ -1282,6 +1283,18 @@ void file_update_time(struct file *file)
 
 EXPORT_SYMBOL(file_update_time);
 
+/*

+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapped_file_update_time(struct file *file)
+{
+   if (test_and_clear_bit(AS_MCTIME, >f_mapping->flags)) {
+   get_file(file);
+   file_update_time(file);
+   fput(file);
+   }
+}
+
 int inode_needs_sync(struct inode *inode)
 {
if (IS_SYNC(inode))
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..df57507 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+	mapped_file_update_time(file);

+
ret = filemap_fdatawrite(mapping);
 
 	/*

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..0b05118 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1978,6 +1978,7 @@ extern int inode_change_ok(struct inode *, struct iattr 
*);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);

+extern void mapped_file_update_time(struct file *file);
 
 static inline ino_t parent_ino(struct dentry *dentry)

 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..bf0f9e7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -17,8 +17,9 @@
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
  * allocation mode flags.
  */
-#defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
+#define AS_EIO (__GFP_BITS_SHIFT + 0)  /* IO error on async write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_MCTIME  (__GFP_BITS_SHIFT + 2)  /* mtime and ctime to update */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)

 {
diff --git a/mm/msync.c b/mm/msync.c
index e788f7b..9d0a8f9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 1994-1999  Linus Torvalds
  *
  * Substantial code cleanup.
+ * Updating the ctime and mtime stamps for memory mapped files.
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
@@ -22,6 +23,10 @@

  * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -74,14 +79,17 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
break;
}
file = vma->vm_file;
-   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
-   get_file(file);
-   up_read(>mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error)
-   return error;
-   down_read(>mmap_sem);
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   mapped_file_update_time(file);
+   if (flags & MS_SYNC) {
+   get_file(file);
+  

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov [EMAIL PROTECTED]

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().

  


What happens if the application does not issue either an msync
or an fsync call, but either just munmap's the region or just
keeps on manipulating it?  It appears to me that the file times
will never be updated in these cases.

It seems to me that the file times should be updated eventually,
and perhaps even regularly if the file is being constantly
updated via the mmap'd region.

   Thanx...

  ps


Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(mapping-tree_lock);
__mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
+   set_bit(AS_MCTIME, mapping-flags);
 
 	return 1;

 }
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c5b954e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include linux/bootmem.h
 #include linux/inotify.h
 #include linux/mount.h
+#include linux/file.h
 
 /*

  * This is needed for the following functions:
@@ -1282,6 +1283,18 @@ void file_update_time(struct file *file)
 
 EXPORT_SYMBOL(file_update_time);
 
+/*

+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapped_file_update_time(struct file *file)
+{
+   if (test_and_clear_bit(AS_MCTIME, file-f_mapping-flags)) {
+   get_file(file);
+   file_update_time(file);
+   fput(file);
+   }
+}
+
 int inode_needs_sync(struct inode *inode)
 {
if (IS_SYNC(inode))
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..df57507 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+	mapped_file_update_time(file);

+
ret = filemap_fdatawrite(mapping);
 
 	/*

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..0b05118 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1978,6 +1978,7 @@ extern int inode_change_ok(struct inode *, struct iattr 
*);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);

+extern void mapped_file_update_time(struct file *file);
 
 static inline ino_t parent_ino(struct dentry *dentry)

 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..bf0f9e7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -17,8 +17,9 @@
  * Bits in mapping-flags.  The lower __GFP_BITS_SHIFT bits are the page
  * allocation mode flags.
  */
-#defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
+#define AS_EIO (__GFP_BITS_SHIFT + 0)  /* IO error on async write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_MCTIME  (__GFP_BITS_SHIFT + 2)  /* mtime and ctime to update */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)

 {
diff --git a/mm/msync.c b/mm/msync.c
index e788f7b..9d0a8f9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 1994-1999  Linus Torvalds
  *
  * Substantial code cleanup.
+ * Updating the ctime and mtime stamps for memory mapped files.
  * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
@@ -22,6 +23,10 @@

  * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -74,14 +79,17 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
break;
}
file = vma-vm_file;
-   if ((flags  MS_SYNC)  file  (vma-vm_flags  VM_SHARED)) {
-   get_file(file);
-   up_read(mm-mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error)
-   return error;
-   down_read(mm-mmap_sem);
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   mapped_file_update_time(file);
+   if (flags  MS_SYNC) {
+   

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov [EMAIL PROTECTED]

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().


Sorry, one other issue to throw out too -- an mmap'd block device
should also have its inode time fields updated.  This is a little
tricky because the inode referenced via mapping-host isn't the
one that needs to have the time fields updated on.

I have attached the patch that I submitted last.  It is quite out
of date, but does show my attempt to resolve some of these issues.

   Thanx...

  ps
--- linux-2.6.20.i686/fs/buffer.c.org
+++ linux-2.6.20.i686/fs/buffer.c
@@ -710,6 +710,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty_buffers(struct page *page)
 {
struct address_space * const mapping = page_mapping(page);
+   int ret = 0;
 
if (unlikely(!mapping))
return !TestSetPageDirty(page);
@@ -727,7 +728,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(mapping-private_lock);
 
if (TestSetPageDirty(page))
-   return 0;
+   goto out;
 
write_lock_irq(mapping-tree_lock);
if (page-mapping) {/* Race with truncate? */
@@ -740,7 +741,11 @@ int __set_page_dirty_buffers(struct page
}
write_unlock_irq(mapping-tree_lock);
__mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
-   return 1;
+   ret = 1;
+out:
+   if (page_mapped(page))
+   set_bit(AS_MCTIME, mapping-flags);
+   return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 
--- linux-2.6.20.i686/fs/fs-writeback.c.org
+++ linux-2.6.20.i686/fs/fs-writeback.c
@@ -167,6 +167,13 @@ __sync_single_inode(struct inode *inode,
 
spin_unlock(inode_lock);
 
+   if (test_and_clear_bit(AS_MCTIME, mapping-flags)) {
+   if (S_ISBLK(inode-i_mode))
+   bd_inode_update_time(inode);
+   else
+   inode_update_time(inode);
+   }
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
--- linux-2.6.20.i686/fs/inode.c.org
+++ linux-2.6.20.i686/fs/inode.c
@@ -1201,8 +1201,8 @@ void touch_atime(struct vfsmount *mnt, s
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: file accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1212,9 +1212,8 @@ EXPORT_SYMBOL(touch_atime);
  * timestamps are handled by the server.
  */
 
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode)
 {
-   struct inode *inode = file-f_path.dentry-d_inode;
struct timespec now;
int sync_it = 0;
 
@@ -1238,7 +1237,7 @@ void file_update_time(struct file *file)
mark_inode_dirty_sync(inode);
 }
 
-EXPORT_SYMBOL(file_update_time);
+EXPORT_SYMBOL(inode_update_time);
 
 int inode_needs_sync(struct inode *inode)
 {
--- linux-2.6.20.i686/fs/block_dev.c.org
+++ linux-2.6.20.i686/fs/block_dev.c
@@ -608,6 +608,22 @@ void bdput(struct block_device *bdev)
 
 EXPORT_SYMBOL(bdput);
  
+void bd_inode_update_time(struct inode *inode)
+{
+   struct block_device *bdev = inode-i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   spin_lock(bdev_lock);
+   list_for_each(p, bdev-bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode);
+   }
+   spin_unlock(bdev_lock);
+}
+
 static struct block_device *bd_acquire(struct inode *inode)
 {
struct block_device *bdev;
--- linux-2.6.20.i686/include/linux/fs.h.org
+++ linux-2.6.20.i686/include/linux/fs.h
@@ -1488,6 +1488,7 @@ extern struct block_device *bdget(dev_t)
 extern void bd_set_size(struct block_device *, loff_t size);
 extern void bd_forget(struct inode *inode);
 extern void bdput(struct block_device *);
+extern void bd_inode_update_time(struct inode *);
 extern struct block_device *open_by_devnum(dev_t, unsigned);
 extern const struct address_space_operations def_blk_aops;
 #else
@@ -1892,7 +1893,11 @@ extern int buffer_migrate_page(struct ad
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
-extern void file_update_time(struct file *file);
+extern void inode_update_time(struct inode *inode);
+static inline void file_update_time(struct file *file)
+{
+   inode_update_time(file-f_path.dentry-d_inode);
+}
 
 static inline ino_t 

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Peter Staubach

Anton Salikhmetov wrote:

2008/1/11, Peter Staubach [EMAIL PROTECTED]:
  

Anton Salikhmetov wrote:


From: Anton Salikhmetov [EMAIL PROTECTED]

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().
  

Sorry, one other issue to throw out too -- an mmap'd block device
should also have its inode time fields updated.  This is a little
tricky because the inode referenced via mapping-host isn't the
one that needs to have the time fields updated on.

I have attached the patch that I submitted last.  It is quite out
of date, but does show my attempt to resolve some of these issues.



Thanks for your feedback!

Now I'm looking at your solution and thinking about which parts of it
I could adapt to the infrastructure I'm trying to develop.

However, I would like to address the block device case within
a separate project. But for now, I want the msync() and fsync()
system calls to update ctime and mtime at least for memory-mapped
regular files properly. I feel that even this little improvement could address
many customer's troubles such as the one Jacob Oestergaard reported
in the bug #2645.


Not that I disagree and I also have customers who would really like
to see this situation addressed so that I can then fix it in RHEL,
but the block device issue was raised by Andrew Morton during my
first attempt to get a patch integrated.

Just so that you are aware of who has raised which issues...  :-)

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Peter Staubach

Anton Salikhmetov wrote:

2008/1/10, Rik van Riel <[EMAIL PROTECTED]>:
  

On Thu, 10 Jan 2008 18:56:07 +0300
"Anton Salikhmetov" <[EMAIL PROTECTED]> wrote:



However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.
  

Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.

The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.



Adding the new flag (AS_MCTIME) has been already suggested by Peter
Staubach in his first solution for this bug. Now I understand that the
AS_MCTIME flag is required for fixing the bug.


Well, that was the approach before we had I_DIRTY_PAGES.  I am
still wondering whether we can get this approach to work, with
a little more support and heuristics.  PeterZ's work to better
track dirty pages should be helpful in determining when and why
a patch was dirty.

I keep thinking that by recording the time when a page was found
to be dirty and the file is mmap'd and then updating the mtime
and ctime fields in the inode during msync() and sync_single_inode()
if that time is newer than the mtime and ctime fields, then we
can solve the problem of when and when not to update those two
time fields.

I haven't had a chance to think it all through completely or do
the appropriate analysis yet though.

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Peter Staubach

Rik van Riel wrote:

On Thu, 10 Jan 2008 18:56:07 +0300
"Anton Salikhmetov" <[EMAIL PROTECTED]> wrote:

  

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.



Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.

The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.


And also remembering that the file times should not be updated
if the pages were modified via a write(2) operation.  Or if
there has been an intervening write(2) operation...

The number of cases to consider and the boundary conditions
quickly make this reasonably complex to get right.  That's why
this is the 4'th or 5'th attempt in the last 18 months or so
to get this situation addressed.

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Peter Staubach

Rik van Riel wrote:

On Thu, 10 Jan 2008 18:56:07 +0300
Anton Salikhmetov [EMAIL PROTECTED] wrote:

  

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.



Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.

The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.


And also remembering that the file times should not be updated
if the pages were modified via a write(2) operation.  Or if
there has been an intervening write(2) operation...

The number of cases to consider and the boundary conditions
quickly make this reasonably complex to get right.  That's why
this is the 4'th or 5'th attempt in the last 18 months or so
to get this situation addressed.

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Peter Staubach

Anton Salikhmetov wrote:

2008/1/10, Rik van Riel [EMAIL PROTECTED]:
  

On Thu, 10 Jan 2008 18:56:07 +0300
Anton Salikhmetov [EMAIL PROTECTED] wrote:



However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.
  

Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.

The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.



Adding the new flag (AS_MCTIME) has been already suggested by Peter
Staubach in his first solution for this bug. Now I understand that the
AS_MCTIME flag is required for fixing the bug.


Well, that was the approach before we had I_DIRTY_PAGES.  I am
still wondering whether we can get this approach to work, with
a little more support and heuristics.  PeterZ's work to better
track dirty pages should be helpful in determining when and why
a patch was dirty.

I keep thinking that by recording the time when a page was found
to be dirty and the file is mmap'd and then updating the mtime
and ctime fields in the inode during msync() and sync_single_inode()
if that time is newer than the mtime and ctime fields, then we
can solve the problem of when and when not to update those two
time fields.

I haven't had a chance to think it all through completely or do
the appropriate analysis yet though.

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Rik van Riel wrote:

On Wed, 09 Jan 2008 16:06:17 -0500
[EMAIL PROTECTED] wrote:
  

On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:



Could you explain (using short words and simple sentences) what the
exact problem is?
  

It's like this:

Monday  9:04AM:  System boots, database server starts up, mmaps file
Monday  9:06AM:  Database server writes to mmap area, updates mtime/ctime
Monday  Database server writes to mmap area, no further update..
Monday 11:45PM:  Backup sees "file modified 9:06AM, let's back it up"
Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
Wed  9:00AM-5:00PM: More updates, more not touching the mtime
Wed  11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it 
up..

Lather, rinse, repeat



On the other hand, updating the mtime and ctime whenever a page is dirtied
also does not work right.  Apparently that can break mutt.

  


Could you elaborate on why that would break mutt?  I am assuming
that the pages being modified are mmap'd, but if they are not, then
it is very clear why mutt (and anything else) would break.


Calling msync() every once in a while with Anton's patch does not look like a
fool proof method to me either, because the VM can write all the dirty pages
to disk by itself, leaving nothing for msync() to detect.  (I think...)

Can we get by with simply updating the ctime and mtime every time msync()
is called, regardless of whether or not the mmaped pages were still dirty
by the time we called msync() ?


As long as we can keep track of that information and then remember
it for an munmap so that eventually the file times do get updated,
then this should work.

It would seem that a better solution would be to update the file
times whenever the inode gets cleaned, ie. modified pages written
out and the inode synchronized to the disk.  That way, long running
programs would not have to msync occasionally in order to have
the data file properly backed up.

   Thanx...

  ps

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:

Since no reaction in LKML was recieved for this message it seemed
logical to suggest closing the bug #2645 as "WONTFIX":

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

However, the reporter of the bug, Jacob Oestergaard, insisted the
solution to be resubmitted once more:

  


Please re-submit to LKML.

  


Yes, please!

Let's have the right discussion and get this bug addressed for real.
It is a real bug and is causing data corruption for some very large
Red Hat customers because their applications were architected to
use mmap, but their backups are not backing up the modified files
due to this aspect of the system.

This is the 4'th or 5'th attempt in the last 2 years to submit a
patch to address this situation.  None have been able to make it
all of the way through the process and to be integrated.

I posted some comments.

   Thanx...

  ps


This bug causes backup systems to *miss* changed files.

This bug does cause data loss in common real-world deployments (I gave an
example with a database when posting the bug, but this affects the data from
all mmap using applications with common backup systems).

Silent exclusion from backups is very very nasty.

<<<

Please comment on my solution or commit it if it's acceptable in its
present form.

2008/1/7, Anton Salikhmetov <[EMAIL PROTECTED]>:
  

From: Anton Salikhmetov <[EMAIL PROTECTED]>

Due to the lack of reaction in LKML I presume the message was lost
in the high traffic of that list. Resending it now with the addressee changed
to the memory management mailing list.

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
"manual" alignment, improved readability of the loop,
which traverses the process memory regions, updated comments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */

-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 

 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check 

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:
> From: Anton Salikhmetov <[EMAIL PROTECTED]>
>
> I would like to propose my solution for the bug #2645 from the kernel 
bug tracker:

>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> The Open Group defines the behavior of the mmap() function as follows.
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to 
msync()

> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> If there is no such call and if the underlying file is modified as a 
result
> of a write reference, then these fields shall be marked for update at 
some

> time after the write reference.
>
> The above citation was taken from the following link:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>
> Therefore, the msync() function should be called before verifying the 
time
> stamps st_mtime and st_ctime in the test program Badari wrote in the 
context

> of the bug #2645. Otherwise, the time stamps may be updated
> at some unspecified moment according to the POSIX standard.
>
> I changed his test program a little. The changed unit test can be 
downloaded

> using the following link:
>
> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows the appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
>  


Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region.  It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.

Or am I looking in the wrong place?  :-)

> Additionally, the test cases for the msync() system call from
> the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
> and mmapstress10) successfully passed using the kernel
> with the patch included into this email.
>
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment check, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
>  


These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

--

A better architecture would be to arrange for the file times
to be updated when the page makes the transition from being
unmodified to modified.  This is not straightforward due to
the current locking, but should be doable, I think.  Perhaps
recording the current time and then using it to update the
file times at a more suitable time (no pun intended) might
work.

  Thanx...

 ps


> Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
>
> ---
>
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..cb973eb 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -1,26 +1,32 @@
>  /*
>   *linux/mm/msync.c
>   *
> + * The msync() system call.
>   * Copyright (C) 1994-1999  Linus Torvalds
> + *
> + * Updating the mtime and ctime stamps for mapped files
> + * and code cleanup.
> + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>   */
> 
> -/*

> - * The msync() system call.
> - */
> +#include 
>  #include 
>  #include 
>  #include 
> -#include 
> -#include 
>  #include 
> +#include 
> 
>  /*

>   * MS_SYNC syncs the entire file - including mappings.
>   *
>   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
>   * Now it doesn't do anything, since dirty pages are properly tracked.
>   *
> + * The msync() system call updates the ctime and mtime fields for
> + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
> + * according to the POSIX standard.
> + *
>   * The application may now run fsync() 

Re: [PATCH] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov <[EMAIL PROTECTED]>

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows the appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

  


Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region.  It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.

Or am I looking in the wrong place?  :-)


Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
"manual" alignment check, improved readability of the loop,
which traverses the process memory regions, updated comments.

  


These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

--

A better architecture would be to arrange for the file times
to be updated when the page makes the transition from being
unmodified to modified.  This is not straightforward due to
the current locking, but should be doable, I think.  Perhaps
recording the current time and then using it to update the
file times at a more suitable time (no pun intended) might
work.

   Thanx...

  ps



Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*

- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*

  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -33,70 +39,68 @@ 

Re: [PATCH] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:

From: Anton Salikhmetov [EMAIL PROTECTED]

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows the appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

  


Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region.  It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.

Or am I looking in the wrong place?  :-)


Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
manual alignment check, improved readability of the loop,
which traverses the process memory regions, updated comments.

  


These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

--

A better architecture would be to arrange for the file times
to be updated when the page makes the transition from being
unmodified to modified.  This is not straightforward due to
the current locking, but should be doable, I think.  Perhaps
recording the current time and then using it to update the
file times at a more suitable time (no pun intended) might
work.

   Thanx...

  ps



Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */
 
-/*

- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h
 
 /*

  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or 

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:
 From: Anton Salikhmetov [EMAIL PROTECTED]

 I would like to propose my solution for the bug #2645 from the kernel 
bug tracker:


 http://bugzilla.kernel.org/show_bug.cgi?id=2645

 The Open Group defines the behavior of the mmap() function as follows.

 The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
 and PROT_WRITE shall be marked for update at some point in the interval
 between a write reference to the mapped region and the next call to 
msync()

 with MS_ASYNC or MS_SYNC for that portion of the file by any process.
 If there is no such call and if the underlying file is modified as a 
result
 of a write reference, then these fields shall be marked for update at 
some

 time after the write reference.

 The above citation was taken from the following link:

 http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

 Therefore, the msync() function should be called before verifying the 
time
 stamps st_mtime and st_ctime in the test program Badari wrote in the 
context

 of the bug #2645. Otherwise, the time stamps may be updated
 at some unspecified moment according to the POSIX standard.

 I changed his test program a little. The changed unit test can be 
downloaded

 using the following link:

 http://pygx.sourceforge.net/mmap.c

 This program showed that the msync() function had a bug:
 it did not update the st_mtime and st_ctime fields.

 The program shows the appropriate behavior of the msync()
 function using the kernel with the proposed patch applied.
 Specifically, the ctime and mtime time stamps do change
 when modifying the mapped memory and do not change when
 there have been no write references between the mmap()
 and msync() system calls.

  


Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region.  It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.

Or am I looking in the wrong place?  :-)

 Additionally, the test cases for the msync() system call from
 the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
 and mmapstress10) successfully passed using the kernel
 with the patch included into this email.

 The patch adds a call to the file_update_time() function to change
 the file metadata before syncing. The patch also contains
 substantial code cleanup: consolidated error check
 for function parameters, using the PAGE_ALIGN() macro instead of
 manual alignment check, improved readability of the loop,
 which traverses the process memory regions, updated comments.

  


These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

--

A better architecture would be to arrange for the file times
to be updated when the page makes the transition from being
unmodified to modified.  This is not straightforward due to
the current locking, but should be doable, I think.  Perhaps
recording the current time and then using it to update the
file times at a more suitable time (no pun intended) might
work.

  Thanx...

 ps


 Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]

 ---

 diff --git a/mm/msync.c b/mm/msync.c
 index 144a757..cb973eb 100644
 --- a/mm/msync.c
 +++ b/mm/msync.c
 @@ -1,26 +1,32 @@
  /*
   *linux/mm/msync.c
   *
 + * The msync() system call.
   * Copyright (C) 1994-1999  Linus Torvalds
 + *
 + * Updating the mtime and ctime stamps for mapped files
 + * and code cleanup.
 + * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
   */
 
 -/*

 - * The msync() system call.
 - */
 +#include linux/file.h
  #include linux/fs.h
  #include linux/mm.h
  #include linux/mman.h
 -#include linux/file.h
 -#include linux/syscalls.h
  #include linux/sched.h
 +#include linux/syscalls.h
 
  /*

   * MS_SYNC syncs the entire file - including mappings.
   *
   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
 + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
   * Now it doesn't do anything, since dirty pages are properly tracked.
   *
 + * The msync() system call updates the ctime and mtime fields for
 + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
 + * according to the POSIX standard.
 + *
   * The application may now run fsync() to
   

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Anton Salikhmetov wrote:

Since no reaction in LKML was recieved for this message it seemed
logical to suggest closing the bug #2645 as WONTFIX:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

However, the reporter of the bug, Jacob Oestergaard, insisted the
solution to be resubmitted once more:

  


Please re-submit to LKML.

  


Yes, please!

Let's have the right discussion and get this bug addressed for real.
It is a real bug and is causing data corruption for some very large
Red Hat customers because their applications were architected to
use mmap, but their backups are not backing up the modified files
due to this aspect of the system.

This is the 4'th or 5'th attempt in the last 2 years to submit a
patch to address this situation.  None have been able to make it
all of the way through the process and to be integrated.

I posted some comments.

   Thanx...

  ps


This bug causes backup systems to *miss* changed files.

This bug does cause data loss in common real-world deployments (I gave an
example with a database when posting the bug, but this affects the data from
all mmap using applications with common backup systems).

Silent exclusion from backups is very very nasty.



Please comment on my solution or commit it if it's acceptable in its
present form.

2008/1/7, Anton Salikhmetov [EMAIL PROTECTED]:
  

From: Anton Salikhmetov [EMAIL PROTECTED]

Due to the lack of reaction in LKML I presume the message was lost
in the high traffic of that list. Resending it now with the addressee changed
to the memory management mailing list.

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
manual alignment, improved readability of the loop,
which traverses the process memory regions, updated comments.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov [EMAIL PROTECTED]
  */

-/*
- * The msync() system call.
- */
+#include linux/file.h
 #include linux/fs.h
 #include linux/mm.h
 #include linux/mman.h
-#include linux/file.h
-#include linux/syscalls.h
 #include linux/sched.h
+#include linux/syscalls.h

 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may 

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Peter Staubach

Rik van Riel wrote:

On Wed, 09 Jan 2008 16:06:17 -0500
[EMAIL PROTECTED] wrote:
  

On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:



Could you explain (using short words and simple sentences) what the
exact problem is?
  

It's like this:

Monday  9:04AM:  System boots, database server starts up, mmaps file
Monday  9:06AM:  Database server writes to mmap area, updates mtime/ctime
Monday many times Database server writes to mmap area, no further update..
Monday 11:45PM:  Backup sees file modified 9:06AM, let's back it up
Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
Tuesday 11:45PM: Backup sees file modified back on Monday, we backed this up..
Wed  9:00AM-5:00PM: More updates, more not touching the mtime
Wed  11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it 
up..

Lather, rinse, repeat



On the other hand, updating the mtime and ctime whenever a page is dirtied
also does not work right.  Apparently that can break mutt.

  


Could you elaborate on why that would break mutt?  I am assuming
that the pages being modified are mmap'd, but if they are not, then
it is very clear why mutt (and anything else) would break.


Calling msync() every once in a while with Anton's patch does not look like a
fool proof method to me either, because the VM can write all the dirty pages
to disk by itself, leaving nothing for msync() to detect.  (I think...)

Can we get by with simply updating the ctime and mtime every time msync()
is called, regardless of whether or not the mmaped pages were still dirty
by the time we called msync() ?


As long as we can keep track of that information and then remember
it for an munmap so that eventually the file times do get updated,
then this should work.

It would seem that a better solution would be to update the file
times whenever the inode gets cleaned, ie. modified pages written
out and the inode synchronized to the disk.  That way, long running
programs would not have to msync occasionally in order to have
the data file properly backed up.

   Thanx...

  ps

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] NLM: Add reference counting to lockd

2008-01-08 Thread Peter Staubach

Jeff Layton wrote:

On Tue, 8 Jan 2008 17:46:33 +1100
Neil Brown <[EMAIL PROTECTED]> wrote:

The comments about patch 5/6 seem sane. I'll plan to incorporate them
in the respin...

  

On Saturday January 5, [EMAIL PROTECTED] wrote:


@@ -357,7 +375,18 @@ lockd_down(void)
goto out;
}
warned = 0;
-   kthread_stop(nlmsvc_task);
+   if (atomic_sub_return(1, _ref) != 0)
+   printk(KERN_WARNING "lockd_down: lockd is waiting
for "
+   "outstanding requests to complete before
exiting.\n");
  

Why not "atomic_dec_and_test" ??




Temporary amnesia? :-) I'll change that, atomic_dec_and_test will be
clearer.

  

+
+   /*
+* Sending a signal is necessary here. If we get to this
point and
+* nlm_blocked isn't empty then lockd may be held hostage
by clients
+* that are still blocking. Sending the signal makes sure
that lockd
+* invalidates all of its locks so that it's just waiting
on RPC
+* callbacks to complete
+*/
+   kill_proc(nlmsvc_task->pid, SIGKILL, 1);
  

The previous patch removes a kill_proc(... SIGKILL),  this one adds it
back.
That makes me wonder if the intermediate state is 'correct'.

But I also wonder what "correct" means.
Do we want all locks to be dropped when the last nfsd thread dies?
The answer is presumably either "yes" or "no".
If "yes", then we don't have that because if there are any NFS mounts
active, lockd will not be killed.
If "no", then we don't want this kill_proc here.

The comment in lockd() which currently reads:

/*
 * The main request loop. We don't terminate until the last
 * NFS mount or NFS daemon has gone away, and we've been sent
a
 * signal, or else another process has taken over our job.
 */

suggests that someone once thought that lockd could hang around after
all nfsd threads and nfs mounts had gone, but I don't think it does.

We really should think this through and get it right, because if lockd
ever drops it's locks, then we really need to make sure sm_notify gets
run.  So it needs to be a well defined event.

Thoughts?




This is the part I've been struggling with the most -- defining what
proper behavior should be when lockd is restarted. As you point out,
restarting lockd without doing a sm_notify could be bad news for data
integrity.

Then again, we'd like someone to be able to shut down the NFS "service"
and be able to unmount underlying filesystems without jumping through
special hoops

Overall, I think I'd vote "yes". We need to drop locks when the last
nfsd goes down. If userspace brings down nfsd, then it's userspace's
responsibility to make sure that a sm_notify is sent when nfsd and lockd
are restarted.
  


I would vote for the simplest possible model that makes sense.
We need a simple model for admins as well as a simple model
which is easy to implement in as bug free way as possible.  The
trick is not making it too simple because that can cost
performance, but not making it too complicated to implement
reasonably and for admins to be able to figure out.

So, I would vote for "yes" as well.  That will yield an
architecture where we can shutdown systems cleanly and will
be easy to understand when locks for clients exist and when
they do not.

   Thanx...

  ps




As a side note, I'm not thrilled with this design that mixes signals
and kthreads, but didn't see another way to do this. I'm open to
suggestions if anyone has them...

  

Also, it is sad that the inc/dec of nlmsvc_ref is called in somewhat
non-obvious ways.
e.g.



+   if (!nlmsvc_users && error)
+   atomic_dec(_ref);
  

and



+   if (list_empty(_blocked))
+   atomic_inc(_ref);
+
if (list_empty(>b_list)) {
kref_get(>b_count);
} else {
  

where if we moved the atomic_inc a little bit later next to the
"list_add_tail" (which seems to make more sense) it would actually be
wrong... But I think that code is correct as it is - just non-obvious.




The nlmsvc_ref logic is pretty convoluted, unfortunately. I'll plan to
add some comments to clarify what I'm doing there.

Thanks for the review, Neil. I'll see if I can get a new patchset done
in the next few days.

Cheers,
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] NLM: Add reference counting to lockd

2008-01-08 Thread Peter Staubach

Jeff Layton wrote:

On Tue, 8 Jan 2008 17:46:33 +1100
Neil Brown [EMAIL PROTECTED] wrote:

The comments about patch 5/6 seem sane. I'll plan to incorporate them
in the respin...

  

On Saturday January 5, [EMAIL PROTECTED] wrote:


@@ -357,7 +375,18 @@ lockd_down(void)
goto out;
}
warned = 0;
-   kthread_stop(nlmsvc_task);
+   if (atomic_sub_return(1, nlmsvc_ref) != 0)
+   printk(KERN_WARNING lockd_down: lockd is waiting
for 
+   outstanding requests to complete before
exiting.\n);
  

Why not atomic_dec_and_test ??




Temporary amnesia? :-) I'll change that, atomic_dec_and_test will be
clearer.

  

+
+   /*
+* Sending a signal is necessary here. If we get to this
point and
+* nlm_blocked isn't empty then lockd may be held hostage
by clients
+* that are still blocking. Sending the signal makes sure
that lockd
+* invalidates all of its locks so that it's just waiting
on RPC
+* callbacks to complete
+*/
+   kill_proc(nlmsvc_task-pid, SIGKILL, 1);
  

The previous patch removes a kill_proc(... SIGKILL),  this one adds it
back.
That makes me wonder if the intermediate state is 'correct'.

But I also wonder what correct means.
Do we want all locks to be dropped when the last nfsd thread dies?
The answer is presumably either yes or no.
If yes, then we don't have that because if there are any NFS mounts
active, lockd will not be killed.
If no, then we don't want this kill_proc here.

The comment in lockd() which currently reads:

/*
 * The main request loop. We don't terminate until the last
 * NFS mount or NFS daemon has gone away, and we've been sent
a
 * signal, or else another process has taken over our job.
 */

suggests that someone once thought that lockd could hang around after
all nfsd threads and nfs mounts had gone, but I don't think it does.

We really should think this through and get it right, because if lockd
ever drops it's locks, then we really need to make sure sm_notify gets
run.  So it needs to be a well defined event.

Thoughts?




This is the part I've been struggling with the most -- defining what
proper behavior should be when lockd is restarted. As you point out,
restarting lockd without doing a sm_notify could be bad news for data
integrity.

Then again, we'd like someone to be able to shut down the NFS service
and be able to unmount underlying filesystems without jumping through
special hoops

Overall, I think I'd vote yes. We need to drop locks when the last
nfsd goes down. If userspace brings down nfsd, then it's userspace's
responsibility to make sure that a sm_notify is sent when nfsd and lockd
are restarted.
  


I would vote for the simplest possible model that makes sense.
We need a simple model for admins as well as a simple model
which is easy to implement in as bug free way as possible.  The
trick is not making it too simple because that can cost
performance, but not making it too complicated to implement
reasonably and for admins to be able to figure out.

So, I would vote for yes as well.  That will yield an
architecture where we can shutdown systems cleanly and will
be easy to understand when locks for clients exist and when
they do not.

   Thanx...

  ps




As a side note, I'm not thrilled with this design that mixes signals
and kthreads, but didn't see another way to do this. I'm open to
suggestions if anyone has them...

  

Also, it is sad that the inc/dec of nlmsvc_ref is called in somewhat
non-obvious ways.
e.g.



+   if (!nlmsvc_users  error)
+   atomic_dec(nlmsvc_ref);
  

and



+   if (list_empty(nlm_blocked))
+   atomic_inc(nlmsvc_ref);
+
if (list_empty(block-b_list)) {
kref_get(block-b_count);
} else {
  

where if we moved the atomic_inc a little bit later next to the
list_add_tail (which seems to make more sense) it would actually be
wrong... But I think that code is correct as it is - just non-obvious.




The nlmsvc_ref logic is pretty convoluted, unfortunately. I'll plan to
add some comments to clarify what I'm doing there.

Thanks for the review, Neil. I'll see if I can get a new patchset done
in the next few days.

Cheers,
  


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VFS: new fgetattr() file operation

2007-10-24 Thread Peter Staubach

Miklos Szeredi wrote:

Miklos Szeredi wrote:


I don't think Christoph will like the patch better, regardless of how
I change the description.

Of course, I'm open to suggestion on how to improve the interface, but
I think fundamentally this is the only way to correctly deal with the
below problem.

Anyway, here's the patch another time, please consider adding it to
-mm.  For 2.6.25 obviously.

Thanks,
Miklos


From: Miklos Szeredi <[EMAIL PROTECTED]>

Add a new file operation: f_op->fgetattr(), that is invoked by
fstat().  Fall back to i_op->getattr() if it is not defined.

We need this because fstat() semantics can in some cases be better
implemented if the filesystem has the open file available.

Let's take the following example: we have a network filesystem, with
the server implemented as an unprivileged userspace process running on
a UNIX system (this is basically what sshfs does).

We want the filesystem to follow the familiar UNIX file semantics as
closely as possible.  If for example we have this sequence of events,
we still would like fstat to work correctly:

 1) file X is opened on client
 2) file X is renamed to Y on server
 3) fstat() is performed on open file descriptor on client

This is only possible if the filesystem server acutally uses fstat()
on a file descriptor obtained when the file was opened.  Which means,
the filesystem client needs a way to get this information from the
VFS.

  
  

This true iff the protocol that this mythical



Not mythical at all.  As noted in the description, there's sshfs, a
live and quite popular example of this sort of filesystem.

  

network file system uses the name of the file on the server to
actually identify the file on the server.



The constraint is that the server has to be an ordinary unprivileged
process.  How should it identify the file, other than by name, or by
an open file descriptor?

  


I explained this.  The fileid and the generation count along
with the file system id will uniquely identify the file.


Clearly, this is broken on many levels.  It can't handle
situations as described nor can it handle different instances
of the same filename being used.



Can you please give concrete examples what it can't handle, and how
should the implementation be improved to be able to handle it, given
the above constraints?

  

This is why NFS, a network file system, does not use the filename
as part of the file handle.



And the nfs server isn't a userspace process, or if it is, it must use
horrible hacks to convert the file handle to a name, that don't work
half the time.

  


Nice try.  Wrong.  Try a different rationalization.


Wouldn't you be better off by attempting to implement an "open
by ino" operation and an operation to get the generation count
for the file and then modifying the network protocol of interest
to use these as identifiers for the file to be manipulated?



You mean an "open by inode" on the userspace API?  My guess, it
wouldn't get very far.

  


This isn't a new idea and has been implemented on a variety of
different systems.


Anyway, that would still not work on old servers, and servers running
other OS's.

  


I didn't think that we were talking about old servers and other
OS's.  My concern at the moment is Linux and the changes being
made to it.


Note, the point is _not_ to make a brand new NFS replacement
filesystem, that can use names instead of file handles.  The point is
to use existing infrastructure, to make the setup as easy as ssh'ing
to a different machine.  And sshfs does just that.


And the solution is limiting.  It is not scalable nor particularly
interesting to anyone interested in security.  Unless there is a
way of limiting access to a particular set of files, then it is
not generally useful outside of hackers or perhaps small groups
of users not concerned about too many aspects of security.

I am not interested in an extended discussion of this topic.

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VFS: new fgetattr() file operation

2007-10-24 Thread Peter Staubach

Miklos Szeredi wrote:

I don't think Christoph will like the patch better, regardless of how
I change the description.

Of course, I'm open to suggestion on how to improve the interface, but
I think fundamentally this is the only way to correctly deal with the
below problem.

Anyway, here's the patch another time, please consider adding it to
-mm.  For 2.6.25 obviously.

Thanks,
Miklos


From: Miklos Szeredi <[EMAIL PROTECTED]>

Add a new file operation: f_op->fgetattr(), that is invoked by
fstat().  Fall back to i_op->getattr() if it is not defined.

We need this because fstat() semantics can in some cases be better
implemented if the filesystem has the open file available.

Let's take the following example: we have a network filesystem, with
the server implemented as an unprivileged userspace process running on
a UNIX system (this is basically what sshfs does).

We want the filesystem to follow the familiar UNIX file semantics as
closely as possible.  If for example we have this sequence of events,
we still would like fstat to work correctly:

 1) file X is opened on client
 2) file X is renamed to Y on server
 3) fstat() is performed on open file descriptor on client

This is only possible if the filesystem server acutally uses fstat()
on a file descriptor obtained when the file was opened.  Which means,
the filesystem client needs a way to get this information from the
VFS.

  


This true iff the protocol that this mythical network file
system uses the name of the file on the server to actually
identify the file on the server.

Clearly, this is broken on many levels.  It can't handle
situations as described nor can it handle different instances
of the same filename being used.

This is why NFS, a network file system, does not use the filename
as part of the file handle.

Wouldn't you be better off by attempting to implement an "open
by ino" operation and an operation to get the generation count
for the file and then modifying the network protocol of interest
to use these as identifiers for the file to be manipulated?

I agree with Christoph on this one.  It is the wrong path.

  ps



Even if we assume, that the remote filesystem never changes, it is
difficult to implement open-unlink-fstat semantics correctly in the
client, without having this information.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/stat.c
===
--- linux.orig/fs/stat.c2007-10-24 11:59:46.0 +0200
+++ linux/fs/stat.c 2007-10-24 11:59:47.0 +0200
@@ -55,6 +55,33 @@ int vfs_getattr(struct vfsmount *mnt, st
 
 EXPORT_SYMBOL(vfs_getattr);
 
+/*

+ * Perform getattr on an open file
+ *
+ * Fall back to i_op->getattr (or generic_fillattr) if the filesystem
+ * doesn't define an f_op->fgetattr operation.
+ */
+static int vfs_fgetattr(struct file *file, struct kstat *stat)
+{
+   struct vfsmount *mnt = file->f_path.mnt;
+   struct dentry *dentry = file->f_path.dentry;
+   struct inode *inode = dentry->d_inode;
+   int retval;
+
+   retval = security_inode_getattr(mnt, dentry);
+   if (retval)
+   return retval;
+
+   if (file->f_op && file->f_op->fgetattr) {
+   return file->f_op->fgetattr(file, stat);
+   } else if (inode->i_op->getattr) {
+   return inode->i_op->getattr(mnt, dentry, stat);
+   } else {
+   generic_fillattr(inode, stat);
+   return 0;
+   }
+}
+
 int vfs_stat_fd(int dfd, char __user *name, struct kstat *stat)
 {
struct nameidata nd;
@@ -101,7 +128,7 @@ int vfs_fstat(unsigned int fd, struct ks
int error = -EBADF;
 
 	if (f) {

-   error = vfs_getattr(f->f_path.mnt, f->f_path.dentry, stat);
+   error = vfs_fgetattr(f, stat);
fput(f);
}
return error;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-10-24 11:59:46.0 +0200
+++ linux/include/linux/fs.h2007-10-24 11:59:47.0 +0200
@@ -1195,6 +1195,7 @@ struct file_operations {
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
int (*revoke)(struct file *, struct address_space *);
+   int (*fgetattr)(struct file *, struct kstat *);
 };
 
 struct inode_operations {

Index: linux/fs/fuse/file.c
===
--- linux.orig/fs/fuse/file.c   2007-10-24 11:59:46.0 +0200
+++ linux/fs/fuse/file.c2007-10-24 12:01:00.0 +0200
@@ -871,6 +871,17 @@ static int fuse_file_flock(struct file *
return err;
 }
 
+static int fuse_file_fgetattr(struct file *file, struct kstat *stat)

+{
+   struct inode *inode = file->f_dentry->d_inode;
+   struct fuse_conn *fc = 

Re: [PATCH] VFS: new fgetattr() file operation

2007-10-24 Thread Peter Staubach

Miklos Szeredi wrote:

I don't think Christoph will like the patch better, regardless of how
I change the description.

Of course, I'm open to suggestion on how to improve the interface, but
I think fundamentally this is the only way to correctly deal with the
below problem.

Anyway, here's the patch another time, please consider adding it to
-mm.  For 2.6.25 obviously.

Thanks,
Miklos


From: Miklos Szeredi [EMAIL PROTECTED]

Add a new file operation: f_op-fgetattr(), that is invoked by
fstat().  Fall back to i_op-getattr() if it is not defined.

We need this because fstat() semantics can in some cases be better
implemented if the filesystem has the open file available.

Let's take the following example: we have a network filesystem, with
the server implemented as an unprivileged userspace process running on
a UNIX system (this is basically what sshfs does).

We want the filesystem to follow the familiar UNIX file semantics as
closely as possible.  If for example we have this sequence of events,
we still would like fstat to work correctly:

 1) file X is opened on client
 2) file X is renamed to Y on server
 3) fstat() is performed on open file descriptor on client

This is only possible if the filesystem server acutally uses fstat()
on a file descriptor obtained when the file was opened.  Which means,
the filesystem client needs a way to get this information from the
VFS.

  


This true iff the protocol that this mythical network file
system uses the name of the file on the server to actually
identify the file on the server.

Clearly, this is broken on many levels.  It can't handle
situations as described nor can it handle different instances
of the same filename being used.

This is why NFS, a network file system, does not use the filename
as part of the file handle.

Wouldn't you be better off by attempting to implement an open
by ino operation and an operation to get the generation count
for the file and then modifying the network protocol of interest
to use these as identifiers for the file to be manipulated?

I agree with Christoph on this one.  It is the wrong path.

  ps



Even if we assume, that the remote filesystem never changes, it is
difficult to implement open-unlink-fstat semantics correctly in the
client, without having this information.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/fs/stat.c
===
--- linux.orig/fs/stat.c2007-10-24 11:59:46.0 +0200
+++ linux/fs/stat.c 2007-10-24 11:59:47.0 +0200
@@ -55,6 +55,33 @@ int vfs_getattr(struct vfsmount *mnt, st
 
 EXPORT_SYMBOL(vfs_getattr);
 
+/*

+ * Perform getattr on an open file
+ *
+ * Fall back to i_op-getattr (or generic_fillattr) if the filesystem
+ * doesn't define an f_op-fgetattr operation.
+ */
+static int vfs_fgetattr(struct file *file, struct kstat *stat)
+{
+   struct vfsmount *mnt = file-f_path.mnt;
+   struct dentry *dentry = file-f_path.dentry;
+   struct inode *inode = dentry-d_inode;
+   int retval;
+
+   retval = security_inode_getattr(mnt, dentry);
+   if (retval)
+   return retval;
+
+   if (file-f_op  file-f_op-fgetattr) {
+   return file-f_op-fgetattr(file, stat);
+   } else if (inode-i_op-getattr) {
+   return inode-i_op-getattr(mnt, dentry, stat);
+   } else {
+   generic_fillattr(inode, stat);
+   return 0;
+   }
+}
+
 int vfs_stat_fd(int dfd, char __user *name, struct kstat *stat)
 {
struct nameidata nd;
@@ -101,7 +128,7 @@ int vfs_fstat(unsigned int fd, struct ks
int error = -EBADF;
 
 	if (f) {

-   error = vfs_getattr(f-f_path.mnt, f-f_path.dentry, stat);
+   error = vfs_fgetattr(f, stat);
fput(f);
}
return error;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-10-24 11:59:46.0 +0200
+++ linux/include/linux/fs.h2007-10-24 11:59:47.0 +0200
@@ -1195,6 +1195,7 @@ struct file_operations {
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
int (*revoke)(struct file *, struct address_space *);
+   int (*fgetattr)(struct file *, struct kstat *);
 };
 
 struct inode_operations {

Index: linux/fs/fuse/file.c
===
--- linux.orig/fs/fuse/file.c   2007-10-24 11:59:46.0 +0200
+++ linux/fs/fuse/file.c2007-10-24 12:01:00.0 +0200
@@ -871,6 +871,17 @@ static int fuse_file_flock(struct file *
return err;
 }
 
+static int fuse_file_fgetattr(struct file *file, struct kstat *stat)

+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   if 

Re: [PATCH] VFS: new fgetattr() file operation

2007-10-24 Thread Peter Staubach

Miklos Szeredi wrote:

Miklos Szeredi wrote:


I don't think Christoph will like the patch better, regardless of how
I change the description.

Of course, I'm open to suggestion on how to improve the interface, but
I think fundamentally this is the only way to correctly deal with the
below problem.

Anyway, here's the patch another time, please consider adding it to
-mm.  For 2.6.25 obviously.

Thanks,
Miklos


From: Miklos Szeredi [EMAIL PROTECTED]

Add a new file operation: f_op-fgetattr(), that is invoked by
fstat().  Fall back to i_op-getattr() if it is not defined.

We need this because fstat() semantics can in some cases be better
implemented if the filesystem has the open file available.

Let's take the following example: we have a network filesystem, with
the server implemented as an unprivileged userspace process running on
a UNIX system (this is basically what sshfs does).

We want the filesystem to follow the familiar UNIX file semantics as
closely as possible.  If for example we have this sequence of events,
we still would like fstat to work correctly:

 1) file X is opened on client
 2) file X is renamed to Y on server
 3) fstat() is performed on open file descriptor on client

This is only possible if the filesystem server acutally uses fstat()
on a file descriptor obtained when the file was opened.  Which means,
the filesystem client needs a way to get this information from the
VFS.

  
  

This true iff the protocol that this mythical



Not mythical at all.  As noted in the description, there's sshfs, a
live and quite popular example of this sort of filesystem.

  

network file system uses the name of the file on the server to
actually identify the file on the server.



The constraint is that the server has to be an ordinary unprivileged
process.  How should it identify the file, other than by name, or by
an open file descriptor?

  


I explained this.  The fileid and the generation count along
with the file system id will uniquely identify the file.


Clearly, this is broken on many levels.  It can't handle
situations as described nor can it handle different instances
of the same filename being used.



Can you please give concrete examples what it can't handle, and how
should the implementation be improved to be able to handle it, given
the above constraints?

  

This is why NFS, a network file system, does not use the filename
as part of the file handle.



And the nfs server isn't a userspace process, or if it is, it must use
horrible hacks to convert the file handle to a name, that don't work
half the time.

  


Nice try.  Wrong.  Try a different rationalization.


Wouldn't you be better off by attempting to implement an open
by ino operation and an operation to get the generation count
for the file and then modifying the network protocol of interest
to use these as identifiers for the file to be manipulated?



You mean an open by inode on the userspace API?  My guess, it
wouldn't get very far.

  


This isn't a new idea and has been implemented on a variety of
different systems.


Anyway, that would still not work on old servers, and servers running
other OS's.

  


I didn't think that we were talking about old servers and other
OS's.  My concern at the moment is Linux and the changes being
made to it.


Note, the point is _not_ to make a brand new NFS replacement
filesystem, that can use names instead of file handles.  The point is
to use existing infrastructure, to make the setup as easy as ssh'ing
to a different machine.  And sshfs does just that.


And the solution is limiting.  It is not scalable nor particularly
interesting to anyone interested in security.  Unless there is a
way of limiting access to a particular set of files, then it is
not generally useful outside of hackers or perhaps small groups
of users not concerned about too many aspects of security.

I am not interested in an extended discussion of this topic.

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...

2007-10-04 Thread Peter Staubach

Trond Myklebust wrote:

On Thu, 2007-10-04 at 11:42 -0700, Andrew Morton wrote:
  

On Thu, 4 Oct 2007 18:43:04 +0200
Pierre Ossman <[EMAIL PROTECTED]> wrote:



On Thu, 04 Oct 2007 10:00:50 -0400
Trond Myklebust <[EMAIL PROTECTED]> wrote:

  

On Thu, 2007-10-04 at 08:52 +0200, Pierre Ossman wrote:


On Wed, 03 Oct 2007 19:41:16 -0400
Trond Myklebust <[EMAIL PROTECTED]> wrote:

  

We also have the 64-bit inode support from RedHat/Peter Staubach.



As has been pointed[1] out[2], this will cause regressions for
non-LFS applications (of which there are still lots and lots). This
change should be in feature-removal (the "feature" being removed is
legacy support for non-LFS applications using NFS servers that make
full use of the protocol) and preferably accompanied with
appropriate user space changes (e.g. compatibility option in glibc).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=241348
[2] http://marc.info/?l=linux-nfs=118701088726477=2

Rgds
  

How about a boot/module parameter to turn it on or off?



That would be perfect. It can even be in non-legacy mode by default,
just as long as you can go back to the old behaviour when/if you run
into a non-LFS application.

  

Wouldn't a mount option be better?



I suppose that might be OK if you know that the 32-bit legacy
applications will only touch one or two servers, but that sounds like a
niche thing.

On the downside, forcing all those people who have portable 64-bit aware
applications to upgrade their version of mount just in order to have
stat64() work correctly seems unnecessarily complicated. I'd prefer not
to have to do that unless someone comes up with a good reason why we
must.


I would agree.  The 64 bit fileids will only become visible when
the server is exporting file systems which contain fileids which
are bigger than 32 bits and then only when the application
encounters these files.

Also, these 32-bit legacy applications are going to have a
problem if they are ever run on a system which contains local
file systems which expose the large fileids.

It would be better to identify these applications and get them
fixed.  The world is evolving and it is time for them to do so.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NFS] What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...

2007-10-04 Thread Peter Staubach

Trond Myklebust wrote:

On Thu, 2007-10-04 at 11:42 -0700, Andrew Morton wrote:
  

On Thu, 4 Oct 2007 18:43:04 +0200
Pierre Ossman [EMAIL PROTECTED] wrote:



On Thu, 04 Oct 2007 10:00:50 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:

  

On Thu, 2007-10-04 at 08:52 +0200, Pierre Ossman wrote:


On Wed, 03 Oct 2007 19:41:16 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:

  

We also have the 64-bit inode support from RedHat/Peter Staubach.



As has been pointed[1] out[2], this will cause regressions for
non-LFS applications (of which there are still lots and lots). This
change should be in feature-removal (the feature being removed is
legacy support for non-LFS applications using NFS servers that make
full use of the protocol) and preferably accompanied with
appropriate user space changes (e.g. compatibility option in glibc).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=241348
[2] http://marc.info/?l=linux-nfsm=118701088726477w=2

Rgds
  

How about a boot/module parameter to turn it on or off?



That would be perfect. It can even be in non-legacy mode by default,
just as long as you can go back to the old behaviour when/if you run
into a non-LFS application.

  

Wouldn't a mount option be better?



I suppose that might be OK if you know that the 32-bit legacy
applications will only touch one or two servers, but that sounds like a
niche thing.

On the downside, forcing all those people who have portable 64-bit aware
applications to upgrade their version of mount just in order to have
stat64() work correctly seems unnecessarily complicated. I'd prefer not
to have to do that unless someone comes up with a good reason why we
must.


I would agree.  The 64 bit fileids will only become visible when
the server is exporting file systems which contain fileids which
are bigger than 32 bits and then only when the application
encounters these files.

Also, these 32-bit legacy applications are going to have a
problem if they are ever run on a system which contains local
file systems which expose the large fileids.

It would be better to identify these applications and get them
fixed.  The world is evolving and it is time for them to do so.

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/22] NFS: Use local caching

2007-09-24 Thread Peter Staubach

David Howells wrote:

David Howells <[EMAIL PROTECTED]> wrote:

  

Peter Staubach <[EMAIL PROTECTED]> wrote:



Did I miss the section where the modified semantics about which
mounted file systems can use the cache and which ones can not
was implemented?
  

Yes.



fs/nfs/super.c:

case Opt_sharecache:
mnt->flags &= ~NFS_MOUNT_UNSHARED;
break;
case Opt_nosharecache:
mnt->flags |= NFS_MOUNT_UNSHARED;
mnt->options &= ~NFS_OPTION_FSCACHE;
break;
case Opt_fscache:
/* sharing is mandatory with fscache */
mnt->options |= NFS_OPTION_FSCACHE;
mnt->flags &= ~NFS_MOUNT_UNSHARED;
break;
case Opt_nofscache:
mnt->options &= ~NFS_OPTION_FSCACHE;
break;

Hmmm...  Actually, I'm not sure this is sufficient.


This doesn't seem to take into account any of the other options
which can cause sharing to be disabled.  Perhaps SteveD can add
his patch to the mix which does resolve the issues?

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/22] NFS: Use local caching

2007-09-24 Thread Peter Staubach

David Howells wrote:

David Howells [EMAIL PROTECTED] wrote:

  

Peter Staubach [EMAIL PROTECTED] wrote:



Did I miss the section where the modified semantics about which
mounted file systems can use the cache and which ones can not
was implemented?
  

Yes.



fs/nfs/super.c:

case Opt_sharecache:
mnt-flags = ~NFS_MOUNT_UNSHARED;
break;
case Opt_nosharecache:
mnt-flags |= NFS_MOUNT_UNSHARED;
mnt-options = ~NFS_OPTION_FSCACHE;
break;
case Opt_fscache:
/* sharing is mandatory with fscache */
mnt-options |= NFS_OPTION_FSCACHE;
mnt-flags = ~NFS_MOUNT_UNSHARED;
break;
case Opt_nofscache:
mnt-options = ~NFS_OPTION_FSCACHE;
break;

Hmmm...  Actually, I'm not sure this is sufficient.


This doesn't seem to take into account any of the other options
which can cause sharing to be disabled.  Perhaps SteveD can add
his patch to the mix which does resolve the issues?

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/22] NFS: Use local caching

2007-09-21 Thread Peter Staubach

David Howells wrote:

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

mount warthog:/ /a -o fsc

Signed-Off-By: David Howells <[EMAIL PROTECTED]>
---


Did I miss the section where the modified semantics about which
mounted file systems can use the cache and which ones can not
was implemented?  For example, mounts of the same file system
from the server with "fsc", but with different mount options
such as "rw" or "ro" or NFS dependent mount options, must fail
because of the way that the cache is accessed.  Also, perhaps
a little confusing, that mounts of different paths on a server
which land on the same mounted file system on the server, but
with these differing mount options must also fail?

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/22] NFS: Use local caching

2007-09-21 Thread Peter Staubach

David Howells wrote:

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an fsc option to the mount:

mount warthog:/ /a -o fsc

Signed-Off-By: David Howells [EMAIL PROTECTED]
---


Did I miss the section where the modified semantics about which
mounted file systems can use the cache and which ones can not
was implemented?  For example, mounts of the same file system
from the server with fsc, but with different mount options
such as rw or ro or NFS dependent mount options, must fail
because of the way that the cache is accessed.  Also, perhaps
a little confusing, that mounts of different paths on a server
which land on the same mounted file system on the server, but
with these differing mount options must also fail?

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NFS] [PATCH 1/3] VFS: make notify_change pass ATTR_KILL_S*ID to setattr operations

2007-08-30 Thread Peter Staubach

Jeff Layton wrote:

Make notify_change not clear the ATTR_KILL_S*ID bits in the ia_vaid that
gets passed to the setattr inode operation. This allows the filesystems
to reinterpret whether this mode change is simply intended to clear the
setuid/setgid bits.

This means that notify_change should never be called with both ATTR_MODE
and either of the ATTR_KILL_S*ID bits set, since the filesystem would
have no way to know what part of the mode change was intentional. If
it is called this way, consider it a BUG().

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>
---
 fs/attr.c |   22 --
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index ae58bd3..f98d10c 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -103,12 +103,11 @@ EXPORT_SYMBOL(inode_setattr);
 int notify_change(struct dentry * dentry, struct iattr * attr)
 {
struct inode *inode = dentry->d_inode;
-   mode_t mode;
+   mode_t mode = inode->i_mode;
int error;
struct timespec now;
unsigned int ia_valid = attr->ia_valid;
 
-	mode = inode->i_mode;

now = current_fs_time(inode->i_sb);
 
 	attr->ia_ctime = now;

@@ -125,18 +124,21 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr)
if (error)
return error;
}
+
+   /*
+* It's not valid to pass an iattr with both ATTR_MODE and
+* ATTR_KILL_S*ID set.
+*/
+   if (ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID) && ia_valid & ATTR_MODE)
  


If you would, please add some parentheses to show and make
explicit what the bindings are.  This is:

  if ((ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) &&
  (ia_valid & ATTR_MODE))

Thanx...

  ps


+   BUG();
+
if (ia_valid & ATTR_KILL_SUID) {
-   attr->ia_valid &= ~ATTR_KILL_SUID;
if (mode & S_ISUID) {
-   if (!(ia_valid & ATTR_MODE)) {
-   ia_valid = attr->ia_valid |= ATTR_MODE;
-   attr->ia_mode = inode->i_mode;
-   }
-   attr->ia_mode &= ~S_ISUID;
+   ia_valid = attr->ia_valid |= ATTR_MODE;
+   attr->ia_mode = (inode->i_mode & ~S_ISUID);
}
}
if (ia_valid & ATTR_KILL_SGID) {
-   attr->ia_valid &= ~ ATTR_KILL_SGID;
if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
if (!(ia_valid & ATTR_MODE)) {
ia_valid = attr->ia_valid |= ATTR_MODE;
@@ -145,7 +147,7 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr)
attr->ia_mode &= ~S_ISGID;
}
}
-   if (!attr->ia_valid)
+   if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
return 0;
 
 	if (ia_valid & ATTR_SIZE)
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [NFS] [PATCH 1/3] VFS: make notify_change pass ATTR_KILL_S*ID to setattr operations

2007-08-30 Thread Peter Staubach

Jeff Layton wrote:

Make notify_change not clear the ATTR_KILL_S*ID bits in the ia_vaid that
gets passed to the setattr inode operation. This allows the filesystems
to reinterpret whether this mode change is simply intended to clear the
setuid/setgid bits.

This means that notify_change should never be called with both ATTR_MODE
and either of the ATTR_KILL_S*ID bits set, since the filesystem would
have no way to know what part of the mode change was intentional. If
it is called this way, consider it a BUG().

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
---
 fs/attr.c |   22 --
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index ae58bd3..f98d10c 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -103,12 +103,11 @@ EXPORT_SYMBOL(inode_setattr);
 int notify_change(struct dentry * dentry, struct iattr * attr)
 {
struct inode *inode = dentry-d_inode;
-   mode_t mode;
+   mode_t mode = inode-i_mode;
int error;
struct timespec now;
unsigned int ia_valid = attr-ia_valid;
 
-	mode = inode-i_mode;

now = current_fs_time(inode-i_sb);
 
 	attr-ia_ctime = now;

@@ -125,18 +124,21 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr)
if (error)
return error;
}
+
+   /*
+* It's not valid to pass an iattr with both ATTR_MODE and
+* ATTR_KILL_S*ID set.
+*/
+   if (ia_valid  (ATTR_KILL_SUID|ATTR_KILL_SGID)  ia_valid  ATTR_MODE)
  


If you would, please add some parentheses to show and make
explicit what the bindings are.  This is:

  if ((ia_valid  (ATTR_KILL_SUID | ATTR_KILL_SGID)) 
  (ia_valid  ATTR_MODE))

Thanx...

  ps


+   BUG();
+
if (ia_valid  ATTR_KILL_SUID) {
-   attr-ia_valid = ~ATTR_KILL_SUID;
if (mode  S_ISUID) {
-   if (!(ia_valid  ATTR_MODE)) {
-   ia_valid = attr-ia_valid |= ATTR_MODE;
-   attr-ia_mode = inode-i_mode;
-   }
-   attr-ia_mode = ~S_ISUID;
+   ia_valid = attr-ia_valid |= ATTR_MODE;
+   attr-ia_mode = (inode-i_mode  ~S_ISUID);
}
}
if (ia_valid  ATTR_KILL_SGID) {
-   attr-ia_valid = ~ ATTR_KILL_SGID;
if ((mode  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
if (!(ia_valid  ATTR_MODE)) {
ia_valid = attr-ia_valid |= ATTR_MODE;
@@ -145,7 +147,7 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr)
attr-ia_mode = ~S_ISGID;
}
}
-   if (!attr-ia_valid)
+   if (!(attr-ia_valid  ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
return 0;
 
 	if (ia_valid  ATTR_SIZE)
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add source address to sunrpc svc errors

2007-08-29 Thread Peter Staubach

[EMAIL PROTECTED] wrote:

On Mon, 27 Aug 2007 17:43:33 EDT, "J. Bruce Fields" said:

  

Looks like a reasonable idea to me, thanks!  Any objection to just
calling it "svc_printk" instead of "svc_printkerr"?

I also wonder whether these shouldn't all be dprintk's instead of
printk's.  One misbehaving client could create a lot of noise in the
logs.



I shouldn't have to rebuild my kernel with debugging enabled just to see
who is throwing trash at my machine.  printk(KERN_INFO maybe and/or using
a printk_ratelimit.
  


There are a lot of ways to discover who is throwing trash
at your system other than the kernel printing messages.

Tools such as tcpdump and tethereal/wireshark make much better
tools for this purpose.

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add source address to sunrpc svc errors

2007-08-29 Thread Peter Staubach

[EMAIL PROTECTED] wrote:

On Mon, 27 Aug 2007 17:43:33 EDT, J. Bruce Fields said:

  

Looks like a reasonable idea to me, thanks!  Any objection to just
calling it svc_printk instead of svc_printkerr?

I also wonder whether these shouldn't all be dprintk's instead of
printk's.  One misbehaving client could create a lot of noise in the
logs.



I shouldn't have to rebuild my kernel with debugging enabled just to see
who is throwing trash at my machine.  printk(KERN_INFO maybe and/or using
a printk_ratelimit.
  


There are a lot of ways to discover who is throwing trash
at your system other than the kernel printing messages.

Tools such as tcpdump and tethereal/wireshark make much better
tools for this purpose.

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-24 Thread Peter Staubach

Ric Wheeler wrote:

J. Bruce Fields wrote:

On Tue, Aug 21, 2007 at 02:50:42PM -0400, John Stoffel wrote:
 

Not in my experience.  We use NetApps as our backing NFS servers, so
maybe my experience isn't totally relevant.  But with a mix of Linux
and Solaris clients, we've never had problems with soft,intr on our
NFS clients.

We also don't see file corruption, mysterious executables failing to
run, etc. 
Now maybe those issues are raised when you have a Linux NFS server

with Solaris clients.  But in my book, reliable NFS servers are key,
and if they are reliable, 'soft,intr' works just fine.



The NFS server alone can't prevent the problems Peter Staubach refers
to.  Their frequency also depends on the network and the way you're
using the filesystem.  (A sufficiently paranoid application accessing
the filesystem could function correctly despite the problems caused by
soft mounts, but the degree of paranoia required probably isn't common.)
  
Would it be sufficient to insure that that application always issues 
an fsync() before closing any recently written/updated file? Is there 
some other subtle paranoid techniques that should be used?


I suspect that this is not sufficient.  The application should
be prepared to rewrite data if it can determine what data did
not get written.  Using fsync will tell the application when
data was not written to the server correctly, but not which
part of the data.

Perhaps O_SYNC or fsync following each write, but either one of
these options will also cause a large performance degradation.

The right solution is the use of TCP and hard mounting.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-24 Thread Peter Staubach

Ric Wheeler wrote:

J. Bruce Fields wrote:

On Tue, Aug 21, 2007 at 02:50:42PM -0400, John Stoffel wrote:
 

Not in my experience.  We use NetApps as our backing NFS servers, so
maybe my experience isn't totally relevant.  But with a mix of Linux
and Solaris clients, we've never had problems with soft,intr on our
NFS clients.

We also don't see file corruption, mysterious executables failing to
run, etc. 
Now maybe those issues are raised when you have a Linux NFS server

with Solaris clients.  But in my book, reliable NFS servers are key,
and if they are reliable, 'soft,intr' works just fine.



The NFS server alone can't prevent the problems Peter Staubach refers
to.  Their frequency also depends on the network and the way you're
using the filesystem.  (A sufficiently paranoid application accessing
the filesystem could function correctly despite the problems caused by
soft mounts, but the degree of paranoia required probably isn't common.)
  
Would it be sufficient to insure that that application always issues 
an fsync() before closing any recently written/updated file? Is there 
some other subtle paranoid techniques that should be used?


I suspect that this is not sufficient.  The application should
be prepared to rewrite data if it can determine what data did
not get written.  Using fsync will tell the application when
data was not written to the server correctly, but not which
part of the data.

Perhaps O_SYNC or fsync following each write, but either one of
these options will also cause a large performance degradation.

The right solution is the use of TCP and hard mounting.

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

John Stoffel wrote:

"Peter" == Peter Staubach <[EMAIL PROTECTED]> writes:



Peter> John Stoffel wrote:
Robin> I'm bringing this up again (I know it's been mentioned here
Robin> before) because I had been told that NFS support had gotten
Robin> better in Linux recently, so I have been (for my $dayjob)
Robin> testing the behaviour of NFS (autofs NFS, specifically) under
Robin> Linux with hard,intr and using iptables to simulate a hang.
  

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with 'soft,intr'
and it works well.  If an NFS server goes down, clients don't hang for
large periods of time. 
  


Peter> Wow!  That's _really_ a bad idea.  NFS READ operations which
Peter> timeout can lead to executables which mysteriously fail, file
Peter> corruption, etc.  NFS WRITE operations which fail may or may
Peter> not lead to file corruption.

Peter> Anything writable should _always_ be mounted "hard" for safety
Peter> purposes.  Readonly mounted file systems _may_ be mounted
Peter> "soft", depending upon what is located on them.

Not in my experience.  We use NetApps as our backing NFS servers, so
maybe my experience isn't totally relevant.  But with a mix of Linux
and Solaris clients, we've never had problems with soft,intr on our
NFS clients.

We also don't see file corruption, mysterious executables failing to
run, etc.  


Now maybe those issues are raised when you have a Linux NFS server
with Solaris clients.  But in my book, reliable NFS servers are key,
and if they are reliable, 'soft,intr' works just fine.

Now maybe if we had NFS exported directories everywhere, and stuff
cross mounted all over the place with autofs, then we might change our
minds.  


In any case, I don't dis-agree with the fundamental request to make
the NFS client code on Linux easier to work with.  I bet Trond (who
works at NetApp) will have something to say on this issue.


Just for the others who may be reading this thread --

If you use sufficient network bandwidth and high quality
enough networks and NFS servers with plenty of resources,
then you _may_ be able to get away with "soft" mounting
for a some period of time.

However, any server, including Solaris and NetApp servers,
will fail, and those failures may or may not affect the
NFS service being provided.  In fact, unless the system
is being carefully administrated and the applications are
written very well, with error detection and recovery in
mind, then corruption can occur, and it can be silent and
unnoticed until too late.  In fact, most failures do occur
silently and get chalked up to other causes because it will
not be possible to correlate the badness with the NFS
client giving up when attempting to communicate with an
NFS server.

I wish you the best of luck, although with the environment
that you describe, it seems like "hard" mounts would work
equally well and would not incur the risks.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

Robin Lee Powell wrote:

On Tue, Aug 21, 2007 at 01:01:44PM -0400, Peter Staubach wrote:
  

John Stoffel wrote:


Robin> I'm bringing this up again (I know it's been mentioned here
Robin> before) because I had been told that NFS support had gotten
Robin> better in Linux recently, so I have been (for my $dayjob)
Robin> testing the behaviour of NFS (autofs NFS, specifically) under
Robin> Linux with hard,intr and using iptables to simulate a hang.

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with
'soft,intr' and it works well.  If an NFS server goes down,
clients don't hang for large periods of time. 
  

Wow!  That's _really_ a bad idea.  NFS READ operations which
timeout can lead to executables which mysteriously fail, file
corruption, etc.  NFS WRITE operations which fail may or may not
lead to file corruption.

Anything writable should _always_ be mounted "hard" for safety
purposes.  Readonly mounted file systems _may_ be mounted "soft",
depending upon what is located on them.



Does write + tcp make this any different?


Nope...

TCP may make a difference if the problem is related to the network
being slow or lossy, but will not affect anything if the server
is just slow or down.  Even if TCP would have eventually gotten
all of the packets in a request or response through, the client
may time out, cease waiting, and corruption may occur again.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

John Stoffel wrote:

Robin> I'm bringing this up again (I know it's been mentioned here
Robin> before) because I had been told that NFS support had gotten
Robin> better in Linux recently, so I have been (for my $dayjob)
Robin> testing the behaviour of NFS (autofs NFS, specifically) under
Robin> Linux with hard,intr and using iptables to simulate a hang.

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with 'soft,intr'
and it works well.  If an NFS server goes down, clients don't hang for
large periods of time. 

  


Wow!  That's _really_ a bad idea.  NFS READ operations which
timeout can lead to executables which mysteriously fail, file
corruption, etc.  NFS WRITE operations which fail may or may
not lead to file corruption.

Anything writable should _always_ be mounted "hard" for safety
purposes.  Readonly mounted file systems _may_ be mounted "soft",
depending upon what is located on them.


Robin> fuser hangs, as far as I can tell indefinately, as does
Robin> lsof. umount -f returns after a long time with "busy", umount
Robin> -l works after a long time but leaves the system in a very
Robin> unfortunate state such that I have to kill things by hand and
Robin> manually edit /etc/mtab to get autofs to work again.

Robin> The "correct solution" to this situation according to
Robin> http://nfs.sourceforge.net/ is cycles of "kill processes" and
Robin> "umount -f".  This has two problems:  1.  It sucks.  2.  If fuser
Robin> and lsof both hand (and they do: fuser has been on
Robin> "stat("/home/rpowell/"," for > 30 minutes now), I have no way to
Robin> pick which processes to kill.

Robin> I've read every man page I could find, and the only nfs option
Robin> that semes even vaguely helpful is "soft", but everything that
Robin> mentions "soft" also says to never use it.

I think the man pages are out of date, or ignoring reality.  Try
mounting with soft,intr and see how it works for you.  I think you'll
be happy.  

  


Please don't.  You will end up regretting it in the long run.
Taking a chance on corrupted data or critical applications which
just fail is not worth the benefit.

It would safer for us to implement something which works like
the Solaris forced umount support for NFS.

   Thanx...

  ps


Robin> This is the single worst aspect of adminning a Linux system that I,
Robin> as a carreer sysadmin, have to deal with.  In fact, it's really the
Robin> only one I even dislike. At my current work place, we've lost
Robin> multiple person-days to this issue, having to go around and reboot
Robin> every Linux box that was hanging off a down NFS server.

Robin> I know many other admins who also really want Solaris style
Robin> "umount -f"; I'm sure if I passed the hat I could get a decent
Robin> bounty together for this feature; let me know if you're interested.

Robin> Thanks.

Robin> -Robin

Robin> -- 
Robin> http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

Robin> Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Robin> Proud Supporter of the Singularity Institute - http://singinst.org/
Robin> -
Robin> To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
in
Robin> the body of a message to [EMAIL PROTECTED]
Robin> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robin> Please read the FAQ at  http://www.tux.org/lkml/


Robin> !DSPAM:46ca1d9676791030010506!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] autofs4: reinstate negatitive timeout of mount fails

2007-08-21 Thread Peter Staubach

Ian Kent wrote:

Hi,

Due to a change to fs/dcache.c:d_lookup() in the 2.6 kernel whereby only
hashed dentrys are returned the negative caching of mount failures
stopped working in the autofs4 module for nobrowse mount (ie. directory
created at mount time and removed at umount or following a mount
failure).

This patch keeps track of the dentrys from mount fails in order to be
able check the timeout since the last fail and return the appropriate
status. In addition the timeout value is settable at load time as a
module option and via sysfs using the module
parameter /sys/module/autofs4/parameters/negative_timeout.

Signed-off-by: Ian Kent <[EMAIL PROTECTED]>

---
--- linux-2.6.23-rc2-mm2/fs/autofs4/init.c.negative-timeout 2007-07-09 
07:32:17.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/init.c  2007-08-21 15:44:34.0 
+0800
@@ -14,6 +14,10 @@
 #include 
 #include "autofs_i.h"
 
+unsigned int negative_timeout = AUTOFS_NEGATIVE_TIMEOUT;

+module_param(negative_timeout, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(negative_timeout, "Cache mount fails negatively for this many 
seconds");
+
 static int autofs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
--- linux-2.6.23-rc2-mm2/fs/autofs4/inode.c.negative-timeout2007-08-17 
11:52:33.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/inode.c 2007-08-21 15:44:34.0 
+0800
@@ -46,6 +46,7 @@ struct autofs_info *autofs4_init_ino(str
ino->inode = NULL;
ino->dentry = NULL;
ino->size = 0;
+   ino->negative_timeout = negative_timeout;
 
 	INIT_LIST_HEAD(>rehash);
 
@@ -98,11 +99,24 @@ void autofs4_free_ino(struct autofs_info

 static void autofs4_force_release(struct autofs_sb_info *sbi)
 {
struct dentry *this_parent = sbi->sb->s_root;
-   struct list_head *next;
+   struct list_head *p, *next;
 
 	if (!sbi->sb->s_root)

return;
 
+	/* Cleanup the negative dentry cache */

+   spin_lock(>rehash_lock);
+   list_for_each_safe(p, next, >rehash_list) {
+   struct autofs_info *ino;
+   struct dentry *dentry;
+   ino = list_entry(p, struct autofs_info, rehash);
+   dentry = ino->dentry;
+   spin_unlock(>rehash_lock);
+   dput(ino->dentry);
  


Should this be dput(dentry);?

   Thanx...

  ps



+   spin_lock(>rehash_lock);
+   }
+   spin_unlock(>rehash_lock);
+
spin_lock(_lock);
 repeat:
next = this_parent->d_subdirs.next;
--- linux-2.6.23-rc2-mm2/fs/autofs4/autofs_i.h.negative-timeout 2007-08-17 
11:52:33.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/autofs_i.h  2007-08-21 15:44:34.0 
+0800
@@ -40,6 +40,14 @@
 #define DPRINTK(fmt,args...) do {} while(0)
 #endif
 
+/*

+ * If the daemon returns a negative response (AUTOFS_IOC_FAIL) then we keep
+ * the negative response cached for up to the time given here, although
+ * the time can be shorter if the kernel throws the dcache entry away.
+ */
+#define AUTOFS_NEGATIVE_TIMEOUT60  /* default 1 minute */
+extern unsigned int negative_timeout;
+
 /* Unified info structure.  This is pointed to by both the dentry and
inode structures.  Each file in the filesystem has an instance of this
structure.  It holds a reference to the dentry, so dentries are never
@@ -52,8 +60,16 @@ struct autofs_info {
 
 	int		flags;
 
+	/*

+* Two types of unhashed dentry can exist on this list.
+* Negative dentrys from failed mounts and positive dentrys
+	 * resulting from a race between expire and mount. This 
+	 * fact is used when looking for dentrys in the list.

+*/
struct list_head rehash;
 
+	unsigned int negative_timeout;

+
struct autofs_sb_info *sbi;
unsigned long last_used;
atomic_t count;
--- linux-2.6.23-rc2-mm2/fs/autofs4/root.c.negative-timeout 2007-08-17 
11:53:38.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/root.c  2007-08-21 15:44:34.0 
+0800
@@ -238,6 +238,125 @@ out:
return dcache_readdir(file, dirent, filldir);
 }
 
+static int autofs4_compare_dentry(struct dentry *parent, struct dentry *dentry, struct qstr *name)

+{
+   unsigned int len = name->len;
+   unsigned int hash = name->hash;
+   const unsigned char *str = name->name;
+   struct qstr *qstr = >d_name;
+
+   if (dentry->d_name.hash != hash)
+   return 0;
+   if (dentry->d_parent != parent)
+   return 0;
+
+   if (qstr->len != len)
+   return 0;
+   if (memcmp(qstr->name, str, len))
+   return 0;
+
+   return 1;
+}
+
+static struct dentry *autofs4_lookup_dentry(struct autofs_sb_info *sbi, struct 
dentry *dentry)
+{
+   struct dentry *parent = dentry->d_parent;
+   struct qstr *name = >d_name;
+   struct list_head *p, *head;
+
+   head = >rehash_list;
+   

Re: [PATCH] autofs4: reinstate negatitive timeout of mount fails

2007-08-21 Thread Peter Staubach

Ian Kent wrote:

Hi,

Due to a change to fs/dcache.c:d_lookup() in the 2.6 kernel whereby only
hashed dentrys are returned the negative caching of mount failures
stopped working in the autofs4 module for nobrowse mount (ie. directory
created at mount time and removed at umount or following a mount
failure).

This patch keeps track of the dentrys from mount fails in order to be
able check the timeout since the last fail and return the appropriate
status. In addition the timeout value is settable at load time as a
module option and via sysfs using the module
parameter /sys/module/autofs4/parameters/negative_timeout.

Signed-off-by: Ian Kent [EMAIL PROTECTED]

---
--- linux-2.6.23-rc2-mm2/fs/autofs4/init.c.negative-timeout 2007-07-09 
07:32:17.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/init.c  2007-08-21 15:44:34.0 
+0800
@@ -14,6 +14,10 @@
 #include linux/init.h
 #include autofs_i.h
 
+unsigned int negative_timeout = AUTOFS_NEGATIVE_TIMEOUT;

+module_param(negative_timeout, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(negative_timeout, Cache mount fails negatively for this many 
seconds);
+
 static int autofs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
--- linux-2.6.23-rc2-mm2/fs/autofs4/inode.c.negative-timeout2007-08-17 
11:52:33.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/inode.c 2007-08-21 15:44:34.0 
+0800
@@ -46,6 +46,7 @@ struct autofs_info *autofs4_init_ino(str
ino-inode = NULL;
ino-dentry = NULL;
ino-size = 0;
+   ino-negative_timeout = negative_timeout;
 
 	INIT_LIST_HEAD(ino-rehash);
 
@@ -98,11 +99,24 @@ void autofs4_free_ino(struct autofs_info

 static void autofs4_force_release(struct autofs_sb_info *sbi)
 {
struct dentry *this_parent = sbi-sb-s_root;
-   struct list_head *next;
+   struct list_head *p, *next;
 
 	if (!sbi-sb-s_root)

return;
 
+	/* Cleanup the negative dentry cache */

+   spin_lock(sbi-rehash_lock);
+   list_for_each_safe(p, next, sbi-rehash_list) {
+   struct autofs_info *ino;
+   struct dentry *dentry;
+   ino = list_entry(p, struct autofs_info, rehash);
+   dentry = ino-dentry;
+   spin_unlock(sbi-rehash_lock);
+   dput(ino-dentry);
  


Should this be dput(dentry);?

   Thanx...

  ps



+   spin_lock(sbi-rehash_lock);
+   }
+   spin_unlock(sbi-rehash_lock);
+
spin_lock(dcache_lock);
 repeat:
next = this_parent-d_subdirs.next;
--- linux-2.6.23-rc2-mm2/fs/autofs4/autofs_i.h.negative-timeout 2007-08-17 
11:52:33.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/autofs_i.h  2007-08-21 15:44:34.0 
+0800
@@ -40,6 +40,14 @@
 #define DPRINTK(fmt,args...) do {} while(0)
 #endif
 
+/*

+ * If the daemon returns a negative response (AUTOFS_IOC_FAIL) then we keep
+ * the negative response cached for up to the time given here, although
+ * the time can be shorter if the kernel throws the dcache entry away.
+ */
+#define AUTOFS_NEGATIVE_TIMEOUT60  /* default 1 minute */
+extern unsigned int negative_timeout;
+
 /* Unified info structure.  This is pointed to by both the dentry and
inode structures.  Each file in the filesystem has an instance of this
structure.  It holds a reference to the dentry, so dentries are never
@@ -52,8 +60,16 @@ struct autofs_info {
 
 	int		flags;
 
+	/*

+* Two types of unhashed dentry can exist on this list.
+* Negative dentrys from failed mounts and positive dentrys
+	 * resulting from a race between expire and mount. This 
+	 * fact is used when looking for dentrys in the list.

+*/
struct list_head rehash;
 
+	unsigned int negative_timeout;

+
struct autofs_sb_info *sbi;
unsigned long last_used;
atomic_t count;
--- linux-2.6.23-rc2-mm2/fs/autofs4/root.c.negative-timeout 2007-08-17 
11:53:38.0 +0800
+++ linux-2.6.23-rc2-mm2/fs/autofs4/root.c  2007-08-21 15:44:34.0 
+0800
@@ -238,6 +238,125 @@ out:
return dcache_readdir(file, dirent, filldir);
 }
 
+static int autofs4_compare_dentry(struct dentry *parent, struct dentry *dentry, struct qstr *name)

+{
+   unsigned int len = name-len;
+   unsigned int hash = name-hash;
+   const unsigned char *str = name-name;
+   struct qstr *qstr = dentry-d_name;
+
+   if (dentry-d_name.hash != hash)
+   return 0;
+   if (dentry-d_parent != parent)
+   return 0;
+
+   if (qstr-len != len)
+   return 0;
+   if (memcmp(qstr-name, str, len))
+   return 0;
+
+   return 1;
+}
+
+static struct dentry *autofs4_lookup_dentry(struct autofs_sb_info *sbi, struct 
dentry *dentry)
+{
+   struct dentry *parent = dentry-d_parent;
+   struct qstr *name = dentry-d_name;
+   struct list_head *p, *head;
+
+   head = 

Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

John Stoffel wrote:

Robin I'm bringing this up again (I know it's been mentioned here
Robin before) because I had been told that NFS support had gotten
Robin better in Linux recently, so I have been (for my $dayjob)
Robin testing the behaviour of NFS (autofs NFS, specifically) under
Robin Linux with hard,intr and using iptables to simulate a hang.

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with 'soft,intr'
and it works well.  If an NFS server goes down, clients don't hang for
large periods of time. 

  


Wow!  That's _really_ a bad idea.  NFS READ operations which
timeout can lead to executables which mysteriously fail, file
corruption, etc.  NFS WRITE operations which fail may or may
not lead to file corruption.

Anything writable should _always_ be mounted hard for safety
purposes.  Readonly mounted file systems _may_ be mounted soft,
depending upon what is located on them.


Robin fuser hangs, as far as I can tell indefinately, as does
Robin lsof. umount -f returns after a long time with busy, umount
Robin -l works after a long time but leaves the system in a very
Robin unfortunate state such that I have to kill things by hand and
Robin manually edit /etc/mtab to get autofs to work again.

Robin The correct solution to this situation according to
Robin http://nfs.sourceforge.net/ is cycles of kill processes and
Robin umount -f.  This has two problems:  1.  It sucks.  2.  If fuser
Robin and lsof both hand (and they do: fuser has been on
Robin stat(/home/rpowell/, for  30 minutes now), I have no way to
Robin pick which processes to kill.

Robin I've read every man page I could find, and the only nfs option
Robin that semes even vaguely helpful is soft, but everything that
Robin mentions soft also says to never use it.

I think the man pages are out of date, or ignoring reality.  Try
mounting with soft,intr and see how it works for you.  I think you'll
be happy.  

  


Please don't.  You will end up regretting it in the long run.
Taking a chance on corrupted data or critical applications which
just fail is not worth the benefit.

It would safer for us to implement something which works like
the Solaris forced umount support for NFS.

   Thanx...

  ps


Robin This is the single worst aspect of adminning a Linux system that I,
Robin as a carreer sysadmin, have to deal with.  In fact, it's really the
Robin only one I even dislike. At my current work place, we've lost
Robin multiple person-days to this issue, having to go around and reboot
Robin every Linux box that was hanging off a down NFS server.

Robin I know many other admins who also really want Solaris style
Robin umount -f; I'm sure if I passed the hat I could get a decent
Robin bounty together for this feature; let me know if you're interested.

Robin Thanks.

Robin -Robin

Robin -- 
Robin http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

Robin Reason #237 To Learn Lojban: Homonyms: Their Grate!
Robin Proud Supporter of the Singularity Institute - http://singinst.org/
Robin -
Robin To unsubscribe from this list: send the line unsubscribe linux-kernel 
in
Robin the body of a message to [EMAIL PROTECTED]
Robin More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robin Please read the FAQ at  http://www.tux.org/lkml/


Robin !DSPAM:46ca1d9676791030010506!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

Robin Lee Powell wrote:

On Tue, Aug 21, 2007 at 01:01:44PM -0400, Peter Staubach wrote:
  

John Stoffel wrote:


Robin I'm bringing this up again (I know it's been mentioned here
Robin before) because I had been told that NFS support had gotten
Robin better in Linux recently, so I have been (for my $dayjob)
Robin testing the behaviour of NFS (autofs NFS, specifically) under
Robin Linux with hard,intr and using iptables to simulate a hang.

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with
'soft,intr' and it works well.  If an NFS server goes down,
clients don't hang for large periods of time. 
  

Wow!  That's _really_ a bad idea.  NFS READ operations which
timeout can lead to executables which mysteriously fail, file
corruption, etc.  NFS WRITE operations which fail may or may not
lead to file corruption.

Anything writable should _always_ be mounted hard for safety
purposes.  Readonly mounted file systems _may_ be mounted soft,
depending upon what is located on them.



Does write + tcp make this any different?


Nope...

TCP may make a difference if the problem is related to the network
being slow or lossy, but will not affect anything if the server
is just slow or down.  Even if TCP would have eventually gotten
all of the packets in a request or response through, the client
may time out, cease waiting, and corruption may occur again.

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NFS hang + umount -f: better behaviour requested.

2007-08-21 Thread Peter Staubach

John Stoffel wrote:

Peter == Peter Staubach [EMAIL PROTECTED] writes:



Peter John Stoffel wrote:
Robin I'm bringing this up again (I know it's been mentioned here
Robin before) because I had been told that NFS support had gotten
Robin better in Linux recently, so I have been (for my $dayjob)
Robin testing the behaviour of NFS (autofs NFS, specifically) under
Robin Linux with hard,intr and using iptables to simulate a hang.
  

So why are you mouting with hard,intr semantics?  At my current
SysAdmin job, we mount everything (solaris included) with 'soft,intr'
and it works well.  If an NFS server goes down, clients don't hang for
large periods of time. 
  


Peter Wow!  That's _really_ a bad idea.  NFS READ operations which
Peter timeout can lead to executables which mysteriously fail, file
Peter corruption, etc.  NFS WRITE operations which fail may or may
Peter not lead to file corruption.

Peter Anything writable should _always_ be mounted hard for safety
Peter purposes.  Readonly mounted file systems _may_ be mounted
Peter soft, depending upon what is located on them.

Not in my experience.  We use NetApps as our backing NFS servers, so
maybe my experience isn't totally relevant.  But with a mix of Linux
and Solaris clients, we've never had problems with soft,intr on our
NFS clients.

We also don't see file corruption, mysterious executables failing to
run, etc.  


Now maybe those issues are raised when you have a Linux NFS server
with Solaris clients.  But in my book, reliable NFS servers are key,
and if they are reliable, 'soft,intr' works just fine.

Now maybe if we had NFS exported directories everywhere, and stuff
cross mounted all over the place with autofs, then we might change our
minds.  


In any case, I don't dis-agree with the fundamental request to make
the NFS client code on Linux easier to work with.  I bet Trond (who
works at NetApp) will have something to say on this issue.


Just for the others who may be reading this thread --

If you use sufficient network bandwidth and high quality
enough networks and NFS servers with plenty of resources,
then you _may_ be able to get away with soft mounting
for a some period of time.

However, any server, including Solaris and NetApp servers,
will fail, and those failures may or may not affect the
NFS service being provided.  In fact, unless the system
is being carefully administrated and the applications are
written very well, with error detection and recovery in
mind, then corruption can occur, and it can be silent and
unnoticed until too late.  In fact, most failures do occur
silently and get chalked up to other causes because it will
not be possible to correlate the badness with the NFS
client giving up when attempting to communicate with an
NFS server.

I wish you the best of luck, although with the environment
that you describe, it seems like hard mounts would work
equally well and would not incur the risks.

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21.1] nfs-root: added possibility to override default MTU (for UDP jumbo frames)

2007-07-02 Thread Peter Staubach

[EMAIL PROTECTED] wrote:

To use a NFS-root for UDP jumbo frames the kernel on the client need to bring
up interface with MTU set to 9000 bytes - otherwise it cannot contact server
with jumbo frames enabled (nfs server not responding, still trying) and cannot
boot. Added a kernel parameter named 'ipmtu' which can be used to specify
initial MTU size when booting via nfsroot.

  


Could you describe the problem better, please?  Something does not
sound right.  Both ends need to have jumbo frames enabled in order
to use jumbo frames, but if one end or the other does not, the systems
still should be able to exchange packets using normal sized ethernet
packets.  Isn't this a problem that mtu discovery should handle?

   Thanx...

  ps


Signed-off-by: Mariusz Bialonczyk <[EMAIL PROTECTED]>

diff -Nru linux-2.6.21.1-orig/Documentation/kernel-parameters.txt 
linux-2.6.21.1/Documentation/kernel-parameters.txt
--- linux-2.6.21.1-orig/Documentation/kernel-parameters.txt 2007-04-27 
23:49:26.0 +0200
+++ linux-2.6.21.1/Documentation/kernel-parameters.txt  2007-07-01 
18:47:11.0 +0200
@@ -720,6 +720,9 @@
ip2=[HW] Set IO/IRQ pairs for up to 4 IntelliPort boards
See comment before ip2_setup() in drivers/char/ip2.c.
 
+	ipmtu=		[IP_PNP]

+   See Documentation/nfsroot.txt.
+
ips=[HW,SCSI] Adaptec / IBM ServeRAID controller
See header of drivers/scsi/ips.c.
 
diff -Nru linux-2.6.21.1-orig/Documentation/nfsroot.txt linux-2.6.21.1/Documentation/nfsroot.txt

--- linux-2.6.21.1-orig/Documentation/nfsroot.txt   2007-04-27 
23:49:26.0 +0200
+++ linux-2.6.21.1/Documentation/nfsroot.txt2007-07-01 19:02:40.0 
+0200
@@ -153,6 +153,16 @@
 Default: any
 
 
+ipmtu=

+
+  This parameter tells the kernel to override default MTU size to specified
+  . Useful in cases where NFS server have jumbo frames enabled and
+  client can't connect via UDP because of default MTU value (in ethernet
+  usually 1500 bytes). With this option before bringing interface up, kernel
+  will set the passed MTU size. In case of NFS-root booting server and client
+  can use UDP jumbo frames (NFS's rsize and wsize set to 8192 for instance).
+
+
 
 
 3.) Boot Loader

diff -Nru linux-2.6.21.1-orig/net/ipv4/ipconfig.c 
linux-2.6.21.1/net/ipv4/ipconfig.c
--- linux-2.6.21.1-orig/net/ipv4/ipconfig.c 2007-04-27 23:49:26.0 
+0200
+++ linux-2.6.21.1/net/ipv4/ipconfig.c  2007-07-01 15:44:44.0 +0200
@@ -113,6 +113,8 @@
  */
 int ic_set_manually __initdata = 0;/* IPconfig parameters set 
manually */
 
+unsigned int ic_mtu __initdata = 0;		/* IPconfig MTU parameter: 0 - defaults, other - override */

+
 static int ic_enable __initdata = 0;   /* IP config enabled? */
 
 /* Protocol choice */

@@ -209,6 +211,11 @@
able &= ic_proto_enabled;
if (ic_proto_enabled && !able)
continue;
+   if (ic_mtu > 0)
+   {
+   printk(KERN_ERR "IP-Config: Overriding %s MTU to %d 
bytes\n", dev->name, ic_mtu);
+   dev->mtu = ic_mtu;
+   }
oflags = dev->flags;
if (dev_change_flags(dev, oflags | IFF_UP) < 0) {
printk(KERN_ERR "IP-Config: Failed to open %s\n", 
dev->name);
@@ -1506,5 +1513,14 @@
return ip_auto_config_setup(addrs);
 }
 
+static int __init mtu_config_setup(char *str)

+{
+   if (!str)
+   return 0;
+   ic_mtu = simple_strtoul(str, , 0);
+   return 1;
+}
+
 __setup("ip=", ip_auto_config_setup);
+__setup("ipmtu=", mtu_config_setup);
 __setup("nfsaddrs=", nfsaddrs_config_setup);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21.1] nfs-root: added possibility to override default MTU (for UDP jumbo frames)

2007-07-02 Thread Peter Staubach

[EMAIL PROTECTED] wrote:

To use a NFS-root for UDP jumbo frames the kernel on the client need to bring
up interface with MTU set to 9000 bytes - otherwise it cannot contact server
with jumbo frames enabled (nfs server not responding, still trying) and cannot
boot. Added a kernel parameter named 'ipmtu' which can be used to specify
initial MTU size when booting via nfsroot.

  


Could you describe the problem better, please?  Something does not
sound right.  Both ends need to have jumbo frames enabled in order
to use jumbo frames, but if one end or the other does not, the systems
still should be able to exchange packets using normal sized ethernet
packets.  Isn't this a problem that mtu discovery should handle?

   Thanx...

  ps


Signed-off-by: Mariusz Bialonczyk [EMAIL PROTECTED]

diff -Nru linux-2.6.21.1-orig/Documentation/kernel-parameters.txt 
linux-2.6.21.1/Documentation/kernel-parameters.txt
--- linux-2.6.21.1-orig/Documentation/kernel-parameters.txt 2007-04-27 
23:49:26.0 +0200
+++ linux-2.6.21.1/Documentation/kernel-parameters.txt  2007-07-01 
18:47:11.0 +0200
@@ -720,6 +720,9 @@
ip2=[HW] Set IO/IRQ pairs for up to 4 IntelliPort boards
See comment before ip2_setup() in drivers/char/ip2.c.
 
+	ipmtu=		[IP_PNP]

+   See Documentation/nfsroot.txt.
+
ips=[HW,SCSI] Adaptec / IBM ServeRAID controller
See header of drivers/scsi/ips.c.
 
diff -Nru linux-2.6.21.1-orig/Documentation/nfsroot.txt linux-2.6.21.1/Documentation/nfsroot.txt

--- linux-2.6.21.1-orig/Documentation/nfsroot.txt   2007-04-27 
23:49:26.0 +0200
+++ linux-2.6.21.1/Documentation/nfsroot.txt2007-07-01 19:02:40.0 
+0200
@@ -153,6 +153,16 @@
 Default: any
 
 
+ipmtu=mtu_value

+
+  This parameter tells the kernel to override default MTU size to specified
+  mtu_value. Useful in cases where NFS server have jumbo frames enabled and
+  client can't connect via UDP because of default MTU value (in ethernet
+  usually 1500 bytes). With this option before bringing interface up, kernel
+  will set the passed MTU size. In case of NFS-root booting server and client
+  can use UDP jumbo frames (NFS's rsize and wsize set to 8192 for instance).
+
+
 
 
 3.) Boot Loader

diff -Nru linux-2.6.21.1-orig/net/ipv4/ipconfig.c 
linux-2.6.21.1/net/ipv4/ipconfig.c
--- linux-2.6.21.1-orig/net/ipv4/ipconfig.c 2007-04-27 23:49:26.0 
+0200
+++ linux-2.6.21.1/net/ipv4/ipconfig.c  2007-07-01 15:44:44.0 +0200
@@ -113,6 +113,8 @@
  */
 int ic_set_manually __initdata = 0;/* IPconfig parameters set 
manually */
 
+unsigned int ic_mtu __initdata = 0;		/* IPconfig MTU parameter: 0 - defaults, other - override */

+
 static int ic_enable __initdata = 0;   /* IP config enabled? */
 
 /* Protocol choice */

@@ -209,6 +211,11 @@
able = ic_proto_enabled;
if (ic_proto_enabled  !able)
continue;
+   if (ic_mtu  0)
+   {
+   printk(KERN_ERR IP-Config: Overriding %s MTU to %d 
bytes\n, dev-name, ic_mtu);
+   dev-mtu = ic_mtu;
+   }
oflags = dev-flags;
if (dev_change_flags(dev, oflags | IFF_UP)  0) {
printk(KERN_ERR IP-Config: Failed to open %s\n, 
dev-name);
@@ -1506,5 +1513,14 @@
return ip_auto_config_setup(addrs);
 }
 
+static int __init mtu_config_setup(char *str)

+{
+   if (!str)
+   return 0;
+   ic_mtu = simple_strtoul(str, str, 0);
+   return 1;
+}
+
 __setup(ip=, ip_auto_config_setup);
+__setup(ipmtu=, mtu_config_setup);
 __setup(nfsaddrs=, nfsaddrs_config_setup);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NFS: Make NFS root work again

2007-06-07 Thread Peter Staubach

David Howells wrote:

Make NFS root work by creating a "/root" directory to satisfy the mount,
otherwise the path lookup for the mount fails with ENOENT.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 init/do_mounts.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index 46fe407..967b852 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -270,7 +270,10 @@ static void __init get_fs_names(char *page)
 
 static int __init do_mount_root(char *name, char *fs, int flags, void *data)

 {
-   int err = sys_mount(name, "/root", fs, flags, data);
+   int err;
+
+   sys_mkdir("/root", 0755);
+   err = sys_mount(name, "/root", fs, flags, data);
if (err)
return err;


It seems to me that if sys_mkdir() fails with anything other other
than EEXISTS, then sys_mount() will continue to fail.  Is this
something that we care about?

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] NFS: Make NFS root work again

2007-06-07 Thread Peter Staubach

David Howells wrote:

Make NFS root work by creating a /root directory to satisfy the mount,
otherwise the path lookup for the mount fails with ENOENT.

Signed-off-by: David Howells [EMAIL PROTECTED]
---

 init/do_mounts.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index 46fe407..967b852 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -270,7 +270,10 @@ static void __init get_fs_names(char *page)
 
 static int __init do_mount_root(char *name, char *fs, int flags, void *data)

 {
-   int err = sys_mount(name, /root, fs, flags, data);
+   int err;
+
+   sys_mkdir(/root, 0755);
+   err = sys_mount(name, /root, fs, flags, data);
if (err)
return err;


It seems to me that if sys_mkdir() fails with anything other other
than EEXISTS, then sys_mount() will continue to fail.  Is this
something that we care about?

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] allow file system to configure for no leases

2007-06-06 Thread Peter Staubach

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidentally, due to a reference counting difference.  These file
systems should fail more consciously, with a proper error to
indicate that the call is invalid for them.

  Thanx...

 ps

Sign-off-by: Peter Staubach <[EMAIL PROTECTED]>

--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,8 @@ static inline void nfs_initialise_sb(str
 
sb->s_magic = NFS_SUPER_MAGIC;
 
+   sb->s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb->s_id, sizeof(sb->s_id),
 "%x:%x", MAJOR(sb->s_dev), MINOR(sb->s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock();
error = lease_init(filp, arg, );
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (1<<19) /* change to slave */
 #define MS_SHARED  (1<<20) /* change to shared */
 #define MS_RELATIME(1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (1<<22) /* fs does not support leases */
 #define MS_ACTIVE  (1<<30)
 #define MS_NOUSER  (1<<31)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)->i_flags & S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


Re: [PATCH] allow file system to configure for no leases

2007-06-06 Thread Peter Staubach

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidentally, due to a reference counting difference.  These file
systems should fail more consciously, with a proper error to
indicate that the call is invalid for them.

  Thanx...

 ps

Sign-off-by: Peter Staubach [EMAIL PROTECTED]

--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,8 @@ static inline void nfs_initialise_sb(str
 
sb-s_magic = NFS_SUPER_MAGIC;
 
+   sb-s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb-s_id, sizeof(sb-s_id),
 %x:%x, MAJOR(sb-s_dev), MINOR(sb-s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock(fl);
error = lease_init(filp, arg, fl);
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (119) /* change to slave */
 #define MS_SHARED  (120) /* change to shared */
 #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (122) /* fs does not support leases */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)-i_flags  S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)-i_flags  S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)-i_flags  S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


Re: [PATCH] allow file system to configure for no leases

2007-06-05 Thread Peter Staubach

Trond Myklebust wrote:

On Tue, 2007-06-05 at 15:10 -0400, Peter Staubach wrote:
  

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidently, due to a reference counting difference.  These file systems
should fail more consciously, with a proper error to indicate that
the call is invalid for them.

Thanx...

   ps
plain text document attachment (devel.tototoday)
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,9 @@ static inline void nfs_initialise_sb(str
 
 	sb->s_magic = NFS_SUPER_MAGIC;
 
+	if (server->nfs_client->cl_nfsversion < 4)

+   sb->s_flags |= MS_NO_LEASES;
+



This should be unconditional since we have no support for "lease locks"
under NFSv4 either. The NFSv4 concept of leases and delegations is very
different, since it is really tied to the ability to cache data.


No problem.  I wasn't sure, what with the changes that Bruce Fields
is constructing.

Attached is the simplified version.

   Thanx...

  ps
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,8 @@ static inline void nfs_initialise_sb(str
 
sb->s_magic = NFS_SUPER_MAGIC;
 
+   sb->s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb->s_id, sizeof(sb->s_id),
 "%x:%x", MAJOR(sb->s_dev), MINOR(sb->s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock();
error = lease_init(filp, arg, );
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (1<<19) /* change to slave */
 #define MS_SHARED  (1<<20) /* change to shared */
 #define MS_RELATIME(1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (1<<22) /* fs does not support leases */
 #define MS_ACTIVE  (1<<30)
 #define MS_NOUSER  (1<<31)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)->i_flags & S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


[PATCH] allow file system to configure for no leases

2007-06-05 Thread Peter Staubach

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidently, due to a reference counting difference.  These file systems
should fail more consciously, with a proper error to indicate that
the call is invalid for them.

   Thanx...

  ps
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,9 @@ static inline void nfs_initialise_sb(str
 
sb->s_magic = NFS_SUPER_MAGIC;
 
+   if (server->nfs_client->cl_nfsversion < 4)
+   sb->s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb->s_id, sizeof(sb->s_id),
 "%x:%x", MAJOR(sb->s_dev), MINOR(sb->s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock();
error = lease_init(filp, arg, );
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (1<<19) /* change to slave */
 #define MS_SHARED  (1<<20) /* change to shared */
 #define MS_RELATIME(1<<21) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (1<<22) /* fs does not support leases */
 #define MS_ACTIVE  (1<<30)
 #define MS_NOUSER  (1<<31)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)->i_flags & S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


[PATCH] allow file system to configure for no leases

2007-06-05 Thread Peter Staubach

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidently, due to a reference counting difference.  These file systems
should fail more consciously, with a proper error to indicate that
the call is invalid for them.

   Thanx...

  ps
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,9 @@ static inline void nfs_initialise_sb(str
 
sb-s_magic = NFS_SUPER_MAGIC;
 
+   if (server-nfs_client-cl_nfsversion  4)
+   sb-s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb-s_id, sizeof(sb-s_id),
 %x:%x, MAJOR(sb-s_dev), MINOR(sb-s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock(fl);
error = lease_init(filp, arg, fl);
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (119) /* change to slave */
 #define MS_SHARED  (120) /* change to shared */
 #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (122) /* fs does not support leases */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)-i_flags  S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)-i_flags  S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)-i_flags  S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


Re: [PATCH] allow file system to configure for no leases

2007-06-05 Thread Peter Staubach

Trond Myklebust wrote:

On Tue, 2007-06-05 at 15:10 -0400, Peter Staubach wrote:
  

Hi.

Attached is a small patch to allow file systems to inform the file
system independent layers that they don't support file leases.

The problem is that some file system such as NFSv2 and NFSv3 do
not have sufficient support to be able to support leases correctly.
In particular for these two file systems, there is no over the wire
protocol support.

Currently, these two file systems fail the fcntl(F_SETLEASE) call
accidently, due to a reference counting difference.  These file systems
should fail more consciously, with a proper error to indicate that
the call is invalid for them.

Thanx...

   ps
plain text document attachment (devel.tototoday)
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,9 @@ static inline void nfs_initialise_sb(str
 
 	sb-s_magic = NFS_SUPER_MAGIC;
 
+	if (server-nfs_client-cl_nfsversion  4)

+   sb-s_flags |= MS_NO_LEASES;
+



This should be unconditional since we have no support for lease locks
under NFSv4 either. The NFSv4 concept of leases and delegations is very
different, since it is really tied to the ability to cache data.


No problem.  I wasn't sure, what with the changes that Bruce Fields
is constructing.

Attached is the simplified version.

   Thanx...

  ps
--- linux-2.6.21.i686/fs/nfs/super.c.org
+++ linux-2.6.21.i686/fs/nfs/super.c
@@ -522,6 +522,8 @@ static inline void nfs_initialise_sb(str
 
sb-s_magic = NFS_SUPER_MAGIC;
 
+   sb-s_flags |= MS_NO_LEASES;
+
/* We probably want something more informative here */
snprintf(sb-s_id, sizeof(sb-s_id),
 %x:%x, MAJOR(sb-s_dev), MINOR(sb-s_dev));
--- linux-2.6.21.i686/fs/locks.c.org
+++ linux-2.6.21.i686/fs/locks.c
@@ -1493,6 +1493,8 @@ int fcntl_setlease(unsigned int fd, stru
error = security_file_lock(filp, arg);
if (error)
return error;
+   if (IS_NO_LEASES(inode))
+   return -EINVAL;
 
locks_init_lock(fl);
error = lease_init(filp, arg, fl);
--- linux-2.6.21.i686/include/linux/fs.h.org
+++ linux-2.6.21.i686/include/linux/fs.h
@@ -121,6 +121,7 @@ extern int dir_notify_enable;
 #define MS_SLAVE   (119) /* change to slave */
 #define MS_SHARED  (120) /* change to shared */
 #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */
+#define MS_NO_LEASES   (122) /* fs does not support leases */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 
@@ -180,6 +181,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode) ((inode)-i_flags  S_NOCMTIME)
 #define IS_SWAPFILE(inode) ((inode)-i_flags  S_SWAPFILE)
 #define IS_PRIVATE(inode)  ((inode)-i_flags  S_PRIVATE)
+#define IS_NO_LEASES(inode)__IS_FLG(inode, MS_NO_LEASES)
 
 /* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */


Re: [PATCH 30/40] nfs: fixup missing error code

2007-05-04 Thread Peter Staubach

Peter Zijlstra wrote:

Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
to -EIO when there was no progress with short reads.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/nfs/read.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/read.c
===
--- linux-2.6-git.orig/fs/nfs/read.c2007-03-13 14:35:53.0 +0100
+++ linux-2.6-git/fs/nfs/read.c 2007-03-13 14:36:05.0 +0100
@@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
/* This is a short read! */
nfs_inc_stats(data->inode, NFSIOS_SHORTREAD);
/* Has the server at least made some progress? */
-   if (resp->count == 0)
+   if (resp->count == 0) {
+   task->tk_status = -EIO;
return 0;
+   }
 
 	/* Yes, so retry the read at the end of the data */

argp->offset += resp->count;


This doesn't look right to me.  It is not an error for the NFS server
to return 0 bytes.  It is usually an indication of EOF.  If an error
occured, then the NFS server would have returned an error.

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 30/40] nfs: fixup missing error code

2007-05-04 Thread Peter Staubach

Peter Zijlstra wrote:

Commit 0b67130149b006628389ff3e8f46be9957af98aa lost the setting of tk_status
to -EIO when there was no progress with short reads.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/read.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-git/fs/nfs/read.c
===
--- linux-2.6-git.orig/fs/nfs/read.c2007-03-13 14:35:53.0 +0100
+++ linux-2.6-git/fs/nfs/read.c 2007-03-13 14:36:05.0 +0100
@@ -384,8 +384,10 @@ static int nfs_readpage_retry(struct rpc
/* This is a short read! */
nfs_inc_stats(data-inode, NFSIOS_SHORTREAD);
/* Has the server at least made some progress? */
-   if (resp-count == 0)
+   if (resp-count == 0) {
+   task-tk_status = -EIO;
return 0;
+   }
 
 	/* Yes, so retry the read at the end of the data */

argp-offset += resp-count;


This doesn't look right to me.  It is not an error for the NFS server
to return 0 bytes.  It is usually an indication of EOF.  If an error
occured, then the NFS server would have returned an error.

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Nfs over tcp retries

2007-03-05 Thread Peter Staubach

Andy Chittenden wrote:

Here's a sequence of packets captured at the end of a NFS connection and
the start of the next for a RH Fedora Core 6 client:

# cat ~/tmp/28852a.txt
...

As you can see in packet 3, the nfs server's sent a FIN-ACK which is
acknowledged in packet 6 by the client. So by packet 8, the connection's
closed. The client attempts to reconnect to the server in packet 8 which
is refused by the server in packet 9 as the client is using the same
port number as the previous session: the server's in TIME WAIT from the
previous connection and the initial send sequence number of this new
connection is below the highest sequence number of the previous
connection. The client's attempts to reconnect continue unsuccessfully
until 2MSL is exceeded.

So, a few questions:

* why does the NFS client reuse the same source port number (894 in the
example above)?
* if the socket's being reused, why is the ISS being chosen such that
it's within the same range as the last successful connection?
* why does the ISS seem to go up by only 3 since the last attempt to
connect?

If the linux NFS client had used a different source port number or
chosen an out-of-range ISS, then its reconnection attempts would have
been successful in a more timely manner.


I suspect that the NFS client attempts to reuse the same port number
for the new connection so that it does not invalidate the duplicate
request cache on the server.  NFS servers typically use the entire
IP address of the client, including the port number, when performing
the tests to check to see if the current request is the duplicate of
a previous request.

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Nfs over tcp retries

2007-03-05 Thread Peter Staubach

Andy Chittenden wrote:

Here's a sequence of packets captured at the end of a NFS connection and
the start of the next for a RH Fedora Core 6 client:

# cat ~/tmp/28852a.txt
...

As you can see in packet 3, the nfs server's sent a FIN-ACK which is
acknowledged in packet 6 by the client. So by packet 8, the connection's
closed. The client attempts to reconnect to the server in packet 8 which
is refused by the server in packet 9 as the client is using the same
port number as the previous session: the server's in TIME WAIT from the
previous connection and the initial send sequence number of this new
connection is below the highest sequence number of the previous
connection. The client's attempts to reconnect continue unsuccessfully
until 2MSL is exceeded.

So, a few questions:

* why does the NFS client reuse the same source port number (894 in the
example above)?
* if the socket's being reused, why is the ISS being chosen such that
it's within the same range as the last successful connection?
* why does the ISS seem to go up by only 3 since the last attempt to
connect?

If the linux NFS client had used a different source port number or
chosen an out-of-range ISS, then its reconnection attempts would have
been successful in a more timely manner.


I suspect that the NFS client attempts to reuse the same port number
for the new connection so that it does not invalidate the duplicate
request cache on the server.  NFS servers typically use the entire
IP address of the client, including the port number, when performing
the tests to check to see if the current request is the duplicate of
a previous request.

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 01/22] update ctime and mtime for mmaped write

2007-02-28 Thread Peter Staubach

Miklos Szeredi wrote:

What happens if the application overwrites what it had written some
time later?  Nothing.  The page is already read-write, the pte dirty,
so even though the file was clearly modified, there's absolutely no
way in which this can be used to force an update to the timestamp.



Which, I realize now, actually means, that the patch is wrong.  Msync
will have to write protect the page table entries, so that later
dirtyings may have an effect on the timestamp.


I thought that PeterZ's changes were to write-protect the page after
cleaning it so that future modifications could be detected and tracked
accordingly?  Does the right thing not happen already?

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >