On 2014-10-20 09:02, Zygo Blaxell wrote:
On Mon, Oct 20, 2014 at 04:38:28AM +0000, Duncan wrote:
Russell Coker posted on Sat, 18 Oct 2014 14:54:19 +1100 as excerpted:

# find . -name "*546"
./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot
access ./1412233213.M638209P10546: No such file or directory

Does your mail server do a lot of renames?  Is one perhaps stuck?  If so,
that sounds like the same thing "Zygo Blaxell" is reporting in the
"3.16.3..3.17.1 hang in renameat2()" thread, OP on Sun, 19 Oct 2014
15:25:26 -400, Msg-ID: <20141019192525.ga29...@hungrycats.org>, as linked
here:

<http://permalink.gmane.org/gmane.comp.file-systems.btrfs/39539>

I pointed him at this thread too.  I hadn't seen you mention a hung
rename, but the other symptoms sound similar.

Not really.  It looks like Russell having a NFS client-side problem,
I'm having a server-side one (maybe).  Also, all Russell's system calls
seem to be returning promptly, while some of mine are not.  Even if
there were timeouts, an NFS server timeout gives a different error than
'No such file or directory'.  Finally, the one and only thing I _can_
do with my bug is 'ls' on the renamed files (for me, the find would get
stuck before returning any output).

For Russell's issue...most of the stuff I can think of has been
tried already.  I didn't see if there was any attempt try to ls the
file from the NFS server as well as the client side.  If ls is OK on
the server but not the client, it's an NFS issue (possibly interacting
with some btrfs-specific quirk); otherwise, it's likely a corrupted
filesystem (mail servers seem to be unusually good at making these).

Most of the I/O time on mail servers tends to land in the fsync() system
call, and some nasty fsync() btrfs bugs were fixed in 3.17 (i.e. after
3.16, and not in the 3.16.x stable update for x <= 5 (the last one
I've checked)).  That said, I'm not familiar with how fsync() translates
over NFS, so it might not be relevant after all.

If the NFS server's view of the filesystem is OK, check the NFS protocol
version from /proc/mounts on the client.  Sometimes NFS clients will
get some transient network error during connection and fall back to some
earlier (and potentially buggier) NFS version.  I've seen very different
behavior in some important corner cases from v4 and v3 clients, for
example, and if the client is falling all the way back to v2 the bugs
and their workarounds start to get just plain _weird_ (e.g. filenames
which produce specific values from some hash function or that contain
specific character sequences are unusable).  v2 is so old it may even
have issues with 64-bit inode numbers.

Just now saw this thread, but IIRC 'No such file or directory' also gets returned sometimes when trying to automount a share that can't be enumerated by the client, and also sometimes when there is a stale NFS file handle.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to