Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11332



Some snippets of updated info from Andreas & Brian, who continues
to work on this problem:

(11:53:46) behlendo: adilger: In other news, and I'll put this all in the bug
once I get to the bottom of it.  But our 11040 appears to be server side now,
our 1.4.7.2-pre-6llnl clients taking to 1.4.6.95-17.2llnl servers hit the bug
when accessing a symlink from lustre to NFS.  But matched 1.4.7.2-pre-6llnl
clients and servers work fine.  The older server returns a larger reply to the
newer client, the newer server returns the expect reply size.  So I suppose I
need to look at the server now and see what changed here, and why they're not
100% interoperable.
(12:05:23) adilger: behlendo: is it that the symlink is too long on the server,
or is there a bug because the symlink is to NFS?

(12:34:57) behlendo: adilger: As for our symlink issue, no the symlink is very
dull.  I looked at the inode on the MDS and it's a fast link for a very short
path.  Nothing special about the inode, no EAs, etc.
(12:37:10) behlendo: adilger: But on a single client mounting lscratcha
(1.4.6.95 servers), and lscratchb (1.4.7.2 servers) with a symlink from each FS
to the same NFS file.  It caused the issue everytime for the lscratcha symlink
and never for the lscratchb symlink.  Now that we know what to look for we're
also seeing it on non-peleton style systems.  In fact I'm about to go reproduce
it in the testbed so I can get clean MDS logs of the failure.

(14:28:30) behlendo: adilger: Got a sec, I've got an ugly ugly hack as a
workaround for out 11332 issue, the symlink thing between mismatched lustre
version.  But I want to run it past you just in case I've made some oversight. 
Here are the ground rules since we're going in to a holiday:
1) We don't want to be putting new code on a server, so the change needs to be
client side.
2) It should be as minimal as possible to avoid introducing new issues while
we're all away
3) It doesn't have to be pretty, we can get a -correct- fix in after the new 
year
So with that in mind I basically adjusted the getattr case in mdc_enqueue() to
add in an arbitrary extra 512 bytes to the repsize[3] to increase the allocated
replen.  This seems to work... and is ugly as sin... but since we can't tweak
the server it seemed reasonable.  Can you think of any bad side effect this
might cause?

(14:47:49) adilger: behlendo: it likely isn't harmful, but yes it's ugly
(14:48:28) adilger: do you know WHY the MDS is trying to reply with a larger
buffer? or conversely why the client thinks it only needs a smaller one?
(14:56:37) behlendo: adilger: Nope, not yet.  Since my time is short before the
holidays I was focusing on a quick hack, a little sanity testing, then put it
out where folks are suffering.  Once I've got that handed of to the admins I'm
planning to look in to it in the testbed.  This afternoon I hope.
(14:57:32) behlendo: So, the 1.4.7.2 servers response with the correctly sized
reply, so I'm going to try and see why the older server code thinks it should be
bigger.
(14:58:06) behlendo: Presumably the 1.4.6.95 clients expect the larger reply
too, but I don't know that for sure.
(15:06:49) adilger: I'd imagine yes, but nothing pops out at me why this would
have changed
(15:22:59) behlendo: Me either...  well I've just built a version with the hack
for zeus so I'll turn my attention to the real reason WHY this changed.  Thanks
for the quick sanity thoughts on that ugly ugly client mod.

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to