On Aug 29, 2012, at 19:12 , Marc Dionne <[email protected]> wrote:

> On Wed, Aug 29, 2012 at 11:21 AM, Stephan Wiesand
> <[email protected]> wrote:
>> Hi All,
>> 
>> this is just a question. I'm not asserting an openafs bug.
>> 
>> Since SL6, we have we have been using "kABI tracking kmods" for installing 
>> the OpenAFS kernel module on clients. For full information on this 
>> mechanism, see http://people.redhat.com/jcm/el6/dup/docs/dup_book.pdf . In 
>> short, you only have to compile and install the module once, and it will be 
>> used with future kernels as long as it doesn't use parts of the ABI that 
>> changed.
>> 
>> Trying this may have been stupid in the first place. If so, happy bashing :-)
>> 
>> But in practice, it has worked perfectly for a long time. The modules built 
>> against the EL6 GA kernel (2.6.32-71.el6) work fine with every released 
>> kernel up to the latest EL6.2 kernels (2.6.32-220.23.1.el6), on both 32-bit 
>> and 64-bit systems.
>> 
>> But with the EL6.3 update (2.6.32-279.el5), something changed that broke at 
>> least the interface to the 32-bit module. The symptoms are reads getting 
>> stuck at the very beginning, except for very small files. The reads can be 
>> interrupted, but the client can no longer be stopped cleanly.
> 
> Anything interesting in the syslog when this occurs?

No. When shutting down afterwards, it gets as far as "WARM shutting down of: 
vcaches..." before umount and afsd get stuck in D state. but that's all I see.

I fstraced a read getting stuck. This is what it looks like:

time 196.758223, pid 1608: Lookup adp 0xeb949300 name 
openafs.SLx-1.6.0-89.pre1.src.rpm fid (178:536891922.210.509), code=0 
time 196.758632, pid 1608: Analyze RPC op 2 conn 0xebea2320 code 0x0 user 
0x413f6096 
time 196.758635, pid 1608: ProcessFS vp 0xe938ed40 old len (0x0, 0x0) new len 
(0x0, 0x11ed320) 
time 196.758637, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320) 
time 196.758991, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320) 
time 196.759225, pid 1608: Access vp 0xe938acc0 mode 0x40 len (0x0, 0x800) 
time 196.759229, pid 1608: Access vp 0xe938a040 mode 0x40 len (0x0, 0x11) 
time 196.759231, pid 1608: GetdCache vp 0xe938a2c0 dcache 0xedc49000 dcache 
low-version 0x463c, vcache low-version 0x463c 
time 196.759231, pid 1608: GetdCache tlen 0x800 flags 0x1 abyte (0x0, 0x0) 
Position (0x0, 0x0) 
time 196.759232, pid 1608: Lookup adp 0xe938a2c0 name packages fid 
(178:536870916.2.17768), code=0 
time 196.759233, pid 1608: Mount point is to vp 0xe938a540 fid 
(178:536870916.2.17768) 
time 196.759235, pid 1608: Access vp 0xe938a7c0 mode 0x40 len (0x0, 0x800) 
time 196.759236, pid 1608: Access vp 0xeb949d00 mode 0x40 len (0x0, 0x2000) 
time 196.759237, pid 1608: Access vp 0xeb949800 mode 0x40 len (0x0, 0x12800) 
time 196.759237, pid 1608: Access vp 0xeb949300 mode 0x40 len (0x0, 0x2800) 
time 196.759238, pid 1608: Access vp 0xe938ed40 mode 0x100 len (0x0, 0x11ed320) 
time 196.759241, pid 1608: Open 0xe938ed40 flags 0x8000 
time 196.759242, pid 1608: Open 0xe938ed40 flags 0xf423f 
time 196.759248, pid 1608: Getattr vp 0xe938ed40 len (0x0, 0x11ed320) 
time 196.759394, pid 1608: Iread ip xe938ed40 pos (0x0, 0x0) count 0x8000 code 
1869f 

NB rxdebug, cmdebug, fs getcache etc. all still work.

>> Using a module built against the 6.3 kernel with pre-6.3 ones has worse 
>> effects. BUGs, panics, spontaneous reboots.
>> 
>> All this was only observed on 32-bit systems, and only if the cache is on 
>> ext4. I have a suspicion that it might be related to a change described 
>> here: http://joejulian.name/blog/glusterfs-bit-by-ext4-structure-change/ . 
>> Quote: << a patch against ext4 to "return 32/64-bit dir name hash according 
>> to usage type". Prior to that, ext2/3/4 would return a 32-bit hash value 
>> from telldir()/seekdir() [. . .] That patch was for kernel v3.3-rc2. To make 
>> things more fun, [. . .] merged in that patch in 2.6.32-268.el6 >>
>> 
>> The direct link to the patch is 
>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=d1f5273e9adb40724a85272f248f210dc4ce919a
>>  .
>> 
>> Does anyone familiar withe the openafs module's inner workings see whether 
>> that patch would have the effects described above, on 32-bit systems only?
>> 
>> Thanks a lot in advance for any insights.
>> 
>>        Stephan
> 
> Offhand I don't see anything in that change that should affect
> openafs.  Within the kernel module each cache file is looked up up
> individually with a full path that it receives from afsd - the
> directory scanning is done by afsd in user space using readdir.  The
> lookup returns a dentry which is then converted to a file handle by
> the underlying file system's own conversion function.  The file
> handles for all cache files are stored in memory by the module.  When
> a file is used, the file handle is converted to a dentry with the fs's
> conversion function, and the file is opened with dentry_open.

Thanks for the explanation.

> Any other changes to ext4 in that update?

Yes, quite a few. Alas, the patches are no longer available separately, and 
practically all the BZs are private. But I'll run a diff later.

> Does the module work correctly on this system, with ext4, if it is recompiled?

Yes, if it is built against a 6.3 (-279) kernel. Rebuilding against an old 
kernel with the current toolchchain makes no difference.

It gets weirder: I can't reproduce the problem with an ext4 cache filesystem 
created with mkfs.ext4 on the running system. Only with filesystems created by 
the installer (SL6.2 is confirmed yet). The fsck doesn't find anything wrong 
with the fs.

I guess it's an ext4 issue in EL6. But I'd still feel better if I understood 
what's going on.

Thanks a lot for your help
        Stephan

-- 
Stephan Wiesand
DESY - DV -
Platanenallee 6
15732 Zeuthen, Germany

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to