Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Similar symptoms (uid/gid 4294967294) can also be caused by running against the root's keyring quota in the kernel, which is used to store all the id_resolv/id_legacy keys for each uid and gid. The default kernel keyring quotas (200 keys!) in jessie's 3.16 kernel work poorly for NFSv4 setups with more than ~100 users. The kernel.keys.root_{maxkeys,maxbytes} sysctls should be bumped higher: $ sudo sysctl kernel.keys.root_maxbytes=2500 kernel.keys.root_maxkeys=100 Using the new Linux 3.18 default values: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=738c5d190f6540539a04baf36ce21d46b5da04bd Works much better with all of the 4294967294 UIDs now showing correctly. $ sudo head -n1 /proc/key-users 0: 1248 1247/1247 1243/100 31475/2500 With the current kernel.keys.root_maxkeys=200 default in 3.16, this will quickly be saturated. Relevant Ubuntu bug: https://bugs.launchpad.net/fedora/+bug/1124250 -- Tero Marttila
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Hi all, As far as I can see, the nfs4 jessie clients caused serious problems on the serving nfs server (which is a wheezy system), leading to an nfsd kernel crash on the wheezy system. FYI: the server is an up-to-date wheezy system (kernel 3.2.60-1+deb7u3). The dumps got away as soon as I unmounted the nfs4 clients that are running jessie; those jessie systems are also up-to-date (kernel 3.14.15-2 (2014-08-09)). This took down our production environment. I fixed it by disabling all jessie nfs v4 clients. Regards, Piet Here is a set of kernel dumps from the nfs server system: [ 270.413977] BUG: unable to handle kernel NULL pointer dereference at 0010 [ 270.414297] IP: [a0381311] set_nfsv4_acl_one+0x15/0x7e [nfsd] [ 270.414516] PGD 0 [ 270.414709] Oops: [#3] SMP [ 270.414989] CPU 1 [ 270.415078] Modules linked in: parport_pc ppdev lp parport binfmt_misc nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_pcm snd_page_alloc snd_timer coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic snd iTCO_wdt iTCO_vendor_support soundcore cryptd joydev evdev pcspkr sb_edac edac_core dcdbas shpchp wmi acpi_power_meter processor button thermal_sys xfs usbhid hid usb_storage sg sd_mod crc_t10dif igb ehci_hcd i2c_algo_bit i2c_core usbcore ixgbe usb_common dca megaraid_sas scsi_mod mdio [last unloaded: scsi_wait_scan] [ 270.420471] [ 270.420583] Pid: 3504, comm: nfsd Tainted: G D W3.2.0-4-amd64 #1 Debian 3.2.60-1+deb7u3 Dell Inc. PowerEdge R720xd/0C4Y3R [ 270.420989] RIP: 0010:[a0381311] [a0381311] set_nfsv4_acl_one+0x15/0x7e [nfsd] [ 270.421229] RSP: 0018:880fe48a7d30 EFLAGS: 00010286 [ 270.421347] RAX: 4000 RBX: 880fbac5ecc0 RCX: 0024 [ 270.421471] RDX: a039d6fb RSI: RDI: 880fbac5ecc0 [ 270.421595] RBP: 880fbac5ecc0 R08: R09: [ 270.421719] R10: 880fbe8f6de0 R11: 880fbe8f6de0 R12: a039d6fb [ 270.421842] R13: R14: 0440 R15: 880fe4f89180 [ 270.421967] FS: () GS:88102f22() knlGS: [ 270.422115] CS: 0010 DS: ES: CR0: 8005003b [ 270.422241] CR2: 0010 CR3: 01605000 CR4: 000406e0 [ 270.422367] DR0: DR1: DR2: [ 270.422491] DR3: DR6: 0ff0 DR7: 0400 [ 270.422616] Process nfsd (pid: 3504, threadinfo 880fe48a6000, task 880fecb42180) [ 270.422762] Stack: [ 270.422872] 0451 880f 880fbac5ecc0 880fe076b1c0 [ 270.423351] 0440 a0381bf4 [ 270.423829] 880f9f9e4600 880fe4f89040 880fe4f8c000 [ 270.424306] Call Trace: [ 270.424428] [a0381bf4] ? nfsd4_set_nfs4_acl+0xb4/0xe6 [nfsd] [ 270.424559] [a038b637] ? nfsd4_setattr+0xae/0xe8 [nfsd] [ 270.424687] [a038a8d6] ? nfsd4_proc_compound+0x251/0x41d [nfsd] [ 270.424816] [a037e7cd] ? nfsd_dispatch+0xd7/0x1ba [nfsd] [ 270.424950] [a030ec3f] ? svc_process_common+0x2c3/0x4c4 [sunrpc] [ 270.425078] [8103f6e2] ? try_to_wake_up+0x197/0x197 [ 270.425207] [a030f050] ? svc_process+0x110/0x12c [sunrpc] [ 270.425332] [a037e0e3] ? nfsd+0xe3/0x127 [nfsd] [ 270.425455] [a037e000] ? 0xa037dfff [ 270.425577] [8105f701] ? kthread+0x76/0x7e [ 270.425702] [813575b4] ? kernel_thread_helper+0x4/0x10 [ 270.425826] [8105f68b] ? kthread_worker_fn+0x139/0x139 [ 270.425945] [813575b0] ? gs_change+0x13/0x13 [ 270.426063] Code: 48 8b 80 20 01 00 00 c6 87 9e 00 00 00 01 48 89 87 c8 00 00 00 c3 41 56 41 55 49 89 f5 41 54 49 89 d4 55 48 89 fd 53 48 83 ec 10 48 63 46 10 be d0 00 00 00 4c 8d 34 c5 04 00 00 00 4c 89 f7 e8 [ 270.431678] RIP [a0381311] set_nfsv4_acl_one+0x15/0x7e [nfsd] [ 270.431889] RSP 880fe48a7d30 [ 270.432003] CR2: 0010 [ 270.432121] BUG: unable to handle kernel [ 270.432142] ---[ end trace 5c2d08be1ab6f7cb ]--- [ 270.432437] NULL pointer dereference at 0010 [ 270.432639] IP: [a0381311] set_nfsv4_acl_one+0x15/0x7e [nfsd] [ 270.432853] PGD 0 [ 270.432945] Oops: [#4] SMP [ 270.432948] CPU 0 [ 270.432949] Modules linked in: parport_pc ppdev lp parport binfmt_misc nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_pcm snd_page_alloc snd_timer coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic snd iTCO_wdt iTCO_vendor_support soundcore cryptd joydev evdev pcspkr sb_edac edac_core dcdbas shpchp wmi acpi_power_meter processor button thermal_sys xfs usbhid hid usb_storage sg sd_mod crc_t10dif igb ehci_hcd i2c_algo_bit i2c_core usbcore ixgbe usb_common dca megaraid_sas scsi_mod mdio [last unloaded: scsi_wait_scan] [
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
On Fri, Sep 05, 2014 at 11:19:35AM +0200, Piet Plomp wrote: Hi Iustin, On 2014-09-04 19:11, Iustin Pop wrote: [...] Just another datapoint: this is different from my case. No new users created, randomly new files get -1 for a while, after which the correct UID is listed. No, this is not different: all new users get new files, which have never been served by the nfs server before. With me, a while might last forever for some identities. I still don't understand what new users mean - as I said, I don't have new users. When I create files or dirs, they may be owned by the infamous -2 (4294967294), regardless _where_ I created them (i.e. through nfs or locally on the filesystem. Exactly. You report that after a while the currect uid and gid are listed. Same for me, but sadly not always, some identities get stuck on 4294967294 forever. I'm curious if we have any differences in our setups: - Do you also have a mixture of wheezy and jessie systems? Is your nfs server also on a wheezy system? Are your clients both jessie and wheezy systems? Only sid (unstable) clients. Server and some clients run custom (upstream) kernels, some clients run sid kernel. - Did you see any changes in the behaviour of the wheezy clients after the jessie clients mounted? I don't have wheezy clients, so N/A. - Do you have inet6 entries in /etc/netconfig enabled on the jessie clients (which is the default)? Yes. - Did you change /etc/idmapd.conf? Yes. I tried to add static mappings for some users, but it didn't have any positive effect. - Did you change or add any files in /etc/request-key.d/ ? (small test: rename the id_resolver file, and suddenly _all_ identities are 4294967294) No. - Is the serving filesystem XFS formatted? Interestingly, yes. Only XFS. - Is NIS involved? Or LDAP? (A small test by copying the passwd, shadow and group entries to the client system: everything is ok). No. Only 'compat' nssswitch entries. - Do you use nsswitch to resolve identities (uid/gid)? I don't understand - nsswitch is always used. Did you mean what nsswitch configuration do I have? If so, it's just 'compat'. - Does your client run a name service caching daemon (nscd or unscd)? No. - Did you see nobody/nogroup (65534/65534) identities too? Yes. Just to make sure: this is nfs v4 (v4.0) only. Mounting with nfs version 3 over tcp works fine. Not using anything but kerberised nfs v4. regards, iustin signature.asc Description: Digital signature
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Hi Iustin, On 2014-09-04 19:11, Iustin Pop wrote: [...] Just another datapoint: this is different from my case. No new users created, randomly new files get -1 for a while, after which the correct UID is listed. No, this is not different: all new users get new files, which have never been served by the nfs server before. With me, a while might last forever for some identities. When I create files or dirs, they may be owned by the infamous -2 (4294967294), regardless _where_ I created them (i.e. through nfs or locally on the filesystem. You report that after a while the currect uid and gid are listed. Same for me, but sadly not always, some identities get stuck on 4294967294 forever. I'm curious if we have any differences in our setups: - Do you also have a mixture of wheezy and jessie systems? Is your nfs server also on a wheezy system? Are your clients both jessie and wheezy systems? - Did you see any changes in the behaviour of the wheezy clients after the jessie clients mounted? - Do you have inet6 entries in /etc/netconfig enabled on the jessie clients (which is the default)? - Did you change /etc/idmapd.conf? - Did you change or add any files in /etc/request-key.d/ ? (small test: rename the id_resolver file, and suddenly _all_ identities are 4294967294) - Is the serving filesystem XFS formatted? - Is NIS involved? Or LDAP? (A small test by copying the passwd, shadow and group entries to the client system: everything is ok). - Do you use nsswitch to resolve identities (uid/gid)? - Does your client run a name service caching daemon (nscd or unscd)? - Did you see nobody/nogroup (65534/65534) identities too? Just to make sure: this is nfs v4 (v4.0) only. Mounting with nfs version 3 over tcp works fine. Regards, Piet -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Dear all, This bug might be related to recency. I created accounts for our new students last week. Now, a listing of the home directories on the jessie systems shows about half of the _new_ accounts the identity as the infamous 4294967294. Since the new accounts were created, no reboots were done and no relevant services were restarted. The identities of older accounts are now all present. As always, id lists the correct identities in all cases. On the wheezy systems, all identities are shown correctly. Hope this helps in some way. Regards, Piet -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
On Thu, Sep 04, 2014 at 12:32:07PM +0200, Piet Plomp wrote: Dear all, This bug might be related to recency. I created accounts for our new students last week. Now, a listing of the home directories on the jessie systems shows about half of the _new_ accounts the identity as the infamous 4294967294. Since the new accounts were created, no reboots were done and no relevant services were restarted. The identities of older accounts are now all present. As always, id lists the correct identities in all cases. On the wheezy systems, all identities are shown correctly. Hope this helps in some way. Just another datapoint: this is different from my case. No new users created, randomly new files get -1 for a while, after which the correct UID is listed. regards, iustin signature.asc Description: Digital signature
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Hi Ben and other readers, I tried to find more on this issue. I have a fix, that make things better, but not good. First, on one of the jessie systems, I created the file /etc/request-key.d/id_legacy.conf containing: create id_legacy * * /usr/sbin/nfsidmap -t 600 %k %d This itself does not make the situation better. Since I suspect that rpc might have something to do with this, I looked at the ti-rpc library (libtirpc1 pkg) on another jessie system. Ti-rpc comes with the /etc/netconfig file. Since we don't have ipv6 here, I commented the inet6 lines. This prevents rpcbind from listening op ipv6 addresses. When I UNcommented the inet6 lines, things got better (identities were resolved in minutes). But when I rebooted this system, things were as bad as before. Note: the commented inet6 lines pose no problem on wheezy systems. However, when I applied _both_ changes to both systems things got better. Identities are always resolved (some in seconds, some in minutes), but some files still have bad identities (0xFFFE), notably new ones. Where I created the new files (i.e. on a jessie-mounted partition, a wheezy-mounted partition or directly on the xfs filesystem itself) did not make a difference. This situation is stable for over a day now. Btw, both systems run the regular jessie kernel (3.14.15-2), both have amd64 architecture. Hope this helps in narrowing the search. Best regards, Piet -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
On Mon, 2014-08-26 20:47, Ben Hutchings wrote: Can you also test with Linux 3.16, which is packaged in experimental? I did. This does _not_ solve the problem. Piet -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Hi Ben, Here are some tests: A wheezy system: For a new test I took a standard _wheezy_ system without systemd, 3.2.0-4 kernel (Debian 3.2.60-1+deb7u3). No nfs problem. I upgraded libc6 to jessie's 2.19.9: no nfs problem. Then I installed the linux-image-3.14.2-amd64 (3.14.15-2) kernel (which pulled in initramfs-tools) and rebooted: : YES there is the nfs problem! A jessie system: Another system, one of the jessie systems with older kernels installed: - kernel 3.13.10 nfs problem YES - kernel 3.14.12 nfs problem YES - kernel 3.14.15 nfs problem YES - kernel 3.2.0-4 (3.2.54 from wheezy) nfs problem NO This system uses systemd. Looks like it's a kernel problem, the problem is not introduced in 3.14.11 or 12, as I thought earlier. Piet On Sun, 2014-08-24 08:04, Ben Hutchings wrote: Control: tag -1 moreinfo On Fri, 2014-08-22 at 12:06 +0200, Piet Plomp wrote: [...] * What exactly did you do (or not do) that was effective (or ineffective)? aptitude update, which pulled in kernel 3.14.12, systemd for the first time, and libc6 2.19.9? Problem has also been reproduced on older libc6's This libc version cannot be correct. But it is one of the 2.19 series, and the problem still exists in 2.19.9. [...] So can you test with the newer kernel and sysvinit instead of systemd? Or alternately, the older kernel and systemd? Ben. -- Ben Hutchings One of the nice things about standards is that there are so many of them. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Control: reassign -1 src:linux 3.14.15-2 Control: found -1 3.13.10-1 Control: found -1 3.14.12-1 On Mon, 2014-08-25 at 14:43 +0200, Piet Plomp wrote: Hi Ben, Here are some tests: A wheezy system: For a new test I took a standard _wheezy_ system without systemd, 3.2.0-4 kernel (Debian 3.2.60-1+deb7u3). No nfs problem. I upgraded libc6 to jessie's 2.19.9: no nfs problem. Then I installed the linux-image-3.14.2-amd64 (3.14.15-2) kernel (which pulled in initramfs-tools) and rebooted: : YES there is the nfs problem! A jessie system: Another system, one of the jessie systems with older kernels installed: - kernel 3.13.10 nfs problem YES - kernel 3.14.12 nfs problem YES - kernel 3.14.15 nfs problem YES - kernel 3.2.0-4 (3.2.54 from wheezy) nfs problem NO This system uses systemd. Looks like it's a kernel problem, the problem is not introduced in 3.14.11 or 12, as I thought earlier. [...] Thanks for testing. Can you also test with Linux 3.16, which is packaged in experimental? Ben. -- Ben Hutchings All extremists should be taken out and shot. signature.asc Description: This is a digitally signed message part
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
On Mon, Aug 25, 2014 at 11:47:46AM -0700, Ben Hutchings wrote: On Mon, 2014-08-25 at 14:43 +0200, Piet Plomp wrote: Hi Ben, Here are some tests: A wheezy system: For a new test I took a standard _wheezy_ system without systemd, 3.2.0-4 kernel (Debian 3.2.60-1+deb7u3). No nfs problem. I upgraded libc6 to jessie's 2.19.9: no nfs problem. Then I installed the linux-image-3.14.2-amd64 (3.14.15-2) kernel (which pulled in initramfs-tools) and rebooted: : YES there is the nfs problem! A jessie system: Another system, one of the jessie systems with older kernels installed: - kernel 3.13.10 nfs problem YES - kernel 3.14.12 nfs problem YES - kernel 3.14.15 nfs problem YES - kernel 3.2.0-4 (3.2.54 from wheezy) nfs problem NO This system uses systemd. Looks like it's a kernel problem, the problem is not introduced in 3.14.11 or 12, as I thought earlier. [...] Thanks for testing. Can you also test with Linux 3.16, which is packaged in experimental? Just FYI: I have the same problem, but as I use custom kernels built from upstream I didn't report it yet (I thought it's maybe my config or such). But I know that this was not a problem with 3.7; it appeared when I switched from 3.7 to 3.12, so it was introduced sometime between 3.8 and 3.12. regards, iustin -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Control: tag -1 moreinfo On Fri, 2014-08-22 at 12:06 +0200, Piet Plomp wrote: [...] * What exactly did you do (or not do) that was effective (or ineffective)? aptitude update, which pulled in kernel 3.14.12, systemd for the first time, and libc6 2.19.9? Problem has also been reproduced on older libc6's [...] So can you test with the newer kernel and sysvinit instead of systemd? Or alternately, the older kernel and systemd? Ben. -- Ben Hutchings One of the nice things about standards is that there are so many of them. signature.asc Description: This is a digitally signed message part
Bug#758870: nfs-common: nfs v4: uid/gid lookup fails for some of the users
Package: nfs-common Version: 1:1.2.8-9 Severity: important Dear Maintainer, I'm reporting this bug in nfs (4)? where client's uid's/gid's are shown as them instead of their regular ownerships. This has been puzzling me for over a month now. The problem appeared after a system upgrade in jessie. Before this update everything worked well. More details are below. Problem description: We (a school) serve home directories over nfs from a wheezy system. The clients mount them using nfs v4. Some client systems were upgraded to jessie. On jessie systems, nfs v4 clients worked ok, until some day (about half of july) systemd, a new libc6 and a new kernel appeared. Since then, identities of the users home directories are shown as 4294967294. When the main login directory has the right identities, subdirectories and files within it still can have the 4294967294 for uid and gid. This bahaviour is consistent on all the jessie systems I upgraded. And yes, identities for the users, as shown with id, _are_ available in all cases. When booting, I see that systemd registers the id_resolver in the kernel, as well as the id_legacy resolver. Since the id_resolver is supposed to call nfsidmap, I raised Verbosity in /etc/idmapd.conf to 65535. This shows that nfsidmap is simply not called for some users, some random, some rather consistent. About a third is lacking a call. When a user lookup is succesful, the corresponding group is looked up as well. From time to time users, whether logged in or not, lose their identities. This can be inidivdual files or directories. Lateron all identities are present again. I see no request-key program, it looks like systemd performs this role. I've been debugging a lot, and still have no clue who is actually calling the id_resolver (as defined in /etc/request-key.d/id_resolver.conf). When I remove the file, _all_ users and groups immediately show up as 4294967294. I have no idea what happens when the nfsidmap key expires after 600 seconds, tracing listings show that known users stay known, and some of the unknown ones become known over time. System setup: The nfs server is an up-to-date wheezy system. The partition with the home directory is an xfs partition. The export entry looks like: client(rw,secure,no_subtree_check,sync) The mount option on the client looks like: nfs rw,vers=4,proto=tcp,rsize=524288,wsize=524288,nosuid,nodev,noatime There is no gss setup here. Identities are distributed through nis. Identities on the clients (as shown by id) are always avalable. Ideas: The problem suggests something is in the way. This might be the rpc.idmap daemon. Killing it did not resolve the problem. Bug #744768 might be related, although I don't see the (null) so often. Maybe there is some 32-bit/16-bit uid/gid mixup. Tested if uid/gid's of the users correspond to some kind of a pattern. No, these are just in-between other users. Same for permissions nfv4 patches in kernel 3.14.11 (in jessie from 3.14.12)? looks like bug #562821, but we're talking clients instead of server. Debugging: I'm willing to help debugging this, but I'm out of options. * What led up to the situation? An regular upgrade in jessie about half of july 2014 * What exactly did you do (or not do) that was effective (or ineffective)? aptitude update, which pulled in kernel 3.14.12, systemd for the first time, and libc6 2.19.9? Problem has also been reproduced on older libc6's * What was the outcome of this action? uid's/gid's did show up as 4294967294 instead; id lists id's correctly * What outcome did you expect instead? user directories with user id'a and group id's shown correctly Thanks, Piet -- Package-specific info: -- rpcinfo -- program vers proto port service 104 tcp111 portmapper 103 tcp111 portmapper 102 tcp111 portmapper 104 udp111 portmapper 103 udp111 portmapper 102 udp111 portmapper 172 udp 37063 ypbind 171 udp 37063 ypbind 172 tcp 48220 ypbind 171 tcp 48220 ypbind 1000241 udp 60043 status 1000241 tcp 45346 status -- /etc/default/nfs-common -- NEED_STATD= STATDOPTS= NEED_IDMAPD= NEED_GSSD= -- /etc/idmapd.conf -- [General] Verbosity = 0 Pipefs-Directory = /run/rpc_pipefs [Mapping] Nobody-User = nobody Nobody-Group = nogroup [Translation] Method = nsswitch -- /etc/fstab -- # nfshost nfshost:/homes/homes nfs4 rw,vers=4,proto=tcp,rsize=524288,wsize=524288,nosuid,nodev,noatime 0 0 -- /proc/mounts -- -- System Information: Debian Release: jessie/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64