Re: Kernel NFS boot failure
Hi All, On 08/11/2016 07:25 PM, Grygorii Strashko wrote: > On 08/03/2016 06:04 PM, Grygorii Strashko wrote: >> Hi Vladimir, >> >> On 08/03/2016 03:06 PM, Vladimir Murzin wrote: >>> On 03/08/16 12:41, Grygorii Strashko wrote: We observe Kernel boot failure while running NFS boot stress test (1000 iterations): - Linux version 4.7.0 >> >> I'd like to pay your attention that this issue also reproducible with >> Kernel 4.7.0! >> The same can be seen from the log I've provided in first e-mail: >> [0.00] Linux version 4.7.0 >> (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 >> (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016 >> >> >> I've not run the test with current master at it's not been tagged yet. > > Still in progress. rc1 unstable on my platforms due to other issues :( > >> - am335x-evm (TI AM335x EVM) - failure rate 10-20 times per test. Originally this issue was reproduced using TI Kernel 4.4 ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: ti-linux-4.4.y) on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms. This issues has not been reproduced with TI Kernel 4.1 before. The SysRq shows that system stuck in nfs_fs_mount() [ 207.904632] [] (schedule) from [] (rpc_wait_bit_killable+0x2c/0xd8) [ 207.912996] [] (rpc_wait_bit_killable) from [] (__wait_on_bit+0x84/0xc0) [ 207.921812] [] (__wait_on_bit) from [] (out_of_line_wait_on_bit+0x64/0x70) [ 207.930810] [] (out_of_line_wait_on_bit) from [] (__rpc_execute+0x18c/0x544) [ 207.939988] [] (__rpc_execute) from [] (rpc_run_task+0x13c/0x158) [ 207.948166] [] (rpc_run_task) from [] (rpc_call_sync+0x44/0xc4) [ 207.956163] [] (rpc_call_sync) from [] (rpc_ping+0x48/0x68) [ 207.963796] [] (rpc_ping) from [] (rpc_create_xprt+0xec/0x164) [ 207.971702] [] (rpc_create_xprt) from [] (rpc_create+0xf0/0x1a0) [ 207.979794] [] (rpc_create) from [] (nfs_create_rpc_client+0xd4/0xec) [ 207.988338] [] (nfs_create_rpc_client) from [] (nfs_init_client+0x20/0x78) [ 207.997332] [] (nfs_init_client) from [] (nfs_create_server+0xa0/0x3bc) [ 208.006057] [] (nfs_create_server) from [] (nfs3_create_server+0x8/0x20) [ 208.014879] [] (nfs3_create_server) from [] (nfs_try_mount+0xc4/0x1f0) [ 208.023513] [] (nfs_try_mount) from [] (nfs_fs_mount+0x290/0x910) [ 208.031702] [] (nfs_fs_mount) from [] (mount_fs+0x44/0x168) I was able to reproduce this issue with NFS/SUNRPC debug logs enabled using TI kernel 4.4. I'd be appreciated for any help :( and be happy to run any kind of tests or check debug patches if needed. >From the logs I found that problem happens when rpc_task 2 is freeing too late >as in the below log. (marked as [GS]) Starting kernel ... [0.00] Booting Linux on physical CPU 0x0 [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 4.4.18-32575-g5db1dce (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 (prerelease) (Linaro GCC 4.9-2015.05) ) #2 SMP PREEMPT Wed Aug 17 10:49:14 CDT 2016 [0.00] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=30c5387d [0.00] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [0.00] Machine model: TI AM5728 EVM [0.00] Reserved memory: created CMA memory pool at 0x9580, size 56 MiB [0.00] Reserved memory: initialized node ipu2_cma@9580, compatible id shared-dma-pool [0.00] Reserved memory: created CMA memory pool at 0x9900, size 64 MiB [0.00] Reserved memory: initialized node dsp1_cma@9900, compatible id shared-dma-pool [0.00] Reserved memory: created CMA memory pool at 0x9d00, size 32 MiB [0.00] Reserved memory: initialized node ipu1_cma@9d00, compatible id shared-dma-pool [0.00] Reserved memory: created CMA memory pool at 0x9f00, size 8 MiB [0.00] Reserved memory: initialized node dsp2_cma@9f00, compatible id shared-dma-pool [0.00] cma: Reserved 24 MiB at 0xfe40 [0.00] Forcing write-allocate cache policy for SMP [0.00] Memory policy: Data cache writealloc [0.00] OMAP4: Map 0xffd0 to fe60 for dram barrier [0.00] DRA752 ES1.1 [0.00] PERCPU: Embedded 11 pages/cpu @eed31000 s14848 r8192 d22016 u45056 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 521792 [0.00] Kernel command line: console=ttyO2,115200n8 earlyprintk=serial,ttyO2,115200n8 rootwait ip=:eth0:dhcp root=/dev/nfs rw
Re: Kernel NFS boot failure
On 08/03/2016 06:04 PM, Grygorii Strashko wrote: > Hi Vladimir, > > On 08/03/2016 03:06 PM, Vladimir Murzin wrote: >> On 03/08/16 12:41, Grygorii Strashko wrote: >>> We observe Kernel boot failure while running NFS boot stress test (1000 >>> iterations): >>> - Linux version 4.7.0 > > I'd like to pay your attention that this issue also reproducible with > Kernel 4.7.0! > The same can be seen from the log I've provided in first e-mail: > [0.00] Linux version 4.7.0 > (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 > (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016 > > > I've not run the test with current master at it's not been tagged yet. Still in progress. rc1 unstable on my platforms due to other issues :( > >>> - am335x-evm (TI AM335x EVM) >>> - failure rate 10-20 times per test. >>> Originally this issue was reproduced using TI Kernel 4.4 >>> ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: >>> ti-linux-4.4.y) >>> on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms. >>> This issues has not been reproduced with TI Kernel 4.1 before. >>> >>> The SysRq shows that system stuck in nfs_fs_mount() >>> >>> [ 207.904632] [] (schedule) from [] >>> (rpc_wait_bit_killable+0x2c/0xd8) >>> [ 207.912996] [] (rpc_wait_bit_killable) from [] >>> (__wait_on_bit+0x84/0xc0) >>> [ 207.921812] [] (__wait_on_bit) from [] >>> (out_of_line_wait_on_bit+0x64/0x70) >>> [ 207.930810] [] (out_of_line_wait_on_bit) from [] >>> (__rpc_execute+0x18c/0x544) >>> [ 207.939988] [] (__rpc_execute) from [] >>> (rpc_run_task+0x13c/0x158) >>> [ 207.948166] [] (rpc_run_task) from [] >>> (rpc_call_sync+0x44/0xc4) >>> [ 207.956163] [] (rpc_call_sync) from [] >>> (rpc_ping+0x48/0x68) >>> [ 207.963796] [] (rpc_ping) from [] >>> (rpc_create_xprt+0xec/0x164) >>> [ 207.971702] [] (rpc_create_xprt) from [] >>> (rpc_create+0xf0/0x1a0) >>> [ 207.979794] [] (rpc_create) from [] >>> (nfs_create_rpc_client+0xd4/0xec) >>> [ 207.988338] [] (nfs_create_rpc_client) from [] >>> (nfs_init_client+0x20/0x78) >>> [ 207.997332] [] (nfs_init_client) from [] >>> (nfs_create_server+0xa0/0x3bc) >>> [ 208.006057] [] (nfs_create_server) from [] >>> (nfs3_create_server+0x8/0x20) >>> [ 208.014879] [] (nfs3_create_server) from [] >>> (nfs_try_mount+0xc4/0x1f0) >>> [ 208.023513] [] (nfs_try_mount) from [] >>> (nfs_fs_mount+0x290/0x910) >>> [ 208.031702] [] (nfs_fs_mount) from [] >>> (mount_fs+0x44/0x168) >>> >>> Has anyone else seen this issue? >>> >>> I'd be appreciated for any help or advice related to this issue? >> >> I did not look at details, but because it is 4.4 and __wait_on_bit >> showed up you might want to look at [1] >> >> [1] https://lkml.org/lkml/2015/11/20/472 > > Thanks. I'll take a look. > I've checked this thread and all three commits mentioned there are present in K4.4 >=3.17 commit 743162013d40 sched: Remove proliferation of wait_on_bit() action functions >=4.4 commit 68985633bccb sched/wait: Fix signal handling in bit wait helpers >=4.4 commit dfd01f026058 sched/wait: Fix the signal handling fix Also, It seems first patch, probably, has copy-past error. I'm not sure and it may be that patch is correct :) Any way, It doesn't help with this issue if I use wait_on_bit_lock_io in nfs_page_group_lock(). 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") -- does: diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index b6ee3a6..6104d35 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -138,12 +138,6 @@ nfs_iocounter_wait(struct nfs_io_counter *c) return __nfs_iocounter_wait(c); } -static int nfs_wait_bit_uninterruptible(void *word) -{ - io_schedule(); - return 0; -} - /* * nfs_page_group_lock - lock the head of the page group * @req - request in group that is to be locked @@ -158,7 +152,6 @@ nfs_page_group_lock(struct nfs_page *req) WARN_ON_ONCE(head != head->wb_head); wait_on_bit_lock(>wb_flags, PG_HEADLOCK, - nfs_wait_bit_uninterruptible, [GS] But it seems should be wait_on_bit_lock_io() < TASK_UNINTERRUPTIBLE); } @@ -425,9 +418,8 @@ void nfs_release_request(struct nfs_page *req) int nfs_wait_on_request(struct nfs_page *req) { - return wait_on_bit(>wb_flags, PG_BUSY, - nfs_wait_bit_uninterruptible, - TASK_UNINTERRUPTIBLE); + return wait_on_bit_io(>wb_flags, PG_BUSY, + TASK_UNINTERRUPTIBLE); } -- regards, -grygorii
Re: Kernel NFS boot failure
Hi Vladimir, On 08/03/2016 03:06 PM, Vladimir Murzin wrote: > On 03/08/16 12:41, Grygorii Strashko wrote: >> We observe Kernel boot failure while running NFS boot stress test (1000 >> iterations): >> - Linux version 4.7.0 I'd like to pay your attention that this issue also reproducible with Kernel 4.7.0! The same can be seen from the log I've provided in first e-mail: [0.00] Linux version 4.7.0 (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016 I've not run the test with current master at it's not been tagged yet. >> - am335x-evm (TI AM335x EVM) >> - failure rate 10-20 times per test. >> Originally this issue was reproduced using TI Kernel 4.4 >> ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: >> ti-linux-4.4.y) >> on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms. >> This issues has not been reproduced with TI Kernel 4.1 before. >> >> The SysRq shows that system stuck in nfs_fs_mount() >> >> [ 207.904632] [] (schedule) from [] >> (rpc_wait_bit_killable+0x2c/0xd8) >> [ 207.912996] [] (rpc_wait_bit_killable) from [] >> (__wait_on_bit+0x84/0xc0) >> [ 207.921812] [] (__wait_on_bit) from [] >> (out_of_line_wait_on_bit+0x64/0x70) >> [ 207.930810] [] (out_of_line_wait_on_bit) from [] >> (__rpc_execute+0x18c/0x544) >> [ 207.939988] [] (__rpc_execute) from [] >> (rpc_run_task+0x13c/0x158) >> [ 207.948166] [] (rpc_run_task) from [] >> (rpc_call_sync+0x44/0xc4) >> [ 207.956163] [] (rpc_call_sync) from [] >> (rpc_ping+0x48/0x68) >> [ 207.963796] [] (rpc_ping) from [] >> (rpc_create_xprt+0xec/0x164) >> [ 207.971702] [] (rpc_create_xprt) from [] >> (rpc_create+0xf0/0x1a0) >> [ 207.979794] [] (rpc_create) from [] >> (nfs_create_rpc_client+0xd4/0xec) >> [ 207.988338] [] (nfs_create_rpc_client) from [] >> (nfs_init_client+0x20/0x78) >> [ 207.997332] [] (nfs_init_client) from [] >> (nfs_create_server+0xa0/0x3bc) >> [ 208.006057] [] (nfs_create_server) from [] >> (nfs3_create_server+0x8/0x20) >> [ 208.014879] [] (nfs3_create_server) from [] >> (nfs_try_mount+0xc4/0x1f0) >> [ 208.023513] [] (nfs_try_mount) from [] >> (nfs_fs_mount+0x290/0x910) >> [ 208.031702] [] (nfs_fs_mount) from [] >> (mount_fs+0x44/0x168) >> >> Has anyone else seen this issue? >> >> I'd be appreciated for any help or advice related to this issue? > > I did not look at details, but because it is 4.4 and __wait_on_bit > showed up you might want to look at [1] > > [1] https://lkml.org/lkml/2015/11/20/472 Thanks. I'll take a look. -- regards, -grygorii
Re: Kernel NFS boot failure
Hi, On 03/08/16 12:41, Grygorii Strashko wrote: > Hi All, > > We observe Kernel boot failure while running NFS boot stress test (1000 > iterations): > - Linux version 4.7.0 > - am335x-evm (TI AM335x EVM) > - failure rate 10-20 times per test. > Originally this issue was reproduced using TI Kernel 4.4 > ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: > ti-linux-4.4.y) > on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms. > This issues has not been reproduced with TI Kernel 4.1 before. > > The SysRq shows that system stuck in nfs_fs_mount() > > [ 207.904632] [] (schedule) from [] > (rpc_wait_bit_killable+0x2c/0xd8) > [ 207.912996] [] (rpc_wait_bit_killable) from [] > (__wait_on_bit+0x84/0xc0) > [ 207.921812] [] (__wait_on_bit) from [] > (out_of_line_wait_on_bit+0x64/0x70) > [ 207.930810] [] (out_of_line_wait_on_bit) from [] > (__rpc_execute+0x18c/0x544) > [ 207.939988] [] (__rpc_execute) from [] > (rpc_run_task+0x13c/0x158) > [ 207.948166] [] (rpc_run_task) from [] > (rpc_call_sync+0x44/0xc4) > [ 207.956163] [] (rpc_call_sync) from [] > (rpc_ping+0x48/0x68) > [ 207.963796] [] (rpc_ping) from [] > (rpc_create_xprt+0xec/0x164) > [ 207.971702] [] (rpc_create_xprt) from [] > (rpc_create+0xf0/0x1a0) > [ 207.979794] [] (rpc_create) from [] > (nfs_create_rpc_client+0xd4/0xec) > [ 207.988338] [] (nfs_create_rpc_client) from [] > (nfs_init_client+0x20/0x78) > [ 207.997332] [] (nfs_init_client) from [] > (nfs_create_server+0xa0/0x3bc) > [ 208.006057] [] (nfs_create_server) from [] > (nfs3_create_server+0x8/0x20) > [ 208.014879] [] (nfs3_create_server) from [] > (nfs_try_mount+0xc4/0x1f0) > [ 208.023513] [] (nfs_try_mount) from [] > (nfs_fs_mount+0x290/0x910) > [ 208.031702] [] (nfs_fs_mount) from [] > (mount_fs+0x44/0x168) > > Has anyone else seen this issue? > > I'd be appreciated for any help or advice related to this issue? I did not look at details, but because it is 4.4 and __wait_on_bit showed up you might want to look at [1] [1] https://lkml.org/lkml/2015/11/20/472 Just my 2p. Vladimir > > Thanks in advance. > > regards, > -grygorii
Kernel NFS boot failure
Hi All, We observe Kernel boot failure while running NFS boot stress test (1000 iterations): - Linux version 4.7.0 - am335x-evm (TI AM335x EVM) - failure rate 10-20 times per test. Originally this issue was reproduced using TI Kernel 4.4 ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: ti-linux-4.4.y) on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms. This issues has not been reproduced with TI Kernel 4.1 before. The SysRq shows that system stuck in nfs_fs_mount() [ 207.904632] [] (schedule) from [] (rpc_wait_bit_killable+0x2c/0xd8) [ 207.912996] [] (rpc_wait_bit_killable) from [] (__wait_on_bit+0x84/0xc0) [ 207.921812] [] (__wait_on_bit) from [] (out_of_line_wait_on_bit+0x64/0x70) [ 207.930810] [] (out_of_line_wait_on_bit) from [] (__rpc_execute+0x18c/0x544) [ 207.939988] [] (__rpc_execute) from [] (rpc_run_task+0x13c/0x158) [ 207.948166] [] (rpc_run_task) from [] (rpc_call_sync+0x44/0xc4) [ 207.956163] [] (rpc_call_sync) from [] (rpc_ping+0x48/0x68) [ 207.963796] [] (rpc_ping) from [] (rpc_create_xprt+0xec/0x164) [ 207.971702] [] (rpc_create_xprt) from [] (rpc_create+0xf0/0x1a0) [ 207.979794] [] (rpc_create) from [] (nfs_create_rpc_client+0xd4/0xec) [ 207.988338] [] (nfs_create_rpc_client) from [] (nfs_init_client+0x20/0x78) [ 207.997332] [] (nfs_init_client) from [] (nfs_create_server+0xa0/0x3bc) [ 208.006057] [] (nfs_create_server) from [] (nfs3_create_server+0x8/0x20) [ 208.014879] [] (nfs3_create_server) from [] (nfs_try_mount+0xc4/0x1f0) [ 208.023513] [] (nfs_try_mount) from [] (nfs_fs_mount+0x290/0x910) [ 208.031702] [] (nfs_fs_mount) from [] (mount_fs+0x44/0x168) Has anyone else seen this issue? I'd be appreciated for any help or advice related to this issue? Thanks in advance. regards, -grygorii === kernel log === Starting kernel ... [0.00] Booting Linux on physical CPU 0x0 [0.00] Linux version 4.7.0 (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016 [0.00] CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7), cr=10c5387d [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [0.00] Machine model: TI AM335x EVM [0.00] cma: Reserved 16 MiB at 0xbe80 [0.00] Memory policy: Data cache writeback [0.00] CPU: All CPU(s) started in SVC mode. [0.00] AM335X ES2.1 (sgx neon ) [0.00] percpu: Embedded 15 pages/cpu @ef6b9000 s29440 r8192 d23808 u61440 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 259904 [0.00] Kernel command line: console=ttyO0,115200n8 earlyprintk=serial,ttyO0,115200n8 rootwait ip=:eth0:dhcp root=/dev/nfs rw nfsroot=192.168.0.1:/home/tigtfarm04/NFS_exports/linux/am335x-evm/autofs/bdb57ec4c06590e6172879867ab30d1e,nolock,v3,tcp,rsize=4096,wsize=4096 [0.00] PID hash table entries: 4096 (order: 2, 16384 bytes) [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [0.00] Memory: 999648K/1046528K available (7354K kernel code, 768K rwdata, 2360K rodata, 1024K init, 8122K bss, 30496K reserved, 16384K cma-reserved, 243712K highmem) [0.00] Virtual kernel memory layout: [0.00] vector : 0x - 0x1000 ( 4 kB) [0.00] fixmap : 0xffc0 - 0xfff0 (3072 kB) [0.00] vmalloc : 0xf080 - 0xff80 ( 240 MB) [0.00] lowmem : 0xc000 - 0xf000 ( 768 MB) [0.00] pkmap : 0xbfe0 - 0xc000 ( 2 MB) [0.00] modules : 0xbf00 - 0xbfe0 ( 14 MB) [0.00] .text : 0xc0008000 - 0xc0a7c8e8 (10707 kB) [0.00] .init : 0xc0b0 - 0xc0c0 (1024 kB) [0.00] .data : 0xc0c0 - 0xc0cc0304 ( 769 kB) [0.00].bss : 0xc0cc2000 - 0xc14b0a40 (8123 kB) [0.00] Running RCU self tests [0.00] Hierarchical RCU implementation. [0.00] RCU lockdep checking is enabled. [0.00] Build-time adjustment of leaf fanout to 32. [0.00] RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1. [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=32, nr_cpu_ids=1 [0.00] NR_IRQS:16 nr_irqs:16 16 [0.00] IRQ: Found an INTC at 0xfa20 (revision 5.0) with 128 interrupts [0.00] OMAP clockevent source: timer2 at 2400 Hz [0.17] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns [0.39] clocksource: timer1: mask: 0x max_cycles: 0x, max_idle_ns: 79635851949 ns [0.90] OMAP clocksource: timer1 at 2400 Hz [0.000600] clocksource_probe: no matching clocksources found [0.001543] Console: colour dummy device 80x30 [0.001605] Lock