Re: Kernel NFS boot failure

2016-08-19 Thread Grygorii Strashko
Hi All,
On 08/11/2016 07:25 PM, Grygorii Strashko wrote:
> On 08/03/2016 06:04 PM, Grygorii Strashko wrote:
>> Hi Vladimir,
>>
>> On 08/03/2016 03:06 PM, Vladimir Murzin wrote:
>>> On 03/08/16 12:41, Grygorii Strashko wrote:
 We observe Kernel boot failure while running NFS boot stress test (1000 
 iterations):
 - Linux version 4.7.0 
>>
>> I'd like to pay your attention that this issue also reproducible with
>> Kernel 4.7.0!
>> The same can be seen from the log I've provided in first e-mail:
>> [0.00] Linux version 4.7.0 
>> (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 
>> (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016
>>
>>
>> I've not run the test with current master at it's not been tagged yet.
> 
> Still in progress. rc1 unstable on my platforms due to other issues :(
> 
>>
 - am335x-evm (TI AM335x EVM)
 - failure rate 10-20 times per test.
 Originally this issue was reproduced using TI Kernel 4.4
 ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: 
 ti-linux-4.4.y)
 on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms.
 This issues has not been reproduced with TI Kernel 4.1 before.

 The SysRq shows that system stuck in nfs_fs_mount()

 [  207.904632] [] (schedule) from [] 
 (rpc_wait_bit_killable+0x2c/0xd8)
 [  207.912996] [] (rpc_wait_bit_killable) from [] 
 (__wait_on_bit+0x84/0xc0)
 [  207.921812] [] (__wait_on_bit) from [] 
 (out_of_line_wait_on_bit+0x64/0x70)
 [  207.930810] [] (out_of_line_wait_on_bit) from [] 
 (__rpc_execute+0x18c/0x544)
 [  207.939988] [] (__rpc_execute) from [] 
 (rpc_run_task+0x13c/0x158)
 [  207.948166] [] (rpc_run_task) from [] 
 (rpc_call_sync+0x44/0xc4)
 [  207.956163] [] (rpc_call_sync) from [] 
 (rpc_ping+0x48/0x68)
 [  207.963796] [] (rpc_ping) from [] 
 (rpc_create_xprt+0xec/0x164)
 [  207.971702] [] (rpc_create_xprt) from [] 
 (rpc_create+0xf0/0x1a0)
 [  207.979794] [] (rpc_create) from [] 
 (nfs_create_rpc_client+0xd4/0xec)
 [  207.988338] [] (nfs_create_rpc_client) from [] 
 (nfs_init_client+0x20/0x78)
 [  207.997332] [] (nfs_init_client) from [] 
 (nfs_create_server+0xa0/0x3bc)
 [  208.006057] [] (nfs_create_server) from [] 
 (nfs3_create_server+0x8/0x20)
 [  208.014879] [] (nfs3_create_server) from [] 
 (nfs_try_mount+0xc4/0x1f0)
 [  208.023513] [] (nfs_try_mount) from [] 
 (nfs_fs_mount+0x290/0x910)
 [  208.031702] [] (nfs_fs_mount) from [] 
 (mount_fs+0x44/0x168)


I was able to reproduce this issue with NFS/SUNRPC debug logs enabled using TI 
kernel 4.4.
I'd be appreciated for any help :( and be happy to run any kind of tests or 
check debug patches if needed.

>From the logs I found that problem happens when rpc_task 2 is freeing too late 
>as in the below log.
(marked as [GS])


Starting kernel ...

[0.00] Booting Linux on physical CPU 0x0
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.4.18-32575-g5db1dce 
(lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 
(prerelease) (Linaro GCC 4.9-2015.05) ) #2 SMP PREEMPT Wed Aug 17 10:49:14 CDT 
2016
[0.00] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=30c5387d
[0.00] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[0.00] Machine model: TI AM5728 EVM
[0.00] Reserved memory: created CMA memory pool at 0x9580, 
size 56 MiB
[0.00] Reserved memory: initialized node ipu2_cma@9580, compatible 
id shared-dma-pool
[0.00] Reserved memory: created CMA memory pool at 0x9900, 
size 64 MiB
[0.00] Reserved memory: initialized node dsp1_cma@9900, compatible 
id shared-dma-pool
[0.00] Reserved memory: created CMA memory pool at 0x9d00, 
size 32 MiB
[0.00] Reserved memory: initialized node ipu1_cma@9d00, compatible 
id shared-dma-pool
[0.00] Reserved memory: created CMA memory pool at 0x9f00, 
size 8 MiB
[0.00] Reserved memory: initialized node dsp2_cma@9f00, compatible 
id shared-dma-pool
[0.00] cma: Reserved 24 MiB at 0xfe40
[0.00] Forcing write-allocate cache policy for SMP
[0.00] Memory policy: Data cache writealloc
[0.00] OMAP4: Map 0xffd0 to fe60 for dram barrier
[0.00] DRA752 ES1.1
[0.00] PERCPU: Embedded 11 pages/cpu @eed31000 s14848 r8192 d22016 
u45056
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 521792
[0.00] Kernel command line: console=ttyO2,115200n8  
earlyprintk=serial,ttyO2,115200n8 rootwait  ip=:eth0:dhcp  root=/dev/nfs rw 

Re: Kernel NFS boot failure

2016-08-11 Thread Grygorii Strashko
On 08/03/2016 06:04 PM, Grygorii Strashko wrote:
> Hi Vladimir,
> 
> On 08/03/2016 03:06 PM, Vladimir Murzin wrote:
>> On 03/08/16 12:41, Grygorii Strashko wrote:
>>> We observe Kernel boot failure while running NFS boot stress test (1000 
>>> iterations):
>>> - Linux version 4.7.0 
> 
> I'd like to pay your attention that this issue also reproducible with
> Kernel 4.7.0!
> The same can be seen from the log I've provided in first e-mail:
> [0.00] Linux version 4.7.0 
> (lcpd...@dflsdit-build06.dal.design.ti.com) (gcc version 4.9.3 20150413 
> (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri Jul 29 17:41:27 CDT 2016
> 
> 
> I've not run the test with current master at it's not been tagged yet.

Still in progress. rc1 unstable on my platforms due to other issues :(

> 
>>> - am335x-evm (TI AM335x EVM)
>>> - failure rate 10-20 times per test.
>>> Originally this issue was reproduced using TI Kernel 4.4
>>> ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: 
>>> ti-linux-4.4.y)
>>> on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms.
>>> This issues has not been reproduced with TI Kernel 4.1 before.
>>>
>>> The SysRq shows that system stuck in nfs_fs_mount()
>>>
>>> [  207.904632] [] (schedule) from [] 
>>> (rpc_wait_bit_killable+0x2c/0xd8)
>>> [  207.912996] [] (rpc_wait_bit_killable) from [] 
>>> (__wait_on_bit+0x84/0xc0)
>>> [  207.921812] [] (__wait_on_bit) from [] 
>>> (out_of_line_wait_on_bit+0x64/0x70)
>>> [  207.930810] [] (out_of_line_wait_on_bit) from [] 
>>> (__rpc_execute+0x18c/0x544)
>>> [  207.939988] [] (__rpc_execute) from [] 
>>> (rpc_run_task+0x13c/0x158)
>>> [  207.948166] [] (rpc_run_task) from [] 
>>> (rpc_call_sync+0x44/0xc4)
>>> [  207.956163] [] (rpc_call_sync) from [] 
>>> (rpc_ping+0x48/0x68)
>>> [  207.963796] [] (rpc_ping) from [] 
>>> (rpc_create_xprt+0xec/0x164)
>>> [  207.971702] [] (rpc_create_xprt) from [] 
>>> (rpc_create+0xf0/0x1a0)
>>> [  207.979794] [] (rpc_create) from [] 
>>> (nfs_create_rpc_client+0xd4/0xec)
>>> [  207.988338] [] (nfs_create_rpc_client) from [] 
>>> (nfs_init_client+0x20/0x78)
>>> [  207.997332] [] (nfs_init_client) from [] 
>>> (nfs_create_server+0xa0/0x3bc)
>>> [  208.006057] [] (nfs_create_server) from [] 
>>> (nfs3_create_server+0x8/0x20)
>>> [  208.014879] [] (nfs3_create_server) from [] 
>>> (nfs_try_mount+0xc4/0x1f0)
>>> [  208.023513] [] (nfs_try_mount) from [] 
>>> (nfs_fs_mount+0x290/0x910)
>>> [  208.031702] [] (nfs_fs_mount) from [] 
>>> (mount_fs+0x44/0x168)
>>>
>>> Has anyone else seen this issue?
>>>
>>> I'd be appreciated for any help or advice related to this issue?
>>
>> I did not look at details, but because it is 4.4 and __wait_on_bit
>> showed up you might want to look at [1]
>>
>> [1] https://lkml.org/lkml/2015/11/20/472
> 
> Thanks. I'll take a look.
> 

I've checked this thread and all three commits mentioned there are present in 
K4.4
>=3.17
commit 743162013d40  sched: Remove proliferation of wait_on_bit() action 
functions
>=4.4
commit 68985633bccb  sched/wait: Fix signal handling in bit wait helpers
>=4.4
commit dfd01f026058  sched/wait: Fix the signal handling fix


Also, It seems first patch, probably, has copy-past error.
I'm not sure and it may be that patch is correct :)
Any way, It doesn't help with this issue if I use wait_on_bit_lock_io in 
nfs_page_group_lock().

743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
-- does:

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index b6ee3a6..6104d35 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -138,12 +138,6 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
return __nfs_iocounter_wait(c);
 }
 
-static int nfs_wait_bit_uninterruptible(void *word)
-{
-   io_schedule();
-   return 0;
-}
-
 /*
  * nfs_page_group_lock - lock the head of the page group
  * @req - request in group that is to be locked
@@ -158,7 +152,6 @@ nfs_page_group_lock(struct nfs_page *req)
WARN_ON_ONCE(head != head->wb_head);
 
wait_on_bit_lock(>wb_flags, PG_HEADLOCK,
-   nfs_wait_bit_uninterruptible,

[GS] But it seems should be wait_on_bit_lock_io() <

TASK_UNINTERRUPTIBLE);
 }
 
@@ -425,9 +418,8 @@ void nfs_release_request(struct nfs_page *req)
 int
 nfs_wait_on_request(struct nfs_page *req)
 {
-   return wait_on_bit(>wb_flags, PG_BUSY,
-   nfs_wait_bit_uninterruptible,
-   TASK_UNINTERRUPTIBLE);
+   return wait_on_bit_io(>wb_flags, PG_BUSY,
+ TASK_UNINTERRUPTIBLE);
 }




-- 
regards,
-grygorii


Re: Kernel NFS boot failure

2016-08-03 Thread Grygorii Strashko
Hi Vladimir,

On 08/03/2016 03:06 PM, Vladimir Murzin wrote:
> On 03/08/16 12:41, Grygorii Strashko wrote:
>> We observe Kernel boot failure while running NFS boot stress test (1000 
>> iterations):
>> - Linux version 4.7.0 

I'd like to pay your attention that this issue also reproducible with
Kernel 4.7.0!
The same can be seen from the log I've provided in first e-mail:
[0.00] Linux version 4.7.0 (lcpd...@dflsdit-build06.dal.design.ti.com) 
(gcc version 4.9.3 20150413 (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri 
Jul 29 17:41:27 CDT 2016


I've not run the test with current master at it's not been tagged yet.

>> - am335x-evm (TI AM335x EVM)
>> - failure rate 10-20 times per test.
>> Originally this issue was reproduced using TI Kernel 4.4
>> ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: 
>> ti-linux-4.4.y)
>> on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms.
>> This issues has not been reproduced with TI Kernel 4.1 before.
>>
>> The SysRq shows that system stuck in nfs_fs_mount()
>>
>> [  207.904632] [] (schedule) from [] 
>> (rpc_wait_bit_killable+0x2c/0xd8)
>> [  207.912996] [] (rpc_wait_bit_killable) from [] 
>> (__wait_on_bit+0x84/0xc0)
>> [  207.921812] [] (__wait_on_bit) from [] 
>> (out_of_line_wait_on_bit+0x64/0x70)
>> [  207.930810] [] (out_of_line_wait_on_bit) from [] 
>> (__rpc_execute+0x18c/0x544)
>> [  207.939988] [] (__rpc_execute) from [] 
>> (rpc_run_task+0x13c/0x158)
>> [  207.948166] [] (rpc_run_task) from [] 
>> (rpc_call_sync+0x44/0xc4)
>> [  207.956163] [] (rpc_call_sync) from [] 
>> (rpc_ping+0x48/0x68)
>> [  207.963796] [] (rpc_ping) from [] 
>> (rpc_create_xprt+0xec/0x164)
>> [  207.971702] [] (rpc_create_xprt) from [] 
>> (rpc_create+0xf0/0x1a0)
>> [  207.979794] [] (rpc_create) from [] 
>> (nfs_create_rpc_client+0xd4/0xec)
>> [  207.988338] [] (nfs_create_rpc_client) from [] 
>> (nfs_init_client+0x20/0x78)
>> [  207.997332] [] (nfs_init_client) from [] 
>> (nfs_create_server+0xa0/0x3bc)
>> [  208.006057] [] (nfs_create_server) from [] 
>> (nfs3_create_server+0x8/0x20)
>> [  208.014879] [] (nfs3_create_server) from [] 
>> (nfs_try_mount+0xc4/0x1f0)
>> [  208.023513] [] (nfs_try_mount) from [] 
>> (nfs_fs_mount+0x290/0x910)
>> [  208.031702] [] (nfs_fs_mount) from [] 
>> (mount_fs+0x44/0x168)
>>
>> Has anyone else seen this issue?
>>
>> I'd be appreciated for any help or advice related to this issue?
> 
> I did not look at details, but because it is 4.4 and __wait_on_bit
> showed up you might want to look at [1]
> 
> [1] https://lkml.org/lkml/2015/11/20/472

Thanks. I'll take a look.

-- 
regards,
-grygorii


Re: Kernel NFS boot failure

2016-08-03 Thread Vladimir Murzin
Hi,

On 03/08/16 12:41, Grygorii Strashko wrote:
> Hi All,
> 
> We observe Kernel boot failure while running NFS boot stress test (1000 
> iterations):
> - Linux version 4.7.0 
> - am335x-evm (TI AM335x EVM)
> - failure rate 10-20 times per test.
> Originally this issue was reproduced using TI Kernel 4.4
> ( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: 
> ti-linux-4.4.y)
> on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms.
> This issues has not been reproduced with TI Kernel 4.1 before.
> 
> The SysRq shows that system stuck in nfs_fs_mount()
> 
> [  207.904632] [] (schedule) from [] 
> (rpc_wait_bit_killable+0x2c/0xd8)
> [  207.912996] [] (rpc_wait_bit_killable) from [] 
> (__wait_on_bit+0x84/0xc0)
> [  207.921812] [] (__wait_on_bit) from [] 
> (out_of_line_wait_on_bit+0x64/0x70)
> [  207.930810] [] (out_of_line_wait_on_bit) from [] 
> (__rpc_execute+0x18c/0x544)
> [  207.939988] [] (__rpc_execute) from [] 
> (rpc_run_task+0x13c/0x158)
> [  207.948166] [] (rpc_run_task) from [] 
> (rpc_call_sync+0x44/0xc4)
> [  207.956163] [] (rpc_call_sync) from [] 
> (rpc_ping+0x48/0x68)
> [  207.963796] [] (rpc_ping) from [] 
> (rpc_create_xprt+0xec/0x164)
> [  207.971702] [] (rpc_create_xprt) from [] 
> (rpc_create+0xf0/0x1a0)
> [  207.979794] [] (rpc_create) from [] 
> (nfs_create_rpc_client+0xd4/0xec)
> [  207.988338] [] (nfs_create_rpc_client) from [] 
> (nfs_init_client+0x20/0x78)
> [  207.997332] [] (nfs_init_client) from [] 
> (nfs_create_server+0xa0/0x3bc)
> [  208.006057] [] (nfs_create_server) from [] 
> (nfs3_create_server+0x8/0x20)
> [  208.014879] [] (nfs3_create_server) from [] 
> (nfs_try_mount+0xc4/0x1f0)
> [  208.023513] [] (nfs_try_mount) from [] 
> (nfs_fs_mount+0x290/0x910)
> [  208.031702] [] (nfs_fs_mount) from [] 
> (mount_fs+0x44/0x168)
> 
> Has anyone else seen this issue?
> 
> I'd be appreciated for any help or advice related to this issue?

I did not look at details, but because it is 4.4 and __wait_on_bit
showed up you might want to look at [1]

[1] https://lkml.org/lkml/2015/11/20/472

Just my 2p.
Vladimir

> 
> Thanks in advance.
> 
> regards,
> -grygorii



Kernel NFS boot failure

2016-08-03 Thread Grygorii Strashko
Hi All,

We observe Kernel boot failure while running NFS boot stress test (1000 
iterations):
- Linux version 4.7.0 
- am335x-evm (TI AM335x EVM)
- failure rate 10-20 times per test.
Originally this issue was reproduced using TI Kernel 4.4
( git://git.ti.com/ti-linux-kernel/ti-linux-kernel.git, branch: ti-linux-4.4.y)
on both am335x-evm and am57xx-beagle-x15(am57xx-evm) platforms.
This issues has not been reproduced with TI Kernel 4.1 before.

The SysRq shows that system stuck in nfs_fs_mount()

[  207.904632] [] (schedule) from [] 
(rpc_wait_bit_killable+0x2c/0xd8)
[  207.912996] [] (rpc_wait_bit_killable) from [] 
(__wait_on_bit+0x84/0xc0)
[  207.921812] [] (__wait_on_bit) from [] 
(out_of_line_wait_on_bit+0x64/0x70)
[  207.930810] [] (out_of_line_wait_on_bit) from [] 
(__rpc_execute+0x18c/0x544)
[  207.939988] [] (__rpc_execute) from [] 
(rpc_run_task+0x13c/0x158)
[  207.948166] [] (rpc_run_task) from [] 
(rpc_call_sync+0x44/0xc4)
[  207.956163] [] (rpc_call_sync) from [] 
(rpc_ping+0x48/0x68)
[  207.963796] [] (rpc_ping) from [] 
(rpc_create_xprt+0xec/0x164)
[  207.971702] [] (rpc_create_xprt) from [] 
(rpc_create+0xf0/0x1a0)
[  207.979794] [] (rpc_create) from [] 
(nfs_create_rpc_client+0xd4/0xec)
[  207.988338] [] (nfs_create_rpc_client) from [] 
(nfs_init_client+0x20/0x78)
[  207.997332] [] (nfs_init_client) from [] 
(nfs_create_server+0xa0/0x3bc)
[  208.006057] [] (nfs_create_server) from [] 
(nfs3_create_server+0x8/0x20)
[  208.014879] [] (nfs3_create_server) from [] 
(nfs_try_mount+0xc4/0x1f0)
[  208.023513] [] (nfs_try_mount) from [] 
(nfs_fs_mount+0x290/0x910)
[  208.031702] [] (nfs_fs_mount) from [] 
(mount_fs+0x44/0x168)

Has anyone else seen this issue?

I'd be appreciated for any help or advice related to this issue?

Thanks in advance.

regards,
-grygorii

=== kernel log ===

Starting kernel ...

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 4.7.0 (lcpd...@dflsdit-build06.dal.design.ti.com) 
(gcc version 4.9.3 20150413 (prerelease) (Linaro GCC 4.9-2015.05) ) #1 SMP Fri 
Jul 29 17:41:27 CDT 2016
[0.00] CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7), cr=10c5387d
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[0.00] Machine model: TI AM335x EVM
[0.00] cma: Reserved 16 MiB at 0xbe80
[0.00] Memory policy: Data cache writeback
[0.00] CPU: All CPU(s) started in SVC mode.
[0.00] AM335X ES2.1 (sgx neon )
[0.00] percpu: Embedded 15 pages/cpu @ef6b9000 s29440 r8192 d23808 
u61440
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 259904
[0.00] Kernel command line: console=ttyO0,115200n8 
earlyprintk=serial,ttyO0,115200n8 rootwait   ip=:eth0:dhcp  root=/dev/nfs 
rw 
nfsroot=192.168.0.1:/home/tigtfarm04/NFS_exports/linux/am335x-evm/autofs/bdb57ec4c06590e6172879867ab30d1e,nolock,v3,tcp,rsize=4096,wsize=4096
 
[0.00] PID hash table entries: 4096 (order: 2, 16384 bytes)
[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Memory: 999648K/1046528K available (7354K kernel code, 768K 
rwdata, 2360K rodata, 1024K init, 8122K bss, 30496K reserved, 16384K 
cma-reserved, 243712K highmem)
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
[0.00] vmalloc : 0xf080 - 0xff80   ( 240 MB)
[0.00] lowmem  : 0xc000 - 0xf000   ( 768 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0008000 - 0xc0a7c8e8   (10707 kB)
[0.00]   .init : 0xc0b0 - 0xc0c0   (1024 kB)
[0.00]   .data : 0xc0c0 - 0xc0cc0304   ( 769 kB)
[0.00].bss : 0xc0cc2000 - 0xc14b0a40   (8123 kB)
[0.00] Running RCU self tests
[0.00] Hierarchical RCU implementation.
[0.00]  RCU lockdep checking is enabled.
[0.00]  Build-time adjustment of leaf fanout to 32.
[0.00]  RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=32, nr_cpu_ids=1
[0.00] NR_IRQS:16 nr_irqs:16 16
[0.00] IRQ: Found an INTC at 0xfa20 (revision 5.0) with 128 
interrupts
[0.00] OMAP clockevent source: timer2 at 2400 Hz
[0.17] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 
89478484971ns
[0.39] clocksource: timer1: mask: 0x max_cycles: 0x, 
max_idle_ns: 79635851949 ns
[0.90] OMAP clocksource: timer1 at 2400 Hz
[0.000600] clocksource_probe: no matching clocksources found
[0.001543] Console: colour dummy device 80x30
[0.001605] Lock