Re: [PATCH] tracing: Remove precision vsnprintf() check from print event

2024-03-04 Thread Sachin Sant



> On 05-Mar-2024, at 4:13 AM, Steven Rostedt  wrote:
> 
> From: "Steven Rostedt (Google)" 
> 
> This reverts 60be76eeabb3d ("tracing: Add size check when printing
> trace_marker output"). The only reason the precision check was added
> was because of a bug that miscalculated the write size of the string into
> the ring buffer and it truncated it removing the terminating nul byte. On
> reading the trace it crashed the kernel. But this was due to the bug in
> the code that happened during development and should never happen in
> practice. If anything, the precision can hide bugs where the string in the
> ring buffer isn't nul terminated and it will not be checked.
> 
> Link: 
> https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/
> Link: 
> https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home
> Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/
> 
> Reported-by: Sachin Sant 
> Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker 
> output")
> Signed-off-by: Steven Rostedt (Google) 
> ---

This fixes the reported problem for me.

All the ftrace selftests complete without any fails.
# of passed:  121
# of failed:  0
# of unresolved:  6
# of untested:  0
# of unsupported:  7
# of xfailed:  1
# of undefined(test bug):  0

Tested-by: Sachin Sant 


— Sachin



Re: [PATCH V2] selftests/cgroup: Fix build on older distros

2020-11-10 Thread Sachin Sant



> On 11-Nov-2020, at 3:45 AM, Shuah Khan  wrote:
> 
> On 11/6/20 12:40 AM, Sachin Sant wrote:
>> ---
>> V2: Replace all instances of clone_args by __clone_args
>> ---
>> diff --git a/a/tools/testing/selftests/cgroup/cgroup_util.c 
>> b/b/tools/testing/selftests/cgroup/cgroup_util.c
>> index 05853b0..0270146 100644
>> --- a/a/tools/testing/selftests/cgroup/cgroup_util.c
>> +++ b/b/tools/testing/selftests/cgroup/cgroup_util.c
> 
> Not sure how you generated the patch. I had to use git am -p2
> 
Sorry about that. Not sure what happened. Thanks.

-Sachin


[PATCH V2] selftests/cgroup: Fix build on older distros

2020-11-05 Thread Sachin Sant
On older distros struct clone_args does not have a cgroup member,
leading to build errors:

 cgroup_util.c: In function 'clone_into_cgroup':
 cgroup_util.c:343:4: error: 'struct clone_args' has no member named 'cgroup'
 cgroup_util.c:346:33: error: invalid application of 'sizeof' to incomplete
  type 'struct clone_args'

But the selftests already have a locally defined version of the
structure which is up to date, called __clone_args.

So use __clone_args which fixes the error.

Signed-off-by: Michael Ellerman 
Signed-off-by: Sachin Sant >
Acked-by: Christian Brauner 
---

V2: Replace all instances of clone_args by __clone_args
---

diff --git a/a/tools/testing/selftests/cgroup/cgroup_util.c 
b/b/tools/testing/selftests/cgroup/cgroup_util.c
index 05853b0..0270146 100644
--- a/a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -337,13 +337,13 @@ pid_t clone_into_cgroup(int cgroup_fd)
 #ifdef CLONE_ARGS_SIZE_VER2
pid_t pid;
 
-   struct clone_args args = {
+   struct __clone_args args = {
.flags = CLONE_INTO_CGROUP,
.exit_signal = SIGCHLD,
.cgroup = cgroup_fd,
};
 
-   pid = sys_clone3(&args, sizeof(struct clone_args));
+   pid = sys_clone3(&args, sizeof(struct __clone_args));
/*
 * Verify that this is a genuine test failure:
 * ENOSYS -> clone3() not available


Re: [PATCH] selftests/cgroup: Fix build on older distros

2020-11-04 Thread Sachin Sant



> On 04-Nov-2020, at 3:35 PM, Michael Ellerman  wrote:
> 
> On older distros struct clone_args does not have a cgroup member,
> leading to build errors:
> 
>  cgroup_util.c: In function 'clone_into_cgroup':
>  cgroup_util.c:343:4: error: 'struct clone_args' has no member named 'cgroup'
> 
> But the selftests already have a locally defined version of the
> structure which is up to date, called __clone_args.
> 
> So use __clone_args which fixes the error.
> 

Argument passed to sys_clone3() will also require a similar change.

-   pid = sys_clone3(&args, sizeof(struct clone_args));
+   pid = sys_clone3(&args, sizeof(struct __clone_args));

Without this compilation still fails(at least for me) due to following error

cgroup_util.c: In function 'clone_into_cgroup':
cgroup_util.c:346:33: error: invalid application of 'sizeof' to incomplete type 
'struct clone_args'
  pid = sys_clone3(&args, sizeof(struct clone_args));

> Signed-off-by: Michael Ellerman 
> ---
> tools/testing/selftests/cgroup/cgroup_util.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
> b/tools/testing/selftests/cgroup/cgroup_util.c
> index 05853b0b8831..58e30f65df5e 100644
> --- a/tools/testing/selftests/cgroup/cgroup_util.c
> +++ b/tools/testing/selftests/cgroup/cgroup_util.c
> @@ -337,7 +337,7 @@ pid_t clone_into_cgroup(int cgroup_fd)
> #ifdef CLONE_ARGS_SIZE_VER2
>   pid_t pid;
> 
> - struct clone_args args = {
> + struct __clone_args args = {
>   .flags = CLONE_INTO_CGROUP,
>   .exit_signal = SIGCHLD,
>   .cgroup = cgroup_fd,
> 
> base-commit: cf7cd542d1b538f6e9e83490bc090dd773f4266d
> -- 
> 2.25.1
> 



Re: "fs/namei.c: keep track of nd->root refcount status" causes boot panic

2019-09-03 Thread Sachin Sant



> On 03-Sep-2019, at 1:43 PM, Naresh Kamboju  wrote:
> 
> On Tue, 3 Sep 2019 at 09:51, Qian Cai  wrote:
>> 
>> The linux-next commit "fs/namei.c: keep track of nd->root refcount status” 
>> [1] causes boot panic on all
>> architectures here on today’s linux-next (0902). Reverted it will fix the 
>> issue.

Similar problem is seen on ppc64le arch.

[0.493235] BUG: Kernel NULL pointer dereference at 0x0cc0
[0.493241] Faulting instruction address: 0xc03e9260
[0.493245] Oops: Kernel access of bad area, sig: 11 [#1]
[0.493250] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[0.493254] Modules linked in:
[0.493260] CPU: 1 PID: 1 Comm: systemd Not tainted 
5.3.0-rc6-next-20190902-autotest-autotest #1
[0.493265] NIP:  c03e9260 LR: c03e925c CTR: 01fc
[0.493270] REGS: c004f85038c0 TRAP: 0300   Not tainted  
(5.3.0-rc6-next-20190902-autotest-autotest)
[0.493274] MSR:  80009033   CR: 28002842  
XER: 
[0.493282] CFAR: c000df44 DAR: 0cc0 DSISR: 4000 
IRQMASK: 0 
[0.493282] GPR00: c03e925c c004f8503b50 c1458e00 
 
[0.493282] GPR04: c004f8503ce0  0064 
 
[0.493282] GPR08:  c0ff7a65  
c004f70100c0 
[0.493282] GPR12: 2200 c0001ecaee00  
 
[0.493282] GPR16:    
 
[0.493282] GPR20: 00077624   
7fffa1099e20 
[0.493282] GPR24:  00010f9572a4  
0001 
[0.493282] GPR28: 00080060 00080040  
0cc0 
[0.493327] NIP [c03e9260] dput+0x70/0x4e0
[0.493332] LR [c03e925c] dput+0x6c/0x4e0
[0.493334] Call Trace:
[0.493338] [c004f8503b50] [c03e925c] dput+0x6c/0x4e0 
(unreliable)
[0.493345] [c004f8503bc0] [c03d5da4] terminate_walk+0x104/0x130
[0.493351] [c004f8503c00] [c03da9d8] path_lookupat+0xe8/0x2b0
[0.493356] [c004f8503c70] [c03dd668] filename_lookup+0xa8/0x1c0
[0.493362] [c004f8503da0] [c046c4d4] 
sys_name_to_handle_at+0xe4/0x2d0
[0.493369] [c004f8503e20] [c000b378] system_call+0x5c/0x68
[0.493373] Instruction dump:
[0.493376] f8010010 f821ff91 7c7f1b79 41820050 3d28 3b61 613c0060 
613d0040 
[0.493383] 3b40 3b00 48707b11 6000 <813f> 3bdf0058 7fc3f378 
71390008 
[0.493391] ---[ end trace 7701d360352c734d ]—

Reverting the mentioned commit allows next to boot.

Thanks
-Sachin


Re: Oops (request_key_auth_describe) while running cve-2016-7042 from LTP

2019-08-30 Thread Sachin Sant



> On 30-Aug-2019, at 8:43 PM, David Howells  wrote:
> 
> Can you try this patch instead of Hillf’s?

Works for me. Test ran fine without any problem.

Tested-by: Sachin Sant 

Thanks
-Sachin



Re: Oops (request_key_auth_describe) while running cve-2016-7042 from LTP

2019-08-30 Thread Sachin Sant



> On 30-Aug-2019, at 2:26 PM, Hillf Danton  wrote:
> 
> 
> On Fri, 30 Aug 2019 12:18:07 +0530 Sachin Sant wrote:
>> 
>> [ 8074.351033] BUG: Kernel NULL pointer dereference at 0x0038
>> [ 8074.351046] Faulting instruction address: 0xc04ddf30
>> [ 8074.351052] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 8074.351056] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> 
> Add rcu gp.
> 
> --- a/security/keys/request_key_auth.c
> +++ b/security/keys/request_key_auth.c
> @@ -64,12 +64,19 @@ static int request_key_auth_instantiate(
> static void request_key_auth_describe(const struct key *key,
> struct seq_file *m)
> {
> - struct request_key_auth *rka = dereference_key_rcu(key);
> + struct request_key_auth *rka;
> +
> + rcu_read_lock();
> + rka = dereference_key_rcu(key);
> + if (!rka)
> + goto out;
> 

Thanks for the patch. Works for me. Test ran fine without any problems.

Tested-by: Sachin Sant 

Thanks
-Sachin



Oops (request_key_auth_describe) while running cve-2016-7042 from LTP

2019-08-29 Thread Sachin Sant
While running LTP tests (specifically cve-2016-7042) against 5.3-rc6
(commit 4a64489cf8) on a POWER9 LPAR, following problem is seen

[ 3373.814425] FS-Cache: Netfs 'nfs' registered for caching
[ 7695.250230] Clock: inserting leap second 23:59:60 UTC
[ 8074.351033] BUG: Kernel NULL pointer dereference at 0x0038
[ 8074.351046] Faulting instruction address: 0xc04ddf30
[ 8074.351052] Oops: Kernel access of bad area, sig: 11 [#1]
[ 8074.351056] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[ 8074.351067] Dumping ftrace buffer:
[ 8074.351081](ftrace buffer empty)
[ 8074.351085] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache sctp 
tun brd vfat fat fuse xfs overlay loop iscsi_target_mod target_core_mod macsec 
tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc 
bridge stp llc ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
ip6table_nat ip6table_mangle ip6table_raw iptable_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc 
uio_pdrv_genirq pseries_rng sg uio ip_tables ext4 mbcache jbd2 sr_mod cdrom 
sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log 
dm_mod [last unloaded: dummy_del_mod]
[ 8074.351153] CPU: 10 PID: 8314 Comm: cve-2016-7042 Tainted: G   O 
 5.3.0-rc6-autotest #1
[ 8074.351158] NIP:  c04ddf30 LR: c04ddef4 CTR: c04ddea0
[ 8074.351164] REGS: c000e74fb800 TRAP: 0300   Tainted: G   O   
(5.3.0-rc6-autotest)
[ 8074.351170] MSR:  80009033   CR: 88002482  
XER: 
[ 8074.351177] CFAR: c000dfc4 DAR: 0038 DSISR: 4000 
IRQMASK: 0 
[ 8074.351177] GPR00: c04ddef4 c000e74fba90 c13cc200 
c008b0d7039b 
[ 8074.351177] GPR04: c008b0dabe3e 0007 00090a020904 
c008b0d8 
[ 8074.351177] GPR08: 03a2 0001 039b 
c0d03ac0 
[ 8074.351177] GPR12: c04ddea0 c0001ec5dc00  
 
[ 8074.351177] GPR16:  0002 1b01 
 
[ 8074.351177] GPR20: 3bc24df7 c000e74fbc28 0049 
0052 
[ 8074.351177] GPR24: 002d c000ffe30780 c008a991d800 
002d 
[ 8074.351177] GPR28: 0069  c000ffe30780 
c008a991d800 
[ 8074.351224] NIP [c04ddf30] request_key_auth_describe+0x90/0xd0
[ 8074.351230] LR [c04ddef4] request_key_auth_describe+0x54/0xd0
[ 8074.351233] Call Trace:
[ 8074.351237] [c000e74fba90] [c04ddef4] 
request_key_auth_describe+0x54/0xd0 (unreliable)
[ 8074.351244] [c000e74fbb10] [c04df718] proc_keys_show+0x308/0x4c0
[ 8074.351250] [c000e74fbcc0] [c0404950] seq_read+0x3d0/0x540
[ 8074.351255] [c000e74fbd40] [c04865e0] proc_reg_read+0x90/0x110
[ 8074.351261] [c000e74fbd70] [c03c901c] __vfs_read+0x3c/0x70
[ 8074.351267] [c000e74fbd90] [c03c9104] vfs_read+0xb4/0x1b0
[ 8074.351272] [c000e74fbdd0] [c03c95ec] ksys_read+0x7c/0x130
[ 8074.351277] [c000e74fbe20] [c000b388] system_call+0x5c/0x70
[ 8074.351281] Instruction dump:
[ 8074.351285] 2b890001 419e002c 38210080 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 
7c0803a6 
[ 8074.351292] 4e800020 6000 6000 6042  e8dd0030 3c82ff93 
7fc3f378 
[ 8074.351301] ---[ end trace d3304a3a5a0a0ca1 ]—

These CVE tests from LTP were recently added to the automated regression test 
bucket that
I run against upstream. I can’t tell if this is a regression or a new problem.

Thanks
-Sachin


Re: [PATCH] tpm: fixes uninitialized allocated banks for IBM vtpm driver

2019-07-04 Thread Sachin Sant


> On 04-Jul-2019, at 5:29 PM, Mimi Zohar  wrote:
> 
> On Wed, 2019-07-03 at 23:32 -0400, Nayna Jain wrote:
>> The nr_allocated_banks and allocated banks are initialized as part of
>> tpm_chip_register. Currently, this is done as part of auto startup
>> function. However, some drivers, like the ibm vtpm driver, do not run
>> auto startup during initialization. This results in uninitialized memory
>> issue and causes a kernel panic during boot.
>> 
>> This patch moves the pcr allocation outside the auto startup function
>> into tpm_chip_register. This ensures that allocated banks are initialized
>> in any case.
>> 
>> Fixes: 879b589210a9 ("tpm: retrieve digest size of unknown algorithms with
>> PCR read")
>> Signed-off-by: Nayna Jain 
> Reviewed-by: Mimi Zohar 

Thanks for the fix. Kernel boots fine with this fix.

Tested-by: Sachin Sant 

Thanks
-Sachin



Re: [next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR

2019-06-24 Thread Sachin Sant



> On 24-Jun-2019, at 8:12 PM, David Hildenbrand  wrote:
> 
> On 24.06.19 16:09, Sachin Sant wrote:
>> Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls.
>> 
>> This problem was introduced with next-20190620 (dc636f5d78).
>> next-20190619 was last good kernel.
>> 
>> Reverting following commit allows the kernel to boot.
>> 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks()
>> 
>> 
>> [0.014409] Using shared cache scheduler topology
>> [0.016302] devtmpfs: initialized
>> [0.031022] clocksource: jiffies: mask: 0x max_cycles: 
>> 0x, max_idle_ns: 1911260446275 ns
>> [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, 
>> linear)
>> [0.031575] NET: Registered protocol family 16
>> [0.031724] audit: initializing netlink subsys (disabled)
>> [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized 
>> audit_enabled=0 res=1
>> [0.032249] cpuidle: using governor menu
>> [0.032403] pstore: Registered nvram as persistent store backend
>> [   60.061246] rcu: INFO: rcu_sched self-detected stall on CPU
>> [   60.061254] rcu:  0-: (5999 ticks this GP) 
>> idle=1ea/1/0x4002 softirq=5/5 fqs=2999 
>> [   60.061261]   (t=6000 jiffies g=-1187 q=0)
>> [   60.061265] NMI backtrace for cpu 0
>> [   60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>> 5.2.0-rc5-next-20190621-autotest-autotest #1
>> [   60.061275] Call Trace:
>> [   60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 
>> (unreliable)
>> [   60.061287] [c018ee85f3c0] [c0b6d464] 
>> nmi_cpu_backtrace+0x144/0x150
>> [   60.061293] [c018ee85f450] [c0b6d61c] 
>> nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
>> [   60.061300] [c018ee85f4f0] [c00692c8] 
>> arch_trigger_cpumask_backtrace+0x28/0x40
>> [   60.061306] [c018ee85f510] [c01c5f90] 
>> rcu_dump_cpu_stacks+0x10c/0x16c
>> [   60.061313] [c018ee85f560] [c01c4fe4] 
>> rcu_sched_clock_irq+0x744/0x990
>> [   60.061318] [c018ee85f630] [c01d5b58] 
>> update_process_times+0x48/0x90
>> [   60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120
>> [   60.061330] [c018ee85f690] [c01ea150] 
>> tick_handle_periodic+0x40/0xe0
>> [   60.061336] [c018ee85f6d0] [c002b5cc] 
>> timer_interrupt+0x10c/0x2e0
>> [   60.061342] [c018ee85f730] [c0009204] 
>> decrementer_common+0x134/0x140
>> [   60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
>> [   60.061350] LR = arch_local_irq_restore+0x84/0x90
>> [   60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac 
>> (unreliable)
>> [   60.061364] [c018ee85fa50] [c0b88300] 
>> _raw_spin_unlock_irqrestore+0x50/0x80
>> [   60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150
>> [   60.061376] [c018ee85fac0] [c0766ea0] 
>> subsys_find_device_by_id+0xf0/0x1a0
>> [   60.061382] [c018ee85fb20] [c0797a94] 
>> walk_memory_blocks+0x84/0x100
>> [   60.061388] [c018ee85fb80] [c0795ea0] 
>> link_mem_sections+0x40/0x60
>> [   60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268
>> [   60.061400] [c018ee85fc10] [c0010448] 
>> do_one_initcall+0x68/0x2c0
>> [   60.061406] [c018ee85fce0] [c0f247dc] 
>> kernel_init_freeable+0x318/0x47c
>> [   60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150
>> [   60.061417] [c018ee85fe20] [c000ba54] 
>> ret_from_kernel_thread+0x5c/0x68
>> [   88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
>> [swapper/0:1]
>> [   88.016569] Modules linked in:
>> 
> 
> Hi, thanks! Please see
> 
> https://lkml.org/lkml/2019/6/21/600
> 
> and especially
> 
> https://lkml.org/lkml/2019/6/21/908
> 
> Does this fix your problem? The fix is on its way to next.

Yes, this patch fixes the problem for me.

Thanks
-Sachin



[next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR

2019-06-24 Thread Sachin Sant
Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls.

This problem was introduced with next-20190620 (dc636f5d78).
next-20190619 was last good kernel.

Reverting following commit allows the kernel to boot.
2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks()


[0.014409] Using shared cache scheduler topology
[0.016302] devtmpfs: initialized
[0.031022] clocksource: jiffies: mask: 0x max_cycles: 0x, 
max_idle_ns: 1911260446275 ns
[0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, linear)
[0.031575] NET: Registered protocol family 16
[0.031724] audit: initializing netlink subsys (disabled)
[0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized 
audit_enabled=0 res=1
[0.032249] cpuidle: using governor menu
[0.032403] pstore: Registered nvram as persistent store backend
[   60.061246] rcu: INFO: rcu_sched self-detected stall on CPU
[   60.061254] rcu: 0-: (5999 ticks this GP) 
idle=1ea/1/0x4002 softirq=5/5 fqs=2999 
[   60.061261]  (t=6000 jiffies g=-1187 q=0)
[   60.061265] NMI backtrace for cpu 0
[   60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.2.0-rc5-next-20190621-autotest-autotest #1
[   60.061275] Call Trace:
[   60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 
(unreliable)
[   60.061287] [c018ee85f3c0] [c0b6d464] 
nmi_cpu_backtrace+0x144/0x150
[   60.061293] [c018ee85f450] [c0b6d61c] 
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
[   60.061300] [c018ee85f4f0] [c00692c8] 
arch_trigger_cpumask_backtrace+0x28/0x40
[   60.061306] [c018ee85f510] [c01c5f90] 
rcu_dump_cpu_stacks+0x10c/0x16c
[   60.061313] [c018ee85f560] [c01c4fe4] 
rcu_sched_clock_irq+0x744/0x990
[   60.061318] [c018ee85f630] [c01d5b58] 
update_process_times+0x48/0x90
[   60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120
[   60.061330] [c018ee85f690] [c01ea150] 
tick_handle_periodic+0x40/0xe0
[   60.061336] [c018ee85f6d0] [c002b5cc] timer_interrupt+0x10c/0x2e0
[   60.061342] [c018ee85f730] [c0009204] 
decrementer_common+0x134/0x140
[   60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[   60.061350] LR = arch_local_irq_restore+0x84/0x90
[   60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac 
(unreliable)
[   60.061364] [c018ee85fa50] [c0b88300] 
_raw_spin_unlock_irqrestore+0x50/0x80
[   60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150
[   60.061376] [c018ee85fac0] [c0766ea0] 
subsys_find_device_by_id+0xf0/0x1a0
[   60.061382] [c018ee85fb20] [c0797a94] 
walk_memory_blocks+0x84/0x100
[   60.061388] [c018ee85fb80] [c0795ea0] link_mem_sections+0x40/0x60
[   60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268
[   60.061400] [c018ee85fc10] [c0010448] do_one_initcall+0x68/0x2c0
[   60.061406] [c018ee85fce0] [c0f247dc] 
kernel_init_freeable+0x318/0x47c
[   60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150
[   60.061417] [c018ee85fe20] [c000ba54] 
ret_from_kernel_thread+0x5c/0x68
[   88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[   88.016569] Modules linked in:


Thanks
-Sachin




boot.log
Description: Binary data


Re: [POWERPC][next-20190603] Boot failure : Kernel BUG at mm/vmalloc.c:470

2019-06-04 Thread Sachin Sant



> On 04-Jun-2019, at 3:59 PM, Stephen Rothwell  wrote:
> 
> Hi Sachin,
> 
> On Tue, 4 Jun 2019 14:45:43 +0530 Sachin Sant  
> wrote:
>> 
>> While booting linux-next [next-20190603] on a POWER9 LPAR following
>> BUG is encountered and the boot fails.
>> 
>> If I revert the following 2 patches I no longer see this BUG message
>> 
>> 07031d37b2f9 ( mm/vmalloc.c: switch to WARN_ON() and move it under 
>> unlink_va() )
>> 728e0fbf263e ( mm/vmalloc.c: get rid of one single unlink_va() when merge )
> 
> This latter patch has been fixed in today's linux-next …

Thanks Stephen. 
With today’s next (20190604) I no longer see this issue.

Thanks
-Sachin


[PowerPC][next-20190603] WARNING: at kernel/fork.c:721

2019-06-04 Thread Sachin Sant
While booting linux-next [20190603] on a POWER9 LPAR ran into
following warning

[9.002935] WARNING: CPU: 0 PID: 1 at kernel/fork.c:721 
__put_task_struct+0x34/0x170
[9.002947] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[9.002960] CPU: 0 PID: 1 Comm: systemd Not tainted 
5.2.0-rc3-next-20190603-autotest #1
[9.002971] NIP:  c01191e4 LR: c020c53c CTR: 
[9.002980] REGS: c008b2783810 TRAP: 0700   Not tainted  
(5.2.0-rc3-next-20190603-autotest)
[9.002990] MSR:  80029033   CR: 24222842  
XER: 2004
[9.003004] CFAR: c020c538 IRQMASK: 0 
[9.003004] GPR00: c020c53c c008b2783aa0 c138ca00 
c008b92e19f8 
[9.003004] GPR04: c008b2783b98 c008b2783b98 c008b92e24b0 
 
[9.003004] GPR08:  0001  
c0a81060 
[9.003004] GPR12: 24224842 c17c  
 
[9.003004] GPR16:    
0001 
[9.003004] GPR20:  7fff95f9 c008b2756dc0 
c008b2783df0 
[9.003004] GPR24: 2000 c008ad74d200 c008b926a218 
 
[9.003004] GPR28: c008b92e1400   
c008b92e19f8 
[9.003083] NIP [c01191e4] __put_task_struct+0x34/0x170
[9.003094] LR [c020c53c] css_task_iter_end+0x11c/0x1b0
[9.003101] Call Trace:
[9.003108] [c008b2783aa0] [c008b92e1400] 0xc008b92e1400 
(unreliable)
[9.003119] [c008b2783ad0] [c020c53c] 
css_task_iter_end+0x11c/0x1b0
[9.003129] [c008b2783b10] [c020f60c] 
pidlist_array_load+0x12c/0x390
[9.003140] [c008b2783bf0] [c020fa20] 
cgroup_pidlist_start+0x1b0/0x1e0
[9.003151] [c008b2783c40] [c01ffd98] 
cgroup_seqfile_start+0x38/0x50
[9.003163] [c008b2783c60] [c049c270] kernfs_seq_start+0x80/0x120
[9.003175] [c008b2783ca0] [c03fea08] seq_read+0x208/0x540
[9.003184] [c008b2783d20] [c049cdd4] kernfs_fop_read+0x1a4/0x260
[9.003196] [c008b2783d70] [c03c3cec] __vfs_read+0x3c/0x70
[9.003205] [c008b2783d90] [c03c3dd4] vfs_read+0xb4/0x1b0
[9.003214] [c008b2783dd0] [c03c42bc] ksys_read+0x7c/0x130
[9.003224] [c008b2783e20] [c000b688] system_call+0x5c/0x70
[9.003232] Instruction dump:
[9.003237] 38423850 7c0802a6 6000 7c0802a6 fbc1fff0 fbe1fff8 f8010010 
f821ffd1 
[9.003251] 7c7f1b78 8123067c 7d290034 5529d97e <0b09> 81230110 7d290034 
5529d97e 
[9.003264] ---[ end trace 2194bb4cf2567482 ]—

Have not seen this warning previously and is new with this next build.

void __put_task_struct(struct task_struct *tsk)
{
WARN_ON(!tsk->exit_state); <<== 
WARN_ON(refcount_read(&tsk->usage));

Since I am running into various boot failures with next tree for last week or so
am not able to bisect.

Thanks
-Sachin


[POWERPC][next-20190603] Boot failure : Kernel BUG at mm/vmalloc.c:470

2019-06-04 Thread Sachin Sant
While booting linux-next [next-20190603] on a POWER9 LPAR following
BUG is encountered and the boot fails.

If I revert the following 2 patches I no longer see this BUG message

07031d37b2f9 ( mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va() )
728e0fbf263e ( mm/vmalloc.c: get rid of one single unlink_va() when merge )

[1.130734] [ cut here ]
[1.130745] kernel BUG at mm/vmalloc.c:470!
[1.130753] Oops: Exception in kernel mode, sig: 5 [#1]
[1.130761] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[1.130768] Modules linked in: ibmvscsi(+) ibmveth scsi_transport_srp
[1.130781] CPU: 23 PID: 127 Comm: kworker/23:0 Not tainted 
5.2.0-rc3-next-20190603 #1
[1.130796] Workqueue: events do_free_init
[1.130803] NIP:  c03bcc48 LR: c03bcbc4 CTR: c03bc860
[1.130812] REGS: c018ec7877e0 TRAP: 0700   Not tainted  
(5.2.0-rc3-next-20190603)
[1.130820] MSR:  80010282b033   
CR: 2400  XER: 20040005
[1.130837] CFAR: c03bcc6c IRQMASK: 0
[1.130837] GPR00: c03bd0e4 c018ec787a70 c1929400 
0001
[1.130837] GPR04: c018e8d2e8f0 c018e8d2e8f0 c0080657 
c0080655
[1.130837] GPR08: c0080967 c018e8d0e098 c018e8d0e5f8 
c018e8ca93f0
[1.130837] GPR12: c03bc860 c0001ec4e200 c0160188 
c018f4044d00
[1.130837] GPR16:    

[1.130837] GPR20:   c1485108 
c19620d8
[1.130837] GPR24: c1962464 0800 0001 
1800
[1.130837] GPR28: c1ccc0a8 c1ccc080 c018e8d2e8b8 
c018e8d2e8d8
[1.130899] NIP [c03bcc48] __free_vmap_area+0xf8/0x480
[1.130906] LR [c03bcbc4] __free_vmap_area+0x74/0x480
[1.130912] Call Trace:
[1.130916] [c018ec787a70] [c03bcf70] 
__free_vmap_area+0x420/0x480 (unreliable)
[1.130924] [c018ec787ac0] [c03bd0e4] 
__purge_vmap_area_lazy+0x114/0x1e0
[1.130932] [c018ec787b10] [c03bef44] 
_vm_unmap_aliases+0x1a4/0x210
[1.130939] [c018ec787b90] [c03c1c48] __vunmap+0xe8/0x220
[1.130946] [c018ec787c20] [c021102c] module_memfree+0x3c/0x50
[1.130953] [c018ec787c40] [c02110ac] do_free_init+0x6c/0xa0
[1.130964] [c018ec787c70] [c0156df0] 
process_one_work+0x260/0x520
[1.130976] [c018ec787d10] [c0157138] worker_thread+0x88/0x5f0
[1.130985] [c018ec787db0] [c0160328] kthread+0x1a8/0x1b0
[1.130996] [c018ec787e20] [c000ba54] 
ret_from_kernel_thread+0x5c/0x68
[1.131004] Instruction dump:
[1.131011] e9292c90 2fa9 419e0378 e8fe e8de0008 6000 e909ffe8 
e949ffe0
[1.131021] 7fa74040 409c0014 7faa3040 409c002c <0fe0> 6000 7faa3040 
409cfff4
[1.131032] ---[ end trace b0b43434aedbb78e ]—

Have attached the boot log for reference.

Thanks
-Sachin




next-20190603.log
Description: Binary data


Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Sachin Sant

> On 07-Apr-2017, at 2:14 AM, Tyrel Datwyler  wrote:
> 
> On 04/06/2017 03:27 AM, Sachin Sant wrote:
>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
>> any I/O adapter results in the following warning
> 
> I remember you mentioning this when the issue was brought up for CPUs. I
> assume the case is the same here where the issue is only seen with
> adapters that were hot-added after boot (ie. hot-remove of adapter
> present at boot doesn't trip the warning)?
> 

Correct, can be recreated only with adapters that were hot-added after boot.

> -Tyrel
> 
>> 
>> Thanks
>> -Sachin
>> 
>> 
> 



WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Sachin Sant
On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
any I/O adapter results in the following warning

This problem has been in the code for some time now. I had first seen this in
-next tree.

[  269.589441] rpadlpar_io: slot PHB 72 removed
[  270.589997] refcount_t: underflow; use-after-free.
[  270.590019] [ cut here ]
[  270.590025] WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 
refcount_sub_and_test+0xf4/0x110
[  270.590028] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge 
stp llc rpadlpar_io rpaphp kvm_pr kvm ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag 
af_packet_diag netlink_diag ghash_generic xts gf128mul vmx_crypto tpm_ibmvtpm 
tpm sg pseries_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc 
ip_tables xfs libcrc32c sr_mod sd_mod cdrom ibmvscsi ibmveth scsi_transport_srp 
dm_mirror dm_region_hash dm_log dm_mod
[  270.590076] CPU: 5 PID: 3335 Comm: drmgr Not tainted 4.11.0-rc5 #3
[  270.590079] task: c005d8df8600 task.stack: c000fb3a8000
[  270.590081] NIP: c1aa3ca4 LR: c1aa3ca0 CTR: 006338e4
[  270.590084] REGS: c000fb3ab8a0 TRAP: 0700   Not tainted  (4.11.0-rc5)
[  270.590087] MSR: 80029033 
[  270.590090]   CR: 22002422  XER: 0007
[  270.590093] CFAR: c1edaabc SOFTE: 1 
[  270.590093] GPR00: c1aa3ca0 c000fb3abb20 c25ea900 
0026 
[  270.590093] GPR04: c0077fc4ada0 c0077fc617b8 000f0c33 
 
[  270.590093] GPR08:  c227146c 00077d9e 
3ff0 
[  270.590093] GPR12: 2200 ce802d00  
 
[  270.590093] GPR16:    
 
[  270.590093] GPR20:  1001b5a8 10018338 
10016650 
[  270.590093] GPR24: 1001b278 c00776e0fdcc 10016650 
 
[  270.590093] GPR28: c0077ffea910 c000fbf79180 c00776e0fdc0 
c000fbf791d8 
[  270.590126] NIP [c1aa3ca4] refcount_sub_and_test+0xf4/0x110
[  270.590129] LR [c1aa3ca0] refcount_sub_and_test+0xf0/0x110
[  270.590132] Call Trace:
[  270.590134] [c000fb3abb20] [c1aa3ca0] 
refcount_sub_and_test+0xf0/0x110 (unreliable)
[  270.590139] [c000fb3abb80] [c1a8221c] kobject_put+0x3c/0xa0
[  270.590143] [c000fb3abbf0] [c1d22d34] of_node_put+0x24/0x40
[  270.590147] [c000fb3abc10] [c165c874] ofdt_write+0x204/0x6b0
[  270.590151] [c000fb3abcd0] [c197a220] proc_reg_write+0x80/0xd0
[  270.590155] [c000fb3abd00] [c18de680] __vfs_write+0x40/0x1c0
[  270.590158] [c000fb3abd90] [c18dffd8] vfs_write+0xc8/0x240
[  270.590162] [c000fb3abde0] [c18e1c40] SyS_write+0x60/0x110
[  270.590165] [c000fb3abe30] [c15cb184] system_call+0x38/0xe0
[  270.590168] Instruction dump:
[  270.590170] 7863d182 4e800020 7c0802a6 3921 3d42fff8 3c62ffb1 386371a8 
992a0171 
[  270.590175] f8010010 f821ffa1 48436de1 6000 <0fe0> 38210060 3860 
e8010010 
[  270.590180] ---[ end trace 08c7a2f3c8bead33 ]—

Have attached the dmesg log from the system. Let me know if any additional
information is required to help debug this problem.

Thanks
-Sachin




dmesg-4.11-rc5.log
Description: Binary data


Re: [PATCH] jump_label: align jump_entry table to at least 4-bytes

2017-03-01 Thread Sachin Sant

> I also checked all the other .ko files and they were properly aligned. So I 
> think this should hopefully work, and I like that its not a per-arch fix.
> 
> Sachin, sorry to bother you again, but I'm hoping you can try David's latest 
> patch to scripts/module-common.lds, just to test in your setup.

I tested the patch on 2 different systems where I ran into this problem. In 
both cases
the system boots without any warning. A quick module load/unload test also 
worked
correctly.

Tested-by: Sachin Sant 

Thanks
-Sachin



Re: [PATCH] jump_label: align jump_entry table to at least 4-bytes

2017-02-27 Thread Sachin Sant

> Thanks for the suggestion! I would like to see if this resolves the ppc issue 
> we had. I'm attaching a powerpc patch based on your suggestion. Hopefully, 
> Sachin can try it.
> 
> Thanks,
> 

I tried this patch. It does not fix the warning. 

[   11.709071] mount (2956) used greatest stack depth: 10176 bytes left
[   11.731883] [ cut here ]
[   11.731911] WARNING: CPU: 3 PID: 2972 at kernel/jump_label.c:287 
static_key_set_entries.isra.10+0x3c/0x50
[   11.731915] Modules linked in: nfsd(+) ip_tables x_tables autofs4
[   11.731925] CPU: 3 PID: 2972 Comm: modprobe Not tainted 4.10.0-next-20170227 
#4
[   11.731930] task: c0077b284a00 task.stack: c0077b8b8000
[   11.731933] NIP: c17bf84c LR: c17bfcbc CTR: 
[   11.731937] REGS: c0077b8bb800 TRAP: 0700   Not tainted  
(4.10.0-next-20170227)
[   11.731940] MSR: 8282b033 
[   11.731948]   CR: 48248282  XER: 0001
[   11.731953] CFAR: c17bf81c SOFTE: 1 
GPR00: c17bfc7c c0077b8bba80 c266c300 d63e28f8 
GPR04: d63e5b57 00010017 c17bf5a0  
GPR08: 00052eb3 0001 c258c300 0001 
GPR12: c1b5b460 cea80c00 0020 d6380bb0 
GPR16: c0077b8bbda0 c0077b8bbdec  8580 
GPR20: d641 d63e7ea8 c256db90 0001 
GPR24: c258ca14  c25737f8 d63e5c17 
GPR28:  d63e6780 d63e28f0 d63e5b57 
[   11.732000] NIP [c17bf84c] static_key_set_entries.isra.10+0x3c/0x50
[   11.732004] LR [c17bfcbc] jump_label_module_notify+0x20c/0x420
[   11.732007] Call Trace:
[   11.732011] [c0077b8bba80] [c17bfc7c] 
jump_label_module_notify+0x1cc/0x420 (unreliable)
[   11.732019] [c0077b8bbb40] [c16b69b0] 
notifier_call_chain+0x90/0x100
[   11.732024] [c0077b8bbb90] [c16b6e80] 
__blocking_notifier_call_chain+0x60/0x90
[   11.732029] [c0077b8bbbe0] [c17380ec] load_module+0x1c2c/0x2760
[   11.732034] [c0077b8bbd70] [c1738e80] SyS_finit_module+0xc0/0xf0
[   11.732040] [c0077b8bbe30] [c15cb8e0] system_call+0x38/0xfc
[   11.732043] Instruction dump:
[   11.732046] 40c20018 e923 792907a0 7c844b78 f883 4e800020 3d42fff2 
892a0714 
[   11.732053] 2f89 40feffe0 3921 992a0714 <0fe0> 4bd0 6000 
6000 
[   11.732061] ---[ end trace 13c67d418143453c ]---
[   11.732319] Installing knfsd (copyright (C) 1996 o...@monad.swb.de).

I have collected the o/p of the command suggested by David. Here is a snippet 
from the run

File: ./arch/powerpc/kernel/built-in.o
  [383] __jump_table  PROGBITS 068020 000c78 18 WAM 
 0   0  1
File: ./arch/powerpc/kernel/rtasd.o
File: ./arch/powerpc/kernel/of_platform.o
File: ./arch/powerpc/kernel/eeh_event.o
File: ./arch/powerpc/kernel/setup_64.o
  [18] __jump_table  PROGBITS 001240 48 18 WAM  
0   0  1
File: ./arch/powerpc/kernel/rtas-proc.o
File: ./arch/powerpc/kernel/signal_64.o
  [13] __jump_table  PROGBITS 001c68 60 18 WAM  
0   0  1

Have attached the complete o/p here for reference.

Thanks
-Sachin



jump_table.log
Description: Binary data


Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140

2017-02-19 Thread Sachin Sant

>> While booting next-20170217 on a POWER6 box, I ran into following
>> warning. This is a full system lpar. Previous next tree was good.
>> I will try a bisect tomorrow.
> 
> Do you have CONFIG_DEBUG_SHIRQ=y ?
> 

Yes. CONFIG_DEBUG_SHIRQ is enabled.

As suggested by you reverting following commit allows a clean boot.
f91f694540f3 ("genirq: Reenable shared irq debugging in request_*_irq()”)

>> ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015)
>> ipr 0200:00:01.0: Found IOA with IRQ: 305
>> [ cut here ]
>> WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 
>> .icp_hv_eoi+0x40/0x140
>> Modules linked in:
>> CPU: 12 PID: 1 Comm: swapper/14 Not tainted 
>> 4.10.0-rc8-next-20170217-autotest #1
>> task: c002b2a4a580 task.stack: c002b2a5c000
>> NIP: c00731b0 LR: c01389f8 CTR: c0073170
>> REGS: c002b2a5f050 TRAP: 0700   Not tainted  
>> (4.10.0-rc8-next-20170217-autotest)
>> MSR: 80029032 
>>  CR: 28004082  XER: 2004
>> CFAR: c01389e0 SOFTE: 0 
>> GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 
>> GPR04:   0064 0131 
>> GPR08: 0001 c000d3104cb8  0009b1f8 
>> GPR12: 48004082 cedc2400 c000dad0  
>> GPR16:  3c007efc c0a9e848  
>> GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 
>> GPR24: c0a9e848  c002af4d4fb8  
>> GPR28:  c002b203f498 c0ef8928 c002b203f400 
>> NIP [c00731b0] .icp_hv_eoi+0x40/0x140
>> LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
>> Call Trace:
>> [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable)
>> [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
>> [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370
>> [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390
>> [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0
>> [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130
>> [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0
>> [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0
>> [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190
>> [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0
>> [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40
>> [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370
>> [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170
>> [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60
>> [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70
>> [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0
>> [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360
>> [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130
>> [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8
>> Instruction dump:
>> f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 
>> 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 
>> ---[ end trace 5e18ae409f46392c ]---
>> ipr 0200:00:01.0: Initializing IOA.
>> 
>> Thanks
>> -Sachin
> 



next-20170217 boot on POWER8 LPAR : WARNING @kernel/jump_label.c:287

2017-02-19 Thread Sachin Sant
While booting next-20170217 on a POWER8 LPAR following
warning is displayed.

Reverting the following commit helps boot cleanly.
commit 3821fd35b5 :  jump_label: Reduce the size of struct static_key

[   11.393008] [ cut here ]
[   11.393031] WARNING: CPU: 5 PID: 2890 at kernel/jump_label.c:287 
static_key_set_entries.isra.10+0x3c/0x50
[   11.393035] Modules linked in: nfsd(+) ip_tables x_tables autofs4
[   11.393043] CPU: 5 PID: 2890 Comm: modprobe Not tainted 
4.10.0-rc8-next-20170217-autotest #1
[   11.393047] task: c003a5692500 task.stack: c003a7774000
[   11.393051] NIP: c17bcffc LR: c17bd46c CTR: 
[   11.393054] REGS: c003a800 TRAP: 0700   Not tainted  
(4.10.0-rc8-next-20170217-autotest)
[   11.393058] MSR: 8282b033 
[   11.393065]   CR: 48248282  XER: 0001
[   11.393070] CFAR: c17bcfcc SOFTE: 1
GPR00: c17bd42c c003aa80 c262ce00 d3fdd580
GPR04: d3fe07df 00010017 c17bcd50 
GPR08: 00053a09 0001 c254ce00 0001
GPR12: c1b56c40 cea81400 0020 d5081098
GPR16: c003ada0 c003adec  84a8
GPR20: d3fef000 d3fe2b28 c252dc90 0001
GPR24: c254d314  c25338f8 d3fe089f
GPR28:  d3fe1400 d3fdd578 d3fe07df
[   11.393115] NIP [c17bcffc] static_key_set_entries.isra.10+0x3c/0x50
[   11.393119] LR [c17bd46c] jump_label_module_notify+0x20c/0x420
[   11.393122] Call Trace:
[   11.393125] [c003aa80] [c17bd42c] 
jump_label_module_notify+0x1cc/0x420 (unreliable)
[   11.393132] [c003ab40] [c16b38e0] 
notifier_call_chain+0x90/0x100
[   11.393137] [c003ab90] [c16b3db0] 
__blocking_notifier_call_chain+0x60/0x90
[   11.393142] [c003abe0] [c17357bc] load_module+0x1c1c/0x2750
[   11.393147] [c003ad70] [c1736550] SyS_finit_module+0xc0/0xf0
[   11.393152] [c003ae30] [c15cb8e0] system_call+0x38/0xfc
[   11.393156] Instruction dump:
[   11.393158] 40c20018 e923 792907a0 7c844b78 f883 4e800020 3d42fff2 
892a0514
[   11.393166] 2f89 40feffe0 3921 992a0514 <0fe0> 4bd0 6000 
6000
[   11.393173] ---[ end trace a5f8fbc5d8226aec ]---

Have attached boot log.

Thanks
-Sachin

dmesg_next_20170217.log
Description: Binary data


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-05 Thread Sachin Sant

>>> I've seen it on tip. It looks like hot unplug goes really slow when
>>> there's running tasks on the CPU being taken down.
>>> 
>>> What I did was something like:
>>> 
>>>  taskset -p $((1<<1)) $$
>>>  for ((i=0; i<20; i++)) do while :; do :; done & done
>>> 
>>>  taskset -p $((1<<0)) $$
>>>  echo 0 > /sys/devices/system/cpu/cpu1/online
>>> 
>>> And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes
>>> _really_ slow and the RCU stall triggers. What I suspect happens is that
>>> hotplug stops participating in the RCU state machine early, but only
>>> tells RCU about it really late, and in between it gets suspicious it
>>> takes too long.
>>> 
>>> I've yet to dig through the RCU code to figure out the exact sequence of
>>> events, but found the above to be fairly reliable in triggering the
>>> issue.
> 
>> If you send me the full splat from the dmesg and the RCU portions of
>> .config, I will take a look.  Is this new behavior, or a new test?
> 

I have sent the required files to you via separate email.

> If new behavior, I would be most suspicious of these commits in -rcu which
> recently entered -tip:
> 
> 19e4d983cda1 rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() 
> actions
> 913324b1364f rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle()
> fcdcfefafa45 rcu: Pull rcu_qs_ctr into rcu_dynticks structure
> 0919a0b7e7a5 rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure
> caa7c8e34293 rcu: Make rcu_note_context_switch() do deferred NOCB wakeups
> 41e4b159d516 rcu: Make rcu_all_qs() do deferred NOCB wakeups
> b457a3356a68 rcu: Make call_rcu() do deferred NOCB wakeups
> 
> Does reverting any of these help?

I tried reverting the above commits. That does not help. I can still recreate 
the issue.

Thanks
-Sachin


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-02 Thread Sachin Sant

> On 02-Feb-2017, at 9:25 PM, Peter Zijlstra  wrote:
> 
> On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote:
>> On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith  wrote:
>>> On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
> 
> 
> Could some of you test this? It seems to cure things in my (very)
> limited testing.
> 

I ran few cycles of cpu hot(un)plug tests. In most cases it works except one
where I ran into rcu stall:

[  173.493453] INFO: rcu_sched detected stalls on CPUs/tasks:
[  173.493473]  8-...: (2 GPs behind) idle=006/140/0 softirq=0/0 
fqs=2996 
[  173.493476]  (detected by 0, t=6002 jiffies, g=885, c=884, q=6350)
[  173.493482] Task dump for CPU 8:
[  173.493484] cpuhp/8 R  running task0  3416  2 0x0884
[  173.493489] Call Trace:
[  173.493492] [c004f7b834a0] [c004f7b83560] 0xc004f7b83560 
(unreliable)
[  173.493498] [c004f7b83670] [c0008d28] 
alignment_common+0x128/0x130
[  173.493503] --- interrupt: 600 at _raw_spin_lock+0x2c/0xc0
[  173.493503] LR = try_to_wake_up+0x204/0x5c0
[  173.493507] [c004f7b83960] [c004f4d8084c] 0xc004f4d8084c 
(unreliable)
[  173.493511] [c004f7b83990] [c00fef54] try_to_wake_up+0x204/0x5c0
[  173.493515] [c004f7b83a10] [c00e2b88] create_worker+0x148/0x250
[  173.493519] [c004f7b83ab0] [c00e6e1c] 
alloc_unbound_pwq+0x3bc/0x4c0
[  173.493522] [c004f7b83b10] [c00e7084] 
wq_update_unbound_numa+0x164/0x270
[  173.493526] [c004f7b83bb0] [c00e8990] 
workqueue_online_cpu+0x250/0x3b0
[  173.493529] [c004f7b83c70] [c00c2758] 
cpuhp_invoke_callback+0x148/0x5b0
[  173.493533] [c004f7b83ce0] [c00c2df8] 
cpuhp_up_callbacks+0x48/0x140
[  173.493536] [c004f7b83d30] [c00c3e98] 
cpuhp_thread_fun+0x148/0x180
[  173.493540] [c004f7b83d60] [c00f3930] 
smpboot_thread_fn+0x290/0x2a0
[  173.493544] [c004f7b83dc0] [c00edb3c] kthread+0x14c/0x190
[  173.493547] [c004f7b83e30] [c000b4e8] 
ret_from_kernel_thread+0x5c/0x74
[  243.913715] INFO: task kworker/0:2:380 blocked for more than 120 seconds.
[  243.913732]   Not tainted 4.10.0-rc6-next-20170202 #6
[  243.913735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  243.913738] kworker/0:2 D0   380  2 0x0800
[  243.913746] Workqueue: events vmstat_shepherd
[  243.913748] Call Trace:
[  243.913752] [c000ff07f820] [c011135c] 
enqueue_entity+0x81c/0x1200 (unreliable)
[  243.913757] [c000ff07f9f0] [c001a660] __switch_to+0x300/0x400
[  243.913762] [c000ff07fa50] [c08df4f4] __schedule+0x314/0xb10
[  243.913766] [c000ff07fb20] [c08dfd30] schedule+0x40/0xb0
[  243.913769] [c000ff07fb50] [c08e02b8] 
schedule_preempt_disabled+0x18/0x30
[  243.913773] [c000ff07fb70] [c08e1654] 
__mutex_lock.isra.6+0x1a4/0x660
[  243.913777] [c000ff07fc00] [c00c3828] get_online_cpus+0x48/0x90
[  243.913780] [c000ff07fc30] [c025fd78] vmstat_shepherd+0x38/0x150
[  243.913784] [c000ff07fc80] [c00e5794] 
process_one_work+0x1a4/0x4d0
[  243.913788] [c000ff07fd20] [c00e5b58] worker_thread+0x98/0x5a0
[  243.913791] [c000ff07fdc0] [c00edb3c] kthread+0x14c/0x190
[  243.913795] [c000ff07fe30] [c000b4e8] 
ret_from_kernel_thread+0x5c/0x74
[  243.913824] INFO: task drmgr:3413 blocked for more than 120 seconds.
[  243.913826]   Not tainted 4.10.0-rc6-next-20170202 #6
[  243.913829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  243.913831] drmgr   D0  3413   3114 0x00040080
[  243.913834] Call Trace:
[  243.913836] [c00257ff3380] [c00257ff3440] 0xc00257ff3440 
(unreliable)
[  243.913840] [c00257ff3550] [c001a660] __switch_to+0x300/0x400
[  243.913844] [c00257ff35b0] [c08df4f4] __schedule+0x314/0xb10
[  243.913847] [c00257ff3680] [c08dfd30] schedule+0x40/0xb0
[  243.913851] [c00257ff36b0] [c08e4594] 
schedule_timeout+0x274/0x470
[  243.913855] [c00257ff37b0] [c08e0efc] wait_for_common+0x1ac/0x2c0
[  243.913858] [c00257ff3830] [c00c50e4] bringup_cpu+0x84/0xe0
[  243.913862] [c00257ff3860] [c00c2758] 
cpuhp_invoke_callback+0x148/0x5b0
[  243.913865] [c00257ff38d0] [c00c2df8] 
cpuhp_up_callbacks+0x48/0x140
[  243.913868] [c00257ff3920] [c00c5438] _cpu_up+0xe8/0x1c0
[  243.913872] [c00257ff3980] [c00c5630] do_cpu_up+0x120/0x150
[  243.913876] [c00257ff3a00] [c05c005c] cpu_subsys_online+0x5c/0xe0
[  243.913879] [c00257ff3a50] [c05b7d84] device_online+0xb4/0x120
[  243.913883] [c00257ff3a90] [c0093424] 
dlpar_online_cpu+0x144/0x1e0
[  243.913887] [c00257ff3b50] [c0093c08] dlpar_cpu_ad

Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-01-31 Thread Sachin Sant
Trimming the cc list.

>> I assume I should be worried?
> 
> Thanks for the report. No need to worry, the bug has existed for a
> while, this patch just turns on the warning ;-)
> 
> The following commit queued up in tip/sched/core should fix your
> issues (assuming you see the same callstack on all your powerpc
> machines):
> 
>  
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=1b1d62254df0fe42a711eb71948f915918987790

I still see this warning with today’s next running inside PowerVM LPAR
on a POWER8 box. The stack trace is different from what Michael had
reported.

Easiest way to recreate this is to Online/offline cpu’s.

[  114.795609] rq->clock_update_flags < RQCF_ACT_SKIP
[  114.795621] [ cut here ]
[  114.795632] WARNING: CPU: 2 PID: 27 at kernel/sched/sched.h:804 
set_next_entity+0xbc8/0xcc0
[  114.795634] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge 
stp llc rpadlpar_io rpaphp kvm_pr kvm ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag 
af_packet_diag netlink_diag rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
scsi_transport_iscsi ib_srpt target_core_mod ib_srp ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 ib_core ghash_generic xts 
gf128mul tpm_ibmvtpm tpm sg vmx_crypto pseries_rng nfsd auth_rpcgss nfs_acl 
lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c sr_mod sd_mod cdrom 
cxgb3 ibmvscsi ibmveth scsi_transport_srp mdio
[  114.795751]  dm_mirror dm_region_hash dm_log dm_mod
[  114.795762] CPU: 2 PID: 27 Comm: migration/2 Not tainted 
4.10.0-rc6-next-20170131 #1
[  114.795765] task: c004fa2f8600 task.stack: c004fa49c000
[  114.795768] NIP: c0114ed8 LR: c0114ed4 CTR: c04a8cf0
[  114.795771] REGS: c004fa49f6a0 TRAP: 0700   Not tainted  
(4.10.0-rc6-next-20170131)
[  114.795773] MSR: 82823033 
[  114.795787]   CR: 28004022  XER: 
[  114.795789] CFAR: c08ec5c4 SOFTE: 0 
GPR00: c0114ed4 c004fa49f920 c100dd00 0026 
GPR04:  0006 6574616470755f6b c11cdd00 
GPR08:  c0c6edb0 00015ef2 d6488538 
GPR12: 4400 ce801200 c00ecc38 c004fe064300 
GPR16:  0001  c0f27e08 
GPR20: c0f277c5  0004  
GPR24: c0015fba49f0 c0f27e08 c0ef9e80 c004fa49fb00 
GPR28: c0015fba4980 c0015fba49f0 c004f34c1000 c0015fba49f0 
[  114.795850] NIP [c0114ed8] set_next_entity+0xbc8/0xcc0
[  114.795855] LR [c0114ed4] set_next_entity+0xbc4/0xcc0
[  114.795857] Call Trace:
[  114.795862] [c004fa49f920] [c0114ed4] 
set_next_entity+0xbc4/0xcc0 (unreliable)
[  114.795869] [c004fa49f9d0] [c0119f4c] 
pick_next_task_fair+0xfc/0x6f0
[  114.795874] [c004fa49fae0] [c0104820] sched_cpu_dying+0x3c0/0x450
[  114.795880] [c004fa49fb80] [c00c1958] 
cpuhp_invoke_callback+0x148/0x5b0
[  114.795886] [c004fa49fbf0] [c00c3340] take_cpu_down+0xb0/0x110
[  114.795893] [c004fa49fc50] [c01a1e58] multi_cpu_stop+0x1a8/0x1e0
[  114.795899] [c004fa49fca0] [c01a20c4] 
cpu_stopper_thread+0x104/0x1e0
[  114.795905] [c004fa49fd60] [c00f2b90] 
smpboot_thread_fn+0x290/0x2a0
[  114.795911] [c004fa49fdc0] [c00ecd7c] kthread+0x14c/0x190
[  114.795919] [c004fa49fe30] [c000b4e8] 
ret_from_kernel_thread+0x5c/0x74
[  114.795921] Instruction dump:
[  114.795924] 0fe0 4bfff884 3d02fff2 89289ac5 2f89 40fef4ec 3921 
3c62ffac 
[  114.795936] 38633698 99289ac5 487d76b5 6000 <0fe0> 4bfff4cc eb9f0118 
e93f0120 
[  114.795948] ---[ end trace 5c822f32f967fbc5 ]---
[  123.059141] nr_pdflush_threads exported in /proc is scheduled for removal

Thanks
-Sachin