ast: cursor flashing softlockups

2016-05-17 Thread Ming Lei
Hi,

On Tue, May 17, 2016 at 4:07 AM, Dann Frazier
 wrote:
> Hi,
>  I'm observing a soft lockup issue w/ the ASPEED controller on an
> arm64 server platform. This was originally seen on Ubuntu's 4.4
> kernel, but it is reproducible w/ vanilla 4.6-rc7 as well.
>
> [   32.792656] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s!
> [swapper/38:0]
>
> I observe this just once each time I boot into debian-installer (I'm
> using a serial console, but the ast module gets loaded during
> startup).

I have figured out that it is caused by 'mod_timer(timer, jiffies)' and
'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
when the issue happened.

Looks it is a real fbcon/vt issue, see following:

fbcon_init()
<-.con_init
  <-visual_init()

reset_terminal()
<-vc_init()

vc->vc_cur_blink_ms is just set in reset_terminal() from vc_init() path,
and ops->cur_blink_jiffies is figured out from vc->vc_cur_blink_ms
in fbcon_init().

And visual_init() is always run before vc_init(), so ops->cur_blink_jiffies
is initialized as zero and cause the soft lockup issue finally.

Thanks,
Ming

>
> perf shows that the CPU caught by the NMI is typically in code
> updating the cursor timer:
>
> -   16.92%  swapper  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>- _raw_spin_unlock_irqrestore
>   + 16.87% mod_timer
>   + 0.05% cursor_timer_handler
> -   12.15%  swapper  [kernel.kallsyms]  [k] queue_work_on
>- queue_work_on
>   + 12.00% cursor_timer_handler
>   + 0.15% call_timer_fn
> +   10.98%  swapper  [kernel.kallsyms]  [k] run_timer_softirq
> -2.23%  swapper  [kernel.kallsyms]  [k] mod_timer
>- mod_timer
>   + 1.97% cursor_timer_handler
>   + 0.26% call_timer_fn
>
> During the same period, I can see that another CPU is actively
> executing the timer function:
>
> -   42.18%  kworker/u96:2  [kernel.kallsyms]  [k] ww_mutex_unlock
>- ww_mutex_unlock
>   - 40.70% ast_dirty_update
>ast_imageblit
>soft_cursor
>bit_cursor
>fb_flashcursor
>process_one_work
>worker_thread
>kthread
>ret_from_fork
>   + 1.48% ast_imageblit
> -   40.15%  kworker/u96:2  [kernel.kallsyms]  [k] __memcpy_toio
>- __memcpy_toio
>   + 31.54% ast_dirty_update
>   + 8.61% ast_imageblit
>
> Using the graph function tracer on fb_flashcursor(), I see that
> ast_dirty_update usually takes around 60 us, in which it makes 16
> calls to __memcpy_toio(). However, there is always one instance on
> every boot of the installer where ast_dirty_update() takes ~98 *ms* to
> complete, during which it makes 743 calls to __memcpy_toio(). While
> that  doesn't directly account for the full 22s, I do wonder if that
> maybe a smoking gun.
>
> fyi, this is being tracked at: https://bugs.launchpad.net/bugs/1574814
>
>   -dann


ast: cursor flashing softlockups

2016-05-17 Thread David Daney
On 05/17/2016 10:39 AM, David Daney wrote:
> I can confirm this.

I have a patch that I think is the proper fix.  You should see it soon 
(after I test it a bit more)

Thanks,
David Daney


[...]

>>>
>>> On Tue, May 17, 2016 at 4:07 AM, Dann Frazier
>>>  wrote:
 Hi,
   I'm observing a soft lockup issue w/ the ASPEED controller on an
 arm64 server platform. This was originally seen on Ubuntu's 4.4
 kernel, but it is reproducible w/ vanilla 4.6-rc7 as well.

 [   32.792656] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s!
 [swapper/38:0]

 I observe this just once each time I boot into debian-installer (I'm
 using a serial console, but the ast module gets loaded during
 startup).
>>>
>>> I have figured out that it is caused by 'mod_timer(timer, jiffies)' and
>>> 'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
>>> when the issue happened.
>>
>> Thanks for tracking this down.
>>
>> This softlockup looks to be caused by:
>>
>> commit 27a4c827c34ac4256a190cc9d24607f953c1c459
>> Author: Scot Doyle 
>> Date:   Thu Mar 26 13:56:38 2015 +
>>
>> fbcon: use the cursor blink interval provided by vt
>>
>> vt now provides a cursor blink interval via vc_data. Use this
>> interval instead of the currently hardcoded 200 msecs. Store
>> it in
>> fbcon_ops to avoid locking the console in cursor_timer_handler().
>>
>> Signed-off-by: Scot Doyle 
>> Acked-by: Pavel Machek 
>> Signed-off-by: Greg Kroah-Hartman 
>>
>> and
>>
>> commit bd63364caa8df38bad2b25b11b2a1b849475cce5
>> Author: Scot Doyle 
>> Date:   Thu Mar 26 13:54:39 2015 +
>>
>> vt: add cursor blink interval escape sequence
>>
>> Add an escape sequence to specify the current console's cursor
>> blink
>> interval. The interval is specified as a number of
>> milliseconds until
>> the next cursor display state toggle, from 50 to 65535.
>> /proc/loadavg
>> did not show a difference with a one msec interval, but the lower
>> bound is set to 50 msecs since slower hardware wasn't tested.
>>
>> Store the interval in the vc_data structure for later access
>> by fbcon,
>> initializing the value to fbcon's current hardcoded value of
>> 200 msecs.
>>
>> Signed-off-by: Scot Doyle 
>> Acked-by: Pavel Machek 
>> Signed-off-by: Greg Kroah-Hartman 
>>
>>
>>
>>> Looks it is a real fbcon/vt issue, see following:
>>>
>>> fbcon_init()
>>>  <-.con_init
>>><-visual_init()
>>>
>>> reset_terminal()
>>>  <-vc_init()
>>>
>>> vc->vc_cur_blink_ms is just set in reset_terminal() from vc_init() path,
>>> and ops->cur_blink_jiffies is figured out from vc->vc_cur_blink_ms
>>> in fbcon_init().
>>>
>>> And visual_init() is always run before vc_init(), so
>>> ops->cur_blink_jiffies
>>> is initialized as zero and cause the soft lockup issue finally.
>>>
>>> Thanks,
>>> Ming
>>>

 perf shows that the CPU caught by the NMI is typically in code
 updating the cursor timer:

 -   16.92%  swapper  [kernel.kallsyms]  [k]
 _raw_spin_unlock_irqrestore
 - _raw_spin_unlock_irqrestore
+ 16.87% mod_timer
+ 0.05% cursor_timer_handler
 -   12.15%  swapper  [kernel.kallsyms]  [k] queue_work_on
 - queue_work_on
+ 12.00% cursor_timer_handler
+ 0.15% call_timer_fn
 +   10.98%  swapper  [kernel.kallsyms]  [k] run_timer_softirq
 -2.23%  swapper  [kernel.kallsyms]  [k] mod_timer
 - mod_timer
+ 1.97% cursor_timer_handler
+ 0.26% call_timer_fn

 During the same period, I can see that another CPU is actively
 executing the timer function:

 -   42.18%  kworker/u96:2  [kernel.kallsyms]  [k] ww_mutex_unlock
 - ww_mutex_unlock
- 40.70% ast_dirty_update
 ast_imageblit
 soft_cursor
 bit_cursor
 fb_flashcursor
 process_one_work
 worker_thread
 kthread
 ret_from_fork
+ 1.48% ast_imageblit
 -   40.15%  kworker/u96:2  [kernel.kallsyms]  [k] __memcpy_toio
 - __memcpy_toio
+ 31.54% ast_dirty_update
+ 8.61% ast_imageblit

 Using the graph function tracer on fb_flashcursor(), I see that
 ast_dirty_update usually takes around 60 us, in which it makes 16
 calls to __memcpy_toio(). However, there is always one instance on
 every boot of the installer where ast_dirty_update() takes ~98 *ms* to
 complete, during which it makes 743 calls to __memcpy_toio(). While
 that  doesn't directly account for the full 22s, I do wonder if that
 maybe a smoking gun.

 fyi, this is being tracked at: https://bugs.launchpad.net/bugs/1574814

-dann
>>
>



ast: cursor flashing softlockups

2016-05-17 Thread David Daney
I can confirm this.  The cursor is blinking along nicely with a 200mS 
on/off time then with this patch installed:

diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
index 6e92917..7855446 100644
--- a/drivers/video/console/fbcon.c
+++ b/drivers/video/console/fbcon.c
@@ -402,6 +402,8 @@ static void cursor_timer_handler(unsigned long dev_addr)
struct fbcon_ops *ops = info->fbcon_par;

queue_work(system_power_efficient_wq, &info->queue);
+   if (WARN_ON(ops->cur_blink_jiffies < 10))
+   ops->cur_blink_jiffies = 200;
mod_timer(&ops->cursor_timer, jiffies + ops->cur_blink_jiffies);
  }

@@ -417,6 +419,8 @@ static void fbcon_add_cursor_timer(struct fb_info *info)

init_timer(&ops->cursor_timer);
ops->cursor_timer.function = cursor_timer_handler;
+   if (WARN_ON(ops->cur_blink_jiffies < 10))
+   ops->cur_blink_jiffies = 200;
ops->cursor_timer.expires = jiffies + ops->cur_blink_jiffies;
ops->cursor_timer.data = (unsigned long ) info;
add_timer(&ops->cursor_timer);
@@ -1096,6 +1100,7 @@ static void fbcon_init(struct vc_data *vc, int init)

ops = info->fbcon_par;
ops->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms);
+   WARN_ON(ops->cur_blink_jiffies < 10);
p->con_rotate = initial_rotation;
set_blitting_type(vc, info);

@@ -1310,6 +1315,7 @@ static void fbcon_cursor(struct vc_data *vc, int mode)
int c = scr_readw((u16 *) vc->vc_pos);

ops->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms);
+   WARN_ON(ops->cur_blink_jiffies < 10);

if (fbcon_is_inactive(vc, info) || vc->vc_deccm != 1)
return;

-

I get:


[   29.386066] [ cut here ]
[   29.386080] WARNING: CPU: 0 PID: 1688 at 
drivers/video/console/fbcon.c:1103 fbcon_init+0x47c/0x4b8
[   29.386145] Modules linked in: vfat(E) fat(E) aes_ce_blk(E) 
ablk_helper(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E) 
sha1_ce(E) sg(E) ip_tables(E) xfs(E) libcrc32c(E) nicvf(E) ast(E) 
i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) i2c_core(E) nicpf(E) 
thunder_bgx(E) mdio_thunder(E) mdio_cavium(E) dm_mirror(E) 
dm_region_hash(E) dm_log(E) dm_mod(E)
[   29.386147]
[   29.386153] CPU: 0 PID: 1688 Comm: systemd-logind Tainted: G 
E   4.6.0-rc3-arm64next+ #278
[   29.386155] Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX 
CRB-2S, BIOS 0.3 Apr 25 2016
[   29.386159] task: fe0fd1109d80 ti: fe0fd119c000 task.ti: 
fe0fd119c000
[   29.386163] PC is at fbcon_init+0x47c/0x4b8
[   29.386167] LR is at fbcon_init+0x144/0x4b8
[   29.386170] pc : [] lr : [] 
pstate: 8145
[   29.386172] sp : fe0fd119f900
[   29.386179] x29: fe0fd119f900 x28: fe0fdc059c00
[   29.386186] x27: fe001c7c8800 x26: fe0fd6027910
[   29.386192] x25: fe0fd6027800 x24: fc0009b56f98
[   29.386198] x23: 0001 x22: fc0008d1a000
[   29.386205] x21: fe0ff4062000 x20: fc0009b57018
[   29.386211] x19: fc0009b56000 x18: 02aae7bd51f0
[   29.386222] x17: 03ffb542f9f0 x16: fc0008242b80
[   29.386228] x15: 02aae7bdb328 x14: 0006
[   29.386234] x13: 02aae7bc72c8 x12: 
[   29.386240] x11:  x10: 
[   29.386247] x9 :  x8 : fe001c7c8b18
[   29.386253] x7 :  x6 : 007f
[   29.386259] x5 : fc0008c91640 x4 : 
[   29.386266] x3 : 0800 x2 : fe001c7c8ce8
[   29.386272] x1 : fe0fd4df4e00 x0 : 
[   29.386274]
[   29.386277] ---[ end trace 6c32ddc01008c9ba ]---
[   29.386280] Call trace:
[   29.386284] Exception stack(0xfe0fd119f730 to 0xfe0fd119f850)
[   29.386288] f720:   fc0009b56000 
fc0009b57018
[   29.386292] f740: fe0fd119f900 fc0008468930 8145 
003d
[   29.386296] f760: fc0008c36ce8 fc00081c2a30 fe0fd119f7f0 
fc00080f47d8
[   29.386299] f780: fe0fd119f800 fc00080f47d8 010c 
fe0fd1109d80
[   29.386303] f7a0:  0025810c fe0fd110a5a0 

[   29.386307] f7c0: fc000880d2e8  fe0fd119f850 
fc00080f47d8
[   29.386311] f7e0:  fe0fd4df4e00 fe001c7c8ce8 
0800
[   29.386314] f800:  fc0008c91640 007f 

[   29.386318] f820: fe001c7c8b18   

[   29.386321] f840:  02aae7bc72c8
[   29.386327] [] fbcon_init+0x47c/0x4b8
[   29.386332] [] visual_init+0xbc/0x114
[   29.386336] [] vc_allocate+0x108/0x1e0
[   29.386340] [] con_install+0x34/0x100
[

ast: cursor flashing softlockups

2016-05-17 Thread Peter Hurley
[ +to Scot Doyle ]

Scot, please take a look at this soft lockup.

Regards,
Peter Hurley


Hi Ming,

On 05/17/2016 02:12 AM, Ming Lei wrote:
> Hi,
> 
> On Tue, May 17, 2016 at 4:07 AM, Dann Frazier
>  wrote:
>> Hi,
>>  I'm observing a soft lockup issue w/ the ASPEED controller on an
>> arm64 server platform. This was originally seen on Ubuntu's 4.4
>> kernel, but it is reproducible w/ vanilla 4.6-rc7 as well.
>>
>> [   32.792656] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s!
>> [swapper/38:0]
>>
>> I observe this just once each time I boot into debian-installer (I'm
>> using a serial console, but the ast module gets loaded during
>> startup).
> 
> I have figured out that it is caused by 'mod_timer(timer, jiffies)' and
> 'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
> when the issue happened.

Thanks for tracking this down.

This softlockup looks to be caused by:

commit 27a4c827c34ac4256a190cc9d24607f953c1c459
Author: Scot Doyle 
Date:   Thu Mar 26 13:56:38 2015 +

fbcon: use the cursor blink interval provided by vt

vt now provides a cursor blink interval via vc_data. Use this
interval instead of the currently hardcoded 200 msecs. Store it in
fbcon_ops to avoid locking the console in cursor_timer_handler().

Signed-off-by: Scot Doyle 
Acked-by: Pavel Machek 
Signed-off-by: Greg Kroah-Hartman 

and

commit bd63364caa8df38bad2b25b11b2a1b849475cce5
Author: Scot Doyle 
Date:   Thu Mar 26 13:54:39 2015 +

vt: add cursor blink interval escape sequence

Add an escape sequence to specify the current console's cursor blink
interval. The interval is specified as a number of milliseconds 
until
the next cursor display state toggle, from 50 to 65535. 
/proc/loadavg
did not show a difference with a one msec interval, but the lower
bound is set to 50 msecs since slower hardware wasn't tested.

Store the interval in the vc_data structure for later access by 
fbcon,
initializing the value to fbcon's current hardcoded value of 200 
msecs.

Signed-off-by: Scot Doyle 
Acked-by: Pavel Machek 
Signed-off-by: Greg Kroah-Hartman 



> Looks it is a real fbcon/vt issue, see following:
> 
> fbcon_init()
> <-.con_init
>   <-visual_init()
> 
> reset_terminal()
> <-vc_init()
> 
> vc->vc_cur_blink_ms is just set in reset_terminal() from vc_init() path,
> and ops->cur_blink_jiffies is figured out from vc->vc_cur_blink_ms
> in fbcon_init().
> 
> And visual_init() is always run before vc_init(), so ops->cur_blink_jiffies
> is initialized as zero and cause the soft lockup issue finally.
> 
> Thanks,
> Ming
> 
>>
>> perf shows that the CPU caught by the NMI is typically in code
>> updating the cursor timer:
>>
>> -   16.92%  swapper  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>- _raw_spin_unlock_irqrestore
>>   + 16.87% mod_timer
>>   + 0.05% cursor_timer_handler
>> -   12.15%  swapper  [kernel.kallsyms]  [k] queue_work_on
>>- queue_work_on
>>   + 12.00% cursor_timer_handler
>>   + 0.15% call_timer_fn
>> +   10.98%  swapper  [kernel.kallsyms]  [k] run_timer_softirq
>> -2.23%  swapper  [kernel.kallsyms]  [k] mod_timer
>>- mod_timer
>>   + 1.97% cursor_timer_handler
>>   + 0.26% call_timer_fn
>>
>> During the same period, I can see that another CPU is actively
>> executing the timer function:
>>
>> -   42.18%  kworker/u96:2  [kernel.kallsyms]  [k] ww_mutex_unlock
>>- ww_mutex_unlock
>>   - 40.70% ast_dirty_update
>>ast_imageblit
>>soft_cursor
>>bit_cursor
>>fb_flashcursor
>>process_one_work
>>worker_thread
>>kthread
>>ret_from_fork
>>   + 1.48% ast_imageblit
>> -   40.15%  kworker/u96:2  [kernel.kallsyms]  [k] __memcpy_toio
>>- __memcpy_toio
>>   + 31.54% ast_dirty_update
>>   + 8.61% ast_imageblit
>>
>> Using the graph function tracer on fb_flashcursor(), I see that
>> ast_dirty_update usually takes around 60 us, in which it makes 16
>> calls to __memcpy_toio(). However, there is always one instance on
>> every boot of the installer where ast_dirty_update() takes ~98 *ms* to
>> complete, during which it makes 743 calls to __memcpy_toio(). While
>> that  doesn't directly account for the full 22s, I do wonder if that
>> maybe a smoking gun.
>>
>> fyi, this is being tracked at: https://bugs.launchpad.net/bugs/1574814
>>
>>   -dann



ast: cursor flashing softlockups

2016-05-16 Thread Dann Frazier
Hi,
 I'm observing a soft lockup issue w/ the ASPEED controller on an
arm64 server platform. This was originally seen on Ubuntu's 4.4
kernel, but it is reproducible w/ vanilla 4.6-rc7 as well.

[   32.792656] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s!
[swapper/38:0]

I observe this just once each time I boot into debian-installer (I'm
using a serial console, but the ast module gets loaded during
startup).

perf shows that the CPU caught by the NMI is typically in code
updating the cursor timer:

-   16.92%  swapper  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
  + 16.87% mod_timer
  + 0.05% cursor_timer_handler
-   12.15%  swapper  [kernel.kallsyms]  [k] queue_work_on
   - queue_work_on
  + 12.00% cursor_timer_handler
  + 0.15% call_timer_fn
+   10.98%  swapper  [kernel.kallsyms]  [k] run_timer_softirq
-2.23%  swapper  [kernel.kallsyms]  [k] mod_timer
   - mod_timer
  + 1.97% cursor_timer_handler
  + 0.26% call_timer_fn

During the same period, I can see that another CPU is actively
executing the timer function:

-   42.18%  kworker/u96:2  [kernel.kallsyms]  [k] ww_mutex_unlock
   - ww_mutex_unlock
  - 40.70% ast_dirty_update
   ast_imageblit
   soft_cursor
   bit_cursor
   fb_flashcursor
   process_one_work
   worker_thread
   kthread
   ret_from_fork
  + 1.48% ast_imageblit
-   40.15%  kworker/u96:2  [kernel.kallsyms]  [k] __memcpy_toio
   - __memcpy_toio
  + 31.54% ast_dirty_update
  + 8.61% ast_imageblit

Using the graph function tracer on fb_flashcursor(), I see that
ast_dirty_update usually takes around 60 us, in which it makes 16
calls to __memcpy_toio(). However, there is always one instance on
every boot of the installer where ast_dirty_update() takes ~98 *ms* to
complete, during which it makes 743 calls to __memcpy_toio(). While
that  doesn't directly account for the full 22s, I do wonder if that
maybe a smoking gun.

fyi, this is being tracked at: https://bugs.launchpad.net/bugs/1574814

  -dann