Re: [Kgdb-bugreport] KGDB/RT integration woes

Sergei Shtylyov Tue, 12 Jun 2007 18:43:20 -0700

Wessel, Jason wrote:

>>>>>>  I'm also getting this with RT patch applied on x86_64 SMP machine 
>>>>>>(with low-latency desktop kernel) after hitting initial


>>breakpoint:

>>>>>>BUG: at kernel/softirq.c:647 __tasklet_action()

>>>>>>Call Trace:
>>>>>>[<ffffffff8022e61a>] __tasklet_action+0xe7/0x138 
>>>>>>[<ffffffff8022e693>] tasklet_action+0x28/0x2a 

>>[<ffffffff8022e892>] 

>>>>>>ksoftirqd+0x149/0x1f3 [<ffffffff8022e749>] ksoftirqd+0x0/0x1f3 
>>>>>>[<ffffffff8023d324>] kthread+0xdc/0x113 [<ffffffff8020adf8>] 
>>>>>>child_rip+0xa/0x12 [<ffffffff8023d44f>] 

>>kthread_create+0x6a/0x15c 

>>>>>>[<ffffffff8023d248>] kthread+0x0/0x113 [<ffffffff8020adee>] 
>>>>>>child_rip+0x0/0x12

>>>>>>---------------------------
>>>>>>| preempt count: 00000100 ]
>>>>>>| 0-level deep critical section nesting:
>>>>>>----------------------------------------

>>>   Ugh, this one was really nasty. The actual reason has turned to be 
>>>that the KGDB's tasklet gets scheduled *before* per-CPU data gets 
>>>replicated for each CPU, therefore it modifies the .data.percpu 
>>>section itself.  But the tasklet is actually run *after* the 
>>>replication, so it gets into the tasklet lists on every CPU -- and so 
>>>I get that BUG on every CPU!  Any thoughts on how to avoid this 
>>>nuisance? :-/

>>    Looks like a design issue to me: KGDB (ab)uses tasklets 
>>before per-CPU data gets replicated. This only happens on 
>>x86_64 SMP machines because those don't have exception stack 
>>setup by the time initial breakpoint is hit.  What I don't 
>>understand yet is why these BUGs don't show up without the 
>>-rt patch...

    The mainline code has the same TASKLET_STATE_SCHED but check and BUG, yet 
it didn't seem to give the trace -- I'll investigate today...

> Is this during the boot cycle or attaching afterwards?

    The former -- it's caused by the 'kgdbwait' option.

> The tasklet at
> runtime should only be used to break in initially.

    And it is.

>   It sounds like the problem might be else where though.

    No, it lays exactly where I've described. Quoting the boot log with my 
some printk() added:

Command line: BOOT_IMAGE=vmlinuz-headless ip=any root=/dev/nfs kgdbwait 
[EMAIL PROTECTED]/,@192.168.222.1/ console=ttyS0,115200n1
BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009ac00 (usable)
  BIOS-e820: 000000000009ac00 - 00000000000a0000 (reserved)
  BIOS-e820: 00000000000d4000 - 0000000000100000 (reserved)
  BIOS-e820: 0000000000100000 - 000000007ff70000 (usable)
  BIOS-e820: 000000007ff70000 - 000000007ff73000 (ACPI data)
  BIOS-e820: 000000007ff73000 - 000000007ff80000 (ACPI NVS)
  BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved)
  BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved)
  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
  BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
__tasklet_common_schedule called on CPU0 with t = ffffffff80810ba0, head = 
ffffffff808ec5c8, nr = 5
Schedule tasklet with next = 0000000000000000, state = 1, count = 0, func = 
ffffffff80259598

    This is where kgdb_early_init() schedules the tasklet -- note the value of 
'head' arg, it's in the initial .data.percpu section.

KGDB cannot initialize I/O yet.
end_pfn_map = 1048576
[...]
Allocating PCI resources starting at 88000000 (gap: 80000000:7ec00000)
PERCPU: Allocating 34816 bytes of per cpu data

    This is where per-CPU data gets allocated and copied from .percpu.data 
section. Our tasklet is still queued, so it gets into each CPUs' data!

Built 1 zonelists.  Total pages: 514985
Kernel command line: BOOT_IMAGE=vmlinuz-headless ip=any root=/dev/nfs kgdbwait 
[EMAIL PROTECTED]/,@192.168.222.1/ console=ttyS0,115200n1
kgdboe: local port 6443
kgdboe: local IP 192.168.222.22
kgdboe: interface eth0
kgdboe: remote port 6442
kgdboe: remote IP 192.168.222.1
kgdboe: remote ethernet address ff:ff:ff:ff:ff:ff
Initializing CPU#0
WARNING: experimental RCU implementation.
PID hash table entries: 4096 (order: 12, 32768 bytes)
Extended CMOS year: 2000
TSC calibrated against PM_TIMER
time.c: Detected 1794.068 MHz processor.
tasklet_action: &tasklet_vec = ffff810002c155c8
Execute tasklet on CPU0 with next = 0000000000000000, state = 1, count = 0, 
func = ffffffff80259598

    And here it is executed at last on CPU0.  And later it gets wrongly 
re-excuted on CPU1 (note zero state meaning TASKLET_STATF_SCHED bit already 
cleared by that time):

[...]
SCSI subsystem initialized
Execute tasklet on CPU1 with next = 0000000000000000, state = 0, count = 0, 
func = ffffffff80259598
BUG: at kernel/softirq.c:667 __tasklet_action()

Call Trace:
  <IRQ>  [<ffffffff8022e5b4>] __tasklet_action+0x108/0x159
  [<ffffffff8022e663>] tasklet_action+0x5e/0x6c
  [<ffffffff8022e028>] ___do_softirq+0xb1/0x183
  [<ffffffff8022e13c>] __do_softirq+0x42/0x5b
  [<ffffffff80243388>] tick_periodic+0x71/0x73
  [<ffffffff8020b18c>] call_softirq+0x1c/0x28
  [<ffffffff8020cfb9>] do_softirq+0x3d/0xb4
  [<ffffffff8022e23c>] irq_exit+0x40/0x52
  [<ffffffff80216f76>] smp_apic_timer_interrupt+0x49/0x64
  [<ffffffff80208206>] default_idle+0x0/0x4c
  [<ffffffff8020ac36>] apic_timer_interrupt+0x66/0x70
  <EOI>  [<ffffffff8020823d>] default_idle+0x37/0x4c
  [<ffffffff802081bb>] enter_idle+0x22/0x24
  [<ffffffff8020840c>] cpu_idle+0x58/0xa3
  [<ffffffff808b094d>] start_secondary+0x2fe/0x30d

KGDB cannot initialize I/O yet.

    (The debug output could have been more verbose but it was hard to 
cut/paste the complete log that way; I've also skipped large irrelevant parts 
of it).

> The tasklet should run on a single
> CPU and that CPU will have an exception to put KGDB into the exception
> context where it should try to obtain control of all the other
> processors via an IPI.

    Yeah, that's clear.  But in reality, in x86_64 SMP kernel, the tasklet 
gets executed on each CPU!

> Perhaps the RT code you have patched the kernel with has not changed the
> lock semantics in the kernel/kgdb.c to lock down all the processors?

    No, that I have fixed (and it would have caused different kind of BUG).

> Jason.

WBR, Sergei

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Kgdb-bugreport mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Re: [Kgdb-bugreport] KGDB/RT integration woes

Reply via email to