On Tue, 19 Sep 2017 16:51:06 +0100 Marc Zyngier <marc.zyng...@arm.com> wrote:
> On 19/09/17 16:40, Yanko Kaneti wrote: > > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote: > >> On 19/09/17 16:12, Yanko Kaneti wrote: > >>> Hello, > >>> > >>> Fedora rawhide config here. > >>> AMD FX-8370E > >>> > >>> Bisected a problem to: > >>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts > >>> actually using it) > >>> > >>> It seems to be causing stalls, short lived or long lived lockups > >>> very shortly after boot. Everything becomes jerky. > >>> > >>> The only visible in the log indication is something like : > >>> .... > >>> [ 59.802129] clocksource: timekeeping watchdog on CPU3: Marking > >>> clocksource 'tsc' as unstable because the skew is too large: > >>> [ 59.802134] clocksource: 'hpet' wd_now: > >>> 3326e7aa wd_last: 329956f8 mask: ffffffff [ 59.802137] > >>> clocksource: 'tsc' cs_now: 423662bc6f > >>> cs_last: 41dfc91650 mask: ffffffffffffffff [ 59.802140] tsc: > >>> Marking TSC unstable due to clocksource watchdog [ 59.802158] > >>> TSC found unstable after boot, most likely due to broken BIOS. > >>> Use 'tsc=unstable'. [ 59.802161] sched_clock: Marking unstable > >>> (59802142067, 15510)<-(59920871789, -118714277) [ 60.015604] > >>> clocksource: Switched to clocksource hpet [ 89.015994] INFO: > >>> NMI handler (perf_event_nmi_handler) took too long to run: > >>> 209.660 msecs [ 89.016003] perf: interrupt took too long > >>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to > >>> 1000 .... > >>> > >>> Just reverting that commit on top of linus mainline cures all the > >>> symptoms > >> > >> Interesting. Do you still get HPET interrupts? > > > > Sorry, I might need some basic help here (i.e where do I count > > them...) > > /proc/interrupts should display them. > > > After the watchdog switches the clocksource to hpet the system is > > still somewhat alive, so I'll guess some clock is still > > ticking.... > Probably, but I suspect they're not hitting the right CPU, hence the > lockups. > > Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off > the net for a few days. > > Thomas, any insight? Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0 can be correct: struct cpumask *effmsk = irq_data_get_effective_affinity_mask(irqdata); unsigned long cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS; if (!cpu_mask) return -EINVAL; *apicid = (unsigned int)cpu_mask; cpumask_bits(effmsk)[0] = cpu_mask; Before that patch, this function wrote to the effective mask unconditionally. After, it only writes to effective_mask if it is already non-zero.