On Wed, Apr 19, 2017 at 08:39:05PM +0100, Ben Hutchings wrote :
> On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote:
> [...]
> > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5
> > minutes even without doing any 'xl vcpu-set'?
> 
> The MCE polling timer for each CPU runs every 5 minutes, so this is
> presumably the first time it runs.  Perhaps this domain is configured
> such that CPUs are hot-removed shortly after boot?

I didn't explicitly set anything like that, but I guess it could also be
a default configuration in Xen.

> In the first crash, it looks like the timer for CPU x!=0 is being
> called on CPU 0.  In general this can happen if CPU x is hot-removed;
> its timers are migrated to another CPU.  This should *not* be possible
> with the MCE timer, as there is a hotplug callback that removes the
> timer when a CPU is removed.  There is a check for the timer having
> been migrated anyway, which triggers the WARNING.  The timer function
> then tries to re-add the timer for the current CPU, but that's still
> pending, which triggers the BUG.  Either the hotplug callback was not
> called, or the timer was migrated before being removed resulting in a
> race condition.
> 
> > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of
> > the time (between 1 and 16 vcpus), but after several tries, I got the
> > attached trace.
> 
> I'm not sure what's going on in this crash, but as it's a null
> dereference in migrate_timer_list it seems somewhat related.
> 
> I didn't find any changes that would explain how this was fixed between
> 4.0 and 4.2.  I suggest you work around it by adding 'nomce' to the
> kernel command line as I would expect Xen or dom0 to handle MCEs.

Thanks a lot Ben, I can't reproduce the issue with 'nomce'.

Thanks,
Vincent

Attachment: signature.asc
Description: PGP signature

Reply via email to