Re: Lockup with --accel tcg,thread=single

2019-10-01 Thread Alex Bennée


Paolo Bonzini  writes:

> On 01/10/19 10:42, Alex Bennée wrote:
>>
>> Paolo Bonzini  writes:
>>
>>> On 30/09/19 21:20, Alex Bennée wrote:
 Does seem to imply the vCPU CPUState is where the queue is. That's not
 to say there shouldn't be a single work queue for thread=single.
>>>
>>> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
>>> work items for all vCPUs if single-threaded", 2018-11-27).
>>>
>>> Are you going to make a patch to have a single work queue, or should
>>> I?
>>
>> What's the neatest way to do it? Are you thinking just special case
>> queue_work_on_cpu to special case first_cpu when mttcg is not enabled?
>
> Yes, I cannot think of anything better.

And I am immediately stymied by the fact that cpus-common is a common
blob so can't have differentiation for SOFTMMU cases in it.

Did you have a look at:

  Date: Mon, 30 Sep 2019 17:38:15 +0100
  Message-ID: <87h84tloy0@linaro.org>

Is that too ugly?

>
> Paolo


--
Alex Bennée



Re: Lockup with --accel tcg,thread=single

2019-10-01 Thread Peter Maydell
On Mon, 30 Sep 2019 at 18:47, Paolo Bonzini  wrote:
> On 30/09/19 17:37, Peter Maydell wrote:
> > Side note -- this use of run_on_cpu() means that we now drop
> > the iothread lock within memory_region_snapshot_and_clear_dirty(),
> > which we didn't before. This means that a vCPU thread can now
> > get in and execute an access to the device registers of
> > hw/display/vga.c, updating its state in a way I suspect that the
> > device model code is not expecting... So maybe the right answer
> > here should be to come up with a fix for the race that 9458a9a1
> > addresses that doesn't require us to drop the iothread lock in
> > memory_region_snapshot_and_clear_dirty() ? Alternatively we need
> > to audit the callers and flag in the documentation that this
> > function has the unexpected side effect of briefly dropping the
> > iothread lock.
>
> Yes, this is intended.  There shouldn't be side effects other than
> possibly a one-frame flash for all callers.

This seems quite tricky to audit to me -- for instance is the code
in vga_draw_graphic() really designed for and expecting to cope with
races where information it reads from s->foo after the call to
memory_region_snapshot_and_clear_dirty() might have been updated
by the guest and not be consistent with the information it read
from s->bar before the call and cached in a local variable?
I think most people write device code assuming that while execution
is within a function in the device model the device won't have to
deal with possible reentrancy where another thread comes in and
alters the device state underfoot.

thanks
-- PMM



Re: Lockup with --accel tcg,thread=single

2019-10-01 Thread Paolo Bonzini
On 01/10/19 10:42, Alex Bennée wrote:
> 
> Paolo Bonzini  writes:
> 
>> On 30/09/19 21:20, Alex Bennée wrote:
>>> Does seem to imply the vCPU CPUState is where the queue is. That's not
>>> to say there shouldn't be a single work queue for thread=single.
>>
>> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
>> work items for all vCPUs if single-threaded", 2018-11-27).
>>
>> Are you going to make a patch to have a single work queue, or should
>> I?
> 
> What's the neatest way to do it? Are you thinking just special case
> queue_work_on_cpu to special case first_cpu when mttcg is not enabled?

Yes, I cannot think of anything better.

Paolo




Re: Lockup with --accel tcg,thread=single

2019-10-01 Thread Alex Bennée


Paolo Bonzini  writes:

> On 30/09/19 21:20, Alex Bennée wrote:
>> Does seem to imply the vCPU CPUState is where the queue is. That's not
>> to say there shouldn't be a single work queue for thread=single.
>
> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
> work items for all vCPUs if single-threaded", 2018-11-27).
>
> Are you going to make a patch to have a single work queue, or should
> I?

What's the neatest way to do it? Are you thinking just special case
queue_work_on_cpu to special case first_cpu when mttcg is not enabled?

--
Alex Bennée



Re: Lockup with --accel tcg,thread=single

2019-09-30 Thread Paolo Bonzini
On 30/09/19 21:20, Alex Bennée wrote:
> Does seem to imply the vCPU CPUState is where the queue is. That's not
> to say there shouldn't be a single work queue for thread=single.

Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
work items for all vCPUs if single-threaded", 2018-11-27).

Are you going to make a patch to have a single work queue, or should I?

Paolo




Re: Lockup with --accel tcg,thread=single

2019-09-30 Thread Alex Bennée


Paolo Bonzini  writes:

> On 30/09/19 17:37, Peter Maydell wrote:
>> Not sure currently what the best fix is here.
>
> Since thread=single uses just one queued work list, could it be as
> simple as

Does it? I thought this was the case but:

  static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
  {
  qemu_mutex_lock(&cpu->work_mutex);
  if (cpu->queued_work_first == NULL) {
  cpu->queued_work_first = wi;
  } else {
  cpu->queued_work_last->next = wi;
  }
  cpu->queued_work_last = wi;
  wi->next = NULL;
  wi->done = false;
  qemu_mutex_unlock(&cpu->work_mutex);

  qemu_cpu_kick(cpu);
  }

Does seem to imply the vCPU CPUState is where the queue is. That's not
to say there shouldn't be a single work queue for thread=single.

>
> diff --git a/cpus.c b/cpus.c
> index d2c61ff..314f9aa 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1539,7 +1539,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
>  cpu = first_cpu;
>  }
>
> -while (cpu && !cpu->queued_work_first && !cpu->exit_request) {
> +while (cpu && !first_cpu->queued_work_first && !cpu->exit_request) {
>
>  atomic_mb_set(&tcg_current_rr_cpu, cpu);
>  current_cpu = cpu;
>
> or something like that?
>
>> Side note -- this use of run_on_cpu() means that we now drop
>> the iothread lock within memory_region_snapshot_and_clear_dirty(),
>> which we didn't before. This means that a vCPU thread can now
>> get in and execute an access to the device registers of
>> hw/display/vga.c, updating its state in a way I suspect that the
>> device model code is not expecting... So maybe the right answer
>> here should be to come up with a fix for the race that 9458a9a1
>> addresses that doesn't require us to drop the iothread lock in
>> memory_region_snapshot_and_clear_dirty() ? Alternatively we need
>> to audit the callers and flag in the documentation that this
>> function has the unexpected side effect of briefly dropping the
>> iothread lock.
>
> Yes, this is intended.  There shouldn't be side effects other than
> possibly a one-frame flash for all callers.
>
> Paolo


--
Alex Bennée



Re: Lockup with --accel tcg,thread=single

2019-09-30 Thread Paolo Bonzini
On 30/09/19 17:37, Peter Maydell wrote:
> Not sure currently what the best fix is here.

Since thread=single uses just one queued work list, could it be as simple as

diff --git a/cpus.c b/cpus.c
index d2c61ff..314f9aa 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1539,7 +1539,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
 cpu = first_cpu;
 }
 
-while (cpu && !cpu->queued_work_first && !cpu->exit_request) {
+while (cpu && !first_cpu->queued_work_first && !cpu->exit_request) {
 
 atomic_mb_set(&tcg_current_rr_cpu, cpu);
 current_cpu = cpu;

or something like that?

> Side note -- this use of run_on_cpu() means that we now drop
> the iothread lock within memory_region_snapshot_and_clear_dirty(),
> which we didn't before. This means that a vCPU thread can now
> get in and execute an access to the device registers of
> hw/display/vga.c, updating its state in a way I suspect that the
> device model code is not expecting... So maybe the right answer
> here should be to come up with a fix for the race that 9458a9a1
> addresses that doesn't require us to drop the iothread lock in
> memory_region_snapshot_and_clear_dirty() ? Alternatively we need
> to audit the callers and flag in the documentation that this
> function has the unexpected side effect of briefly dropping the
> iothread lock.

Yes, this is intended.  There shouldn't be side effects other than
possibly a one-frame flash for all callers.

Paolo




Re: Lockup with --accel tcg,thread=single

2019-09-30 Thread Alex Bennée


Peter Maydell  writes:

> On Mon, 30 Sep 2019 at 14:17, Doug Gale  wrote:
>>
>> I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.
>>
>> qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
>> /usr/share/ovmf/OVMF.fd
>>
>> Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
>> there is no hard drive, just the OVMF firmware locks it up. (I
>> narrowed it down to this from a much larger repro)
>
>> Peter Maydell helped me bisect it in IRC.
>>  Works fine at commit 1e8a98b53867f61
>>  Fails at commit 9458a9a1df1a4c7
>>
>> MTTCG works fine all the way up to master.
>
> Thanks for this bug report. I've reproduced it and think
> I have figured out what is going on here.
>
> Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the
> TCG-vs-dirty-bitmap race by having the thread which is
> doing a memory access wait for the vCPU thread to finish
> its current TB using a no-op run_on_cpu() operation.
>
> In the case of the hang the thread doing the memory access
> is the iothread, like this:
>
> #14 0x55c150c0a98c in run_on_cpu (cpu=0x55c153801c60,
> func=0x55c150bbb542 , data=...)
> at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205
> #15 0x55c150bbb58c in tcg_log_global_after_sync
> (listener=0x55c1538410c8) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963
> #16 0x55c150c1fe18 in memory_global_after_dirty_log_sync () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598
> #17 0x55c150c1e6b8 in memory_region_snapshot_and_clear_dirty
> (mr=0x55c1541e82b0, addr=0, size=192, client=0)
> at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106
> #18 0x55c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0)
> at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661
> #19 0x55c150c771c4 in vga_update_display (opaque=0x55c1541e82a0)
> at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785
> #20 0x55c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268
> #21 0x55c151091490 in gd_refresh (dcl=0x55c1549af090) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484
> #22 0x55c151056571 in dpy_refresh (s=0x55c1542f9d90) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629
> #23 0x55c1510527f0 in gui_update (opaque=0x55c1542f9d90) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206
> #24 0x55c1511ee67c in timerlist_run_timers
> (timer_list=0x55c15370c280) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592
> #25 0x55c1511ee726 in qemu_clock_run_timers
> (type=QEMU_CLOCK_REALTIME) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606
> #26 0x55c1511ee9e5 in qemu_clock_run_all_timers () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692
> #27 0x55c1511ef181 in main_loop_wait (nonblocking=0) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524
> #28 0x55c150db03fe in main_loop () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806
>
> run_on_cpu() adds the do_nothing function to a cpu queued-work list.
>
> However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn()
> has the basic structure:
>
>while (1) {
>for (each vCPU) {
>run this vCPU for a timeslice;
>}
>qemu_tcg_rr_wait_io_event();
>}
>
> and it processes cpu work queues only in qemu_tcg_rr_wait_io_event().
> This is a problem, because the thing which causes us to stop running
> one vCPU when its timeslice ends and move on to the next is the
> tcg_kick_vcpu_timer -- and the iothread will never process that timer
> and kick the vcpu because it is currently blocked in run_on_cpu() !
>
> Not sure currently what the best fix is here.

We seem to be repeating ourselves because:

  a8efa60633575a2ee4dbf807a71cb44d44b0e0f8
  Author: Paolo Bonzini 
  AuthorDate: Wed Nov 14 12:36:57 2018 +0100
  cpus: run work items for all vCPUs if single-threaded

However looking at the commit I can still see we have the problem of not
advancing to the next vCPU if the kick timer (or some other event)
doesn't bring the execution to an exit. I suspect you could get this in
Linux but it's probably sufficiently busy to ensure vCPUs are always
exiting for some reason or another.

So options I can think of so far are:

1. Kick 'em all when not inter-vCPU

Something like this untested patch...

--8<---cut here---start->8---
1 file changed, 17 insertions(+), 5 deletions(-)
cpus.c | 22 +-

modified   cpus.c
@@ -949,8 +949,8 @@ static inline int64_t qemu_tcg_next_kick(void)
 return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + TCG_KICK_PERIOD;
 }

-/* Kick the currently round-robin scheduled vCPU */
-static void qemu_cpu_kick_rr_cpu(void)
+/* Kick the currently round-robin scheduled vCPU to next */
+static void qemu_cpu_kick_rr_

Re: Lockup with --accel tcg,thread=single

2019-09-30 Thread Peter Maydell
On Mon, 30 Sep 2019 at 14:17, Doug Gale  wrote:
>
> I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.
>
> qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
> /usr/share/ovmf/OVMF.fd
>
> Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
> there is no hard drive, just the OVMF firmware locks it up. (I
> narrowed it down to this from a much larger repro)

> Peter Maydell helped me bisect it in IRC.
>  Works fine at commit 1e8a98b53867f61
>  Fails at commit 9458a9a1df1a4c7
>
> MTTCG works fine all the way up to master.

Thanks for this bug report. I've reproduced it and think
I have figured out what is going on here.

Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the
TCG-vs-dirty-bitmap race by having the thread which is
doing a memory access wait for the vCPU thread to finish
its current TB using a no-op run_on_cpu() operation.

In the case of the hang the thread doing the memory access
is the iothread, like this:

#14 0x55c150c0a98c in run_on_cpu (cpu=0x55c153801c60,
func=0x55c150bbb542 , data=...)
at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205
#15 0x55c150bbb58c in tcg_log_global_after_sync
(listener=0x55c1538410c8) at
/home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963
#16 0x55c150c1fe18 in memory_global_after_dirty_log_sync () at
/home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598
#17 0x55c150c1e6b8 in memory_region_snapshot_and_clear_dirty
(mr=0x55c1541e82b0, addr=0, size=192, client=0)
at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106
#18 0x55c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0)
at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661
#19 0x55c150c771c4 in vga_update_display (opaque=0x55c1541e82a0)
at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785
#20 0x55c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268
#21 0x55c151091490 in gd_refresh (dcl=0x55c1549af090) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484
#22 0x55c151056571 in dpy_refresh (s=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629
#23 0x55c1510527f0 in gui_update (opaque=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206
#24 0x55c1511ee67c in timerlist_run_timers
(timer_list=0x55c15370c280) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592
#25 0x55c1511ee726 in qemu_clock_run_timers
(type=QEMU_CLOCK_REALTIME) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606
#26 0x55c1511ee9e5 in qemu_clock_run_all_timers () at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692
#27 0x55c1511ef181 in main_loop_wait (nonblocking=0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524
#28 0x55c150db03fe in main_loop () at
/home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806

run_on_cpu() adds the do_nothing function to a cpu queued-work list.

However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn()
has the basic structure:

   while (1) {
   for (each vCPU) {
   run this vCPU for a timeslice;
   }
   qemu_tcg_rr_wait_io_event();
   }

and it processes cpu work queues only in qemu_tcg_rr_wait_io_event().
This is a problem, because the thing which causes us to stop running
one vCPU when its timeslice ends and move on to the next is the
tcg_kick_vcpu_timer -- and the iothread will never process that timer
and kick the vcpu because it is currently blocked in run_on_cpu() !

Not sure currently what the best fix is here.

Side note -- this use of run_on_cpu() means that we now drop
the iothread lock within memory_region_snapshot_and_clear_dirty(),
which we didn't before. This means that a vCPU thread can now
get in and execute an access to the device registers of
hw/display/vga.c, updating its state in a way I suspect that the
device model code is not expecting... So maybe the right answer
here should be to come up with a fix for the race that 9458a9a1
addresses that doesn't require us to drop the iothread lock in
memory_region_snapshot_and_clear_dirty() ? Alternatively we need
to audit the callers and flag in the documentation that this
function has the unexpected side effect of briefly dropping the
iothread lock.

thanks
-- PMM