Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-11-21 Thread Peter Maydell
On Fri, 19 Nov 2021 at 20:56, Garrick Toubassi  wrote:
> But the question remains as to whether the behavior is expected that if you 
> have pages which are executable and also written to, the writes will be 
> persistently slow?  Or is that a possible (fringe) bug?

I think that is just always going to be slow, because we work generally
on page granularity for access permissions and whether we need to take
slow paths, so we can't have a page which is both fast-path access
for writes and also which has generated translated code.

-- PMM



Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-11-19 Thread Garrick Toubassi
I debugged this a bit more and am closer to an explanation.  Some crude
rdtsc cycle profiling led me to the fact that notdirty_write is called a
ton in the slow case, and, in aggregate, takes up most of the time.
Running qemu with "-trace memory_notdirty_write_access" shows that in the
slow case it is logging this millions of times:

memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4
memory_notdirty_write_access 0x7be0 ram_addr 0x7be0 size 4

This is almost certainly the local variable used for the spin loop (which
iterates 10M times).  Note the address is 0x7be0 which shares the same page
as the boot block (0x7c00).  I am setting the stack to 0x7c00 to grow down
from the boot block based on a suggestion for early bootstrapping from
wiki.osdev.org/MBR_(x86).

So I am doing a lot of writes against a page which is also executable which
I expect is part of the problem.  The reason it's so fast in user mode has
nothing to do with privilege level, it's just that I happen to be using a
stack that is not on a page shared with executable code.  In fact if I
change the slow version's stack to not share a page with executable code
(just moving it to 0x6900) it works fine/fast.  Again, has nothing to do
with privilege level.

This is easily avoided on my end and of course having pages that are both
executable and written to is not a common practice.

But the question remains as to whether the behavior is expected that if you
have pages which are executable and also written to, the writes will be
persistently slow?  Or is that a possible (fringe) bug?

Thank you!

gt


On Fri, Nov 12, 2021 at 10:16 AM Garrick Toubassi 
wrote:

> Thanks Alex!
>
> Thanks for the pointer to gdbstub.  I just single stepped through the loop
> and see it behaving "as expected", which leads me to believe the
> performance issue doesn't show up in the execution of the client code.  But
> it sounds like you are saying you see evidence of it executing at 0x9fffc?
> Can you elaborate?
>
> Here's what I did, let me know if I'm misunderstanding.  I ran
>
> % qemu-system-x86_64 -s -S ./kernel.img
>
> Then
>
> % gdb ./kernel.img
> (gdb) target remote localhost:1234
>
> Then set a breakpoint at the spin() function (see source
> )
> (gdb) b *0x7e00
> (gdb) cont
>
> At that point I stepped through the loop several times and it behaved as
> expected.  I even "let it rip" with:
>
> (gdb) while 1
>  > x/i $rip
>  > stepi
>  > end
>
> And it stayed well behaved operating on the "client" code as I'd expect.
>
> My next step would be to step through the emulator itself but it sounds
> like you are seeing something that would short circuit that labor intensive
> exercise.  Pointers appreciated!
>
> gt
>
> On Fri, Nov 12, 2021 at 3:06 AM Alex Bennée 
> wrote:
>
>>
>> Alex Bennée  writes:
>>
>> > Garrick Toubassi  writes:
>> >
>> >> I went ahead and created a short repro case which can be found at
>> https://github.com/gtoubassi/qemu-spinrepro.  Would appreciate
>> >> thoughts from anyone or guidance on how to debug.
>> >
>> > Well something weird is going on that is chewing through the code
>> > generation logic. If you run with:
>> >
>> >  ./qemu-system-x86_64 -serial mon:stdio -kernel ~/Downloads/kernel.img
>> >
>> > And then C-a c to bring up the monitor you can type "info jit" and see:
>> >
>> >   (qemu) info jit
>> >   Translation buffer state:
>> >   gen code size   1063758051/1073736704
>> >   TB count1
>> >   TB avg target size  1 max=1 bytes
>> >   TB avg host size64 bytes (expansion ratio: 64.0)
>> >   cross page TB count 0 (0%)
>> >   direct jump count   0 (0%) (2 jumps=0 0%)
>> >   TB hash buckets 1/8192 (0.01% head buckets used)
>> >   TB hash occupancy   0.00% avg chain occ. Histogram: [0.0,2.5)%|█
>>   ▁|[22.5,25.0]%
>> >   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>> >
>> 
>>
>> Hmm ok that's just a result of the code disappearing down a hole:
>>
>>   0x0009fffc:  00 00addb %al, (%bx, %si)
>>   0x0009fffe:  00 00addb %al, (%bx, %si)
>>   0x000a:  ff   .byte0xff
>>   0x000a0001:  ff   .byte0xff
>>
>> and as that code is being executed out of a place without a phys_pc we
>> don't cache the TB (by design). Usually this isn't a massive problem but
>> obviously something has gone wrong in the code to be executing these
>> junk instructions.
>>
>> Have you traced the execution of your code via gdbstub?
>>
>> --
>> Alex Bennée
>>
>


Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-11-12 Thread Garrick Toubassi
Thanks Alex!

Thanks for the pointer to gdbstub.  I just single stepped through the loop
and see it behaving "as expected", which leads me to believe the
performance issue doesn't show up in the execution of the client code.  But
it sounds like you are saying you see evidence of it executing at 0x9fffc?
Can you elaborate?

Here's what I did, let me know if I'm misunderstanding.  I ran

% qemu-system-x86_64 -s -S ./kernel.img

Then

% gdb ./kernel.img
(gdb) target remote localhost:1234

Then set a breakpoint at the spin() function (see source
)
(gdb) b *0x7e00
(gdb) cont

At that point I stepped through the loop several times and it behaved as
expected.  I even "let it rip" with:

(gdb) while 1
 > x/i $rip
 > stepi
 > end

And it stayed well behaved operating on the "client" code as I'd expect.

My next step would be to step through the emulator itself but it sounds
like you are seeing something that would short circuit that labor intensive
exercise.  Pointers appreciated!

gt

On Fri, Nov 12, 2021 at 3:06 AM Alex Bennée  wrote:

>
> Alex Bennée  writes:
>
> > Garrick Toubassi  writes:
> >
> >> I went ahead and created a short repro case which can be found at
> https://github.com/gtoubassi/qemu-spinrepro.  Would appreciate
> >> thoughts from anyone or guidance on how to debug.
> >
> > Well something weird is going on that is chewing through the code
> > generation logic. If you run with:
> >
> >  ./qemu-system-x86_64 -serial mon:stdio -kernel ~/Downloads/kernel.img
> >
> > And then C-a c to bring up the monitor you can type "info jit" and see:
> >
> >   (qemu) info jit
> >   Translation buffer state:
> >   gen code size   1063758051/1073736704
> >   TB count1
> >   TB avg target size  1 max=1 bytes
> >   TB avg host size64 bytes (expansion ratio: 64.0)
> >   cross page TB count 0 (0%)
> >   direct jump count   0 (0%) (2 jumps=0 0%)
> >   TB hash buckets 1/8192 (0.01% head buckets used)
> >   TB hash occupancy   0.00% avg chain occ. Histogram: [0.0,2.5)%|█
>   ▁|[22.5,25.0]%
> >   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
> >
> 
>
> Hmm ok that's just a result of the code disappearing down a hole:
>
>   0x0009fffc:  00 00addb %al, (%bx, %si)
>   0x0009fffe:  00 00addb %al, (%bx, %si)
>   0x000a:  ff   .byte0xff
>   0x000a0001:  ff   .byte0xff
>
> and as that code is being executed out of a place without a phys_pc we
> don't cache the TB (by design). Usually this isn't a massive problem but
> obviously something has gone wrong in the code to be executing these
> junk instructions.
>
> Have you traced the execution of your code via gdbstub?
>
> --
> Alex Bennée
>


Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-11-12 Thread Alex Bennée


Alex Bennée  writes:

> Garrick Toubassi  writes:
>
>> I went ahead and created a short repro case which can be found at 
>> https://github.com/gtoubassi/qemu-spinrepro.  Would appreciate
>> thoughts from anyone or guidance on how to debug.
>
> Well something weird is going on that is chewing through the code
> generation logic. If you run with:
>
>  ./qemu-system-x86_64 -serial mon:stdio -kernel ~/Downloads/kernel.img
>
> And then C-a c to bring up the monitor you can type "info jit" and see:
>
>   (qemu) info jit
>   Translation buffer state:
>   gen code size   1063758051/1073736704
>   TB count1
>   TB avg target size  1 max=1 bytes
>   TB avg host size64 bytes (expansion ratio: 64.0)
>   cross page TB count 0 (0%)
>   direct jump count   0 (0%) (2 jumps=0 0%)
>   TB hash buckets 1/8192 (0.01% head buckets used)
>   TB hash occupancy   0.00% avg chain occ. Histogram: [0.0,2.5)%|█
> ▁|[22.5,25.0]%
>   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>


Hmm ok that's just a result of the code disappearing down a hole:

  0x0009fffc:  00 00addb %al, (%bx, %si)
  0x0009fffe:  00 00addb %al, (%bx, %si)
  0x000a:  ff   .byte0xff
  0x000a0001:  ff   .byte0xff

and as that code is being executed out of a place without a phys_pc we
don't cache the TB (by design). Usually this isn't a massive problem but
obviously something has gone wrong in the code to be executing these
junk instructions.

Have you traced the execution of your code via gdbstub?

-- 
Alex Bennée



Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-11-12 Thread Alex Bennée


Garrick Toubassi  writes:

> I went ahead and created a short repro case which can be found at 
> https://github.com/gtoubassi/qemu-spinrepro.  Would appreciate
> thoughts from anyone or guidance on how to debug.

Well something weird is going on that is chewing through the code
generation logic. If you run with:

 ./qemu-system-x86_64 -serial mon:stdio -kernel ~/Downloads/kernel.img

And then C-a c to bring up the monitor you can type "info jit" and see:

  (qemu) info jit
  Translation buffer state:
  gen code size   1063758051/1073736704
  TB count1
  TB avg target size  1 max=1 bytes
  TB avg host size64 bytes (expansion ratio: 64.0)
  cross page TB count 0 (0%)
  direct jump count   0 (0%) (2 jumps=0 0%)
  TB hash buckets 1/8192 (0.01% head buckets used)
  TB hash occupancy   0.00% avg chain occ. Histogram: [0.0,2.5)%|█
▁|[22.5,25.0]%
  TB hash avg chain   1.000 buckets. Histogram: 1|█|1

  Statistics:
  TB flush count  1
  TB invalidate count 237
  TLB full flushes0
  TLB partial flushes 514
  TLB elided flushes  1748
  [TCG profiler not compiled]

the gen code size just grows and grows until an eventual flush. So it's
spending most of it's time in the code generator. Weirdly it's only
translated one block but is happily flushing the code buffer quite
frequently:

  Statistics:
  TB flush count  14
  TB invalidate count 237
  TLB full flushes0
  TLB partial flushes 514
  TLB elided flushes  1748
  [TCG profiler not compiled]

>
> On Tue, Oct 19, 2021 at 3:05 PM Garrick Toubassi  wrote:
>
>  Hello
>
>  I have a mystery I haven't been able to run down and would appreciate any 
> explanation or advice.
>
>  On a mac/intel I am running qemu-system-x86_64 on a simple image which 
> bootstraps into 64 bit long mode and then runs a simple
>  spin loop (literally for (int i = 0; i < 1000; i++) {}).  This completes 
> in ~5 seconds of wall time.  After completion it then enters
>  user mode (CPL=3) via a fabricated interrupt stack frame and an iretq, 
> returning to the same spin loop.  In this case it runs about
>  100x faster.
>
>  I at first thought maybe the TCG jit somehow isn't kicking in and maybe 
> there is some pure interpretation going on but I've run with
>  "-trace exec_tb -trace translate_block -d 
> out_asm,guest_errors,nochain,int,plugin" and it seems to be running 
> "translation blocks", just
>  a lot more of them when running the slow loop (or to be more precise running 
> one tb many more times according to exec_tb
>  logging).  Upon inspection the relevant generated assembly is morally 
> equivalent between the two as best I can tell.  Which implies to
>  me its something outside of the tb.  I was thinking perhaps its regenerating 
> the code every time, but logging doesn't show that.
>
>  I also was wondering if something about the MMU implementation might slow 
> things down when in user mode?  In this case both
>  loops are running under the same GDT/page table which just happens to mark 
> all pages as "user" pages so that when jumping to
>  CPL=3 it will still run.
>
>  I can package up a reproducible case if it's helpful but wanted to see if 
> there is something obvious I am missing in terms of expected
>  behavior before doing that.
>
>  Thanks!
>
>  gt


-- 
Alex Bennée



Re: spin loop 100x faster in user mode (CPL=3) than superuser (CPL=0)?

2021-10-29 Thread Garrick Toubassi
I went ahead and created a short repro case which can be found at
https://github.com/gtoubassi/qemu-spinrepro.  Would appreciate thoughts
from anyone or guidance on how to debug.

On Tue, Oct 19, 2021 at 3:05 PM Garrick Toubassi 
wrote:

> Hello
>
> I have a mystery I haven't been able to run down and would appreciate any
> explanation or advice.
>
> On a mac/intel I am running qemu-system-x86_64 on a simple image which
> bootstraps into 64 bit long mode and then runs a simple spin loop
> (literally for (int i = 0; i < 1000; i++) {}).  This completes in ~5
> seconds of wall time.  After completion it then enters user mode (CPL=3)
> via a fabricated interrupt stack frame and an iretq, returning to the same
> spin loop.  In this case it runs about 100x faster.
>
> I at first thought maybe the TCG jit somehow isn't kicking in and maybe
> there is some pure interpretation going on but I've run with "-trace
> exec_tb -trace translate_block -d out_asm,guest_errors,nochain,int,plugin"
> and it seems to be running "translation blocks", just a lot more of them
> when running the slow loop (or to be more precise running one tb many more
> times according to exec_tb logging).  Upon inspection the relevant
> generated assembly is morally equivalent between the two as best I can
> tell.  Which implies to me its something outside of the tb.  I was thinking
> perhaps its regenerating the code every time, but logging doesn't show that.
>
> I also was wondering if something about the MMU implementation might slow
> things down when in user mode?  In this case both loops are running under
> the same GDT/page table which just happens to mark all pages as "user"
> pages so that when jumping to CPL=3 it will still run.
>
> I can package up a reproducible case if it's helpful but wanted to see if
> there is something obvious I am missing in terms of expected behavior
> before doing that.
>
> Thanks!
>
> gt
>
>
>
>