date:20220818

test_postcopy() is currently run twice - which is just a waste of resources
and time. The commit d1a27b169b2d that introduced the duplicate talked about
renaming the "postcopy/unix" test, but apparently it forgot to remove the
old entry. Let's do that now.

Fixes: d1a27b169b ("tests: Add postcopy tls migration test")
Signed-off-by: Thomas Huth 
---
 tests/qtest/migration-test.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 7be321b62d..f63edd0bc8 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2461,7 +2461,6 @@ int main(int argc, char **argv)
 module_call_init(MODULE_INIT_QOM);
 
 if (has_uffd) {
-qtest_add_func("/migration/postcopy/unix", test_postcopy);
 qtest_add_func("/migration/postcopy/plain", test_postcopy);
 qtest_add_func("/migration/postcopy/recovery/plain",
test_postcopy_recovery);
-- 
2.31.1

[PATCH 3/4] tests/migration/i386: Speed up the i386 migration test (when using TCG)

When KVM is not available, the i386 migration test also runs in a rather
slow fashion, since the guest code takes a couple of seconds to print
the "B"s on the serial console, and the migration test has to wait for
this each time. Let's increase the frequency here, too, so that the
delays in the migration tests get smaller.

Signed-off-by: Thomas Huth 
---
 tests/migration/i386/a-b-bootblock.h | 12 ++--
 tests/migration/i386/a-b-bootblock.S |  1 +
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/migration/i386/a-b-bootblock.h 
b/tests/migration/i386/a-b-bootblock.h
index 7d459d4fde..b7b0fce2ee 100644
--- a/tests/migration/i386/a-b-bootblock.h
+++ b/tests/migration/i386/a-b-bootblock.h
@@ -4,17 +4,17 @@
  * the header and the assembler differences in your patch submission.
  */
 unsigned char x86_bootsect[] = {
-  0xfa, 0x0f, 0x01, 0x16, 0x74, 0x7c, 0x66, 0xb8, 0x01, 0x00, 0x00, 0x00,
+  0xfa, 0x0f, 0x01, 0x16, 0x78, 0x7c, 0x66, 0xb8, 0x01, 0x00, 0x00, 0x00,
   0x0f, 0x22, 0xc0, 0x66, 0xea, 0x20, 0x7c, 0x00, 0x00, 0x08, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xe4, 0x92, 0x0c, 0x02,
   0xe6, 0x92, 0xb8, 0x10, 0x00, 0x00, 0x00, 0x8e, 0xd8, 0x66, 0xb8, 0x41,
   0x00, 0x66, 0xba, 0xf8, 0x03, 0xee, 0xb3, 0x00, 0xb8, 0x00, 0x00, 0x10,
   0x00, 0xfe, 0x00, 0x05, 0x00, 0x10, 0x00, 0x00, 0x3d, 0x00, 0x00, 0x40,
-  0x06, 0x7c, 0xf2, 0xfe, 0xc3, 0x75, 0xe9, 0x66, 0xb8, 0x42, 0x00, 0x66,
-  0xba, 0xf8, 0x03, 0xee, 0xeb, 0xde, 0x66, 0x90, 0x00, 0x00, 0x00, 0x00,
-  0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0x00, 0x00, 0x00, 0x9a, 0xcf, 0x00,
-  0xff, 0xff, 0x00, 0x00, 0x00, 0x92, 0xcf, 0x00, 0x27, 0x00, 0x5c, 0x7c,
-  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+  0x06, 0x7c, 0xf2, 0xfe, 0xc3, 0x80, 0xe3, 0x3f, 0x75, 0xe6, 0x66, 0xb8,
+  0x42, 0x00, 0x66, 0xba, 0xf8, 0x03, 0xee, 0xeb, 0xdb, 0x8d, 0x76, 0x00,
+  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0x00, 0x00,
+  0x00, 0x9a, 0xcf, 0x00, 0xff, 0xff, 0x00, 0x00, 0x00, 0x92, 0xcf, 0x00,
+  0x27, 0x00, 0x60, 0x7c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
diff --git a/tests/migration/i386/a-b-bootblock.S 
b/tests/migration/i386/a-b-bootblock.S
index 3f97f28023..3d464c7568 100644
--- a/tests/migration/i386/a-b-bootblock.S
+++ b/tests/migration/i386/a-b-bootblock.S
@@ -50,6 +50,7 @@ innerloop:
 jl innerloop
 
 inc %bl
+andb $0x3f,%bl
 jnz mainloop
 
 mov $66,%ax
-- 
2.31.1

[PATCH 1/4] tests/qtest/migration-test: Only wait for serial output where migration succeeds

Waiting for the serial output can take a couple of seconds - and since
we're doing a lot of migration tests, this time easily sums up to
multiple minutes. But if a test is supposed to fail, it does not make
much sense to wait for the source to be in the right state first, so
we can skip the waiting here. This way we can speed up all tests where
the migration is supposed to fail. In the gitlab-CI gprov-gcov test,
each of the migration-tests now run two minutes faster!

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Thomas Huth 
---
 tests/qtest/migration-test.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 520a5f917c..7be321b62d 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1307,7 +1307,9 @@ static void test_precopy_common(MigrateCommon *args)
 }
 
 /* Wait for the first serial output from the source */
-wait_for_serial("src_serial");
+if (args->result == MIG_TEST_SUCCEED) {
+wait_for_serial("src_serial");
+}
 
 if (!args->connect_uri) {
 g_autofree char *local_connect_uri =
-- 
2.31.1

[PATCH 0/4] Speed up migration tests

We are currently facing the problem that the "gcov-gprof" CI jobs
in the gitlab-CI are running way too long - which happens since
the migration-tests have been enabled there recently.

These patches now speed up the migration tests, so that the
CI job should be fine again.

This is how it looked like before my modifications:

 https://gitlab.com/thuth/qemu/-/jobs/2888957948#L46
 ...
 5/243 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test  OK  1265.22s
 8/243 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-testOK  1138.82s

And this is how it looks like after the patches have been applied:

 https://gitlab.com/thuth/qemu/-/jobs/2905108018#L48
 ...
 5/243 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test  OK   251.14s
 8/243 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-testOK   336.94s

That means the CI job is running ca. 30 minutes faster here now!

Thomas Huth (4):
  tests/qtest/migration-test: Only wait for serial output where
migration succeeds
  tests/migration/aarch64: Speed up the aarch64 migration test
  tests/migration/i386: Speed up the i386 migration test (when using
TCG)
  tests/qtest/migration-test: Remove duplicated test_postcopy from the
test plan

 tests/migration/aarch64/a-b-kernel.h | 10 +-
 tests/migration/i386/a-b-bootblock.h | 12 ++--
 tests/qtest/migration-test.c |  5 +++--
 tests/migration/aarch64/a-b-kernel.S |  3 +--
 tests/migration/i386/a-b-bootblock.S |  1 +
 5 files changed, 16 insertions(+), 15 deletions(-)

-- 
2.31.1

[PATCH 2/4] tests/migration/aarch64: Speed up the aarch64 migration test

The migration tests spend a lot of time waiting for a sign of live
of the guest on the serial console. The aarch64 migration code only
outputs "B"s every couple of seconds (at least it takes more than 4
seconds between each characeter on my x86 laptop). There are a lot
of migration tests, and if each test that checks for a successful
migration waits for these characters before and after migration, the
wait time sums up to multiple minutes! Let's use a shorter delay to
speed things up.

While we're at it, also remove a superfluous masking with 0xff - we're
reading and storing bytes, so the upper bits of the register do not
matter anyway.

With these changes, the test runs twice as fast on my laptop, decreasing
the total run time from approx. 8 minutes to only 4 minutes!

Signed-off-by: Thomas Huth 
---
 tests/migration/aarch64/a-b-kernel.h | 10 +-
 tests/migration/aarch64/a-b-kernel.S |  3 +--
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/tests/migration/aarch64/a-b-kernel.h 
b/tests/migration/aarch64/a-b-kernel.h
index 0a9b01137e..34e518d061 100644
--- a/tests/migration/aarch64/a-b-kernel.h
+++ b/tests/migration/aarch64/a-b-kernel.h
@@ -10,9 +10,9 @@ unsigned char aarch64_kernel[] = {
   0x03, 0x00, 0x80, 0x52, 0xe4, 0x03, 0x00, 0xaa, 0x83, 0x00, 0x00, 0x39,
   0x84, 0x04, 0x40, 0x91, 0x9f, 0x00, 0x01, 0xeb, 0xad, 0xff, 0xff, 0x54,
   0x05, 0x00, 0x80, 0x52, 0xe4, 0x03, 0x00, 0xaa, 0x83, 0x00, 0x40, 0x39,
-  0x63, 0x04, 0x00, 0x11, 0x63, 0x1c, 0x00, 0x12, 0x83, 0x00, 0x00, 0x39,
-  0x24, 0x7e, 0x0b, 0xd5, 0x84, 0x04, 0x40, 0x91, 0x9f, 0x00, 0x01, 0xeb,
-  0x2b, 0xff, 0xff, 0x54, 0xa5, 0x04, 0x00, 0x11, 0xa5, 0x1c, 0x00, 0x12,
-  0xbf, 0x00, 0x00, 0x71, 0x81, 0xfe, 0xff, 0x54, 0x43, 0x08, 0x80, 0x52,
-  0x43, 0x00, 0x00, 0x39, 0xf1, 0xff, 0xff, 0x17
+  0x63, 0x04, 0x00, 0x11, 0x83, 0x00, 0x00, 0x39, 0x24, 0x7e, 0x0b, 0xd5,
+  0x84, 0x04, 0x40, 0x91, 0x9f, 0x00, 0x01, 0xeb, 0x4b, 0xff, 0xff, 0x54,
+  0xa5, 0x04, 0x00, 0x11, 0xa5, 0x10, 0x00, 0x12, 0xbf, 0x00, 0x00, 0x71,
+  0xa1, 0xfe, 0xff, 0x54, 0x43, 0x08, 0x80, 0x52, 0x43, 0x00, 0x00, 0x39,
+  0xf2, 0xff, 0xff, 0x17
 };
diff --git a/tests/migration/aarch64/a-b-kernel.S 
b/tests/migration/aarch64/a-b-kernel.S
index 0225945348..a4103ecb71 100644
--- a/tests/migration/aarch64/a-b-kernel.S
+++ b/tests/migration/aarch64/a-b-kernel.S
@@ -53,7 +53,6 @@ innerloop:
 /* increment the first byte of each page by 1 */
 ldrbw3, [x4]
 add w3, w3, #1
-and w3, w3, #0xff
 strbw3, [x4]
 
 /* make sure QEMU user space can see consistent data as MMU is off */
@@ -64,7 +63,7 @@ innerloop:
 blt innerloop
 
 add w5, w5, #1
-and w5, w5, #0xff
+and w5, w5, #0x1f
 cmp w5, #0
 bne mainloop
 
-- 
2.31.1

Re: [PATCH] target/riscv: Use official extension names for AIA CSRs

2022-08-18 Thread Anup Patel

On Fri, Aug 19, 2022 at 10:24 AM Weiwei Li  wrote:
>
>
> 在 2022/8/19 上午11:09, Anup Patel 写道:
> > The arch review of AIA spec is completed and we now have official
> > extension names for AIA: Smaia (M-mode AIA CSRs) and Ssaia (S-mode
> > AIA CSRs).
> >
> > Refer, section 1.6 of the latest AIA v0.3.1 stable specification at
> > https://github.com/riscv/riscv-aia/releases/download/0.3.1-draft.32/riscv-interrupts-032.pdf)
> >
> > Based on above, we update QEMU RISC-V to:
> > 1) Have separate config options for Smaia and Ssaia extensions
> > which replace RISCV_FEATURE_AIA in CPU features
> > 2) Not generate AIA INTC compatible string in virt machine
> >
> > Signed-off-by: Anup Patel 
> > Reviewed-by: Andrew Jones 
> > ---
> >   hw/intc/riscv_imsic.c |  4 +++-
> >   hw/riscv/virt.c   | 13 ++---
> >   target/riscv/cpu.c|  9 -
> >   target/riscv/cpu.h|  4 ++--
> >   target/riscv/cpu_helper.c | 30 ++
> >   target/riscv/csr.c| 30 --
> >   6 files changed, 57 insertions(+), 33 deletions(-)
> >
> > diff --git a/hw/intc/riscv_imsic.c b/hw/intc/riscv_imsic.c
> > index 8615e4cc1d..4d4d5b50ca 100644
> > --- a/hw/intc/riscv_imsic.c
> > +++ b/hw/intc/riscv_imsic.c
> > @@ -344,9 +344,11 @@ static void riscv_imsic_realize(DeviceState *dev, 
> > Error **errp)
> >
> >   /* Force select AIA feature and setup CSR read-modify-write callback 
> > */
> >   if (env) {
> > -riscv_set_feature(env, RISCV_FEATURE_AIA);
> >   if (!imsic->mmode) {
> > +rcpu->cfg.ext_ssaia = true;
> >   riscv_cpu_set_geilen(env, imsic->num_pages - 1);
> > +} else {
> > +rcpu->cfg.ext_smaia = true;
> >   }
> >   riscv_cpu_set_aia_ireg_rmw_fn(env, (imsic->mmode) ? PRV_M : PRV_S,
> > riscv_imsic_rmw, imsic);
> > diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> > index e779d399ae..b041b33afc 100644
> > --- a/hw/riscv/virt.c
> > +++ b/hw/riscv/virt.c
> > @@ -261,17 +261,8 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, 
> > int socket,
> >   qemu_fdt_add_subnode(mc->fdt, intc_name);
> >   qemu_fdt_setprop_cell(mc->fdt, intc_name, "phandle",
> >   intc_phandles[cpu]);
> > -if (riscv_feature(>soc[socket].harts[cpu].env,
> > -  RISCV_FEATURE_AIA)) {
> > -static const char * const compat[2] = {
> > -"riscv,cpu-intc-aia", "riscv,cpu-intc"
> > -};
> > -qemu_fdt_setprop_string_array(mc->fdt, intc_name, "compatible",
> > -  (char **), 
> > ARRAY_SIZE(compat));
> > -} else {
> > -qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
> > -"riscv,cpu-intc");
> > -}
> > +qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
> > +"riscv,cpu-intc");
> >   qemu_fdt_setprop(mc->fdt, intc_name, "interrupt-controller", 
> > NULL, 0);
> >   qemu_fdt_setprop_cell(mc->fdt, intc_name, "#interrupt-cells", 1);
> >
> > diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> > index d3fbaa..3cf0c86661 100644
> > --- a/target/riscv/cpu.c
> > +++ b/target/riscv/cpu.c
> > @@ -101,6 +101,8 @@ static const struct isa_ext_data isa_edata_arr[] = {
> >   ISA_EXT_DATA_ENTRY(zve64f, true, PRIV_VERSION_1_12_0, ext_zve64f),
> >   ISA_EXT_DATA_ENTRY(zhinx, true, PRIV_VERSION_1_12_0, ext_zhinx),
> >   ISA_EXT_DATA_ENTRY(zhinxmin, true, PRIV_VERSION_1_12_0, ext_zhinxmin),
> > +ISA_EXT_DATA_ENTRY(smaia, true, PRIV_VERSION_1_12_0, ext_smaia),
> > +ISA_EXT_DATA_ENTRY(ssaia, true, PRIV_VERSION_1_12_0, ext_ssaia),
> >   ISA_EXT_DATA_ENTRY(sscofpmf, true, PRIV_VERSION_1_12_0, ext_sscofpmf),
> >   ISA_EXT_DATA_ENTRY(sstc, true, PRIV_VERSION_1_12_0, ext_sstc),
> >   ISA_EXT_DATA_ENTRY(svinval, true, PRIV_VERSION_1_12_0, ext_svinval),
> > @@ -669,10 +671,6 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
> > **errp)
> >   }
> >   }
> >
> > -if (cpu->cfg.aia) {
> > -riscv_set_feature(env, RISCV_FEATURE_AIA);
> > -}
> > -
> >   if (cpu->cfg.debug) {
> >   riscv_set_feature(env, RISCV_FEATURE_DEBUG);
> >   }
> > @@ -1058,7 +1056,8 @@ static Property riscv_cpu_extensions[] = {
> >   DEFINE_PROP_BOOL("x-j", RISCVCPU, cfg.ext_j, false),
> >   /* ePMP 0.9.3 */
> >   DEFINE_PROP_BOOL("x-epmp", RISCVCPU, cfg.epmp, false),
> > -DEFINE_PROP_BOOL("x-aia", RISCVCPU, cfg.aia, false),
> > +DEFINE_PROP_BOOL("x-smaia", RISCVCPU, cfg.ext_smaia, false),
> > +DEFINE_PROP_BOOL("x-ssaia", RISCVCPU, cfg.ext_ssaia, false),
> >
> >   DEFINE_PROP_END_OF_LIST(),
> >   };
> > diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
> > index 42edfa4558..15cad73def 100644
> > --- a/target/riscv/cpu.h
> > +++ b/target/riscv/cpu.h
> > @@ -85,7 +85,6 @@ enum

Re: [PATCH for-7.2 v2 10/20] hw/ppc: set machine->fdt in spapr machine

2022-08-18 Thread David Gibson

On Fri, Aug 19, 2022 at 12:11:40PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 05/08/2022 19:39, Daniel Henrique Barboza wrote:
> > The pSeries machine never bothered with the common machine->fdt
> > attribute. We do all the FDT related work using spapr->fdt_blob.
> > 
> > We're going to introduce HMP commands to read and save the FDT, which
> > will rely on setting machine->fdt properly to work across all machine
> > archs/types.
> 
> Out of curiosity - why new HMP command, is not QOM'ing this ms::fdt property
> enough?

Huh.. I didn't think of that.  For dumpdtb you could be right, that
you might be able to use existing qom commands to extract the
property.  Would need to check that the size is is handled properly,
fdt's are a bit weird in having their size "in band".

"info fdt" etc. obviously have additional funtionality in formatting
the contents more helpfully.

> Another thing is that on every HMP dump I'd probably rebuild the entire FDT
> for the reasons David explained. Thanks,

This would require per-machine hooks, however.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature

Re: [PATCH] target/riscv: Use official extension names for AIA CSRs

2022-08-18 Thread Weiwei Li




在 2022/8/19 上午11:09, Anup Patel 写道:

The arch review of AIA spec is completed and we now have official
extension names for AIA: Smaia (M-mode AIA CSRs) and Ssaia (S-mode
AIA CSRs).

Refer, section 1.6 of the latest AIA v0.3.1 stable specification at
https://github.com/riscv/riscv-aia/releases/download/0.3.1-draft.32/riscv-interrupts-032.pdf)

Based on above, we update QEMU RISC-V to:
1) Have separate config options for Smaia and Ssaia extensions
which replace RISCV_FEATURE_AIA in CPU features
2) Not generate AIA INTC compatible string in virt machine

Signed-off-by: Anup Patel 
Reviewed-by: Andrew Jones 
---
  hw/intc/riscv_imsic.c |  4 +++-
  hw/riscv/virt.c   | 13 ++---
  target/riscv/cpu.c|  9 -
  target/riscv/cpu.h|  4 ++--
  target/riscv/cpu_helper.c | 30 ++
  target/riscv/csr.c| 30 --
  6 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/hw/intc/riscv_imsic.c b/hw/intc/riscv_imsic.c
index 8615e4cc1d..4d4d5b50ca 100644
--- a/hw/intc/riscv_imsic.c
+++ b/hw/intc/riscv_imsic.c
@@ -344,9 +344,11 @@ static void riscv_imsic_realize(DeviceState *dev, Error 
**errp)
  
  /* Force select AIA feature and setup CSR read-modify-write callback */

  if (env) {
-riscv_set_feature(env, RISCV_FEATURE_AIA);
  if (!imsic->mmode) {
+rcpu->cfg.ext_ssaia = true;
  riscv_cpu_set_geilen(env, imsic->num_pages - 1);
+} else {
+rcpu->cfg.ext_smaia = true;
  }
  riscv_cpu_set_aia_ireg_rmw_fn(env, (imsic->mmode) ? PRV_M : PRV_S,
riscv_imsic_rmw, imsic);
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index e779d399ae..b041b33afc 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -261,17 +261,8 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int 
socket,
  qemu_fdt_add_subnode(mc->fdt, intc_name);
  qemu_fdt_setprop_cell(mc->fdt, intc_name, "phandle",
  intc_phandles[cpu]);
-if (riscv_feature(>soc[socket].harts[cpu].env,
-  RISCV_FEATURE_AIA)) {
-static const char * const compat[2] = {
-"riscv,cpu-intc-aia", "riscv,cpu-intc"
-};
-qemu_fdt_setprop_string_array(mc->fdt, intc_name, "compatible",
-  (char **), ARRAY_SIZE(compat));
-} else {
-qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
-"riscv,cpu-intc");
-}
+qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
+"riscv,cpu-intc");
  qemu_fdt_setprop(mc->fdt, intc_name, "interrupt-controller", NULL, 0);
  qemu_fdt_setprop_cell(mc->fdt, intc_name, "#interrupt-cells", 1);
  
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c

index d3fbaa..3cf0c86661 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -101,6 +101,8 @@ static const struct isa_ext_data isa_edata_arr[] = {
  ISA_EXT_DATA_ENTRY(zve64f, true, PRIV_VERSION_1_12_0, ext_zve64f),
  ISA_EXT_DATA_ENTRY(zhinx, true, PRIV_VERSION_1_12_0, ext_zhinx),
  ISA_EXT_DATA_ENTRY(zhinxmin, true, PRIV_VERSION_1_12_0, ext_zhinxmin),
+ISA_EXT_DATA_ENTRY(smaia, true, PRIV_VERSION_1_12_0, ext_smaia),
+ISA_EXT_DATA_ENTRY(ssaia, true, PRIV_VERSION_1_12_0, ext_ssaia),
  ISA_EXT_DATA_ENTRY(sscofpmf, true, PRIV_VERSION_1_12_0, ext_sscofpmf),
  ISA_EXT_DATA_ENTRY(sstc, true, PRIV_VERSION_1_12_0, ext_sstc),
  ISA_EXT_DATA_ENTRY(svinval, true, PRIV_VERSION_1_12_0, ext_svinval),
@@ -669,10 +671,6 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
**errp)
  }
  }
  
-if (cpu->cfg.aia) {

-riscv_set_feature(env, RISCV_FEATURE_AIA);
-}
-
  if (cpu->cfg.debug) {
  riscv_set_feature(env, RISCV_FEATURE_DEBUG);
  }
@@ -1058,7 +1056,8 @@ static Property riscv_cpu_extensions[] = {
  DEFINE_PROP_BOOL("x-j", RISCVCPU, cfg.ext_j, false),
  /* ePMP 0.9.3 */
  DEFINE_PROP_BOOL("x-epmp", RISCVCPU, cfg.epmp, false),
-DEFINE_PROP_BOOL("x-aia", RISCVCPU, cfg.aia, false),
+DEFINE_PROP_BOOL("x-smaia", RISCVCPU, cfg.ext_smaia, false),
+DEFINE_PROP_BOOL("x-ssaia", RISCVCPU, cfg.ext_ssaia, false),
  
  DEFINE_PROP_END_OF_LIST(),

  };
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 42edfa4558..15cad73def 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -85,7 +85,6 @@ enum {
  RISCV_FEATURE_PMP,
  RISCV_FEATURE_EPMP,
  RISCV_FEATURE_MISA,
-RISCV_FEATURE_AIA,
  RISCV_FEATURE_DEBUG
  };
  
@@ -452,6 +451,8 @@ struct RISCVCPUConfig {

  bool ext_zve64f;
  bool ext_zmmul;
  bool ext_sscofpmf;
+bool ext_smaia;
+bool ext_ssaia;
  bool rvv_ta_all_1s;
  bool rvv_ma_all_1s;
  
@@ -472,7 +473,6 @@ struct RISCVCPUConfig {

  bool mmu;
  bool pmp;
  bool epmp;
-bool aia;
  bool

Re: [PATCH v4 4/6] vdpa: Add asid parameter to vhost_vdpa_dma_map/unmap

2022-08-18 Thread Jason Wang

On Wed, Aug 10, 2022 at 1:04 AM Eugenio Perez Martin
 wrote:
>
> On Tue, Aug 9, 2022 at 9:21 AM Jason Wang  wrote:
> >
> > On Sat, Aug 6, 2022 at 12:39 AM Eugenio Pérez  wrote:
> > >
> > > So the caller can choose which ASID is destined.
> > >
> > > No need to update the batch functions as they will always be called from
> > > memory listener updates at the moment. Memory listener updates will
> > > always update ASID 0, as it's the passthrough ASID.
> > >
> > > All vhost devices's ASID are 0 at this moment.
> > >
> > > Signed-off-by: Eugenio Pérez 
> > > ---
> > > v4: Add comment specifying behavior if device does not support _F_ASID
> > >
> > > v3: Deleted unneeded space
> > > ---
> > >  include/hw/virtio/vhost-vdpa.h |  8 +---
> > >  hw/virtio/vhost-vdpa.c | 25 +++--
> > >  net/vhost-vdpa.c   |  6 +++---
> > >  hw/virtio/trace-events |  4 ++--
> > >  4 files changed, 25 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/include/hw/virtio/vhost-vdpa.h 
> > > b/include/hw/virtio/vhost-vdpa.h
> > > index d85643..6560bb9d78 100644
> > > --- a/include/hw/virtio/vhost-vdpa.h
> > > +++ b/include/hw/virtio/vhost-vdpa.h
> > > @@ -29,6 +29,7 @@ typedef struct vhost_vdpa {
> > >  int index;
> > >  uint32_t msg_type;
> > >  bool iotlb_batch_begin_sent;
> > > +uint32_t address_space_id;
> > >  MemoryListener listener;
> > >  struct vhost_vdpa_iova_range iova_range;
> > >  uint64_t acked_features;
> > > @@ -42,8 +43,9 @@ typedef struct vhost_vdpa {
> > >  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > >  } VhostVDPA;
> > >
> > > -int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
> > > -   void *vaddr, bool readonly);
> > > -int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova, hwaddr size);
> > > +int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
> > > +   hwaddr size, void *vaddr, bool readonly);
> > > +int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr 
> > > iova,
> > > + hwaddr size);
> > >
> > >  #endif
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index 34922ec20d..3eb67b27b7 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -72,22 +72,24 @@ static bool 
> > > vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
> > >  return false;
> > >  }
> > >
> > > -int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
> > > -   void *vaddr, bool readonly)
> > > +int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
> > > +   hwaddr size, void *vaddr, bool readonly)
> > >  {
> > >  struct vhost_msg_v2 msg = {};
> > >  int fd = v->device_fd;
> > >  int ret = 0;
> > >
> > >  msg.type = v->msg_type;
> > > +msg.asid = asid; /* 0 if vdpa device does not support asid */
> >
> > So this comment is still kind of confusing.
> >
> > Does it mean the caller can guarantee that asid is 0 when ASID is not
> > supported?
>
> That's right.
>
> > Even if this is true, does it silently depend on the
> > behaviour that the asid field is extended from the reserved field of
> > the ABI?
> >
>
> I don't get this part.
>
> Regarding the ABI, the reserved bytes will be there either the device
> support asid or not, since the actual iotlb message is after the
> reserved field. And they were already zeroed by msg = {} on top of the
> function. So if the caller always sets asid = 0, there is no change on
> this part.
>
> > (I still wonder if it's better to avoid using msg.asid if the kernel
> > doesn't support that).
> >
>
> We can add a conditional on v->dev->backend_features & _F_ASID.
>
> But that is not the only case where ASID will not be used: If the vq
> group does not match with the supported configuration (like if CVQ is
> not in the independent group). This case is already handled by setting
> all vhost_vdpa of the virtio device to asid = 0, so adding that extra
> check seems redundant to me.

I see.

>
> I'm not against adding it though: It can prevent bugs. Since it would
> be a bug of qemu, maybe it's better to add an assertion?

I'd suggest adding a comment here.

Thanks

>
> Thanks!
>
> > Thanks
> >
> > >  msg.iotlb.iova = iova;
> > >  msg.iotlb.size = size;
> > >  msg.iotlb.uaddr = (uint64_t)(uintptr_t)vaddr;
> > >  msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
> > >  msg.iotlb.type = VHOST_IOTLB_UPDATE;
> > >
> > > -   trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.iotlb.iova, 
> > > msg.iotlb.size,
> > > -msg.iotlb.uaddr, msg.iotlb.perm, 
> > > msg.iotlb.type);
> > > +trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.asid, msg.iotlb.iova,
> > > + msg.iotlb.size, msg.iotlb.uaddr, 
> > > msg.iotlb.perm,
> > > +

Re: [PATCH v8 00/12] NIC vhost-vdpa state restore via Shadow CVQ

2022-08-18 Thread Jason Wang

On Thu, Aug 11, 2022 at 2:57 PM Eugenio Perez Martin
 wrote:
>
> On Tue, Aug 9, 2022 at 7:43 PM Eugenio Pérez  wrote:
> >
> > CVQ of net vhost-vdpa devices can be intercepted since the addition of 
> > x-svq.
> > The virtio-net device model is updated. The migration was blocked because
> > although the state can be megrated between VMM it was not possible to 
> > restore
> > on the destination NIC.
> >
> > This series add support for SVQ to inject external messages without the 
> > guest's
> > knowledge, so before the guest is resumed all the guest visible state is
> > restored. It is done using standard CVQ messages, so the vhost-vdpa device 
> > does
> > not need to learn how to restore it: As long as they have the feature, they
> > know how to handle it.
> >
> > This series needs fix [1] to be applied to achieve full live
> > migration.
> >
> > Thanks!
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2022-08/msg00325.html
> >
> > v8:
> > - Rename NetClientInfo load to start, so is symmetrical with stop()
> > - Delete copy of device's in buffer at vhost_vdpa_net_load
> >
> > v7:
> > - Remove accidental double free.
> >
> > v6:
> > - Move map and unmap of the buffers to the start and stop of the device. 
> > This
> >   implies more callbacks on NetClientInfo, but simplifies the SVQ CVQ code.
> > - Not assume that in buffer is sizeof(virtio_net_ctrl_ack) in
> >   vhost_vdpa_net_cvq_add
> > - Reduce the number of changes from previous versions
> > - Delete unused memory barrier
> >
> > v5:
> > - Rename s/start/load/
> > - Use independent NetClientInfo to only add load callback on cvq.
> > - Accept out sg instead of dev_buffers[] at vhost_vdpa_net_cvq_map_elem
> > - Use only out size instead of iovec dev_buffers to know if the descriptor 
> > is
> >   effectively available, allowing to delete artificial !NULL 
> > VirtQueueElement
> >   on vhost_svq_add call.
> >
> > v4:
> > - Actually use NetClientInfo callback.
> >
> > v3:
> > - Route vhost-vdpa start code through NetClientInfo callback.
> > - Delete extra vhost_net_stop_one() call.
> >
> > v2:
> > - Fix SIGSEGV dereferencing SVQ when not in svq mode
> >
> > v1 from RFC:
> > - Do not reorder DRIVER_OK & enable patches.
> > - Delete leftovers
> >
> > Eugenio Pérez (12):
> >   vhost: stop transfer elem ownership in vhost_handle_guest_kick
> >   vhost: use SVQ element ndescs instead of opaque data for desc
> > validation
> >   vhost: Delete useless read memory barrier
> >   vhost: Do not depend on !NULL VirtQueueElement on vhost_svq_flush
> >   vhost_net: Add NetClientInfo prepare callback
> >   vhost_net: Add NetClientInfo stop callback
> >   vdpa: add net_vhost_vdpa_cvq_info NetClientInfo
> >   vdpa: Move command buffers map to start of net device
> >   vdpa: extract vhost_vdpa_net_cvq_add from
> > vhost_vdpa_net_handle_ctrl_avail
> >   vhost_net: add NetClientState->load() callback
> >   vdpa: Add virtio-net mac address via CVQ at start
> >   vdpa: Delete CVQ migration blocker
> >
> >  include/hw/virtio/vhost-vdpa.h |   1 -
> >  include/net/net.h  |   6 +
> >  hw/net/vhost_net.c |  17 +++
> >  hw/virtio/vhost-shadow-virtqueue.c |  27 ++--
> >  hw/virtio/vhost-vdpa.c |  14 --
> >  net/vhost-vdpa.c   | 225 ++---
> >  6 files changed, 178 insertions(+), 112 deletions(-)
> >
> > --
> > 2.31.1
> >
> >
> >
>
> Hi Jason,
>
> Should I send a new version of this series with the changes you
> proposed, or can they be done at pull time? (Mostly changes in patch
> messages).

A new series please.


> Can you confirm to me that there is no other action I need
> to perform?

No other from my side.

Thanks

>
> Thanks!
>

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-18 Thread Hugh Dickins

On Fri, 19 Aug 2022, Sean Christopherson wrote:
> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > On Wed, 6 Jul 2022, Chao Peng wrote:
> > > But since then, TDX in particular has forced an effort into preventing
> > > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > > 
> > > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> But QEMU and other VMMs are users of shmem and memfd.  The new features 
> certainly
> aren't useful for _all_ existing users, but I don't think it's fair to say 
> that
> they're not useful for _any_ existing users.

Okay, I stand corrected: there exist some users of memfd_create()
who will also have use for "INACCESSIBLE" memory.

> 
> > > What use do you have for a filesystem here?  Almost none.
> > > IIUC, what you want is an fd through which QEMU can allocate kernel
> > > memory, selectively free that memory, and communicate fd+offset+length
> > > to KVM.  And perhaps an interface to initialize a little of that memory
> > > from a template (presumably copied from a real file on disk somewhere).
> > > 
> > > You don't need shmem.c or a filesystem for that!
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> 
> And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far 
> off
> from being able to use UPM for "regular" VMs as a way to provide 
> defense-in-depth

UPM? That's an acronym from your side of the fence, I spy references to
it in the mail threads, but haven't tracked down a definition.  I'll
just take it to mean the fd-based memory we're discussing.

> without having to take on the overhead of confidential VMs.  At that point,
> migration and probably even swap are on the table.

Good, the more "flexible" that memory is, the better for competing users
of memory.  But an fd supplied by KVM gives you freedom to change to a
better implementation of allocation underneath, whenever it suits you.
Maybe shmem beneath is good from the start, maybe not.

Hugh

[PATCH v6 17/21] accel/tcg: Add fast path for translator_ld*

Cache the translation from guest to host address, so we may
use direct loads when we hit on the primary translation page.

Look up the second translation page only once, during translation.
This obviates another lookup of the second page within tb_gen_code
after translation.

Fixes a bug in that plugin_insn_append should be passed the bytes
in the original memory order, not bswapped by pieces.

Signed-off-by: Richard Henderson 
---
 include/exec/translator.h |  63 +++
 accel/tcg/translate-all.c |  26 +++-
 accel/tcg/translator.c| 127 +-
 3 files changed, 144 insertions(+), 72 deletions(-)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 69db0f5c21..329a42fe46 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -81,24 +81,14 @@ typedef enum DisasJumpType {
  * Architecture-agnostic disassembly context.
  */
 typedef struct DisasContextBase {
-const TranslationBlock *tb;
+TranslationBlock *tb;
 target_ulong pc_first;
 target_ulong pc_next;
 DisasJumpType is_jmp;
 int num_insns;
 int max_insns;
 bool singlestep_enabled;
-#ifdef CONFIG_USER_ONLY
-/*
- * Guest address of the last byte of the last protected page.
- *
- * Pages containing the translated instructions are made non-writable in
- * order to achieve consistency in case another thread is modifying the
- * code while translate_insn() fetches the instruction bytes piecemeal.
- * Such writer threads are blocked on mmap_lock() in page_unprotect().
- */
-target_ulong page_protect_end;
-#endif
+void *host_addr[2];
 } DisasContextBase;
 
 /**
@@ -183,24 +173,43 @@ bool translator_use_goto_tb(DisasContextBase *db, 
target_ulong dest);
  * the relevant information at translation time.
  */
 
-#define GEN_TRANSLATOR_LD(fullname, type, load_fn, swap_fn) \
-type fullname ## _swap(CPUArchState *env, DisasContextBase *dcbase, \
-   abi_ptr pc, bool do_swap);   \
-static inline type fullname(CPUArchState *env,  \
-DisasContextBase *dcbase, abi_ptr pc)   \
-{   \
-return fullname ## _swap(env, dcbase, pc, false);   \
+uint8_t translator_ldub(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint16_t translator_lduw(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint32_t translator_ldl(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+uint64_t translator_ldq(CPUArchState *env, DisasContextBase *db, abi_ptr pc);
+
+static inline uint16_t
+translator_lduw_swap(CPUArchState *env, DisasContextBase *db,
+ abi_ptr pc, bool do_swap)
+{
+uint16_t ret = translator_lduw(env, db, pc);
+if (do_swap) {
+ret = bswap16(ret);
 }
+return ret;
+}
 
-#define FOR_EACH_TRANSLATOR_LD(F)   \
-F(translator_ldub, uint8_t, cpu_ldub_code, /* no swap */)   \
-F(translator_lduw, uint16_t, cpu_lduw_code, bswap16)\
-F(translator_ldl, uint32_t, cpu_ldl_code, bswap32)  \
-F(translator_ldq, uint64_t, cpu_ldq_code, bswap64)
+static inline uint32_t
+translator_ldl_swap(CPUArchState *env, DisasContextBase *db,
+abi_ptr pc, bool do_swap)
+{
+uint32_t ret = translator_ldl(env, db, pc);
+if (do_swap) {
+ret = bswap32(ret);
+}
+return ret;
+}
 
-FOR_EACH_TRANSLATOR_LD(GEN_TRANSLATOR_LD)
-
-#undef GEN_TRANSLATOR_LD
+static inline uint64_t
+translator_ldq_swap(CPUArchState *env, DisasContextBase *db,
+abi_ptr pc, bool do_swap)
+{
+uint64_t ret = translator_ldq_swap(env, db, pc, false);
+if (do_swap) {
+ret = bswap64(ret);
+}
+return ret;
+}
 
 /*
  * Return whether addr is on the same page as where disassembly started.
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index b224f856d0..e44f40b234 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1385,10 +1385,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 {
 CPUArchState *env = cpu->env_ptr;
 TranslationBlock *tb, *existing_tb;
-tb_page_addr_t phys_pc, phys_page2;
-target_ulong virt_page2;
+tb_page_addr_t phys_pc;
 tcg_insn_unit *gen_code_buf;
 int gen_code_size, search_size, max_insns;
+void *host_pc;
 #ifdef CONFIG_PROFILER
 TCGProfile *prof = _ctx->prof;
 int64_t ti;
@@ -1397,7 +1397,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 assert_memory_lock();
 qemu_thread_jit_write();
 
-phys_pc = get_page_addr_code_hostp(env, pc, false, NULL);
+phys_pc = get_page_addr_code_hostp(env, pc, false, _pc);
 
 if (phys_pc == -1) {
 /* Generate a one-shot TB with 1 insn in it */
@@ -1428,6 +1428,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,

[PATCH v6 20/21] target/riscv: Add MAX_INSN_LEN and insn_len

These will be useful in properly ending the TB.

Signed-off-by: Richard Henderson 
---
 target/riscv/translate.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 38666ddc91..a719aa6e63 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -1022,6 +1022,14 @@ static uint32_t opcode_at(DisasContextBase *dcbase, 
target_ulong pc)
 /* Include decoders for factored-out extensions */
 #include "decode-XVentanaCondOps.c.inc"
 
+/* The specification allows for longer insns, but not supported by qemu. */
+#define MAX_INSN_LEN  4
+
+static inline int insn_len(uint16_t first_word)
+{
+return (first_word & 3) == 3 ? 4 : 2;
+}
+
 static void decode_opc(CPURISCVState *env, DisasContext *ctx, uint16_t opcode)
 {
 /*
@@ -1037,7 +1045,7 @@ static void decode_opc(CPURISCVState *env, DisasContext 
*ctx, uint16_t opcode)
 };
 
 /* Check for compressed insn */
-if (extract16(opcode, 0, 2) != 3) {
+if (insn_len(opcode) == 2) {
 if (!has_ext(ctx, RVC)) {
 gen_exception_illegal(ctx);
 } else {
-- 
2.34.1

[PATCH v6 16/21] accel/tcg: Add pc and host_pc params to gen_intermediate_code

Pass these along to translator_loop -- pc may be used instead
of tb->pc, and host_pc is currently unused.  Adjust all targets
at one time.

Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h   |  1 -
 include/exec/translator.h | 24 
 accel/tcg/translate-all.c |  3 ++-
 accel/tcg/translator.c|  9 +
 target/alpha/translate.c  |  5 +++--
 target/arm/translate.c|  5 +++--
 target/avr/translate.c|  5 +++--
 target/cris/translate.c   |  5 +++--
 target/hexagon/translate.c|  6 --
 target/hppa/translate.c   |  5 +++--
 target/i386/tcg/translate.c   |  5 +++--
 target/loongarch/translate.c  |  6 --
 target/m68k/translate.c   |  5 +++--
 target/microblaze/translate.c |  5 +++--
 target/mips/tcg/translate.c   |  5 +++--
 target/nios2/translate.c  |  5 +++--
 target/openrisc/translate.c   |  6 --
 target/ppc/translate.c|  5 +++--
 target/riscv/translate.c  |  5 +++--
 target/rx/translate.c |  5 +++--
 target/s390x/tcg/translate.c  |  5 +++--
 target/sh4/translate.c|  5 +++--
 target/sparc/translate.c  |  5 +++--
 target/tricore/translate.c|  6 --
 target/xtensa/translate.c |  6 --
 25 files changed, 95 insertions(+), 52 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7a6dc44d86..4ad166966b 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -39,7 +39,6 @@ typedef ram_addr_t tb_page_addr_t;
 #define TB_PAGE_ADDR_FMT RAM_ADDR_FMT
 #endif
 
-void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int max_insns);
 void restore_state_to_opc(CPUArchState *env, TranslationBlock *tb,
   target_ulong *data);
 
diff --git a/include/exec/translator.h b/include/exec/translator.h
index 45b9268ca4..69db0f5c21 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -26,6 +26,19 @@
 #include "exec/translate-all.h"
 #include "tcg/tcg.h"
 
+/**
+ * gen_intermediate_code
+ * @cpu: cpu context
+ * @tb: translation block
+ * @max_insns: max number of instructions to translate
+ * @pc: guest virtual program counter address
+ * @host_pc: host physical program counter address
+ *
+ * This function must be provided by the target, which should create
+ * the target-specific DisasContext, and then invoke translator_loop.
+ */
+void gen_intermediate_code(CPUState *cpu, TranslationBlock *tb, int max_insns,
+   target_ulong pc, void *host_pc);
 
 /**
  * DisasJumpType:
@@ -123,11 +136,13 @@ typedef struct TranslatorOps {
 
 /**
  * translator_loop:
- * @ops: Target-specific operations.
- * @db: Disassembly context.
  * @cpu: Target vCPU.
  * @tb: Translation block.
  * @max_insns: Maximum number of insns to translate.
+ * @pc: guest virtual program counter address
+ * @host_pc: host physical program counter address
+ * @ops: Target-specific operations.
+ * @db: Disassembly context.
  *
  * Generic translator loop.
  *
@@ -141,8 +156,9 @@ typedef struct TranslatorOps {
  * - When single-stepping is enabled (system-wide or on the current vCPU).
  * - When too many instructions have been translated.
  */
-void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
- CPUState *cpu, TranslationBlock *tb, int max_insns);
+void translator_loop(CPUState *cpu, TranslationBlock *tb, int max_insns,
+ target_ulong pc, void *host_pc,
+ const TranslatorOps *ops, DisasContextBase *db);
 
 void translator_loop_temp_check(DisasContextBase *db);
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 069ed67bac..b224f856d0 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -46,6 +46,7 @@
 
 #include "exec/cputlb.h"
 #include "exec/translate-all.h"
+#include "exec/translator.h"
 #include "qemu/bitmap.h"
 #include "qemu/qemu-print.h"
 #include "qemu/timer.h"
@@ -1444,7 +1445,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 tcg_func_start(tcg_ctx);
 
 tcg_ctx->cpu = env_cpu(env);
-gen_intermediate_code(cpu, tb, max_insns);
+gen_intermediate_code(cpu, tb, max_insns, pc, host_pc);
 assert(tb->size != 0);
 tcg_ctx->cpu = NULL;
 max_insns = tb->icount;
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index fe7af9b943..3eef30d93a 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -51,16 +51,17 @@ static inline void translator_page_protect(DisasContextBase 
*dcbase,
 #endif
 }
 
-void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
- CPUState *cpu, TranslationBlock *tb, int max_insns)
+void translator_loop(CPUState *cpu, TranslationBlock *tb, int max_insns,
+ target_ulong pc, void *host_pc,
+ const TranslatorOps *ops, DisasContextBase *db)
 {
 uint32_t cflags = tb_cflags(tb);
 bool plugin_enabled;
 
 /* Initialize DisasContext */

[PATCH v6 21/21] target/riscv: Make translator stop before the end of a page

Right now the translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1155
Signed-off-by: Richard Henderson 
---
 target/riscv/translate.c  | 17 +--
 tests/tcg/riscv64/noexec.c| 79 +++
 tests/tcg/riscv64/Makefile.target |  1 +
 3 files changed, 93 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/riscv64/noexec.c

diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index a719aa6e63..f8af6daa70 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -1154,12 +1154,21 @@ static void riscv_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cpu)
 }
 ctx->nftemp = 0;
 
+/* Only the first insn within a TB is allowed to cross a page boundary. */
 if (ctx->base.is_jmp == DISAS_NEXT) {
-target_ulong page_start;
-
-page_start = ctx->base.pc_first & TARGET_PAGE_MASK;
-if (ctx->base.pc_next - page_start >= TARGET_PAGE_SIZE) {
+if (!is_same_page(>base, ctx->base.pc_next)) {
 ctx->base.is_jmp = DISAS_TOO_MANY;
+} else {
+unsigned page_ofs = ctx->base.pc_next & ~TARGET_PAGE_MASK;
+
+if (page_ofs > TARGET_PAGE_SIZE - MAX_INSN_LEN) {
+uint16_t next_insn = cpu_lduw_code(env, ctx->base.pc_next);
+int len = insn_len(next_insn);
+
+if (!is_same_page(>base, ctx->base.pc_next + len)) {
+ctx->base.is_jmp = DISAS_TOO_MANY;
+}
+}
 }
 }
 }
diff --git a/tests/tcg/riscv64/noexec.c b/tests/tcg/riscv64/noexec.c
new file mode 100644
index 00..86f64b28db
--- /dev/null
+++ b/tests/tcg/riscv64/noexec.c
@@ -0,0 +1,79 @@
+#include "../multiarch/noexec.c.inc"
+
+static void *arch_mcontext_pc(const mcontext_t *ctx)
+{
+return (void *)ctx->__gregs[REG_PC];
+}
+
+static int arch_mcontext_arg(const mcontext_t *ctx)
+{
+return ctx->__gregs[REG_A0];
+}
+
+static void arch_flush(void *p, int len)
+{
+__builtin___clear_cache(p, p + len);
+}
+
+extern char noexec_1[];
+extern char noexec_2[];
+extern char noexec_end[];
+
+asm(".option push\n"
+".option norvc\n"
+"noexec_1:\n"
+"   li a0,1\n"   /* a0 is 0 on entry, set 1. */
+"noexec_2:\n"
+"   li a0,2\n"  /* a0 is 0/1; set 2. */
+"   ret\n"
+"noexec_end:\n"
+".option pop");
+
+int main(void)
+{
+struct noexec_test noexec_tests[] = {
+{
+.name = "fallthrough",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = noexec_1 - noexec_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 1,
+},
+{
+.name = "jump",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = 0,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 0,
+},
+{
+.name = "fallthrough [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = noexec_1 - noexec_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 1,
+},
+{
+.name = "jump [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = -2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 0,
+},
+};
+
+return test_noexec(noexec_tests,
+   sizeof(noexec_tests) / sizeof(noexec_tests[0]));
+}
diff --git a/tests/tcg/riscv64/Makefile.target 
b/tests/tcg/riscv64/Makefile.target
index d41bf6d60d..b5b89dfb0e 100644
--- a/tests/tcg/riscv64/Makefile.target
+++ b/tests/tcg/riscv64/Makefile.target
@@ -3,3 +3,4 @@
 
 VPATH += $(SRC_PATH)/tests/tcg/riscv64
 TESTS += test-div
+TESTS += noexec
-- 
2.34.1

[PATCH v6 08/21] accel/tcg: Properly implement get_page_addr_code for user-only

The current implementation is a no-op, simply returning addr.
This is incorrect, because we ought to be checking the page
permissions for execution.

Make get_page_addr_code inline for both implementations.

Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h | 85 ++---
 accel/tcg/cputlb.c  |  5 ---
 accel/tcg/user-exec.c   | 15 
 3 files changed, 43 insertions(+), 62 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 311e5fb422..0475ec6007 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -598,43 +598,44 @@ struct MemoryRegionSection *iotlb_to_section(CPUState 
*cpu,
  hwaddr index, MemTxAttrs attrs);
 #endif
 
-#if defined(CONFIG_USER_ONLY)
-void mmap_lock(void);
-void mmap_unlock(void);
-bool have_mmap_lock(void);
-
 /**
- * get_page_addr_code() - user-mode version
+ * get_page_addr_code_hostp()
  * @env: CPUArchState
  * @addr: guest virtual address of guest code
  *
- * Returns @addr.
+ * See get_page_addr_code() (full-system version) for documentation on the
+ * return value.
+ *
+ * Sets *@hostp (when @hostp is non-NULL) as follows.
+ * If the return value is -1, sets *@hostp to NULL. Otherwise, sets *@hostp
+ * to the host address where @addr's content is kept.
+ *
+ * Note: this function can trigger an exception.
+ */
+tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
+void **hostp);
+
+/**
+ * get_page_addr_code()
+ * @env: CPUArchState
+ * @addr: guest virtual address of guest code
+ *
+ * If we cannot translate and execute from the entire RAM page, or if
+ * the region is not backed by RAM, returns -1. Otherwise, returns the
+ * ram_addr_t corresponding to the guest code at @addr.
+ *
+ * Note: this function can trigger an exception.
  */
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env,
 target_ulong addr)
 {
-return addr;
+return get_page_addr_code_hostp(env, addr, NULL);
 }
 
-/**
- * get_page_addr_code_hostp() - user-mode version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * Returns @addr.
- *
- * If @hostp is non-NULL, sets *@hostp to the host address where @addr's 
content
- * is kept.
- */
-static inline tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env,
-  target_ulong addr,
-  void **hostp)
-{
-if (hostp) {
-*hostp = g2h_untagged(addr);
-}
-return addr;
-}
+#if defined(CONFIG_USER_ONLY)
+void mmap_lock(void);
+void mmap_unlock(void);
+bool have_mmap_lock(void);
 
 /**
  * adjust_signal_pc:
@@ -691,36 +692,6 @@ G_NORETURN void cpu_loop_exit_sigbus(CPUState *cpu, 
target_ulong addr,
 static inline void mmap_lock(void) {}
 static inline void mmap_unlock(void) {}
 
-/**
- * get_page_addr_code() - full-system version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * If we cannot translate and execute from the entire RAM page, or if
- * the region is not backed by RAM, returns -1. Otherwise, returns the
- * ram_addr_t corresponding to the guest code at @addr.
- *
- * Note: this function can trigger an exception.
- */
-tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr);
-
-/**
- * get_page_addr_code_hostp() - full-system version
- * @env: CPUArchState
- * @addr: guest virtual address of guest code
- *
- * See get_page_addr_code() (full-system version) for documentation on the
- * return value.
- *
- * Sets *@hostp (when @hostp is non-NULL) as follows.
- * If the return value is -1, sets *@hostp to NULL. Otherwise, sets *@hostp
- * to the host address where @addr's content is kept.
- *
- * Note: this function can trigger an exception.
- */
-tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp);
-
 void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length);
 void tlb_set_dirty(CPUState *cpu, target_ulong vaddr);
 
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index a46f3a654d..43bd65c973 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1544,11 +1544,6 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState 
*env, target_ulong addr,
 return qemu_ram_addr_from_host_nofail(p);
 }
 
-tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr)
-{
-return get_page_addr_code_hostp(env, addr, NULL);
-}
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
CPUIOTLBEntry *iotlbentry, uintptr_t retaddr)
 {
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 20ada5472b..a20234fb02 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -199,6 +199,21 @@ void *probe_access(CPUArchState *env,

[PATCH v6 14/21] accel/tcg: Raise PROT_EXEC exception early

We currently ignore PROT_EXEC on the initial lookup, and
defer raising the exception until cpu_ld*_code().
It makes more sense to raise the exception early.

Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c  | 2 +-
 accel/tcg/translate-all.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 7887af6f45..7b8977a0a4 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -222,7 +222,7 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu, 
target_ulong pc,
 desc.cflags = cflags;
 desc.trace_vcpu_dstate = *cpu->trace_dstate;
 desc.pc = pc;
-phys_pc = get_page_addr_code(desc.env, pc);
+phys_pc = get_page_addr_code_hostp(desc.env, pc, false, NULL);
 if (phys_pc == -1) {
 return NULL;
 }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index b83161a081..069ed67bac 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1396,7 +1396,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 assert_memory_lock();
 qemu_thread_jit_write();
 
-phys_pc = get_page_addr_code(env, pc);
+phys_pc = get_page_addr_code_hostp(env, pc, false, NULL);
 
 if (phys_pc == -1) {
 /* Generate a one-shot TB with 1 insn in it */
-- 
2.34.1

[PATCH v6 18/21] target/s390x: Make translator stop before the end of a page

From: Ilya Leoshkevich 

Right now translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
Message-Id: <20220817150506.592862-3-...@linux.ibm.com>
Signed-off-by: Richard Henderson 
---
 target/s390x/tcg/translate.c |  15 +++-
 tests/tcg/s390x/noexec.c | 106 +++
 tests/tcg/multiarch/noexec.c.inc | 141 +++
 tests/tcg/s390x/Makefile.target  |   1 +
 4 files changed, 259 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/s390x/noexec.c
 create mode 100644 tests/tcg/multiarch/noexec.c.inc

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index d4c0b9b3a2..1d2dddab1c 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -6609,6 +6609,14 @@ static void s390x_tr_insn_start(DisasContextBase 
*dcbase, CPUState *cs)
 dc->insn_start = tcg_last_op();
 }
 
+static target_ulong get_next_pc(CPUS390XState *env, DisasContext *s,
+uint64_t pc)
+{
+uint64_t insn = ld_code2(env, s, pc);
+
+return pc + get_ilen((insn >> 8) & 0xff);
+}
+
 static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
 {
 CPUS390XState *env = cs->env_ptr;
@@ -6616,10 +6624,9 @@ static void s390x_tr_translate_insn(DisasContextBase 
*dcbase, CPUState *cs)
 
 dc->base.is_jmp = translate_one(env, dc);
 if (dc->base.is_jmp == DISAS_NEXT) {
-uint64_t page_start;
-
-page_start = dc->base.pc_first & TARGET_PAGE_MASK;
-if (dc->base.pc_next - page_start >= TARGET_PAGE_SIZE || dc->ex_value) 
{
+if (!is_same_page(dcbase, dc->base.pc_next) ||
+!is_same_page(dcbase, get_next_pc(env, dc, dc->base.pc_next)) ||
+dc->ex_value) {
 dc->base.is_jmp = DISAS_TOO_MANY;
 }
 }
diff --git a/tests/tcg/s390x/noexec.c b/tests/tcg/s390x/noexec.c
new file mode 100644
index 00..15d007d07f
--- /dev/null
+++ b/tests/tcg/s390x/noexec.c
@@ -0,0 +1,106 @@
+#include "../multiarch/noexec.c.inc"
+
+static void *arch_mcontext_pc(const mcontext_t *ctx)
+{
+return (void *)ctx->psw.addr;
+}
+
+static int arch_mcontext_arg(const mcontext_t *ctx)
+{
+return ctx->gregs[2];
+}
+
+static void arch_flush(void *p, int len)
+{
+}
+
+extern char noexec_1[];
+extern char noexec_2[];
+extern char noexec_end[];
+
+asm("noexec_1:\n"
+"   lgfi %r2,1\n"   /* %r2 is 0 on entry, set 1. */
+"noexec_2:\n"
+"   lgfi %r2,2\n"   /* %r2 is 0/1; set 2. */
+"   br %r14\n"  /* return */
+"noexec_end:");
+
+extern char exrl_1[];
+extern char exrl_2[];
+extern char exrl_end[];
+
+asm("exrl_1:\n"
+"   exrl %r0, exrl_2\n"
+"   br %r14\n"
+"exrl_2:\n"
+"   lgfi %r2,2\n"
+"exrl_end:");
+
+int main(void)
+{
+struct noexec_test noexec_tests[] = {
+{
+.name = "fallthrough",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = noexec_1 - noexec_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 1,
+},
+{
+.name = "jump",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = 0,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 0,
+},
+{
+.name = "exrl",
+.test_code = exrl_1,
+.test_len = exrl_end - exrl_1,
+.page_ofs = exrl_1 - exrl_2,
+.entry_ofs = exrl_1 - exrl_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = exrl_1 - exrl_2,
+.expected_arg = 0,
+},
+{
+.name = "fallthrough [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = noexec_1 - noexec_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 1,
+},
+{
+.name = "jump [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+.entry_ofs = -2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = -2,
+.expected_arg = 0,
+},
+{
+.name = "exrl [cross]",
+.test_code = exrl_1,
+.test_len = exrl_end - exrl_1,
+.page_ofs = exrl_1 - exrl_2 - 2,
+.entry_ofs = exrl_1 - exrl_2 - 2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = exrl_1 - exrl_2 - 2,
+

[PATCH v6 19/21] target/i386: Make translator stop before the end of a page

From: Ilya Leoshkevich 

Right now translator stops right *after* the end of a page, which
breaks reporting of fault locations when the last instruction of a
multi-insn translation block crosses a page boundary.

An implementation, like the one arm and s390x have, would require an
i386 length disassembler, which is burdensome to maintain. Another
alternative would be to single-step at the end of a guest page, but
this may come with a performance impact.

Fix by snapshotting disassembly state and restoring it after we figure
out we crossed a page boundary. This includes rolling back cc_op
updates and emitted ops.

Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1143
Message-Id: <20220817150506.592862-4-...@linux.ibm.com>
Signed-off-by: Richard Henderson 
---
 target/i386/tcg/translate.c  | 25 ++-
 tests/tcg/x86_64/noexec.c| 75 
 tests/tcg/x86_64/Makefile.target |  3 +-
 3 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100644 tests/tcg/x86_64/noexec.c

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 4836c889e0..6481ae5c24 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -130,6 +130,7 @@ typedef struct DisasContext {
 TCGv_i64 tmp1_i64;
 
 sigjmp_buf jmpbuf;
+TCGOp *prev_insn_end;
 } DisasContext;
 
 /* The environment in which user-only runs is constrained. */
@@ -2008,6 +2009,12 @@ static uint64_t advance_pc(CPUX86State *env, 
DisasContext *s, int num_bytes)
 {
 uint64_t pc = s->pc;
 
+/* This is a subsequent insn that crosses a page boundary.  */
+if (s->base.num_insns > 1 &&
+!is_same_page(>base, s->pc + num_bytes - 1)) {
+siglongjmp(s->jmpbuf, 2);
+}
+
 s->pc += num_bytes;
 if (unlikely(s->pc - s->pc_start > X86_MAX_INSN_LENGTH)) {
 /* If the instruction's 16th byte is on a different page than the 1st, 
a
@@ -4556,6 +4563,8 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 int modrm, reg, rm, mod, op, opreg, val;
 target_ulong next_eip, tval;
 target_ulong pc_start = s->base.pc_next;
+bool orig_cc_op_dirty = s->cc_op_dirty;
+CCOp orig_cc_op = s->cc_op;
 
 s->pc_start = s->pc = pc_start;
 s->override = -1;
@@ -4568,9 +4577,22 @@ static target_ulong disas_insn(DisasContext *s, CPUState 
*cpu)
 s->rip_offset = 0; /* for relative ip address */
 s->vex_l = 0;
 s->vex_v = 0;
-if (sigsetjmp(s->jmpbuf, 0) != 0) {
+switch (sigsetjmp(s->jmpbuf, 0)) {
+case 0:
+break;
+case 1:
 gen_exception_gpf(s);
 return s->pc;
+case 2:
+/* Restore state that may affect the next instruction. */
+s->cc_op_dirty = orig_cc_op_dirty;
+s->cc_op = orig_cc_op;
+s->base.num_insns--;
+tcg_remove_ops_after(s->prev_insn_end);
+s->base.is_jmp = DISAS_TOO_MANY;
+return pc_start;
+default:
+g_assert_not_reached();
 }
 
 prefixes = 0;
@@ -8632,6 +8654,7 @@ static void i386_tr_insn_start(DisasContextBase *dcbase, 
CPUState *cpu)
 {
 DisasContext *dc = container_of(dcbase, DisasContext, base);
 
+dc->prev_insn_end = tcg_last_op();
 tcg_gen_insn_start(dc->base.pc_next, dc->cc_op);
 }
 
diff --git a/tests/tcg/x86_64/noexec.c b/tests/tcg/x86_64/noexec.c
new file mode 100644
index 00..9b124901be
--- /dev/null
+++ b/tests/tcg/x86_64/noexec.c
@@ -0,0 +1,75 @@
+#include "../multiarch/noexec.c.inc"
+
+static void *arch_mcontext_pc(const mcontext_t *ctx)
+{
+return (void *)ctx->gregs[REG_RIP];
+}
+
+int arch_mcontext_arg(const mcontext_t *ctx)
+{
+return ctx->gregs[REG_RDI];
+}
+
+static void arch_flush(void *p, int len)
+{
+}
+
+extern char noexec_1[];
+extern char noexec_2[];
+extern char noexec_end[];
+
+asm("noexec_1:\n"
+"movq $1,%rdi\n"/* %rdi is 0 on entry, set 1. */
+"noexec_2:\n"
+"movq $2,%rdi\n"/* %rdi is 0/1; set 2. */
+"ret\n"
+"noexec_end:");
+
+int main(void)
+{
+struct noexec_test noexec_tests[] = {
+{
+.name = "fallthrough",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = noexec_1 - noexec_2,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 1,
+},
+{
+.name = "jump",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2,
+.entry_ofs = 0,
+.expected_si_ofs = 0,
+.expected_pc_ofs = 0,
+.expected_arg = 0,
+},
+{
+.name = "fallthrough [cross]",
+.test_code = noexec_1,
+.test_len = noexec_end - noexec_1,
+.page_ofs = noexec_1 - noexec_2 - 2,
+

[PATCH v6 13/21] accel/tcg: Add nofault parameter to get_page_addr_code_hostp

Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h | 10 +-
 accel/tcg/cputlb.c  |  8 
 accel/tcg/plugin-gen.c  |  4 ++--
 accel/tcg/user-exec.c   |  4 ++--
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 9f35e3b7a9..7a6dc44d86 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -599,6 +599,8 @@ struct MemoryRegionSection *iotlb_to_section(CPUState *cpu,
  * get_page_addr_code_hostp()
  * @env: CPUArchState
  * @addr: guest virtual address of guest code
+ * @nofault: do not raise an exception
+ * @hostp: output for host pointer
  *
  * See get_page_addr_code() (full-system version) for documentation on the
  * return value.
@@ -607,10 +609,10 @@ struct MemoryRegionSection *iotlb_to_section(CPUState 
*cpu,
  * If the return value is -1, sets *@hostp to NULL. Otherwise, sets *@hostp
  * to the host address where @addr's content is kept.
  *
- * Note: this function can trigger an exception.
+ * Note: Unless @nofault, this function can trigger an exception.
  */
 tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp);
+bool nofault, void **hostp);
 
 /**
  * get_page_addr_code()
@@ -620,13 +622,11 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState 
*env, target_ulong addr,
  * If we cannot translate and execute from the entire RAM page, or if
  * the region is not backed by RAM, returns -1. Otherwise, returns the
  * ram_addr_t corresponding to the guest code at @addr.
- *
- * Note: this function can trigger an exception.
  */
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env,
 target_ulong addr)
 {
-return get_page_addr_code_hostp(env, addr, NULL);
+return get_page_addr_code_hostp(env, addr, true, NULL);
 }
 
 #if defined(CONFIG_USER_ONLY)
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 2dc2affa12..ae7b40dd51 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1644,16 +1644,16 @@ void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
  * of RAM.  This will force us to execute by loading and translating
  * one insn at a time, without caching.
  *
- * NOTE: This function will trigger an exception if the page is
- * not executable.
+ * NOTE: Unless @nofault, this function will trigger an exception
+ * if the page is not executable.
  */
 tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp)
+bool nofault, void **hostp)
 {
 void *p;
 
 (void)probe_access_internal(env, addr, 1, MMU_INST_FETCH,
-cpu_mmu_index(env, true), true, , 0);
+cpu_mmu_index(env, true), nofault, , 0);
 if (p == NULL) {
 return -1;
 }
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c
index 3d0b101e34..8377c15383 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -872,7 +872,7 @@ bool plugin_gen_tb_start(CPUState *cpu, const 
TranslationBlock *tb, bool mem_onl
 
 ptb->vaddr = tb->pc;
 ptb->vaddr2 = -1;
-get_page_addr_code_hostp(cpu->env_ptr, tb->pc, >haddr1);
+get_page_addr_code_hostp(cpu->env_ptr, tb->pc, true, >haddr1);
 ptb->haddr2 = NULL;
 ptb->mem_only = mem_only;
 
@@ -902,7 +902,7 @@ void plugin_gen_insn_start(CPUState *cpu, const 
DisasContextBase *db)
 unlikely((db->pc_next & TARGET_PAGE_MASK) !=
  (db->pc_first & TARGET_PAGE_MASK))) {
 get_page_addr_code_hostp(cpu->env_ptr, db->pc_next,
- >haddr2);
+ true, >haddr2);
 ptb->vaddr2 = db->pc_next;
 }
 if (likely(ptb->vaddr2 == -1)) {
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 58edd33896..e7fec960c2 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -197,11 +197,11 @@ void *probe_access(CPUArchState *env, target_ulong addr, 
int size,
 }
 
 tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp)
+bool nofault, void **hostp)
 {
 int flags;
 
-flags = probe_access_internal(env, addr, 1, MMU_INST_FETCH, true, 0);
+flags = probe_access_internal(env, addr, 1, MMU_INST_FETCH, nofault, 0);
 if (unlikely(flags)) {
 return -1;
 }
-- 
2.34.1

[PATCH v6 15/21] accel/tcg: Remove translator_ldsw

The only user can easily use translator_lduw and
adjust the type to signed during the return.

Signed-off-by: Richard Henderson 
---
 include/exec/translator.h   | 1 -
 target/i386/tcg/translate.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 0d0bf3a31e..45b9268ca4 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -178,7 +178,6 @@ bool translator_use_goto_tb(DisasContextBase *db, 
target_ulong dest);
 
 #define FOR_EACH_TRANSLATOR_LD(F)   \
 F(translator_ldub, uint8_t, cpu_ldub_code, /* no swap */)   \
-F(translator_ldsw, int16_t, cpu_ldsw_code, bswap16) \
 F(translator_lduw, uint16_t, cpu_lduw_code, bswap16)\
 F(translator_ldl, uint32_t, cpu_ldl_code, bswap32)  \
 F(translator_ldq, uint64_t, cpu_ldq_code, bswap64)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index b7972f0ff5..a23417d058 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2033,7 +2033,7 @@ static inline uint8_t x86_ldub_code(CPUX86State *env, 
DisasContext *s)
 
 static inline int16_t x86_ldsw_code(CPUX86State *env, DisasContext *s)
 {
-return translator_ldsw(env, >base, advance_pc(env, s, 2));
+return translator_lduw(env, >base, advance_pc(env, s, 2));
 }
 
 static inline uint16_t x86_lduw_code(CPUX86State *env, DisasContext *s)
-- 
2.34.1

[PATCH v6 11/21] accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c

The base qemu_ram_addr_from_host function is already in
softmmu/physmem.c; move the nofail version to be adjacent.

Signed-off-by: Richard Henderson 
---
 include/exec/cpu-common.h |  1 +
 accel/tcg/cputlb.c| 12 
 softmmu/physmem.c | 12 
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2281be4e10..d909429427 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -72,6 +72,7 @@ typedef uintptr_t ram_addr_t;
 void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should not be used by devices.  */
 ram_addr_t qemu_ram_addr_from_host(void *ptr);
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
 RAMBlock *qemu_ram_block_by_name(const char *name);
 RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
ram_addr_t *offset);
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 43bd65c973..80a3eb4f1c 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1283,18 +1283,6 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 prot, mmu_idx, size);
 }
 
-static inline ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
-{
-ram_addr_t ram_addr;
-
-ram_addr = qemu_ram_addr_from_host(ptr);
-if (ram_addr == RAM_ADDR_INVALID) {
-error_report("Bad ram pointer %p", ptr);
-abort();
-}
-return ram_addr;
-}
-
 /*
  * Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
  * caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index dc3c3e5f2e..d4c30e99ea 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2460,6 +2460,18 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
 return block->offset + offset;
 }
 
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
+{
+ram_addr_t ram_addr;
+
+ram_addr = qemu_ram_addr_from_host(ptr);
+if (ram_addr == RAM_ADDR_INVALID) {
+error_report("Bad ram pointer %p", ptr);
+abort();
+}
+return ram_addr;
+}
+
 static MemTxResult flatview_read(FlatView *fv, hwaddr addr,
  MemTxAttrs attrs, void *buf, hwaddr len);
 static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs,
-- 
2.34.1

[PATCH v6 10/21] accel/tcg: Make tb_htable_lookup static

The function is not used outside of cpu-exec.c.  Move it and
its subroutines up in the file, before the first use.

Signed-off-by: Richard Henderson 
---
 include/exec/exec-all.h |   3 -
 accel/tcg/cpu-exec.c| 122 
 2 files changed, 61 insertions(+), 64 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 0475ec6007..9f35e3b7a9 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -552,9 +552,6 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr, 
MemTxAttrs attrs);
 #endif
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
-TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-   target_ulong cs_base, uint32_t flags,
-   uint32_t cflags);
 void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr);
 
 /* GETPC is the true target of the return instruction that we'll execute.  */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index d18081ca6f..7887af6f45 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -170,6 +170,67 @@ uint32_t curr_cflags(CPUState *cpu)
 return cflags;
 }
 
+struct tb_desc {
+target_ulong pc;
+target_ulong cs_base;
+CPUArchState *env;
+tb_page_addr_t phys_page1;
+uint32_t flags;
+uint32_t cflags;
+uint32_t trace_vcpu_dstate;
+};
+
+static bool tb_lookup_cmp(const void *p, const void *d)
+{
+const TranslationBlock *tb = p;
+const struct tb_desc *desc = d;
+
+if (tb->pc == desc->pc &&
+tb->page_addr[0] == desc->phys_page1 &&
+tb->cs_base == desc->cs_base &&
+tb->flags == desc->flags &&
+tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
+tb_cflags(tb) == desc->cflags) {
+/* check next page if needed */
+if (tb->page_addr[1] == -1) {
+return true;
+} else {
+tb_page_addr_t phys_page2;
+target_ulong virt_page2;
+
+virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+phys_page2 = get_page_addr_code(desc->env, virt_page2);
+if (tb->page_addr[1] == phys_page2) {
+return true;
+}
+}
+}
+return false;
+}
+
+static TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
+  target_ulong cs_base, uint32_t flags,
+  uint32_t cflags)
+{
+tb_page_addr_t phys_pc;
+struct tb_desc desc;
+uint32_t h;
+
+desc.env = cpu->env_ptr;
+desc.cs_base = cs_base;
+desc.flags = flags;
+desc.cflags = cflags;
+desc.trace_vcpu_dstate = *cpu->trace_dstate;
+desc.pc = pc;
+phys_pc = get_page_addr_code(desc.env, pc);
+if (phys_pc == -1) {
+return NULL;
+}
+desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
+h = tb_hash_func(phys_pc, pc, flags, cflags, *cpu->trace_dstate);
+return qht_lookup_custom(_ctx.htable, , h, tb_lookup_cmp);
+}
+
 /* Might cause an exception, so have a longjmp destination ready */
 static inline TranslationBlock *tb_lookup(CPUState *cpu, target_ulong pc,
   target_ulong cs_base,
@@ -485,67 +546,6 @@ void cpu_exec_step_atomic(CPUState *cpu)
 end_exclusive();
 }
 
-struct tb_desc {
-target_ulong pc;
-target_ulong cs_base;
-CPUArchState *env;
-tb_page_addr_t phys_page1;
-uint32_t flags;
-uint32_t cflags;
-uint32_t trace_vcpu_dstate;
-};
-
-static bool tb_lookup_cmp(const void *p, const void *d)
-{
-const TranslationBlock *tb = p;
-const struct tb_desc *desc = d;
-
-if (tb->pc == desc->pc &&
-tb->page_addr[0] == desc->phys_page1 &&
-tb->cs_base == desc->cs_base &&
-tb->flags == desc->flags &&
-tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
-tb_cflags(tb) == desc->cflags) {
-/* check next page if needed */
-if (tb->page_addr[1] == -1) {
-return true;
-} else {
-tb_page_addr_t phys_page2;
-target_ulong virt_page2;
-
-virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
-phys_page2 = get_page_addr_code(desc->env, virt_page2);
-if (tb->page_addr[1] == phys_page2) {
-return true;
-}
-}
-}
-return false;
-}
-
-TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-   target_ulong cs_base, uint32_t flags,
-   uint32_t cflags)
-{
-tb_page_addr_t phys_pc;
-struct tb_desc desc;
-uint32_t h;
-
-desc.env = cpu->env_ptr;
-desc.cs_base = cs_base;
-desc.flags = flags;
-desc.cflags = cflags;
-desc.trace_vcpu_dstate = *cpu->trace_dstate;
-desc.pc = pc;
-phys_pc =

[PATCH v6 12/21] accel/tcg: Use probe_access_internal for softmmu get_page_addr_code_hostp

Simplify the implementation of get_page_addr_code_hostp
by reusing the existing probe_access infrastructure.

Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c | 76 --
 1 file changed, 26 insertions(+), 50 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 80a3eb4f1c..2dc2affa12 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1482,56 +1482,6 @@ static bool victim_tlb_hit(CPUArchState *env, size_t 
mmu_idx, size_t index,
   victim_tlb_hit(env, mmu_idx, index, offsetof(CPUTLBEntry, TY), \
  (ADDR) & TARGET_PAGE_MASK)
 
-/*
- * Return a ram_addr_t for the virtual address for execution.
- *
- * Return -1 if we can't translate and execute from an entire page
- * of RAM.  This will force us to execute by loading and translating
- * one insn at a time, without caching.
- *
- * NOTE: This function will trigger an exception if the page is
- * not executable.
- */
-tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
-void **hostp)
-{
-uintptr_t mmu_idx = cpu_mmu_index(env, true);
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-void *p;
-
-if (unlikely(!tlb_hit(entry->addr_code, addr))) {
-if (!VICTIM_TLB_HIT(addr_code, addr)) {
-tlb_fill(env_cpu(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0);
-index = tlb_index(env, mmu_idx, addr);
-entry = tlb_entry(env, mmu_idx, addr);
-
-if (unlikely(entry->addr_code & TLB_INVALID_MASK)) {
-/*
- * The MMU protection covers a smaller range than a target
- * page, so we must redo the MMU check for every insn.
- */
-return -1;
-}
-}
-assert(tlb_hit(entry->addr_code, addr));
-}
-
-if (unlikely(entry->addr_code & TLB_MMIO)) {
-/* The region is not backed by RAM.  */
-if (hostp) {
-*hostp = NULL;
-}
-return -1;
-}
-
-p = (void *)((uintptr_t)addr + entry->addend);
-if (hostp) {
-*hostp = p;
-}
-return qemu_ram_addr_from_host_nofail(p);
-}
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
CPUIOTLBEntry *iotlbentry, uintptr_t retaddr)
 {
@@ -1687,6 +1637,32 @@ void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr,
 return flags ? NULL : host;
 }
 
+/*
+ * Return a ram_addr_t for the virtual address for execution.
+ *
+ * Return -1 if we can't translate and execute from an entire page
+ * of RAM.  This will force us to execute by loading and translating
+ * one insn at a time, without caching.
+ *
+ * NOTE: This function will trigger an exception if the page is
+ * not executable.
+ */
+tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
+void **hostp)
+{
+void *p;
+
+(void)probe_access_internal(env, addr, 1, MMU_INST_FETCH,
+cpu_mmu_index(env, true), true, , 0);
+if (p == NULL) {
+return -1;
+}
+if (hostp) {
+*hostp = p;
+}
+return qemu_ram_addr_from_host_nofail(p);
+}
+
 #ifdef CONFIG_PLUGIN
 /*
  * Perform a TLB lookup and populate the qemu_plugin_hwaddr structure.
-- 
2.34.1

[PATCH v6 06/21] tests/tcg/i386: Move smc_code2 to an executable section

We're about to start validating PAGE_EXEC, which means
that we've got to put this code into a section that is
both writable and executable.

Note that this test did not run on hardware beforehand either.

Signed-off-by: Richard Henderson 
---
 tests/tcg/i386/test-i386.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/tcg/i386/test-i386.c b/tests/tcg/i386/test-i386.c
index ac8d5a3c1f..e6b308a2c0 100644
--- a/tests/tcg/i386/test-i386.c
+++ b/tests/tcg/i386/test-i386.c
@@ -1998,7 +1998,7 @@ uint8_t code[] = {
 0xc3, /* ret */
 };
 
-asm(".section \".data\"\n"
+asm(".section \".data_x\",\"awx\"\n"
 "smc_code2:\n"
 "movl 4(%esp), %eax\n"
 "movl %eax, smc_patch_addr2 + 1\n"
-- 
2.34.1

[PATCH v6 04/21] linux-user: Honor PT_GNU_STACK

Map the stack executable if required by default or on demand.

Signed-off-by: Richard Henderson 
---
 include/elf.h|  1 +
 linux-user/qemu.h|  1 +
 linux-user/elfload.c | 19 ++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/elf.h b/include/elf.h
index 3a4bcb646a..3d6b9062c0 100644
--- a/include/elf.h
+++ b/include/elf.h
@@ -31,6 +31,7 @@ typedef int64_t  Elf64_Sxword;
 #define PT_LOPROC  0x7000
 #define PT_HIPROC  0x7fff
 
+#define PT_GNU_STACK  (PT_LOOS + 0x474e551)
 #define PT_GNU_PROPERTY   (PT_LOOS + 0x474e553)
 
 #define PT_MIPS_REGINFO   0x7000
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 7d90de1b15..e2e93fbd1d 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -48,6 +48,7 @@ struct image_info {
 uint32_telf_flags;
 int personality;
 abi_ulong   alignment;
+boolexec_stack;
 
 /* Generic semihosting knows about these pointers. */
 abi_ulong   arg_strings;   /* strings for argv */
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index b20d513929..90375c6b74 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -232,6 +232,7 @@ static bool init_guest_commpage(void)
 #define ELF_ARCHEM_386
 
 #define ELF_PLATFORM get_elf_platform()
+#define EXSTACK_DEFAULT true
 
 static const char *get_elf_platform(void)
 {
@@ -308,6 +309,7 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUX86State *en
 
 #define ELF_ARCHEM_ARM
 #define ELF_CLASS   ELFCLASS32
+#define EXSTACK_DEFAULT true
 
 static inline void init_thread(struct target_pt_regs *regs,
struct image_info *infop)
@@ -776,6 +778,7 @@ static inline void init_thread(struct target_pt_regs *regs,
 #else
 
 #define ELF_CLASS   ELFCLASS32
+#define EXSTACK_DEFAULT true
 
 #endif
 
@@ -973,6 +976,7 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUPPCState *en
 
 #define ELF_CLASS   ELFCLASS64
 #define ELF_ARCHEM_LOONGARCH
+#define EXSTACK_DEFAULT true
 
 #define elf_check_arch(x) ((x) == EM_LOONGARCH)
 
@@ -1068,6 +1072,7 @@ static uint32_t get_elf_hwcap(void)
 #define ELF_CLASS   ELFCLASS32
 #endif
 #define ELF_ARCHEM_MIPS
+#define EXSTACK_DEFAULT true
 
 #ifdef TARGET_ABI_MIPSN32
 #define elf_check_abi(x) ((x) & EF_MIPS_ABI2)
@@ -1806,6 +1811,10 @@ static inline void init_thread(struct target_pt_regs 
*regs,
 #define bswaptls(ptr) bswap32s(ptr)
 #endif
 
+#ifndef EXSTACK_DEFAULT
+#define EXSTACK_DEFAULT false
+#endif
+
 #include "elf.h"
 
 /* We must delay the following stanzas until after "elf.h". */
@@ -2081,6 +2090,7 @@ static abi_ulong setup_arg_pages(struct linux_binprm 
*bprm,
  struct image_info *info)
 {
 abi_ulong size, error, guard;
+int prot;
 
 size = guest_stack_size;
 if (size < STACK_LOWER_LIMIT) {
@@ -2091,7 +2101,11 @@ static abi_ulong setup_arg_pages(struct linux_binprm 
*bprm,
 guard = qemu_real_host_page_size();
 }
 
-error = target_mmap(0, size + guard, PROT_READ | PROT_WRITE,
+prot = PROT_READ | PROT_WRITE;
+if (info->exec_stack) {
+prot |= PROT_EXEC;
+}
+error = target_mmap(0, size + guard, prot,
 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 if (error == -1) {
 perror("mmap stack");
@@ -2921,6 +2935,7 @@ static void load_elf_image(const char *image_name, int 
image_fd,
  */
 loaddr = -1, hiaddr = 0;
 info->alignment = 0;
+info->exec_stack = EXSTACK_DEFAULT;
 for (i = 0; i < ehdr->e_phnum; ++i) {
 struct elf_phdr *eppnt = phdr + i;
 if (eppnt->p_type == PT_LOAD) {
@@ -2963,6 +2978,8 @@ static void load_elf_image(const char *image_name, int 
image_fd,
 if (!parse_elf_properties(image_fd, info, eppnt, bprm_buf, )) {
 goto exit_errmsg;
 }
+} else if (eppnt->p_type == PT_GNU_STACK) {
+info->exec_stack = eppnt->p_flags & PF_X;
 }
 }
 
-- 
2.34.1

[PATCH v6 02/21] linux-user/hppa: Allocate page zero as a commpage

We're about to start validating PAGE_EXEC, which means that we've
got to mark page zero executable.  We had been special casing this
entirely within translate.

Signed-off-by: Richard Henderson 
---
 linux-user/elfload.c | 34 +++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 3e3dc02499..29d910c4cc 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -1646,6 +1646,34 @@ static inline void init_thread(struct target_pt_regs 
*regs,
 regs->gr[31] = infop->entry;
 }
 
+#define LO_COMMPAGE  0
+
+static bool init_guest_commpage(void)
+{
+void *want = g2h_untagged(LO_COMMPAGE);
+void *addr = mmap(want, qemu_host_page_size, PROT_NONE,
+  MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+
+if (addr == MAP_FAILED) {
+perror("Allocating guest commpage");
+exit(EXIT_FAILURE);
+}
+if (addr != want) {
+return false;
+}
+
+/*
+ * On Linux, page zero is normally marked execute only + gateway.
+ * Normal read or write is supposed to fail (thus PROT_NONE above),
+ * but specific offsets have kernel code mapped to raise permissions
+ * and implement syscalls.  Here, simply mark the page executable.
+ * Special case the entry points during translation (see do_page_zero).
+ */
+page_set_flags(LO_COMMPAGE, LO_COMMPAGE + TARGET_PAGE_SIZE,
+   PAGE_EXEC | PAGE_VALID);
+return true;
+}
+
 #endif /* TARGET_HPPA */
 
 #ifdef TARGET_XTENSA
@@ -2326,12 +2354,12 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
argc, int envc,
 }
 
 #if defined(HI_COMMPAGE)
-#define LO_COMMPAGE 0
+#define LO_COMMPAGE -1
 #elif defined(LO_COMMPAGE)
 #define HI_COMMPAGE 0
 #else
 #define HI_COMMPAGE 0
-#define LO_COMMPAGE 0
+#define LO_COMMPAGE -1
 #define init_guest_commpage() true
 #endif
 
@@ -2555,7 +2583,7 @@ static void pgb_static(const char *image_name, abi_ulong 
orig_loaddr,
 } else {
 offset = -(HI_COMMPAGE & -align);
 }
-} else if (LO_COMMPAGE != 0) {
+} else if (LO_COMMPAGE != -1) {
 loaddr = MIN(loaddr, LO_COMMPAGE & -align);
 }
 
-- 
2.34.1

[PATCH v6 09/21] accel/tcg: Unlock mmap_lock after longjmp

The mmap_lock is held around tb_gen_code.  While the comment
is correct that the lock is dropped when tb_gen_code runs out
of memory, the lock is *not* dropped when an exception is
raised reading code for translation.

Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c  | 12 ++--
 accel/tcg/user-exec.c |  3 ---
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index a565a3f8ec..d18081ca6f 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -462,13 +462,11 @@ void cpu_exec_step_atomic(CPUState *cpu)
 cpu_tb_exec(cpu, tb, _exit);
 cpu_exec_exit(cpu);
 } else {
-/*
- * The mmap_lock is dropped by tb_gen_code if it runs out of
- * memory.
- */
 #ifndef CONFIG_SOFTMMU
 clear_helper_retaddr();
-tcg_debug_assert(!have_mmap_lock());
+if (have_mmap_lock()) {
+mmap_unlock();
+}
 #endif
 if (qemu_mutex_iothread_locked()) {
 qemu_mutex_unlock_iothread();
@@ -936,7 +934,9 @@ int cpu_exec(CPUState *cpu)
 
 #ifndef CONFIG_SOFTMMU
 clear_helper_retaddr();
-tcg_debug_assert(!have_mmap_lock());
+if (have_mmap_lock()) {
+mmap_unlock();
+}
 #endif
 if (qemu_mutex_iothread_locked()) {
 qemu_mutex_unlock_iothread();
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index a20234fb02..58edd33896 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -80,10 +80,7 @@ MMUAccessType adjust_signal_pc(uintptr_t *pc, bool is_write)
  * (and if the translator doesn't handle page boundaries correctly
  * there's little we can do about that here).  Therefore, do not
  * trigger the unwinder.
- *
- * Like tb_gen_code, release the memory lock before cpu_loop_exit.
  */
-mmap_unlock();
 *pc = 0;
 return MMU_INST_FETCH;
 }
-- 
2.34.1

[PATCH v6 05/21] linux-user: Clear translations and tb_jmp_cache on mprotect()

From: Ilya Leoshkevich 

Currently it's possible to execute pages that do not have PAGE_EXEC
if there is an existing translation block. Fix by clearing tb_jmp_cache
and invalidating TBs, which forces recheck of permission bits.

Signed-off-by: Ilya Leoshkevich 
Message-Id: <20220817150506.592862-2-...@linux.ibm.com>
[rth: Invalidate is required -- e.g. riscv fallthrough cross test]
Signed-off-by: Richard Henderson 

fixup mprotect
---
 linux-user/mmap.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 048c4135af..e9dc8848be 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -115,6 +115,7 @@ int target_mprotect(abi_ulong start, abi_ulong len, int 
target_prot)
 {
 abi_ulong end, host_start, host_end, addr;
 int prot1, ret, page_flags, host_prot;
+CPUState *cpu;
 
 trace_target_mprotect(start, len, target_prot);
 
@@ -177,7 +178,14 @@ int target_mprotect(abi_ulong start, abi_ulong len, int 
target_prot)
 goto error;
 }
 }
+
 page_set_flags(start, start + len, page_flags);
+tb_invalidate_phys_range(start, start + len);
+
+CPU_FOREACH(cpu) {
+cpu_tb_jmp_cache_clear(cpu);
+}
+
 mmap_unlock();
 return 0;
 error:
-- 
2.34.1

[PATCH v6 07/21] accel/tcg: Introduce is_same_page()

From: Ilya Leoshkevich 

Introduce a function that checks whether a given address is on the same
page as where disassembly started. Having it improves readability of
the following patches.

Signed-off-by: Ilya Leoshkevich 
Message-Id: <20220811095534.241224-3-...@linux.ibm.com>
Reviewed-by: Richard Henderson 
[rth: Make the DisasContextBase parameter const.]
Signed-off-by: Richard Henderson 
---
 include/exec/translator.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/exec/translator.h b/include/exec/translator.h
index 7db6845535..0d0bf3a31e 100644
--- a/include/exec/translator.h
+++ b/include/exec/translator.h
@@ -187,4 +187,14 @@ FOR_EACH_TRANSLATOR_LD(GEN_TRANSLATOR_LD)
 
 #undef GEN_TRANSLATOR_LD
 
+/*
+ * Return whether addr is on the same page as where disassembly started.
+ * Translators can use this to enforce the rule that only single-insn
+ * translation blocks are allowed to cross page boundaries.
+ */
+static inline bool is_same_page(const DisasContextBase *db, target_ulong addr)
+{
+return ((addr ^ db->pc_first) & TARGET_PAGE_MASK) == 0;
+}
+
 #endif /* EXEC__TRANSLATOR_H */
-- 
2.34.1

[PATCH v6 03/21] linux-user/x86_64: Allocate vsyscall page as a commpage

We're about to start validating PAGE_EXEC, which means that we've
got to the vsyscall page executable.  We had been special casing
this entirely within translate.

Signed-off-by: Richard Henderson 
---
 linux-user/elfload.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 29d910c4cc..b20d513929 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -195,6 +195,27 @@ static void elf_core_copy_regs(target_elf_gregset_t *regs, 
const CPUX86State *en
 (*regs)[26] = tswapreg(env->segs[R_GS].selector & 0x);
 }
 
+#if ULONG_MAX >= TARGET_VSYSCALL_PAGE
+#define INIT_GUEST_COMMPAGE
+static bool init_guest_commpage(void)
+{
+/*
+ * The vsyscall page is at a high negative address aka kernel space,
+ * which means that we cannot actually allocate it with target_mmap.
+ * We still should be able to use page_set_flags, unless the user
+ * has specified -R reserved_va, which would trigger an assert().
+ */
+if (reserved_va != 0 &&
+TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE >= reserved_va) {
+error_report("Cannot allocate vsyscall page");
+exit(EXIT_FAILURE);
+}
+page_set_flags(TARGET_VSYSCALL_PAGE,
+   TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE,
+   PAGE_EXEC | PAGE_VALID);
+return true;
+}
+#endif
 #else
 
 #define ELF_START_MMAP 0x8000
@@ -2360,8 +2381,10 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
argc, int envc,
 #else
 #define HI_COMMPAGE 0
 #define LO_COMMPAGE -1
+#ifndef INIT_GUEST_COMMPAGE
 #define init_guest_commpage() true
 #endif
+#endif
 
 static void pgb_fail_in_use(const char *image_name)
 {
-- 
2.34.1

[PATCH v6 01/21] linux-user/arm: Mark the commpage executable

We're about to start validating PAGE_EXEC, which means
that we've got to mark the commpage executable.  We had
been placing the commpage outside of reserved_va, which
was incorrect and lead to an abort.

Signed-off-by: Richard Henderson 
---
 linux-user/arm/target_cpu.h | 4 ++--
 linux-user/elfload.c| 6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/linux-user/arm/target_cpu.h b/linux-user/arm/target_cpu.h
index 709d19bc9e..89ba274cfc 100644
--- a/linux-user/arm/target_cpu.h
+++ b/linux-user/arm/target_cpu.h
@@ -34,9 +34,9 @@ static inline unsigned long arm_max_reserved_va(CPUState *cs)
 } else {
 /*
  * We need to be able to map the commpage.
- * See validate_guest_space in linux-user/elfload.c.
+ * See init_guest_commpage in linux-user/elfload.c.
  */
-return 0xul;
+return 0xul;
 }
 }
 #define MAX_RESERVED_VA  arm_max_reserved_va
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index ce902dbd56..3e3dc02499 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -398,7 +398,8 @@ enum {
 
 static bool init_guest_commpage(void)
 {
-void *want = g2h_untagged(HI_COMMPAGE & -qemu_host_page_size);
+abi_ptr commpage = HI_COMMPAGE & -qemu_host_page_size;
+void *want = g2h_untagged(commpage);
 void *addr = mmap(want, qemu_host_page_size, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
 
@@ -417,6 +418,9 @@ static bool init_guest_commpage(void)
 perror("Protecting guest commpage");
 exit(EXIT_FAILURE);
 }
+
+page_set_flags(commpage, commpage + qemu_host_page_size,
+   PAGE_READ | PAGE_EXEC | PAGE_VALID);
 return true;
 }
 
-- 
2.34.1

[PATCH v6 00/21] linux-user: Fix siginfo_t contents when jumping to non-readable pages

Hi Ilya,

After adding support for riscv (similar to s390x, in that we can
find the total insn length from the first couple of bits, so, easy),
I find that the test case doesn't work without all of the other
changes for PROT_EXEC, including the translator_ld changes.

Other changes from your v5:
  - mprotect invalidates tbs.  The test case is riscv, with a
4-byte insn at offset 0xffe, which was chained to from the
insn at offset 0xffa.  The fact that the 0xffe tb was not
invalidated meant that we chained to it and re-executed
without revalidating page protections.

  - rewrote the test framework to be agnostic of page size, which
reduces some of the repetition.  I ran into trouble with the
riscv linker, which relaxed the segment such that .align+.org
wasn't actually honored.  This new form doesn't require the
test bytes to be aligned in the binary.


r~


Ilya Leoshkevich (4):
  linux-user: Clear translations and tb_jmp_cache on mprotect()
  accel/tcg: Introduce is_same_page()
  target/s390x: Make translator stop before the end of a page
  target/i386: Make translator stop before the end of a page

Richard Henderson (17):
  linux-user/arm: Mark the commpage executable
  linux-user/hppa: Allocate page zero as a commpage
  linux-user/x86_64: Allocate vsyscall page as a commpage
  linux-user: Honor PT_GNU_STACK
  tests/tcg/i386: Move smc_code2 to an executable section
  accel/tcg: Properly implement get_page_addr_code for user-only
  accel/tcg: Unlock mmap_lock after longjmp
  accel/tcg: Make tb_htable_lookup static
  accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c
  accel/tcg: Use probe_access_internal for softmmu
get_page_addr_code_hostp
  accel/tcg: Add nofault parameter to get_page_addr_code_hostp
  accel/tcg: Raise PROT_EXEC exception early
  accel/tcg: Remove translator_ldsw
  accel/tcg: Add pc and host_pc params to gen_intermediate_code
  accel/tcg: Add fast path for translator_ld*
  target/riscv: Add MAX_INSN_LEN and insn_len
  target/riscv: Make translator stop before the end of a page

 include/elf.h |   1 +
 include/exec/cpu-common.h |   1 +
 include/exec/exec-all.h   |  87 ++
 include/exec/translator.h |  96 +---
 linux-user/arm/target_cpu.h   |   4 +-
 linux-user/qemu.h |   1 +
 accel/tcg/cpu-exec.c  | 134 ++--
 accel/tcg/cputlb.c|  93 ++--
 accel/tcg/plugin-gen.c|   4 +-
 accel/tcg/translate-all.c |  29 +++---
 accel/tcg/translator.c| 136 +---
 accel/tcg/user-exec.c |  18 +++-
 linux-user/elfload.c  |  82 +++--
 linux-user/mmap.c |   8 ++
 softmmu/physmem.c |  12 +++
 target/alpha/translate.c  |   5 +-
 target/arm/translate.c|   5 +-
 target/avr/translate.c|   5 +-
 target/cris/translate.c   |   5 +-
 target/hexagon/translate.c|   6 +-
 target/hppa/translate.c   |   5 +-
 target/i386/tcg/translate.c   |  32 ++-
 target/loongarch/translate.c  |   6 +-
 target/m68k/translate.c   |   5 +-
 target/microblaze/translate.c |   5 +-
 target/mips/tcg/translate.c   |   5 +-
 target/nios2/translate.c  |   5 +-
 target/openrisc/translate.c   |   6 +-
 target/ppc/translate.c|   5 +-
 target/riscv/translate.c  |  32 +--
 target/rx/translate.c |   5 +-
 target/s390x/tcg/translate.c  |  20 +++--
 target/sh4/translate.c|   5 +-
 target/sparc/translate.c  |   5 +-
 target/tricore/translate.c|   6 +-
 target/xtensa/translate.c |   6 +-
 tests/tcg/i386/test-i386.c|   2 +-
 tests/tcg/riscv64/noexec.c|  79 +
 tests/tcg/s390x/noexec.c  | 106 ++
 tests/tcg/x86_64/noexec.c |  75 
 tests/tcg/multiarch/noexec.c.inc  | 141 ++
 tests/tcg/riscv64/Makefile.target |   1 +
 tests/tcg/s390x/Makefile.target   |   1 +
 tests/tcg/x86_64/Makefile.target  |   3 +-
 44 files changed, 951 insertions(+), 342 deletions(-)
 create mode 100644 tests/tcg/riscv64/noexec.c
 create mode 100644 tests/tcg/s390x/noexec.c
 create mode 100644 tests/tcg/x86_64/noexec.c
 create mode 100644 tests/tcg/multiarch/noexec.c.inc

-- 
2.34.1

[PATCH] target/riscv: Use official extension names for AIA CSRs

2022-08-18 Thread Anup Patel

The arch review of AIA spec is completed and we now have official
extension names for AIA: Smaia (M-mode AIA CSRs) and Ssaia (S-mode
AIA CSRs).

Refer, section 1.6 of the latest AIA v0.3.1 stable specification at
https://github.com/riscv/riscv-aia/releases/download/0.3.1-draft.32/riscv-interrupts-032.pdf)

Based on above, we update QEMU RISC-V to:
1) Have separate config options for Smaia and Ssaia extensions
   which replace RISCV_FEATURE_AIA in CPU features
2) Not generate AIA INTC compatible string in virt machine

Signed-off-by: Anup Patel 
Reviewed-by: Andrew Jones 
---
 hw/intc/riscv_imsic.c |  4 +++-
 hw/riscv/virt.c   | 13 ++---
 target/riscv/cpu.c|  9 -
 target/riscv/cpu.h|  4 ++--
 target/riscv/cpu_helper.c | 30 ++
 target/riscv/csr.c| 30 --
 6 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/hw/intc/riscv_imsic.c b/hw/intc/riscv_imsic.c
index 8615e4cc1d..4d4d5b50ca 100644
--- a/hw/intc/riscv_imsic.c
+++ b/hw/intc/riscv_imsic.c
@@ -344,9 +344,11 @@ static void riscv_imsic_realize(DeviceState *dev, Error 
**errp)
 
 /* Force select AIA feature and setup CSR read-modify-write callback */
 if (env) {
-riscv_set_feature(env, RISCV_FEATURE_AIA);
 if (!imsic->mmode) {
+rcpu->cfg.ext_ssaia = true;
 riscv_cpu_set_geilen(env, imsic->num_pages - 1);
+} else {
+rcpu->cfg.ext_smaia = true;
 }
 riscv_cpu_set_aia_ireg_rmw_fn(env, (imsic->mmode) ? PRV_M : PRV_S,
   riscv_imsic_rmw, imsic);
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index e779d399ae..b041b33afc 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -261,17 +261,8 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int 
socket,
 qemu_fdt_add_subnode(mc->fdt, intc_name);
 qemu_fdt_setprop_cell(mc->fdt, intc_name, "phandle",
 intc_phandles[cpu]);
-if (riscv_feature(>soc[socket].harts[cpu].env,
-  RISCV_FEATURE_AIA)) {
-static const char * const compat[2] = {
-"riscv,cpu-intc-aia", "riscv,cpu-intc"
-};
-qemu_fdt_setprop_string_array(mc->fdt, intc_name, "compatible",
-  (char **), ARRAY_SIZE(compat));
-} else {
-qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
-"riscv,cpu-intc");
-}
+qemu_fdt_setprop_string(mc->fdt, intc_name, "compatible",
+"riscv,cpu-intc");
 qemu_fdt_setprop(mc->fdt, intc_name, "interrupt-controller", NULL, 0);
 qemu_fdt_setprop_cell(mc->fdt, intc_name, "#interrupt-cells", 1);
 
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index d3fbaa..3cf0c86661 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -101,6 +101,8 @@ static const struct isa_ext_data isa_edata_arr[] = {
 ISA_EXT_DATA_ENTRY(zve64f, true, PRIV_VERSION_1_12_0, ext_zve64f),
 ISA_EXT_DATA_ENTRY(zhinx, true, PRIV_VERSION_1_12_0, ext_zhinx),
 ISA_EXT_DATA_ENTRY(zhinxmin, true, PRIV_VERSION_1_12_0, ext_zhinxmin),
+ISA_EXT_DATA_ENTRY(smaia, true, PRIV_VERSION_1_12_0, ext_smaia),
+ISA_EXT_DATA_ENTRY(ssaia, true, PRIV_VERSION_1_12_0, ext_ssaia),
 ISA_EXT_DATA_ENTRY(sscofpmf, true, PRIV_VERSION_1_12_0, ext_sscofpmf),
 ISA_EXT_DATA_ENTRY(sstc, true, PRIV_VERSION_1_12_0, ext_sstc),
 ISA_EXT_DATA_ENTRY(svinval, true, PRIV_VERSION_1_12_0, ext_svinval),
@@ -669,10 +671,6 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
**errp)
 }
 }
 
-if (cpu->cfg.aia) {
-riscv_set_feature(env, RISCV_FEATURE_AIA);
-}
-
 if (cpu->cfg.debug) {
 riscv_set_feature(env, RISCV_FEATURE_DEBUG);
 }
@@ -1058,7 +1056,8 @@ static Property riscv_cpu_extensions[] = {
 DEFINE_PROP_BOOL("x-j", RISCVCPU, cfg.ext_j, false),
 /* ePMP 0.9.3 */
 DEFINE_PROP_BOOL("x-epmp", RISCVCPU, cfg.epmp, false),
-DEFINE_PROP_BOOL("x-aia", RISCVCPU, cfg.aia, false),
+DEFINE_PROP_BOOL("x-smaia", RISCVCPU, cfg.ext_smaia, false),
+DEFINE_PROP_BOOL("x-ssaia", RISCVCPU, cfg.ext_ssaia, false),
 
 DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 42edfa4558..15cad73def 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -85,7 +85,6 @@ enum {
 RISCV_FEATURE_PMP,
 RISCV_FEATURE_EPMP,
 RISCV_FEATURE_MISA,
-RISCV_FEATURE_AIA,
 RISCV_FEATURE_DEBUG
 };
 
@@ -452,6 +451,8 @@ struct RISCVCPUConfig {
 bool ext_zve64f;
 bool ext_zmmul;
 bool ext_sscofpmf;
+bool ext_smaia;
+bool ext_ssaia;
 bool rvv_ta_all_1s;
 bool rvv_ma_all_1s;
 
@@ -472,7 +473,6 @@ struct RISCVCPUConfig {
 bool mmu;
 bool pmp;
 bool epmp;
-bool aia;
 bool debug;
 uint64_t resetvec;
 
diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-18 Thread Hugh Dickins

On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] 
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

> 
> > Some of these impressions may come from earlier iterations of the
> > patchset (v7 looks better in several ways than v5).  I am probably
> > underestimating the extent to which you have taken on board other
> > usages beyond TDX and SEV private memory, and rightly want to serve
> > them all with similar interfaces: perhaps there is enough justification
> > for shmem there, but I don't see it.  There was mention of userfaultfd
> > in one link: does that provide the justification for using shmem?
> > 
> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct 
> memfile_backing_store.

Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.

But I've yet to see why you want to involve this or that filesystem
(with all its filesystem-icity suppressed) at all.  The backing store
is host memory, and tmpfs and hugetlbfs just impose their own
idiosyncrasies on how that memory is allocated; but I think you would
do better to choose your own idiosyncrasies in allocation directly -
you don't need a different "backing store" to choose between 4k or 2M
or 1G or whatever allocations.

tmpfs and hugetlbfs and page cache are designed around sharing memory:
TDX is designed around absolutely not sharing memory; and the further
uses which Sean foresees appear not to need it as page cache either.

Except perhaps for page migration reasons.  It's somewhat incidental,  
but of course page migration knows how to migrate page cache, so
masquerading as page cache will give a short cut to page migration,
when page migration becomes at all possible.

> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).
> 
> > Here's what I would prefer, and imagine much easier for you to maintain;
> > but I'm no system designer, and may be misunderstanding throughout.
> > 
> > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > the fallocate syscall interface itself) to allocate and free the memory,
> > ioctl for initializing some of it too.  KVM in control of whether that
> > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > in shmem.c, no need for flags, seals, notifications to and fro because
> > KVM is already in control and knows the history.  If shmem actually has
> > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > add, just allocate and free kernel memory directly, recorded in your
> > own xarray.
> 
> I guess shim layer on top of shmem *can* work. I don't see immediately why
> it would not. But I'm not sure it is right direction. We risk creating yet
> another parallel VM with own rules/locking/accounting that opaque to
> core-mm.

You are already proposing a new set of rules, foreign to how tmpfs works
for others.  You're right that KVM allocating large amounts of memory,
opaque to core-mm, carries risk: and you'd be right to say that shmem.c
provides some clues

RE: [PATCH V4 RESEND] net/colo.c: Fix the pointer issue reported by Coverity.

2022-08-18 Thread Zhang, Chen



> -Original Message-
> From: Jason Wang 
> Sent: Thursday, August 18, 2022 4:04 PM
> To: Zhang, Chen 
> Cc: Peter Maydell ; Li Zhijian
> ; qemu-dev 
> Subject: Re: [PATCH V4 RESEND] net/colo.c: Fix the pointer issue reported by
> Coverity.
> 
> On Wed, Aug 17, 2022 at 3:45 PM Zhang, Chen  wrote:
> >
> > Ping  Jason and Peter, any comments for this patch?
> >
> > Thanks
> > Chen
> >
> > > -Original Message-
> > > From: Zhang, Chen 
> > > Sent: Tuesday, August 9, 2022 4:49 PM
> > > To: Jason Wang ; Peter Maydell
> > > ; Li Zhijian ;
> > > qemu-dev 
> > > Cc: Zhang, Chen 
> > > Subject: [PATCH V4 RESEND] net/colo.c: Fix the pointer issue
> > > reported by Coverity.
> > >
> > > When enabled the virtio-net-pci, guest network packet will load the
> vnet_hdr.
> > > In COLO status, the primary VM's network packet maybe redirect to
> > > another VM, it need filter-redirect enable the vnet_hdr flag at the
> > > same time, COLO- proxy will correctly parse the original network
> > > packet. If have any misconfiguration here, the vnet_hdr_len is wrong
> > > for parse the packet, the
> > > data+offset will point to wrong place.
> > >
> > > Signed-off-by: Zhang Chen 
> > > ---
> > >  net/colo.c | 18 ++
> > >  net/colo.h |  1 +
> > >  2 files changed, 11 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/net/colo.c b/net/colo.c index 6b0ff562ad..2b5568fff4
> > > 100644
> > > --- a/net/colo.c
> > > +++ b/net/colo.c
> > > @@ -44,21 +44,23 @@ int parse_packet_early(Packet *pkt)  {
> > >  int network_length;
> > >  static const uint8_t vlan[] = {0x81, 0x00};
> > > -uint8_t *data = pkt->data + pkt->vnet_hdr_len;
> > > +uint8_t *data = pkt->data;
> > >  uint16_t l3_proto;
> > >  ssize_t l2hdr_len;
> > >
> > > -if (data == NULL) {
> > > -trace_colo_proxy_main_vnet_info("This packet is not parsed 
> > > correctly,
> "
> > > +assert(data);
> > > +
> > > +/* Check the received vnet_hdr_len then add the offset */
> > > +if ((pkt->vnet_hdr_len > sizeof(struct virtio_net_hdr_v1_hash)) ||
> > > +(pkt->size < sizeof(struct eth_header) + sizeof(struct 
> > > vlan_header)
> > > ++ pkt->vnet_hdr_len)) {
> > > +trace_colo_proxy_main_vnet_info("This packet may be load wrong "
> > >  "pkt->vnet_hdr_len",
> > > pkt->vnet_hdr_len);
> 
> Nit: I think we need to be verbose here, e.g put the pkt_size here at least.

OK, I will change here to:
/*
  * The received remote packet maybe misconfiguration here,
  * Please enable/disable filter module's the vnet_hdr flag at the same time.
  */
trace_colo_proxy_main_vnet_info("This received packet load wrong "
   
"pkt->vnet_hdr_len",  pkt->vnet_hdr_len, pkt->size);

Thanks
Chen

> 
> Thanks
> 
> > >  return 1;
> > >  }
> > > -l2hdr_len = eth_get_l2_hdr_length(data);
> > > +data += pkt->vnet_hdr_len;
> > >
> > > -if (pkt->size < ETH_HLEN + pkt->vnet_hdr_len) {
> > > -trace_colo_proxy_main("pkt->size < ETH_HLEN");
> > > -return 1;
> > > -}
> > > +l2hdr_len = eth_get_l2_hdr_length(data);
> > >
> > >  /*
> > >   * TODO: support vlan.
> > > diff --git a/net/colo.h b/net/colo.h index 8b3e8d5a83..22fc3031f7
> > > 100644
> > > --- a/net/colo.h
> > > +++ b/net/colo.h
> > > @@ -18,6 +18,7 @@
> > >  #include "qemu/jhash.h"
> > >  #include "qemu/timer.h"
> > >  #include "net/eth.h"
> > > +#include "standard-headers/linux/virtio_net.h"
> > >
> > >  #define HASHTABLE_MAX_SIZE 16384
> > >
> > > --
> > > 2.25.1
> >

Re: [PATCH for-7.2 v2 10/20] hw/ppc: set machine->fdt in spapr machine

2022-08-18 Thread Alexey Kardashevskiy





On 05/08/2022 19:39, Daniel Henrique Barboza wrote:

The pSeries machine never bothered with the common machine->fdt
attribute. We do all the FDT related work using spapr->fdt_blob.

We're going to introduce HMP commands to read and save the FDT, which
will rely on setting machine->fdt properly to work across all machine
archs/types.



Out of curiosity - why new HMP command, is not QOM'ing this ms::fdt 
property enough?


Another thing is that on every HMP dump I'd probably rebuild the entire 
FDT for the reasons David explained. Thanks,





Let's set machine->fdt in the two places where we manipulate the FDT:
spapr_machine_reset() and CAS. spapr->fdt_blob is left untouched: what
we want is a way to access the FDT from HMP, not replace
spapr->fdt_blob.

Cc: Cédric Le Goater 
Cc: qemu-...@nongnu.org
Signed-off-by: Daniel Henrique Barboza 
---
  hw/ppc/spapr.c   | 6 ++
  hw/ppc/spapr_hcall.c | 8 
  2 files changed, 14 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index bc9ba6e6dc..94c90f0351 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1713,6 +1713,12 @@ static void spapr_machine_reset(MachineState *machine)
  spapr->fdt_initial_size = spapr->fdt_size;
  spapr->fdt_blob = fdt;
  
+/*

+ * Set the common machine->fdt pointer to enable support
+ * for 'dumpdtb' and 'info fdt' commands.
+ */
+machine->fdt = fdt;
+
  /* Set up the entry state */
  first_ppc_cpu->env.gpr[5] = 0;
  
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c

index a8d4a6bcf0..0079bc6fdc 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1256,6 +1256,14 @@ target_ulong do_client_architecture_support(PowerPCCPU 
*cpu,
  spapr->fdt_initial_size = spapr->fdt_size;
  spapr->fdt_blob = fdt;
  
+/*

+ * Set the machine->fdt pointer again since we just freed
+ * it above (by freeing spapr->fdt_blob). We set this
+ * pointer to enable support for 'dumpdtb' and 'info fdt'
+ * HMP commands.
+ */
+MACHINE(spapr)->fdt = fdt;
+
  return H_SUCCESS;
  }
  


--
Alexey

Re: [Virtio-fs] [PATCH] virtiofsd: use g_date_time_get_microsecond to get subsecond

2022-08-18 Thread liuyd.f...@fujitsu.com

It works. I tested on RHEL8
Before this fix:
```
# /root/qemu/build/tools/virtiofsd/virtiofsd --socket-path=/tmp/sock -o 
source=/home/test -d

[(null)] [ID: 00133152] virtio_session_mount: Waiting for vhost-user 
socket connection...


```

After applying this patch
```
# /root/qemu/build/tools/virtiofsd/virtiofsd --socket-path=/tmp/sock -o 
source=/home/test -d

[2022-08-19 01:45:41.981608+] [ID: 00134587] virtio_session_mount: 
Waiting for vhost-user socket connection...

``` 


On 8/19/22 01:46, Yusuke Okada wrote:
> The "%f" specifier in g_date_time_format() is only available in glib
> 2.65.2 or later. If combined with older glib, the function returns null
> and the timestamp displayed as "(null)".
> 
> For backward compatibility, g_date_time_get_microsecond should be used
> to retrieve subsecond.
> 
> In this patch the g_date_time_format() leaves subsecond field as "%06d"
> and let next snprintf to format with g_date_time_get_microsecond.
> 
> Signed-off-by: Yusuke Okada 
> ---
>   tools/virtiofsd/passthrough_ll.c | 7 +--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 371a7bead6..20f0f41f99 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -4185,6 +4185,7 @@ static void setup_nofile_rlimit(unsigned long 
> rlimit_nofile)
>   static void log_func(enum fuse_log_level level, const char *fmt, va_list ap)
>   {
>   g_autofree char *localfmt = NULL;
> +char buf[64];
>   
>   if (current_log_level < level) {
>   return;
> @@ -4197,9 +4198,11 @@ static void log_func(enum fuse_log_level level, const 
> char *fmt, va_list ap)
>  fmt);
>   } else {
>   g_autoptr(GDateTime) now = g_date_time_new_now_utc();
> -g_autofree char *nowstr = g_date_time_format(now, "%Y-%m-%d 
> %H:%M:%S.%f%z");
> +g_autofree char *nowstr = g_date_time_format(now,
> +   "%Y-%m-%d %H:%M:%S.%%06d%z");
> +snprintf(buf, 64, nowstr, g_date_time_get_microsecond(now));
>   localfmt = g_strdup_printf("[%s] [ID: %08ld] %s",
> -   nowstr, syscall(__NR_gettid), fmt);
> +   buf, syscall(__NR_gettid), fmt);
>   }
>   fmt = localfmt;
>   }

-- 
Thanks,
Yiding

RE: [PATCH] i386: Disable BTS and PEBS

2022-08-18 Thread Duan, Zhenzhong



>-Original Message-
>From: Paolo Bonzini  On Behalf Of Paolo Bonzini
>Sent: Wednesday, July 20, 2022 2:19 AM
>To: Christopherson,, Sean 
>Cc: Duan, Zhenzhong ; qemu-
>de...@nongnu.org; mtosa...@redhat.com; lik...@tencent.com; Ma,
>XiangfeiX 
>Subject: Re: [PATCH] i386: Disable BTS and PEBS
>
>On 7/18/22 22:12, Sean Christopherson wrote:
>> On Mon, Jul 18, 2022, Paolo Bonzini wrote:
>>> This needs to be fixed in the kernel because old QEMU/new KVM is
>supported.
>>
>> I can't object to adding a quirk for this since KVM is breaking
>> userspace, but on the KVM side we really need to stop "sanitizing"
>> userspace inputs unless it puts the host at risk, because inevitably it leads
>to needing a quirk.
>
>The problem is not the sanitizing, it's that userspace literally cannot know
>that this needs to be done because the feature bits are "backwards"
>(1 = unavailable).
>
>The right way to fix it is probably to use feature MSRs and, by default, leave
>the features marked as unavailable.  I'll think it through and post a patch
>tomorrow for both KVM and QEMU (to enable PEBS).
Hi Paolo,

Can we ask the status of your patch? QA still reproduce with newest upstream 
code.

Thanks
Zhenzhong

Re: [PULL 1/3] linux-user: un-parent OBJECT(cpu) when closing thread


On 8/16/22 05:26, Alex Bennée wrote:

While forcing the CPU to unrealize by hand does trigger the clean-up
code we never fully free resources because refcount never reaches
zero. This is because QOM automatically added objects without an
explicit parent to /unattached/, incrementing the refcount.

Instead of manually triggering unrealization just unparent the object
and let the device machinery deal with that for us.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/866
Signed-off-by: Alex Bennée 
Reviewed-by: Laurent Vivier 
Message-Id: <20220811151413.3350684-2-alex.ben...@linaro.org>

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index f409121202..bfdd60136b 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -8594,7 +8594,13 @@ static abi_long do_syscall1(CPUArchState *cpu_env, int 
num, abi_long arg1,
  if (CPU_NEXT(first_cpu)) {
  TaskState *ts = cpu->opaque;
  
-object_property_set_bool(OBJECT(cpu), "realized", false, NULL);

+if (ts->child_tidptr) {
+put_user_u32(0, ts->child_tidptr);
+do_sys_futex(g2h(cpu, ts->child_tidptr),
+ FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
+}
+
+object_unparent(OBJECT(cpu));


This has caused a regression in arm/aarch64.

We hard-code ARMCPRegInfo pointers into TranslationBlocks, for calling into 
helper_{get,set}cp_reg{,64}.  So we have a race condition between whichever cpu thread 
translates the code first (encoding the pointer), and that cpu thread exiting, so that the 
next execution of the TB references a freed data structure.


We shouldn't have N copies of these pointers in the first place.  This seems like 
something that ought to be placed on the ARMCPUClass, so that it could be shared by each cpu.


But that's going to be a complex fix, so I'm reverting this for rc4.


r~

Re: [PATCH 7/7] target/riscv: Honour -semihosting-config userspace=on and enable=on

2022-08-18 Thread Alistair Francis

On Thu, Aug 18, 2022 at 11:58 PM Peter Maydell  wrote:
>
> On Thu, 18 Aug 2022 at 05:20, Alistair Francis  wrote:
> >
> > On Tue, Aug 16, 2022 at 5:11 AM Peter Maydell  
> > wrote:
> > >
> > > The riscv target incorrectly enabled semihosting always, whether the
> > > user asked for it or not.  Call semihosting_enabled() passing the
> > > correct value to the is_userspace argument, which fixes this and also
> > > handles the userspace=on argument.
> > >
> > > Note that this is a behaviour change: we used to default to
> > > semihosting being enabled, and now the user must pass
> > > "-semihosting-config enable=on" if they want it.
> > >
> > > Signed-off-by: Peter Maydell 
> >
> > I agree with Richard that a check in translate would be better, but
> > this is also an improvement on the broken implementation we have now
>
> Do you have an opinion on whether there are likely to be many
> users who are using riscv semihosting without explicitly enabling it
> on the command line ?

I don't think there are many users. We have always stated that
semihosting had to be enabled via the command line

Alistair

>
> -- PMM

Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

2022-08-18 Thread Leonardo Bras Soares Passos

On Thu, Aug 18, 2022 at 5:05 PM Peter Xu  wrote:
>
> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> > +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> > +   struct kvm_userspace_memory_region 
> > *mem)
> > +{
> > +MemoryRegionNode *node;
> > +
> > +node = g_malloc(sizeof(MemoryRegionNode));
> > +*node = (MemoryRegionNode) {
> > +.mem = mem,
> > +};
>
> Nit: direct assignment of struct looks okay, but maybe pointer assignment
> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
>
>   node = g_new0(MemoryRegionNode, 1);
>   node->mem = mem;
>
> [...]
>
> > +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> > +struct kvm_userspace_memory_region_list {
> > + __u32 nent;
> > + __u32 flags;
> > + struct kvm_userspace_memory_region entries[0];
> > +};
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for 
> > userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> >   struct kvm_userspace_memory_region)
> >  #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
> >  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> > + struct 
> > kvm_userspace_memory_region_list)
>
> I think this is probably good enough, but just to provide the other small
> (but may not be important) piece of puzzle here.  I wanted to think through
> to understand better but I never did..
>
> For a quick look, please read the comment in kvm_set_phys_mem().
>
> /*
>  * NOTE: We should be aware of the fact that here we're only
>  * doing a best effort to sync dirty bits.  No matter whether
>  * we're using dirty log or dirty ring, we ignored two facts:
>  *
>  * (1) dirty bits can reside in hardware buffers (PML)
>  *
>  * (2) after we collected dirty bits here, pages can be 
> dirtied
>  * again before we do the final KVM_SET_USER_MEMORY_REGION to
>  * remove the slot.
>  *
>  * Not easy.  Let's cross the fingers until it's fixed.
>  */
>
> One example is if we have 16G mem, we enable dirty tracking and we punch a
> hole of 1G at offset 1G, it'll change from this:
>
>  (a)
>   |- 16G ---|
>
> To this:
>
>  (b)(c)  (d)
>   |--1G--|XX|14G|
>
> Here (c) will be a 1G hole.
>
> With current code, the hole punching will del region (a) and add back
> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
>
> Here the question is if we're with dirty tracking it means for each region
> we have a dirty bitmap.  Currently we do the best effort of doing below
> sequence:
>
>   (1) fetching dirty bmap of (a)
>   (2) delete region (a)
>   (3) add region (b) (d)
>
> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> dirty pages written between step (1) and (2) (and actually if the write
> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> we're going to fix..).
>
> So ideally the atomic op can be:
>
>   "atomically fetch dirty bmap for removed regions, remove regions, and add
>new regions"
>
> Rather than only:
>
>   "atomically remove regions, and add new regions"
>
> as what the new _LIST ioctl do.
>
> But... maybe that's not a real problem, at least I didn't know any report
> showing issue with current code yet caused by losing of dirty bits during
> step (1) and (2).  Neither do I know how to trigger an issue with it.
>
> I'm just trying to still provide this information so that you should be
> aware of this problem too, at the meantime when proposing the new ioctl
> change for qemu we should also keep in mind that we won't easily lose the
> dirty bmap of (a) here, which I think this patch does the right thing.
>

Thanks for bringing these details Peter!

What do you think of adding?
(4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
dirty bitmaps.


Best regards,
Leo

> Thanks!
>
> --
> Peter Xu
>

Re: [PATCH v6 6/8] KVM: Handle page fault for private memory

2022-08-18 Thread Kirill A. Shutemov

On Fri, Jun 17, 2022 at 09:30:53PM +, Sean Christopherson wrote:
> > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, 
> > struct kvm_page_fault *fault
> > read_unlock(>kvm->mmu_lock);
> > else
> > write_unlock(>kvm->mmu_lock);
> > -   kvm_release_pfn_clean(fault->pfn);
> > +
> > +   if (fault->is_private)
> > +   kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> 
> Why does the shmem path lock the page, and then unlock it here?

Lock is require to avoid race with truncate / punch hole. Like if truncate
happens after get_pfn(), but before it gets into SEPT we are screwed.

> Same question for why this path marks it dirty?  The guest has the page mapped
> so the dirty flag is immediately stale.

If page is clean and refcount is not elevated, vmscan is free to drop the
page from page cache. I don't think we want this.

> In other words, why does KVM need to do something different for private pfns?

Because in the traditional KVM memslot scheme, core mm takes care about
this.

The changes in v7 is wrong. Page has be locked until it lends into SEPT and
must make it dirty before unlocking.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-18 Thread Sean Christopherson

On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > On Wed, 6 Jul 2022, Chao Peng wrote:
> > But since then, TDX in particular has forced an effort into preventing
> > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > 
> > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

But QEMU and other VMMs are users of shmem and memfd.  The new features 
certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.

> > What use do you have for a filesystem here?  Almost none.
> > IIUC, what you want is an fd through which QEMU can allocate kernel
> > memory, selectively free that memory, and communicate fd+offset+length
> > to KVM.  And perhaps an interface to initialize a little of that memory
> > from a template (presumably copied from a real file on disk somewhere).
> > 
> > You don't need shmem.c or a filesystem for that!
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. 

And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far 
off
from being able to use UPM for "regular" VMs as a way to provide 
defense-in-depth
without having to take on the overhead of confidential VMs.  At that point,
migration and probably even swap are on the table.

> And swapping theoretically possible, but I'm not aware of any plans as of
> now.

Ya, I highly doubt confidential VMs will ever bother with swap.

> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct 
> memfile_backing_store.
> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).

But secretmem _could_ be a fit.  If a use case wants to unmap guest private 
memory
from both userspace and the kernel then KVM should absolutely be able to support
that, but at the same time I don't want to have to update KVM to enable 
secretmem
(and I definitely don't want KVM poking into the directmap itself).

MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace",
any other properties should be completely separate, e.g. the inability to 
migrate
pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's 
not
a fundamental property of MFD_INACCESSIBLE.

[PATCH v2] target/arm: Add cortex-a35

2022-08-18 Thread Hao Wu

Add cortex A35 core and enable it for virt board.

Signed-off-by: Hao Wu 
Reviewed-by: Joe Komlodi 
---
 docs/system/arm/virt.rst |  1 +
 hw/arm/virt.c|  1 +
 target/arm/cpu64.c   | 80 
 3 files changed, 82 insertions(+)

diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst
index 3b6ba69a9a..20442ea2c1 100644
--- a/docs/system/arm/virt.rst
+++ b/docs/system/arm/virt.rst
@@ -52,6 +52,7 @@ Supported guest CPU types:
 
 - ``cortex-a7`` (32-bit)
 - ``cortex-a15`` (32-bit; the default)
+- ``cortex-a35`` (64-bit)
 - ``cortex-a53`` (64-bit)
 - ``cortex-a57`` (64-bit)
 - ``cortex-a72`` (64-bit)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9633f822f3..ee06003aed 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -199,6 +199,7 @@ static const int a15irqmap[] = {
 static const char *valid_cpus[] = {
 ARM_CPU_TYPE_NAME("cortex-a7"),
 ARM_CPU_TYPE_NAME("cortex-a15"),
+ARM_CPU_TYPE_NAME("cortex-a35"),
 ARM_CPU_TYPE_NAME("cortex-a53"),
 ARM_CPU_TYPE_NAME("cortex-a57"),
 ARM_CPU_TYPE_NAME("cortex-a72"),
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 78e27f778a..9d1ea32057 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -36,6 +36,85 @@
 #include "hw/qdev-properties.h"
 #include "internals.h"
 
+static void aarch64_a35_initfn(Object *obj)
+{
+ARMCPU *cpu = ARM_CPU(obj);
+
+cpu->dtb_compatible = "arm,cortex-a35";
+set_feature(>env, ARM_FEATURE_V8);
+set_feature(>env, ARM_FEATURE_NEON);
+set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
+set_feature(>env, ARM_FEATURE_AARCH64);
+set_feature(>env, ARM_FEATURE_CBAR_RO);
+set_feature(>env, ARM_FEATURE_EL2);
+set_feature(>env, ARM_FEATURE_EL3);
+set_feature(>env, ARM_FEATURE_PMU);
+
+/* From B2.2 AArch64 identification registers. */
+cpu->midr = 0x411fd040;
+cpu->revidr = 0;
+cpu->ctr = 0x84448004;
+cpu->isar.id_pfr0 = 0x0131;
+cpu->isar.id_pfr1 = 0x00011011;
+cpu->isar.id_dfr0 = 0x03010066;
+cpu->id_afr0 = 0;
+cpu->isar.id_mmfr0 = 0x10201105;
+cpu->isar.id_mmfr1 = 0x4000;
+cpu->isar.id_mmfr2 = 0x0126;
+cpu->isar.id_mmfr3 = 0x02102211;
+cpu->isar.id_isar0 = 0x02101110;
+cpu->isar.id_isar1 = 0x13112111;
+cpu->isar.id_isar2 = 0x21232042;
+cpu->isar.id_isar3 = 0x01112131;
+cpu->isar.id_isar4 = 0x00011142;
+cpu->isar.id_isar5 = 0x00011121;
+cpu->isar.id_aa64pfr0 = 0x;
+cpu->isar.id_aa64pfr1 = 0;
+cpu->isar.id_aa64dfr0 = 0x10305106;
+cpu->isar.id_aa64dfr1 = 0;
+cpu->isar.id_aa64isar0 = 0x00011120;
+cpu->isar.id_aa64isar1 = 0;
+cpu->isar.id_aa64mmfr0 = 0x00101122;
+cpu->isar.id_aa64mmfr1 = 0;
+cpu->clidr = 0x0a200023;
+cpu->dcz_blocksize = 4;
+
+/* From B2.4 AArch64 Virtual Memory control registers */
+cpu->reset_sctlr = 0x00c50838;
+
+/* From B2.10 AArch64 performance monitor registers */
+cpu->isar.reset_pmcr_el0 = 0x410a3000;
+
+/* From B2.29 Cache ID registers */
+cpu->ccsidr[0] = 0x700fe01a; /* 32KB L1 dcache */
+cpu->ccsidr[1] = 0x201fe00a; /* 32KB L1 icache */
+cpu->ccsidr[2] = 0x703fe03a; /* 512KB L2 cache */
+
+/* From B3.5 VGIC Type register */
+cpu->gic_num_lrs = 4;
+cpu->gic_vpribits = 5;
+cpu->gic_vprebits = 5;
+cpu->gic_pribits = 5;
+
+/* From C6.4 Debug ID Register */
+cpu->isar.dbgdidr = 0x3516d000;
+/* From C6.5 Debug Device ID Register */
+cpu->isar.dbgdevid = 0x00110f13;
+/* From C6.6 Debug Device ID Register 1 */
+cpu->isar.dbgdevid1 = 0x2;
+
+/* From Cortex-A35 SIMD and Floating-point Support r1p0 */
+/* From 3.2 AArch32 register summary */
+cpu->reset_fpsid = 0x41034043;
+
+/* From 2.2 AArch64 register summary */
+cpu->isar.mvfr0 = 0x10110222;
+cpu->isar.mvfr1 = 0x1211;
+cpu->isar.mvfr2 = 0x0043;
+
+/* These values are the same with A53/A57/A72. */
+define_cortex_a72_a57_a53_cp_reginfo(cpu);
+}
 
 static void aarch64_a57_initfn(Object *obj)
 {
@@ -1158,6 +1237,7 @@ static void aarch64_a64fx_initfn(Object *obj)
 }
 
 static const ARMCPUInfo aarch64_cpus[] = {
+{ .name = "cortex-a35", .initfn = aarch64_a35_initfn },
 { .name = "cortex-a57", .initfn = aarch64_a57_initfn },
 { .name = "cortex-a53", .initfn = aarch64_a53_initfn },
 { .name = "cortex-a72", .initfn = aarch64_a72_initfn },
-- 
2.37.1.595.g718a3a8f04-goog

[PATCH for-7.2 2/2] ppc/pnv: fix QOM parenting of user creatable root ports

User creatable root ports are being parented by the 'peripheral' or the
'peripheral-anon' container. This happens because this is the regular
QOM schema for sysbus devices that are added via the command line.

Let's make this QOM hierarchy similar to what we have with default root
ports, i.e. the root port must be parented by the pnv-root-bus. To do
that we change the qom and bus parent of the root port during
root_port_realize(). The realize() is shared by the default root port
code path, so we can remove the code inside pnv_phb_attach_root_port()
that was adding the root port as a child of the bus as well.

While we're at it, change pnv_phb_attach_root_port() to receive a PCIBus
instead of a PCIHostState to make it clear that the function does not
make use of the PHB.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/pci-host/pnv_phb.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/hw/pci-host/pnv_phb.c b/hw/pci-host/pnv_phb.c
index 4ea33fb6ba..38ec8571b7 100644
--- a/hw/pci-host/pnv_phb.c
+++ b/hw/pci-host/pnv_phb.c
@@ -62,27 +62,11 @@ static bool pnv_parent_fixup(Object *parent, BusState 
*parent_bus,
 return true;
 }
 
-/*
- * Attach a root port device.
- *
- * 'index' will be used both as a PCIE slot value and to calculate
- * QOM id. 'chip_id' is going to be used as PCIE chassis for the
- * root port.
- */
-static void pnv_phb_attach_root_port(PCIHostState *pci)
+static void pnv_phb_attach_root_port(PCIBus *bus)
 {
 PCIDevice *root = pci_new(PCI_DEVFN(0, 0), TYPE_PNV_PHB_ROOT_PORT);
-const char *dev_id = DEVICE(root)->id;
-g_autofree char *default_id = NULL;
-int index;
 
-index = object_property_get_int(OBJECT(pci->bus), "phb-id", _fatal);
-default_id = g_strdup_printf("%s[%d]", TYPE_PNV_PHB_ROOT_PORT, index);
-
-object_property_add_child(OBJECT(pci->bus), dev_id ? dev_id : default_id,
-  OBJECT(root));
-
-pci_realize_and_unref(root, pci->bus, _fatal);
+pci_realize_and_unref(root, bus, _fatal);
 }
 
 /*
@@ -184,7 +168,7 @@ static void pnv_phb_realize(DeviceState *dev, Error **errp)
 return;
 }
 
-pnv_phb_attach_root_port(pci);
+pnv_phb_attach_root_port(pci->bus);
 }
 
 static const char *pnv_phb_root_bus_path(PCIHostState *host_bridge,
@@ -259,6 +243,11 @@ static void pnv_phb_root_port_realize(DeviceState *dev, 
Error **errp)
 Error *local_err = NULL;
 int chip_id, index;
 
+/*
+ * 'index' will be used both as a PCIE slot value and to calculate
+ * QOM id. 'chip_id' is going to be used as PCIE chassis for the
+ * root port.
+ */
 chip_id = object_property_get_int(OBJECT(bus), "chip-id", _fatal);
 index = object_property_get_int(OBJECT(bus), "phb-id", _fatal);
 
@@ -266,6 +255,17 @@ static void pnv_phb_root_port_realize(DeviceState *dev, 
Error **errp)
 qdev_prop_set_uint8(dev, "chassis", chip_id);
 qdev_prop_set_uint16(dev, "slot", index);
 
+/*
+ * User created root ports are QOM parented to one of
+ * the peripheral containers but it's already at the right
+ * parent bus. Change the QOM parent to be the same as the
+ * parent bus it's already assigned to.
+ */
+if (!pnv_parent_fixup(OBJECT(bus), BUS(bus), OBJECT(dev),
+  index, errp)) {
+return;
+}
+
 rpc->parent_realize(dev, _err);
 if (local_err) {
 error_propagate(errp, local_err);
-- 
2.37.2

[PATCH for-7.2 1/2] ppc/pnv: consolidate pnv_parent_*_fixup() helpers

We have 2 helpers that amends the QOM and parent bus of a given object,
repectively. These 2 helpers are called together, and not by accident.
Due to QOM internals, doing an object_unparent() will result in the
device being removed from its parent bus. This means that changing the
QOM parent requires reassigning the parent bus again.

Create a single helper called pnv_parent_fixup(), documenting some of
the QOM specifics that we're dealing with the unparenting/parenting
mechanics, and handle both the QOM and the parent bus assignment.

Next patch will make use of this function to handle a case where we need
to change the QOM parent while keeping the same parent bus assigned
beforehand.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/pci-host/pnv_phb.c | 43 ---
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/hw/pci-host/pnv_phb.c b/hw/pci-host/pnv_phb.c
index 17d9960aa1..4ea33fb6ba 100644
--- a/hw/pci-host/pnv_phb.c
+++ b/hw/pci-host/pnv_phb.c
@@ -21,34 +21,45 @@
 
 
 /*
- * Set the QOM parent of an object child. If the device state
- * associated with the child has an id, use it as QOM id. Otherwise
- * use object_typename[index] as QOM id.
+ * Set the QOM parent and parent bus of an object child. If the device
+ * state associated with the child has an id, use it as QOM id.
+ * Otherwise use object_typename[index] as QOM id.
+ *
+ * This helper does both operations at the same time because seting
+ * a new QOM child will erase the bus parent of the device. This happens
+ * because object_unparent() will call object_property_del_child(),
+ * which in turn calls the property release callback prop->release if
+ * it's defined. In our case this callback is set to
+ * object_finalize_child_property(), which was assigned during the
+ * first object_property_add_child() call. This callback will end up
+ * calling device_unparent(), and this function removes the device
+ * from its parent bus.
+ *
+ * The QOM and parent bus to be set aren´t necessarily related, so
+ * let's receive both as arguments.
  */
-static void pnv_parent_qom_fixup(Object *parent, Object *child, int index)
+static bool pnv_parent_fixup(Object *parent, BusState *parent_bus,
+ Object *child, int index,
+ Error **errp)
 {
 g_autofree char *default_id =
 g_strdup_printf("%s[%d]", object_get_typename(child), index);
 const char *dev_id = DEVICE(child)->id;
 
 if (child->parent == parent) {
-return;
+return true;
 }
 
 object_ref(child);
 object_unparent(child);
 object_property_add_child(parent, dev_id ? dev_id : default_id, child);
 object_unref(child);
-}
-
-static void pnv_parent_bus_fixup(DeviceState *parent, DeviceState *child,
- Error **errp)
-{
-BusState *parent_bus = qdev_get_parent_bus(parent);
 
-if (!qdev_set_parent_bus(child, parent_bus, errp)) {
-return;
+if (!qdev_set_parent_bus(DEVICE(child), parent_bus, errp)) {
+return false;
 }
+
+return true;
 }
 
 /*
@@ -101,8 +112,10 @@ static bool pnv_phb_user_device_init(PnvPHB *phb, Error 
**errp)
  * correctly the device tree. pnv_xscom_dt() needs every
  * PHB to be a child of the chip to build the DT correctly.
  */
-pnv_parent_qom_fixup(parent, OBJECT(phb), phb->phb_id);
-pnv_parent_bus_fixup(DEVICE(chip), DEVICE(phb), errp);
+if (!pnv_parent_fixup(parent, qdev_get_parent_bus(DEVICE(chip)),
+  OBJECT(phb), phb->phb_id, errp)) {
+return false;
+}
 
 return true;
 }
-- 
2.37.2

[PATCH for-7.2 0/2] ppc/pnv: fix root port QOM parenting

Hi,

These are a couple of patches that got separated from the main series it
belonged to [1] that got already queued for 7.2. Patch 1 is new, patch
2 is a new version of patch 11 of [1].

The patches are based on ppc-7.2 [2].

[1] https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01847.html
[2] https://gitlab.com/danielhb/qemu/-/tree/ppc-7.2


Daniel Henrique Barboza (2):
  ppc/pnv: consolidate pnv_parent_*_fixup() helpers
  ppc/pnv: fix QOM parenting of user creatable root ports

 hw/pci-host/pnv_phb.c | 81 +--
 1 file changed, 47 insertions(+), 34 deletions(-)

-- 
2.37.2

[python-qemu-qmp MR #18] New release - v0.0.2

2022-08-18 Thread GitLab Bot

Author: John Snow - https://gitlab.com/jsnow
Merge Request: 
https://gitlab.com/qemu-project/python-qemu-qmp/-/merge_requests/18
... from: jsnow/python-qemu-qmp:new_release
... into: qemu-project/python-qemu-qmp:main

***If this MR is approved, after merge I will be tagging this commit as 
"v0.0.2", building packages, and publishing them to PyPI.***

New release; primarily for the benefit of downstream packaging. This is
a minor release that should be safe to upgrade to, unless you are
relying on string-matching repr() output for certain error classes,
which have changed slightly.

Changelog:

This release primarily fixes development tooling, documentation, and packaging 
issues that have no impact on the library itself. A handful of small, runtime 
visible changes were added as polish.

* Milestone: %"v0.0.2" 
* #28: Added manual pages and web docs for qmp-shell[-wrap]
* #27: Support building Sphinx docs from SDist files
* #26: Add coverage.py support to GitLab merge requests
* #25: qmp-shell-wrap now exits gracefully when qemu-system not found.
* #24: Minor packaging fixes.
* #10: qmp-tui exits gracefully when [tui] extras are not installed.
* #09: `__repr__` methods have been improved for all custom classes.
* #04: Mutating QMPClient.name now also changes logging messages.


Thanks!

--js

---

This is an automated message. This bot will only relay the creation of new merge
requests and will not relay review comments, new revisions, or concluded merges.
Please follow the GitLab link to participate in review.

Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

2022-08-18 Thread Peter Xu

On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> +   struct kvm_userspace_memory_region 
> *mem)
> +{
> +MemoryRegionNode *node;
> +
> +node = g_malloc(sizeof(MemoryRegionNode));
> +*node = (MemoryRegionNode) {
> +.mem = mem,
> +};

Nit: direct assignment of struct looks okay, but maybe pointer assignment
is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):

  node = g_new0(MemoryRegionNode, 1);
  node->mem = mem;

[...]

> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_list {
> + __u32 nent;
> + __u32 flags;
> + struct kvm_userspace_memory_region entries[0];
> +};
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
>   struct kvm_userspace_memory_region)
>  #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> + struct kvm_userspace_memory_region_list)

I think this is probably good enough, but just to provide the other small
(but may not be important) piece of puzzle here.  I wanted to think through
to understand better but I never did..

For a quick look, please read the comment in kvm_set_phys_mem().

/*
 * NOTE: We should be aware of the fact that here we're only
 * doing a best effort to sync dirty bits.  No matter whether
 * we're using dirty log or dirty ring, we ignored two facts:
 *
 * (1) dirty bits can reside in hardware buffers (PML)
 *
 * (2) after we collected dirty bits here, pages can be dirtied
 * again before we do the final KVM_SET_USER_MEMORY_REGION to
 * remove the slot.
 *
 * Not easy.  Let's cross the fingers until it's fixed.
 */

One example is if we have 16G mem, we enable dirty tracking and we punch a
hole of 1G at offset 1G, it'll change from this:

 (a)
  |- 16G ---|

To this:

 (b)(c)  (d)
  |--1G--|XX|14G|

Here (c) will be a 1G hole.

With current code, the hole punching will del region (a) and add back
region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.

Here the question is if we're with dirty tracking it means for each region
we have a dirty bitmap.  Currently we do the best effort of doing below
sequence:

  (1) fetching dirty bmap of (a)
  (2) delete region (a)
  (3) add region (b) (d)

Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
dirty pages written between step (1) and (2) (and actually if the write
comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
we're going to fix..).

So ideally the atomic op can be:

  "atomically fetch dirty bmap for removed regions, remove regions, and add
   new regions"

Rather than only:

  "atomically remove regions, and add new regions"

as what the new _LIST ioctl do.

But... maybe that's not a real problem, at least I didn't know any report
showing issue with current code yet caused by losing of dirty bits during
step (1) and (2).  Neither do I know how to trigger an issue with it.

I'm just trying to still provide this information so that you should be
aware of this problem too, at the meantime when proposing the new ioctl
change for qemu we should also keep in mind that we won't easily lose the
dirty bmap of (a) here, which I think this patch does the right thing.

Thanks!

--
Peter Xu

Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization

2022-08-18 Thread Stefan Hajnoczi

On Thu, Jul 14, 2022 at 12:13:53PM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > Avoid bounce buffers when QEMUIOVector elements are within previously
> > registered bdrv_register_buf() buffers.
> > 
> > The idea is that emulated storage controllers will register guest RAM
> > using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
> > requests. Therefore no blkio_map_mem_region() calls are necessary in the
> > performance-critical I/O code path.
> > 
> > This optimization doesn't apply if the I/O buffer is internally
> > allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
> > path because BDRV_REQ_REGISTERED_BUF is not set.
> 
> Which keeps the question relevant of how slow the slow path is, i.e. whether
> it wouldn’t make sense to keep some of the mem regions allocated there in a
> cache instead of allocating/freeing them on every I/O request.

Yes, bounce buffer reuse would be possible, but let's keep it simple for
now.

> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >   block/blkio.c | 104 --
> >   1 file changed, 101 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blkio.c b/block/blkio.c
> > index 7fbdbd7fae..37d593a20c 100644
> > --- a/block/blkio.c
> > +++ b/block/blkio.c
> 
> [...]
> 
> > @@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState 
> > *bs, int64_t offset,
> >   BlockCompletionFunc *cb, void *opaque)
> >   {
> >   BDRVBlkioState *s = bs->opaque;
> > +bool needs_mem_regions =
> > +s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
> 
> Is that condition sufficient?  bdrv_register_buf() has no way of returning
> an error, so it’s possible that buffers are silently not registered.  (And
> there are conditions in blkio_register_buf() where the buffer will not be
> registered, e.g. because it isn’t aligned.)
> 
> The caller knows nothing of this and will still pass
> BDRV_REQ_REGISTERED_BUF, and then we’ll assume the region is mapped but it
> won’t be.
> 
> >   struct iovec *iov = qiov->iov;
> >   int iovcnt = qiov->niov;
> >   BlkioAIOCB *acb;
> 
> [...]
> 
> > @@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
> >   }
> >   }
> > +static void blkio_register_buf(BlockDriverState *bs, void *host, size_t 
> > size)
> > +{
> > +BDRVBlkioState *s = bs->opaque;
> > +int ret;
> > +struct blkio_mem_region region = (struct blkio_mem_region){
> > +.addr = host,
> > +.len = size,
> > +.fd = -1,
> > +};
> > +
> > +if (((uintptr_t)host | size) % s->mem_region_alignment) {
> > +error_report_once("%s: skipping unaligned buf %p with size %zu",
> > +  __func__, host, size);
> > +return; /* skip unaligned */
> > +}
> 
> How big is mem-region-alignment generally?  Is it like 4k or is it going to
> be a real issue?

Yes, it's usually the page size of the MMU/IOMMU. vhost-user and VFIO
have the same requirements so I don't think anything special is
necessary.

> (Also, we could probably register a truncated region.  I know, that’ll break
> the BDRV_REQ_REGISTERED_BUF idea because the caller won’t know we’ve
> truncated it, but that’s no different than just not registering the buffer
> at all.)
> 
> > +
> > +/* Attempt to find the fd for a MemoryRegion */
> > +if (s->needs_mem_region_fd) {
> > +int fd = -1;
> > +ram_addr_t offset;
> > +MemoryRegion *mr;
> > +
> > +/*
> > + * bdrv_register_buf() is called with the BQL held so mr lives at 
> > least
> > + * until this function returns.
> > + */
> > +mr = memory_region_from_host(host, );
> > +if (mr) {
> > +fd = memory_region_get_fd(mr);
> > +}
> 
> I don’t think it’s specified that buffers registered with
> bdrv_register_buf() must be within a single memory region, is it? So can we
> somehow verify that the memory region covers the whole buffer?

You are right, there is no guarantee. However, the range will always be
within a RAMBlock at the moment because the bdrv_register_buf() calls
are driven by a RAMBlock notifier and match the boundaries of the
RAMBlocks.

I will add a check so this starts failing when that assumption is
violated.

> 
> > +if (fd == -1) {
> > +error_report_once("%s: skipping fd-less buf %p with size %zu",
> > +  __func__, host, size);
> > +return; /* skip if there is no fd */
> > +}
> > +
> > +region.fd = fd;
> > +region.fd_offset = offset;
> > +}
> > +
> > +WITH_QEMU_LOCK_GUARD(>lock) {
> > +ret = blkio_map_mem_region(s->blkio, );
> > +}
> > +
> > +if (ret < 0) {
> > +error_report_once("Failed to add blkio mem region %p with size 
> > %zu: %s",
> > +  host, size, blkio_get_error_msg());
> > +}
> > +}
> > +
> >

Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls

2022-08-18 Thread Peter Xu

On Tue, Aug 16, 2022 at 06:12:49AM -0400, Emanuele Giuseppe Esposito wrote:
> kvm listeners now need ->commit callback in order to actually send
> the ioctl to the hypervisor. Therefore, add missing callers around
> address_space_set_flatview(), which in turn calls
> address_space_update_topology_pass() which calls ->region_* and
> ->log_* callbacks.
> 
> Using MEMORY_LISTENER_CALL_GLOBAL is a little bit an overkill,
> but it is harmless, considering that other listeners that are not
> invoked in address_space_update_topology_pass() won't do anything,
> since they won't have anything to commit.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  softmmu/memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7ba2048836..1afd3f9703 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1076,7 +1076,9 @@ static void address_space_update_topology(AddressSpace 
> *as)
>  if (!g_hash_table_lookup(flat_views, physmr)) {
>  generate_memory_topology(physmr);
>  }
> +MEMORY_LISTENER_CALL_GLOBAL(begin, Forward);
>  address_space_set_flatview(as);
> +MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);

Should the pair be with MEMORY_LISTENER_CALL() rather than the global
version?  Since it's only updating one address space.

Besides the perf implication (walking per-as list should be faster than
walking global memory listener list?), I think it feels broken too since
we'll call begin() then commit() (with no region_add()/region_del()/..) for
all the listeners that are not registered against this AS.  IIUC it will
empty all regions with those listeners?

Thanks,

-- 
Peter Xu

Re: towards a workable O_DIRECT outmigration to a file

On 8/18/22 20:49, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
>>> * Claudio Fontana (cfont...@suse.de) wrote:
 On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
> * Nikolay Borisov (nbori...@suse.com) wrote:
>> [adding Juan and David to cc as I had missed them. ]
>
> Hi Nikolay,
>
>> On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
>>> Hello,
>>>
>>> I'm currently looking into implementing a 'file:' uri for migration save
>>> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
>>> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
>>> process of brainstorming how a solution would like the a couple of
>>> questions transpired that I think warrant wider discussion in the
>>> community.
>
> OK, so this seems to be a continuation with Claudio and Daniel and co as
> of a few months back.  I'd definitely be leaving libvirt sides of the
> question here to Dan, and so that also means definitely looking at that
> tree above.

 Hi Dave, yes, Nikolai is trying to continue on the qemu side.

 We have something working with libvirt for our short term needs which 
 offers good performance,
 but it is clear that that simple solution is barred for upstream libvirt 
 merging.


>
>>> First, implementing a solution which is self-contained within qemu would
>>> be easy enough( famous last words) but the gist is one  has to only care
>>> about the format within qemu. However, I'm being told that what libvirt
>>> does is prepend its own custom header to the resulting saved file, then
>>> slipstreams the migration stream from qemu. Now with the solution that I
>>> envision I intend to keep all write-related logic inside qemu, this
>>> means there's no way to incorporate the logic of libvirt. The reason I'd
>>> like to keep the write process within qemu is to avoid an extra copy of
>>> data between the two processes (qemu outging migration and libvirt),
>>> with the current fd approach qemu is passed an fd, data is copied
>>> between qemu/libvirt and finally the libvirt_iohelper writes the data.
>>> So the question which remains to be answered is how would libvirt make
>>> use of this new functionality in qemu? I was thinking something along
>>> the lines of :
>>>
>>> 1. Qemu writes its migration stream to a file, ideally on a filesystem
>>> which supports reflink - xfs/btrfs
>>>
>>> 2. Libvirt writes it's header to a separate file
>>> 2.1 Reflinks the qemu's stream right after its header
>>> 2.2 Writes its trailer
>>>
>>> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
>>>
>>> I wouldn't call this solution hacky though it definitely leaves some
>>> bitter aftertaste.
>
> Wouldn't it be simpler to tell libvirt to write it's header, then tell
> qemu to append everything?

 I would think so as well. 

>
>>> Another solution would be to extend the 'fd:' protocol to allow multiple
>>> descriptors (for multifd) support to be passed in. The reason dup()
>>> can't be used is because in order for multifd to be supported it's
>>> required to be able to write to multiple, non-overlapping regions of the
>>> file. And duplicated fd's share their offsets etc. But that really seems
>>> more or less hacky. Alternatively it's possible that pwrite() are used
>>> to write to non-overlapping regions in the file. Any feedback is
>>> welcomed.
>
> I do like the idea of letting fd: take multiple fd's.

 Fine in my view, I think we will still need then a helper process in 
 libvirt to merge the data into a single file, no?
 In case the libvirt multifd to single file multithreaded helper I proposed 
 before is helpful as a reference you could reuse/modify those patches.
>>>
>>> Eww that's messy isn't it.
>>> (You don't fancy a huge sparse file do you?)
>>
>> Wait am I missing something obvious here?
>>
>> Maybe we don't need any libvirt extra process.
>>
>> why don't we open the _single_ file multiple times from libvirt,
>>
>> Lets say the "main channel" fd is opened, we write the libvirt header,
>> then reopen again the same file multiple times,
>> and finally pass all fds to qemu, one fd for each parallel transfer channel 
>> we want to use
>> (so we solve all the permissions, security labels issues etc).
>>
>> And then from QEMU we can write to those fds at the right offsets for each 
>> separate channel,
>> which is easier from QEMU because we can know exactly how much data we need 
>> to transfer before starting the migration,
>> so we have even less need for "holes", possibly only minor ones for single 
>> byte adjustments
>> for uneven division of the interleaved file.
>>
>> What is wrong with this one, or

Re: towards a workable O_DIRECT outmigration to a file

2022-08-18 Thread Dr. David Alan Gilbert

* Claudio Fontana (cfont...@suse.de) wrote:
> On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
> > * Claudio Fontana (cfont...@suse.de) wrote:
> >> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
> >>> * Nikolay Borisov (nbori...@suse.com) wrote:
>  [adding Juan and David to cc as I had missed them. ]
> >>>
> >>> Hi Nikolay,
> >>>
>  On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
> > Hello,
> >
> > I'm currently looking into implementing a 'file:' uri for migration save
> > in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
> > the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
> > process of brainstorming how a solution would like the a couple of
> > questions transpired that I think warrant wider discussion in the
> > community.
> >>>
> >>> OK, so this seems to be a continuation with Claudio and Daniel and co as
> >>> of a few months back.  I'd definitely be leaving libvirt sides of the
> >>> question here to Dan, and so that also means definitely looking at that
> >>> tree above.
> >>
> >> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
> >>
> >> We have something working with libvirt for our short term needs which 
> >> offers good performance,
> >> but it is clear that that simple solution is barred for upstream libvirt 
> >> merging.
> >>
> >>
> >>>
> > First, implementing a solution which is self-contained within qemu would
> > be easy enough( famous last words) but the gist is one  has to only care
> > about the format within qemu. However, I'm being told that what libvirt
> > does is prepend its own custom header to the resulting saved file, then
> > slipstreams the migration stream from qemu. Now with the solution that I
> > envision I intend to keep all write-related logic inside qemu, this
> > means there's no way to incorporate the logic of libvirt. The reason I'd
> > like to keep the write process within qemu is to avoid an extra copy of
> > data between the two processes (qemu outging migration and libvirt),
> > with the current fd approach qemu is passed an fd, data is copied
> > between qemu/libvirt and finally the libvirt_iohelper writes the data.
> > So the question which remains to be answered is how would libvirt make
> > use of this new functionality in qemu? I was thinking something along
> > the lines of :
> >
> > 1. Qemu writes its migration stream to a file, ideally on a filesystem
> > which supports reflink - xfs/btrfs
> >
> > 2. Libvirt writes it's header to a separate file
> > 2.1 Reflinks the qemu's stream right after its header
> > 2.2 Writes its trailer
> >
> > 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
> >
> > I wouldn't call this solution hacky though it definitely leaves some
> > bitter aftertaste.
> >>>
> >>> Wouldn't it be simpler to tell libvirt to write it's header, then tell
> >>> qemu to append everything?
> >>
> >> I would think so as well. 
> >>
> >>>
> > Another solution would be to extend the 'fd:' protocol to allow multiple
> > descriptors (for multifd) support to be passed in. The reason dup()
> > can't be used is because in order for multifd to be supported it's
> > required to be able to write to multiple, non-overlapping regions of the
> > file. And duplicated fd's share their offsets etc. But that really seems
> > more or less hacky. Alternatively it's possible that pwrite() are used
> > to write to non-overlapping regions in the file. Any feedback is
> > welcomed.
> >>>
> >>> I do like the idea of letting fd: take multiple fd's.
> >>
> >> Fine in my view, I think we will still need then a helper process in 
> >> libvirt to merge the data into a single file, no?
> >> In case the libvirt multifd to single file multithreaded helper I proposed 
> >> before is helpful as a reference you could reuse/modify those patches.
> > 
> > Eww that's messy isn't it.
> > (You don't fancy a huge sparse file do you?)
> 
> Wait am I missing something obvious here?
> 
> Maybe we don't need any libvirt extra process.
> 
> why don't we open the _single_ file multiple times from libvirt,
> 
> Lets say the "main channel" fd is opened, we write the libvirt header,
> then reopen again the same file multiple times,
> and finally pass all fds to qemu, one fd for each parallel transfer channel 
> we want to use
> (so we solve all the permissions, security labels issues etc).
> 
> And then from QEMU we can write to those fds at the right offsets for each 
> separate channel,
> which is easier from QEMU because we can know exactly how much data we need 
> to transfer before starting the migration,
> so we have even less need for "holes", possibly only minor ones for single 
> byte adjustments
> for uneven division of the interleaved file.
> 
> What is wrong with this one, or does anyone see some other better approach?

You'd have to know

Re: [RFC PATCH] pnv/chiptod: Add basic P9 chiptod model





On 8/11/22 13:40, Nicholas Piggin wrote:

The chiptod is a pervasive facility which can keep a time, synchronise
it across multiple chips, and can move that time to or from the core
timebase units.

This adds a very basic initial emulation of chiptod registers. The
interesting thing about chiptod is that it targets cores and interacts
with core registers (e.g., TB, TFMR). So far there is no actual time
keeping or TB interaction, but core targeting and initial TFMR is
implemented.

This implements enough for skiboot to boot and go through the chiptod
code (with a small patch to remove QUIRK_NO_CHIPTOD from qemu).

POWER10 is much the same, not implemented yet because skiboot uses a
different core target addressing mode (due to hardware issues) that is
not implemented yet.

This is not completely tidy yet, just thought I would see if there are
comments, particularly with the core TFMR interactions.

Thanks,
Nick



Apart from Cedric's comment about sending a separated TFMR I have a few
other small comments:




---
  hw/ppc/meson.build   |   1 +
  hw/ppc/pnv.c |   9 +
  hw/ppc/pnv_chiptod.c | 320 +++
  hw/ppc/pnv_xscom.c   |   2 +
  hw/ppc/trace-events  |   4 +
  include/hw/ppc/pnv.h |   2 +
  include/hw/ppc/pnv_chiptod.h |  51 ++
  include/hw/ppc/pnv_xscom.h   |   6 +
  target/ppc/cpu.h |  13 ++
  target/ppc/cpu_init.c|   2 +-
  target/ppc/helper.h  |   2 +
  target/ppc/misc_helper.c |  25 +++
  target/ppc/spr_common.h  |   2 +
  target/ppc/translate.c   |  10 ++
  14 files changed, 448 insertions(+), 1 deletion(-)
  create mode 100644 hw/ppc/pnv_chiptod.c
  create mode 100644 include/hw/ppc/pnv_chiptod.h

diff --git a/hw/ppc/meson.build b/hw/ppc/meson.build
index 62801923f3..7eb5031055 100644
--- a/hw/ppc/meson.build
+++ b/hw/ppc/meson.build
@@ -45,6 +45,7 @@ ppc_ss.add(when: 'CONFIG_POWERNV', if_true: files(
'pnv_core.c',
'pnv_lpc.c',
'pnv_psi.c',
+  'pnv_chiptod.c',
'pnv_occ.c',
'pnv_sbe.c',
'pnv_bmc.c',
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 7ff1f464d3..bdd641381c 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1395,6 +1395,8 @@ static void pnv_chip_power9_instance_init(Object *obj)
  
  object_initialize_child(obj, "lpc", >lpc, TYPE_PNV9_LPC);
  
+object_initialize_child(obj, "chiptod", >chiptod, TYPE_PNV9_CHIPTOD);

+
  object_initialize_child(obj, "occ", >occ, TYPE_PNV9_OCC);
  
  object_initialize_child(obj, "sbe", >sbe, TYPE_PNV9_SBE);

@@ -1539,6 +1541,13 @@ static void pnv_chip_power9_realize(DeviceState *dev, 
Error **errp)
  chip->dt_isa_nodename = g_strdup_printf("/lpcm-opb@%" PRIx64 "/lpc@0",
  (uint64_t) PNV9_LPCM_BASE(chip));
  
+/* ChipTOD */

+if (!qdev_realize(DEVICE(>chiptod), NULL, errp)) {
+return;
+}
+pnv_xscom_add_subregion(chip, PNV9_XSCOM_CHIPTOD_BASE,
+>chiptod.xscom_regs);
+
  /* Create the simplified OCC model */
  if (!qdev_realize(DEVICE(>occ), NULL, errp)) {
  return;
diff --git a/hw/ppc/pnv_chiptod.c b/hw/ppc/pnv_chiptod.c
new file mode 100644
index 00..9ef463e640
--- /dev/null
+++ b/hw/ppc/pnv_chiptod.c
@@ -0,0 +1,320 @@
+/*
+ * QEMU PowerPC PowerNV Emulation of some CHIPTOD behaviour
+ *
+ * Copyright (c) 2022, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "target/ppc/cpu.h"
+#include "qapi/error.h"
+#include "qemu/log.h"
+#include "qemu/module.h"
+#include "hw/irq.h"
+#include "hw/qdev-properties.h"
+#include "hw/ppc/fdt.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_chiptod.h"
+#include "trace.h"
+
+#include 
+
+/* TOD chip XSCOM addresses */
+#define TOD_MASTER_PATH_CTRL0x /* Master Path ctrl reg */
+#define TOD_PRI_PORT0_CTRL  0x0001 /* Primary port0 ctrl reg */
+#define TOD_PRI_PORT1_CTRL  0x0002 /* Primary port1 ctrl reg */
+#define TOD_SEC_PORT0_CTRL  0x0003 /* Secondary p0 ctrl reg */
+#define TOD_SEC_PORT1_CTRL  0x0004 /* Secondary p1 ctrl reg */
+#define TOD_SLAVE_PATH_CTRL 0x0005 /* Slave Path ctrl reg */
+#define TOD_INTERNAL_PATH_CTRL  0x0006 /* Internal Path ctrl reg */
+
+/* -- TOD primary/secondary master/slave

Re: towards a workable O_DIRECT outmigration to a file

On 8/18/22 20:09, Claudio Fontana wrote:
> On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
>> * Claudio Fontana (cfont...@suse.de) wrote:
>>> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
 * Nikolay Borisov (nbori...@suse.com) wrote:
> [adding Juan and David to cc as I had missed them. ]

 Hi Nikolay,

> On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
>> Hello,
>>
>> I'm currently looking into implementing a 'file:' uri for migration save
>> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
>> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
>> process of brainstorming how a solution would like the a couple of
>> questions transpired that I think warrant wider discussion in the
>> community.

 OK, so this seems to be a continuation with Claudio and Daniel and co as
 of a few months back.  I'd definitely be leaving libvirt sides of the
 question here to Dan, and so that also means definitely looking at that
 tree above.
>>>
>>> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
>>>
>>> We have something working with libvirt for our short term needs which 
>>> offers good performance,
>>> but it is clear that that simple solution is barred for upstream libvirt 
>>> merging.
>>>
>>>

>> First, implementing a solution which is self-contained within qemu would
>> be easy enough( famous last words) but the gist is one  has to only care
>> about the format within qemu. However, I'm being told that what libvirt
>> does is prepend its own custom header to the resulting saved file, then
>> slipstreams the migration stream from qemu. Now with the solution that I
>> envision I intend to keep all write-related logic inside qemu, this
>> means there's no way to incorporate the logic of libvirt. The reason I'd
>> like to keep the write process within qemu is to avoid an extra copy of
>> data between the two processes (qemu outging migration and libvirt),
>> with the current fd approach qemu is passed an fd, data is copied
>> between qemu/libvirt and finally the libvirt_iohelper writes the data.
>> So the question which remains to be answered is how would libvirt make
>> use of this new functionality in qemu? I was thinking something along
>> the lines of :
>>
>> 1. Qemu writes its migration stream to a file, ideally on a filesystem
>> which supports reflink - xfs/btrfs
>>
>> 2. Libvirt writes it's header to a separate file
>> 2.1 Reflinks the qemu's stream right after its header
>> 2.2 Writes its trailer
>>
>> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
>>
>> I wouldn't call this solution hacky though it definitely leaves some
>> bitter aftertaste.

 Wouldn't it be simpler to tell libvirt to write it's header, then tell
 qemu to append everything?
>>>
>>> I would think so as well. 
>>>

>> Another solution would be to extend the 'fd:' protocol to allow multiple
>> descriptors (for multifd) support to be passed in. The reason dup()
>> can't be used is because in order for multifd to be supported it's
>> required to be able to write to multiple, non-overlapping regions of the
>> file. And duplicated fd's share their offsets etc. But that really seems
>> more or less hacky. Alternatively it's possible that pwrite() are used
>> to write to non-overlapping regions in the file. Any feedback is
>> welcomed.

 I do like the idea of letting fd: take multiple fd's.
>>>
>>> Fine in my view, I think we will still need then a helper process in 
>>> libvirt to merge the data into a single file, no?
>>> In case the libvirt multifd to single file multithreaded helper I proposed 
>>> before is helpful as a reference you could reuse/modify those patches.
>>
>> Eww that's messy isn't it.
>> (You don't fancy a huge sparse file do you?)
> 
> Wait am I missing something obvious here?
> 
> Maybe we don't need any libvirt extra process.
> 
> why don't we open the _single_ file multiple times from libvirt,
> 
> Lets say the "main channel" fd is opened, we write the libvirt header,
> then reopen again the same file multiple times,
> and finally pass all fds to qemu, one fd for each parallel transfer channel 
> we want to use
> (so we solve all the permissions, security labels issues etc).
> 
> And then from QEMU we can write to those fds at the right offsets for each 
> separate channel,
> which is easier from QEMU because we can know exactly how much data we need 
> to transfer before starting the migration,
> so we have even less need for "holes", possibly only minor ones for single 
> byte adjustments
> for uneven division of the interleaved file.

Or even better, not pass multiple fds, but just _one_ fd,
and then from qemu write using multiple threads and pread / pwrite , so we 
don't have the additional complication of managing a

[PATCH] virtiofsd: use g_date_time_get_microsecond to get subsecond

2022-08-18 Thread Yusuke Okada

From: Yusuke Okada 

The "%f" specifier in g_date_time_format() is only available in glib
2.65.2 or later. If combined with older glib, the function returns null
and the timestamp displayed as "(null)".

For backward compatibility, g_date_time_get_microsecond should be used
to retrieve subsecond.

In this patch the g_date_time_format() leaves subsecond field as "%06d"
and let next snprintf to format with g_date_time_get_microsecond.

Signed-off-by: Yusuke Okada 
---
 tools/virtiofsd/passthrough_ll.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 371a7bead6..20f0f41f99 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -4185,6 +4185,7 @@ static void setup_nofile_rlimit(unsigned long 
rlimit_nofile)
 static void log_func(enum fuse_log_level level, const char *fmt, va_list ap)
 {
 g_autofree char *localfmt = NULL;
+char buf[64];
 
 if (current_log_level < level) {
 return;
@@ -4197,9 +4198,11 @@ static void log_func(enum fuse_log_level level, const 
char *fmt, va_list ap)
fmt);
 } else {
 g_autoptr(GDateTime) now = g_date_time_new_now_utc();
-g_autofree char *nowstr = g_date_time_format(now, "%Y-%m-%d 
%H:%M:%S.%f%z");
+g_autofree char *nowstr = g_date_time_format(now,
+   "%Y-%m-%d %H:%M:%S.%%06d%z");
+snprintf(buf, 64, nowstr, g_date_time_get_microsecond(now));
 localfmt = g_strdup_printf("[%s] [ID: %08ld] %s",
-   nowstr, syscall(__NR_gettid), fmt);
+   buf, syscall(__NR_gettid), fmt);
 }
 fmt = localfmt;
 }
-- 
2.31.1

[PATCH v2 2/2] tests/tcg/ppc64le: Added an underflow with UE=1 test

2022-08-18 Thread Lucas Mateus Castro(alqotel)

Added a test to see if the adjustment is being made correctly when an
underflow occurs and UE is set.

Signed-off-by: Lucas Mateus Castro (alqotel) 
---
This patch will also fail without the underflow with UE set bugfix
Message-Id:<20220805141522.412864-3-lucas.ara...@eldorado.org.br>
---
 tests/tcg/ppc64/Makefile.target   |  1 +
 tests/tcg/ppc64le/Makefile.target |  1 +
 tests/tcg/ppc64le/ue_excp.c   | 53 +++
 3 files changed, 55 insertions(+)
 create mode 100644 tests/tcg/ppc64le/ue_excp.c

diff --git a/tests/tcg/ppc64/Makefile.target b/tests/tcg/ppc64/Makefile.target
index 43958ad87b..583677031b 100644
--- a/tests/tcg/ppc64/Makefile.target
+++ b/tests/tcg/ppc64/Makefile.target
@@ -30,5 +30,6 @@ run-plugin-sha512-vector-with-%: QEMU_OPTS+=-cpu POWER10
 PPC64_TESTS += signal_save_restore_xer
 PPC64_TESTS += xxspltw
 PPC64_TESTS += oe_excp
+PPC64_TESTS += ue_excp
 
 TESTS += $(PPC64_TESTS)
diff --git a/tests/tcg/ppc64le/Makefile.target 
b/tests/tcg/ppc64le/Makefile.target
index 8d11ac731d..b9e689c582 100644
--- a/tests/tcg/ppc64le/Makefile.target
+++ b/tests/tcg/ppc64le/Makefile.target
@@ -28,5 +28,6 @@ PPC64LE_TESTS += mffsce
 PPC64LE_TESTS += signal_save_restore_xer
 PPC64LE_TESTS += xxspltw
 PPC64LE_TESTS += oe_excp
+PPC64LE_TESTS += ue_excp
 
 TESTS += $(PPC64LE_TESTS)
diff --git a/tests/tcg/ppc64le/ue_excp.c b/tests/tcg/ppc64le/ue_excp.c
new file mode 100644
index 00..028ef3bbc7
--- /dev/null
+++ b/tests/tcg/ppc64le/ue_excp.c
@@ -0,0 +1,53 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define FP_UE (1ull << 5)
+#define MTFSF(FLM, FRB) asm volatile ("mtfsf %0, %1" :: "i" (FLM), "f" (FRB))
+
+void sigfpe_handler(int sig, siginfo_t *si, void *ucontext)
+{
+union {
+uint64_t ll;
+double dp;
+} r;
+uint64_t ch = 0x1b64f1c1b000ull;
+r.dp = ((ucontext_t *)ucontext)->uc_mcontext.fp_regs[2];
+if (r.ll == ch) {
+exit(0);
+}
+fprintf(stderr, "expected result: %lx\n result: %lx\n", ch, r.ll);
+exit(1);
+}
+
+int main()
+{
+uint64_t fpscr;
+uint64_t a = 0x5ca8ull;
+uint64_t b = 0x1cefull;
+
+struct sigaction sa = {
+.sa_sigaction = sigfpe_handler,
+.sa_flags = SA_SIGINFO
+};
+
+prctl(PR_SET_FPEXC, PR_FP_EXC_PRECISE);
+sigaction(SIGFPE, , NULL);
+
+fpscr = FP_UE;
+MTFSF(0b, fpscr);
+
+asm (
+"lfd 0, %0\n\t"
+"lfd 1, %1\n\t"
+"fmul 2, 0, 1\n\t"
+:
+: "m"(a), "m"(b)
+: "memory", "fr0", "fr1", "fr2"
+);
+
+abort();
+}
-- 
2.25.1

[PATCH v2 1/2] tests/tcg/ppc64le: Added an overflow with OE=1 test

2022-08-18 Thread Lucas Mateus Castro(alqotel)

Added a test to see if the adjustment is being made correctly when an
overflow occurs and OE is set.

Signed-off-by: Lucas Mateus Castro (alqotel) 
---
The prctl patch is not ready yet, so this patch does as Richard
Henderson suggested and check the fp register in the signal handler

This patch will fail without the overflow with OE set bugfix
Message-Id:<20220805141522.412864-3-lucas.ara...@eldorado.org.br>
---
 tests/tcg/ppc64/Makefile.target   |  1 +
 tests/tcg/ppc64le/Makefile.target |  1 +
 tests/tcg/ppc64le/oe_excp.c   | 53 +++
 3 files changed, 55 insertions(+)
 create mode 100644 tests/tcg/ppc64le/oe_excp.c

diff --git a/tests/tcg/ppc64/Makefile.target b/tests/tcg/ppc64/Makefile.target
index 331fae628e..43958ad87b 100644
--- a/tests/tcg/ppc64/Makefile.target
+++ b/tests/tcg/ppc64/Makefile.target
@@ -29,5 +29,6 @@ run-plugin-sha512-vector-with-%: QEMU_OPTS+=-cpu POWER10
 
 PPC64_TESTS += signal_save_restore_xer
 PPC64_TESTS += xxspltw
+PPC64_TESTS += oe_excp
 
 TESTS += $(PPC64_TESTS)
diff --git a/tests/tcg/ppc64le/Makefile.target 
b/tests/tcg/ppc64le/Makefile.target
index 6ca3003f02..8d11ac731d 100644
--- a/tests/tcg/ppc64le/Makefile.target
+++ b/tests/tcg/ppc64le/Makefile.target
@@ -27,5 +27,6 @@ PPC64LE_TESTS += mtfsf
 PPC64LE_TESTS += mffsce
 PPC64LE_TESTS += signal_save_restore_xer
 PPC64LE_TESTS += xxspltw
+PPC64LE_TESTS += oe_excp
 
 TESTS += $(PPC64LE_TESTS)
diff --git a/tests/tcg/ppc64le/oe_excp.c b/tests/tcg/ppc64le/oe_excp.c
new file mode 100644
index 00..c8f07d80b6
--- /dev/null
+++ b/tests/tcg/ppc64le/oe_excp.c
@@ -0,0 +1,53 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define FP_OE (1ull << 6)
+#define MTFSF(FLM, FRB) asm volatile ("mtfsf %0, %1" :: "i" (FLM), "f" (FRB))
+
+void sigfpe_handler(int sig, siginfo_t *si, void *ucontext)
+{
+union {
+uint64_t ll;
+double dp;
+} r;
+uint64_t ch = 0x5fcfffe4965a17e0ull;
+r.dp = ((ucontext_t *)ucontext)->uc_mcontext.fp_regs[2];
+if (r.ll == ch) {
+exit(0);
+}
+fprintf(stderr, "expected result: %lx\n result: %lx\n", ch, r.ll);
+exit(1);
+}
+
+int main()
+{
+uint64_t fpscr;
+uint64_t a = 0x7fdfffe816d77b00ull;
+uint64_t b = 0x7fdfffFC7F7FFF00ull;
+
+struct sigaction sa = {
+.sa_sigaction = sigfpe_handler,
+.sa_flags = SA_SIGINFO
+};
+
+prctl(PR_SET_FPEXC, PR_FP_EXC_PRECISE);
+sigaction(SIGFPE, , NULL);
+
+fpscr = FP_OE;
+MTFSF(0b, fpscr);
+
+asm (
+"lfd 0, %0\n\t"
+"lfd 1, %1\n\t"
+"fmul 2, 0, 1\n\t"
+:
+: "m"(a), "m"(b)
+: "memory", "fr0", "fr1", "fr2"
+);
+
+abort();
+}
-- 
2.25.1

Re: [PATCH v5 0/4] linux-user: Fix siginfo_t contents when jumping to non-readable pages


On 8/18/22 09:55, Vivian Wang wrote:

On 8/17/22 23:05, Ilya Leoshkevich wrote:

Hi,

I noticed that when we get a SEGV due to jumping to non-readable
memory, sometimes si_addr and program counter in siginfo_t are slightly
off. I tracked this down to the assumption that translators stop before
the end of a page, while in reality they may stop right after it.


Hi,

Could this be related to issue 1155 [1]? On RISC-V, I'm getting incorrect 
[m|s]tval/[m|s]epc combinations for page faults in system emulation and incorrect si_addr 
and program counter on SIGSEGV in user emulation. Since it seems to only affect 
instructions that cross page boundaries, and RISC-V also has variable length instructions, 
it seems that I've run into the same problem as what is fixed here.


It seems likely, and the code at the end of riscv_tr_translate_insn is wrong.


Could this fix be extended be extended to targets/riscv?


I'll write up something.


r~

Re: [PATCH v5 0/4] linux-user: Fix siginfo_t contents when jumping to non-readable pages

2022-08-18 Thread Ilya Leoshkevich

On Fri, 2022-08-19 at 00:55 +0800, Vivian Wang wrote:
> Hi,
> Could this be related to issue 1155 [1]? On RISC-V, I'm getting
> incorrect [m|s]tval/[m|s]epc combinations for page faults in system
> emulation and incorrect si_addr and program counter on SIGSEGV in
> user emulation. Since it seems to only affect instructions that cross
> page boundaries, and RISC-V also has variable length instructions, it
> seems that I've run into the same problem as what is fixed here.
> Could this fix be extended be extended to targets/riscv?
> dram
> [1]: https://gitlab.com/qemu-project/qemu/-/issues/1155

Yes, this looks quite similar.
I'm not too familiar with riscv, but I just googled [1].
If the following is correct:

---
However, the instruction set reserves enough opcode space to make it
possible to differentiate between 16-bit, 32-bit, 48-bit, and 64-bit
instructions.  Instructions that start with binary 11 (in the lowest
bit position of the instruction) are 32-bit sized instructions (but one
pattern is reserved: so they cannot start with 1).  The compact
instructions use 00, 01, and 10 in that same position.  48-bit
instructions use starting sequence 01, and 64-bit instructions
start with 011.
---

then we can fix this the same way s390x is being fixed here.

[1]
https://stackoverflow.com/questions/56874101/how-does-risc-v-variable-length-of-instruction-work-in-detail

[PATCH] virtiofsd: use g_date_time_get_microsecond to get subsecond

2022-08-18 Thread Yusuke Okada

The "%f" specifier in g_date_time_format() is only available in glib
2.65.2 or later. If combined with older glib, the function returns null
and the timestamp displayed as "(null)".

For backward compatibility, g_date_time_get_microsecond should be used
to retrieve subsecond.

In this patch the g_date_time_format() leaves subsecond field as "%06d"
and let next snprintf to format with g_date_time_get_microsecond.

Signed-off-by: Yusuke Okada 
---
 tools/virtiofsd/passthrough_ll.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 371a7bead6..20f0f41f99 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -4185,6 +4185,7 @@ static void setup_nofile_rlimit(unsigned long 
rlimit_nofile)
 static void log_func(enum fuse_log_level level, const char *fmt, va_list ap)
 {
 g_autofree char *localfmt = NULL;
+char buf[64];
 
 if (current_log_level < level) {
 return;
@@ -4197,9 +4198,11 @@ static void log_func(enum fuse_log_level level, const 
char *fmt, va_list ap)
fmt);
 } else {
 g_autoptr(GDateTime) now = g_date_time_new_now_utc();
-g_autofree char *nowstr = g_date_time_format(now, "%Y-%m-%d 
%H:%M:%S.%f%z");
+g_autofree char *nowstr = g_date_time_format(now,
+   "%Y-%m-%d %H:%M:%S.%%06d%z");
+snprintf(buf, 64, nowstr, g_date_time_get_microsecond(now));
 localfmt = g_strdup_printf("[%s] [ID: %08ld] %s",
-   nowstr, syscall(__NR_gettid), fmt);
+   buf, syscall(__NR_gettid), fmt);
 }
 fmt = localfmt;
 }
-- 
2.31.1

Re: towards a workable O_DIRECT outmigration to a file

On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
>>> * Nikolay Borisov (nbori...@suse.com) wrote:
 [adding Juan and David to cc as I had missed them. ]
>>>
>>> Hi Nikolay,
>>>
 On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
> Hello,
>
> I'm currently looking into implementing a 'file:' uri for migration save
> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
> process of brainstorming how a solution would like the a couple of
> questions transpired that I think warrant wider discussion in the
> community.
>>>
>>> OK, so this seems to be a continuation with Claudio and Daniel and co as
>>> of a few months back.  I'd definitely be leaving libvirt sides of the
>>> question here to Dan, and so that also means definitely looking at that
>>> tree above.
>>
>> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
>>
>> We have something working with libvirt for our short term needs which offers 
>> good performance,
>> but it is clear that that simple solution is barred for upstream libvirt 
>> merging.
>>
>>
>>>
> First, implementing a solution which is self-contained within qemu would
> be easy enough( famous last words) but the gist is one  has to only care
> about the format within qemu. However, I'm being told that what libvirt
> does is prepend its own custom header to the resulting saved file, then
> slipstreams the migration stream from qemu. Now with the solution that I
> envision I intend to keep all write-related logic inside qemu, this
> means there's no way to incorporate the logic of libvirt. The reason I'd
> like to keep the write process within qemu is to avoid an extra copy of
> data between the two processes (qemu outging migration and libvirt),
> with the current fd approach qemu is passed an fd, data is copied
> between qemu/libvirt and finally the libvirt_iohelper writes the data.
> So the question which remains to be answered is how would libvirt make
> use of this new functionality in qemu? I was thinking something along
> the lines of :
>
> 1. Qemu writes its migration stream to a file, ideally on a filesystem
> which supports reflink - xfs/btrfs
>
> 2. Libvirt writes it's header to a separate file
> 2.1 Reflinks the qemu's stream right after its header
> 2.2 Writes its trailer
>
> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
>
> I wouldn't call this solution hacky though it definitely leaves some
> bitter aftertaste.
>>>
>>> Wouldn't it be simpler to tell libvirt to write it's header, then tell
>>> qemu to append everything?
>>
>> I would think so as well. 
>>
>>>
> Another solution would be to extend the 'fd:' protocol to allow multiple
> descriptors (for multifd) support to be passed in. The reason dup()
> can't be used is because in order for multifd to be supported it's
> required to be able to write to multiple, non-overlapping regions of the
> file. And duplicated fd's share their offsets etc. But that really seems
> more or less hacky. Alternatively it's possible that pwrite() are used
> to write to non-overlapping regions in the file. Any feedback is
> welcomed.
>>>
>>> I do like the idea of letting fd: take multiple fd's.
>>
>> Fine in my view, I think we will still need then a helper process in libvirt 
>> to merge the data into a single file, no?
>> In case the libvirt multifd to single file multithreaded helper I proposed 
>> before is helpful as a reference you could reuse/modify those patches.
> 
> Eww that's messy isn't it.
> (You don't fancy a huge sparse file do you?)
> 
>> Maybe this new way will be acceptable to libvirt,
>> ie avoiding the multifd code -> socket, but still merging the data from the 
>> multiple fds into a single file?
> 
> It feels to me like the problem here is really what we want is something
> closer to a dump than the migration code; you don't need all that
> overhead of the code to deal with live migration bitmaps and dirty pages

well yes you are right, we don't care about live migration bitmaps and dirty 
pages,
but we don't incur in any of that anyway since (at least for what I have in 
mind, virsh save and restore),
the VM is stopped.

> that aren't going to happen.
> Something that just does a nice single write(2) (for each memory
> region);
> and then ties the device state on.

ultimately yes, it's the same thing though, whether we trigger it via migrate 
fd: or via another non-migration-related mechanism,
any approach would work.

Ciao,

C

> 
> Dave
> 
>>>
>>> Dave
>>>
>>
>> Thanks for your comments,
>>
>> Claudio
>
>
> Regards,
> Nikolay

>>

Re: towards a workable O_DIRECT outmigration to a file

On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
>>> * Nikolay Borisov (nbori...@suse.com) wrote:
 [adding Juan and David to cc as I had missed them. ]
>>>
>>> Hi Nikolay,
>>>
 On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
> Hello,
>
> I'm currently looking into implementing a 'file:' uri for migration save
> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
> process of brainstorming how a solution would like the a couple of
> questions transpired that I think warrant wider discussion in the
> community.
>>>
>>> OK, so this seems to be a continuation with Claudio and Daniel and co as
>>> of a few months back.  I'd definitely be leaving libvirt sides of the
>>> question here to Dan, and so that also means definitely looking at that
>>> tree above.
>>
>> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
>>
>> We have something working with libvirt for our short term needs which offers 
>> good performance,
>> but it is clear that that simple solution is barred for upstream libvirt 
>> merging.
>>
>>
>>>
> First, implementing a solution which is self-contained within qemu would
> be easy enough( famous last words) but the gist is one  has to only care
> about the format within qemu. However, I'm being told that what libvirt
> does is prepend its own custom header to the resulting saved file, then
> slipstreams the migration stream from qemu. Now with the solution that I
> envision I intend to keep all write-related logic inside qemu, this
> means there's no way to incorporate the logic of libvirt. The reason I'd
> like to keep the write process within qemu is to avoid an extra copy of
> data between the two processes (qemu outging migration and libvirt),
> with the current fd approach qemu is passed an fd, data is copied
> between qemu/libvirt and finally the libvirt_iohelper writes the data.
> So the question which remains to be answered is how would libvirt make
> use of this new functionality in qemu? I was thinking something along
> the lines of :
>
> 1. Qemu writes its migration stream to a file, ideally on a filesystem
> which supports reflink - xfs/btrfs
>
> 2. Libvirt writes it's header to a separate file
> 2.1 Reflinks the qemu's stream right after its header
> 2.2 Writes its trailer
>
> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
>
> I wouldn't call this solution hacky though it definitely leaves some
> bitter aftertaste.
>>>
>>> Wouldn't it be simpler to tell libvirt to write it's header, then tell
>>> qemu to append everything?
>>
>> I would think so as well. 
>>
>>>
> Another solution would be to extend the 'fd:' protocol to allow multiple
> descriptors (for multifd) support to be passed in. The reason dup()
> can't be used is because in order for multifd to be supported it's
> required to be able to write to multiple, non-overlapping regions of the
> file. And duplicated fd's share their offsets etc. But that really seems
> more or less hacky. Alternatively it's possible that pwrite() are used
> to write to non-overlapping regions in the file. Any feedback is
> welcomed.
>>>
>>> I do like the idea of letting fd: take multiple fd's.
>>
>> Fine in my view, I think we will still need then a helper process in libvirt 
>> to merge the data into a single file, no?
>> In case the libvirt multifd to single file multithreaded helper I proposed 
>> before is helpful as a reference you could reuse/modify those patches.
> 
> Eww that's messy isn't it.
> (You don't fancy a huge sparse file do you?)

Wait am I missing something obvious here?

Maybe we don't need any libvirt extra process.

why don't we open the _single_ file multiple times from libvirt,

Lets say the "main channel" fd is opened, we write the libvirt header,
then reopen again the same file multiple times,
and finally pass all fds to qemu, one fd for each parallel transfer channel we 
want to use
(so we solve all the permissions, security labels issues etc).

And then from QEMU we can write to those fds at the right offsets for each 
separate channel,
which is easier from QEMU because we can know exactly how much data we need to 
transfer before starting the migration,
so we have even less need for "holes", possibly only minor ones for single byte 
adjustments
for uneven division of the interleaved file.

What is wrong with this one, or does anyone see some other better approach?

Thanks,

C

> 
>> Maybe this new way will be acceptable to libvirt,
>> ie avoiding the multifd code -> socket, but still merging the data from the 
>> multiple fds into a single file?
> 
> It feels to me like the problem here is really what we want is something
> closer to

Re: [PULL 0/3] Fixes for QEMU 7.1-rc4


On 8/17/22 23:56, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau 

The following changes since commit c7208a6e0d049f9e8af15df908168a79b1f99685:

   Update version for v7.1.0-rc3 release (2022-08-16 20:45:19 -0500)

are available in the Git repository at:

   g...@gitlab.com:marcandre.lureau/qemu.git tags/fixes-pull-request

for you to fetch changes up to 88738ea40bee4c2cf9aae05edd2ec87e0cbeaf36:

   ui/console: fix qemu_console_resize() regression (2022-08-18 10:46:55 +0400)


Some fixes pending on the ML:
* console regression fix
* dbus-vmstate error handling fix
* a build-sys fix


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~






Marc-André Lureau (2):
   build-sys: disable vhost-user-gpu if !opengl
   ui/console: fix qemu_console_resize() regression

Priyankar Jain (1):
   dbus-vmstate: Restrict error checks to registered proxies in
 dbus_get_proxies

  meson.build |  2 +-
  backends/dbus-vmstate.c | 13 +
  ui/console.c|  6 --
  3 files changed, 14 insertions(+), 7 deletions(-)

Re: [PATCH v2 00/31] QOMify PPC4xx devices and minor clean ups





On 8/18/22 10:17, Cédric Le Goater wrote:

Daniel,

On 8/17/22 17:08, BALATON Zoltan wrote:

Hello,

This is based on gitlab.com/danielhb/qemu/tree/ppc-7.2

This series contains the rest of Cédric's OOM'ify patches modified
according my review comments and some other clean ups I've noticed
along the way.


I think patches 01-24 are good for merge.


Queued in gitlab.com/danielhb/qemu/tree/ppc-7.2 (with the v3 of patch 21).


Daniel




v2 now also includes the sdram changes after some clean up to simplify
it. This should now be the same state as Cédric's series. I shall
continue with the ppc440_sdram DDR2 controller model used by the
sam460ex but that needs a bit more chnages. But it is independent of
this series so this can be merged now and I can follow up later in a
separate series.


I will take a look at the SDRAM changes later.

Thanks,

C.




Regards,
BALATON Zoltan

BALATON Zoltan (31):
   ppc/ppc4xx: Introduce a DCR device model
   ppc/ppc405: QOM'ify CPC
   ppc/ppc405: QOM'ify GPT
   ppc/ppc405: QOM'ify OCM
   ppc/ppc405: QOM'ify GPIO
   ppc/ppc405: QOM'ify DMA
   ppc/ppc405: QOM'ify EBC
   ppc/ppc405: QOM'ify OPBA
   ppc/ppc405: QOM'ify POB
   ppc/ppc405: QOM'ify PLB
   ppc/ppc405: QOM'ify MAL
   ppc4xx: Move PLB model to ppc4xx_devs.c
   ppc4xx: Rename ppc405-plb to ppc4xx-plb
   ppc4xx: Move EBC model to ppc4xx_devs.c
   ppc4xx: Rename ppc405-ebc to ppc4xx-ebc
   ppc/ppc405: Use an embedded PPCUIC model in SoC state
   hw/intc/ppc-uic: Convert ppc-uic to a PPC4xx DCR device
   ppc/ppc405: Use an explicit I2C object
   ppc/ppc405: QOM'ify FPGA
   ppc405: Move machine specific code to ppc405_boards.c
   hw/ppc/Kconfig: Remove PPC405 dependency from sam460ex
   hw/ppc/Kconfig: Move imply before select
   ppc/ppc4xx: Fix sdram trace events
   ppc4xx: Fix code style problems reported by checkpatch
   ppc440_bamboo: Remove unnecessary memsets
   ppc4xx: Introduce Ppc4xxSdramBank struct
   ppc4xx_sdram: Get rid of the init RAM hack
   ppc4xx: Use Ppc4xxSdramBank in ppc4xx_sdram_banks()
   ppc440_bamboo: Add missing 4 MiB valid memory size
   ppc4xx_sdram: Move size check to ppc4xx_sdram_init()
   ppc4xx_sdram: QOM'ify

  hw/intc/ppc-uic.c |   26 +-
  hw/ppc/Kconfig    |    3 +-
  hw/ppc/ppc405.h   |  190 +--
  hw/ppc/ppc405_boards.c    |  384 -
  hw/ppc/ppc405_uc.c    | 1078 -
  hw/ppc/ppc440.h   |    5 +-
  hw/ppc/ppc440_bamboo.c    |   63 ++-
  hw/ppc/ppc440_uc.c    |   57 +-
  hw/ppc/ppc4xx_devs.c  |  670 +--
  hw/ppc/ppc4xx_pci.c   |   31 +-
  hw/ppc/sam460ex.c |   52 +-
  hw/ppc/trace-events   |    3 -
  hw/ppc/virtex_ml507.c |    7 +-
  include/hw/intc/ppc-uic.h |    6 +-
  include/hw/ppc/ppc4xx.h   |  118 +++-
  15 files changed, 1477 insertions(+), 1216 deletions(-)

[PATCH] kvm: fix segfault with query-stats-schemas and -M none

2022-08-18 Thread Paolo Bonzini

-M none creates a guest without a vCPU, causing the following error:

$ ./qemu-system-x86_64 -qmp stdio -M none -accel kvm
{execute:qmp_capabilities}
{"return": {}}
{execute: query-stats-schemas}
Segmentation fault (core dumped)

Fix it by not querying the vCPU stats if first_cpu is NULL.

Signed-off-by: Paolo Bonzini 
---
 accel/kvm/kvm-all.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 645f0a249a..8d81ab74de 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -4131,7 +4131,9 @@ void query_stats_schemas_cb(StatsSchemaList **result, 
Error **errp)
 query_stats_schema(result, STATS_TARGET_VM, stats_fd, errp);
 close(stats_fd);
 
-stats_args.result.schema = result;
-stats_args.errp = errp;
-run_on_cpu(first_cpu, query_stats_schema_vcpu, 
RUN_ON_CPU_HOST_PTR(_args));
+if (first_cpu) {
+stats_args.result.schema = result;
+stats_args.errp = errp;
+run_on_cpu(first_cpu, query_stats_schema_vcpu, 
RUN_ON_CPU_HOST_PTR(_args));
+}
 }
-- 
2.37.1

Re: [PATCH] tests/qtest/migration-test: Only wait for serial output where migration succeeds

2022-08-18 Thread Dr. David Alan Gilbert

* Thomas Huth (th...@redhat.com) wrote:
> Waiting for the serial output can take a couple of seconds - and since
> we're doing a lot of migration tests, this time easily sums up to
> multiple minutes. But if a test is supposed to fail, it does not make
> much sense to wait for the source to be in the right state first, so
> we can skip the waiting here. This way we can speed up all tests where
> the migration is supposed to fail. In the gitlab-CI gprov-gcov test,
> each of the migration-tests now run two minutes faster!
> 
> Signed-off-by: Thomas Huth 
> ---
>  tests/qtest/migration-test.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 520a5f917c..7be321b62d 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -1307,7 +1307,9 @@ static void test_precopy_common(MigrateCommon *args)
>  }
>  
>  /* Wait for the first serial output from the source */
> -wait_for_serial("src_serial");
> +if (args->result == MIG_TEST_SUCCEED) {
> +wait_for_serial("src_serial");
> +}

I think this is OK, albeit only because all of the current fail-tests
are ones where the connection fails; we're not relying on the behaviour
of the emulator at all.  I wonder if it's worth going further and
running the source qemu's with -S (which may or not fail in other ways).

Reviewed-by: Dr. David Alan Gilbert 

>  
>  if (!args->connect_uri) {
>  g_autofree char *local_connect_uri =
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v5 0/4] linux-user: Fix siginfo_t contents when jumping to non-readable pages

2022-08-18 Thread Vivian Wang

On 8/17/22 23:05, Ilya Leoshkevich wrote:
> Hi,
>
> I noticed that when we get a SEGV due to jumping to non-readable
> memory, sometimes si_addr and program counter in siginfo_t are slightly
> off. I tracked this down to the assumption that translators stop before
> the end of a page, while in reality they may stop right after it.

Hi,

Could this be related to issue 1155 [1]? On RISC-V, I'm getting
incorrect [m|s]tval/[m|s]epc combinations for page faults in system
emulation and incorrect si_addr and program counter on SIGSEGV in user
emulation. Since it seems to only affect instructions that cross page
boundaries, and RISC-V also has variable length instructions, it seems
that I've run into the same problem as what is fixed here.

Could this fix be extended be extended to targets/riscv?

dram

[1]: https://gitlab.com/qemu-project/qemu/-/issues/1155

> Patch 1 fixes an invalidation issue, which may prevent SEGV from
> happening altogether.
> Patches 2-3 fix the main issue on x86_64 and s390x. Many other
> architectures have fixed-size instructions and are not affected.
> Patch 4 adds tests.
>
> Note: this series depends on [1].
>
> Best regards,
> Ilya
>
> v1: https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg00822.html
> v1 -> v2: Fix individual translators instead of translator_loop
>   (Peter).
>
> v2: https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01079.html
> v2 -> v3: Peek at the next instruction on s390x (Richard).
>   Undo more on i386 (Richard).
>   Check PAGE_EXEC, not PAGE_READ (Peter, Richard).
>
> v3: https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01306.html
> v3 -> v4: Improve the commit message in patch 1 to better reflect what
>   exactly is being fixed there.
>   Factor out the is_same_page() patch (Richard).
>   Do not touch the common code in the i386 fix (Richard).
>
> v4: https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01747.html
> v4 -> v5: Drop patch 2.
>   Use a different fix for the invalidation issue based on
>   discussion with Richard [2].
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg02472.html
> [2] https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg02556.html
>
> Ilya Leoshkevich (4):
>   linux-user: Clear tb_jmp_cache on mprotect()
>   target/s390x: Make translator stop before the end of a page
>   target/i386: Make translator stop before the end of a page
>   tests/tcg: Test siginfo_t contents when jumping to non-readable pages
>
>  linux-user/mmap.c|  14 +++
>  target/i386/tcg/translate.c  |  25 +-
>  target/s390x/tcg/translate.c |  15 +++-
>  tests/tcg/multiarch/noexec.h | 114 
>  tests/tcg/s390x/Makefile.target  |   1 +
>  tests/tcg/s390x/noexec.c | 145 +++
>  tests/tcg/x86_64/Makefile.target |   3 +-
>  tests/tcg/x86_64/noexec.c| 116 +
>  8 files changed, 427 insertions(+), 6 deletions(-)
>  create mode 100644 tests/tcg/multiarch/noexec.h
>  create mode 100644 tests/tcg/s390x/noexec.c
>  create mode 100644 tests/tcg/x86_64/noexec.c
>

Re: [PATCH 4/8] migration: Implement dirty-limit convergence algo

2022-08-18 Thread Hyman





在 2022/8/18 6:09, Peter Xu 写道:

On Sat, Jul 23, 2022 at 03:49:16PM +0800, huang...@chinatelecom.cn wrote:

From: Hyman Huang(黄勇) 

Implement dirty-limit convergence algo for live migration,
which is kind of like auto-converge algo but using dirty-limit
instead of cpu throttle to make migration convergent.

Signed-off-by: Hyman Huang(黄勇) 
---
  migration/ram.c| 53 +-
  migration/trace-events |  1 +
  2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index b94669b..2a5cd23 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -45,6 +45,7 @@
  #include "qapi/error.h"
  #include "qapi/qapi-types-migration.h"
  #include "qapi/qapi-events-migration.h"
+#include "qapi/qapi-commands-migration.h"
  #include "qapi/qmp/qerror.h"
  #include "trace.h"
  #include "exec/ram_addr.h"
@@ -57,6 +58,8 @@
  #include "qemu/iov.h"
  #include "multifd.h"
  #include "sysemu/runstate.h"
+#include "sysemu/dirtylimit.h"
+#include "sysemu/kvm.h"
  
  #include "hw/boards.h" /* for machine_dump_guest_core() */
  
@@ -1139,6 +1142,21 @@ static void migration_update_rates(RAMState *rs, int64_t end_time)

  }
  }
  
+/*

+ * Enable dirty-limit to throttle down the guest
+ */
+static void migration_dirty_limit_guest(void)
+{
+if (!dirtylimit_in_service()) {
+MigrationState *s = migrate_get_current();
+int64_t quota_dirtyrate = s->parameters.vcpu_dirty_limit;
+
+/* Set quota dirtyrate if dirty limit not in service */
+qmp_set_vcpu_dirty_limit(false, -1, quota_dirtyrate, NULL);
+trace_migration_dirty_limit_guest(quota_dirtyrate);
+}
+}


What if migration is cancelled?  Do we have logic to stop the dirty limit,
or should we?

Yes, we should have logic to stop dirty limit, i'll add that.
Thanks for your suggestion. :)

Yong

[qemu-web PATCH] Add signing pubkey for python-qemu-qmp package

2022-08-18 Thread John Snow

Add the pubkey currently used for signing PyPI releases of qemu.qmp to a
stable location where it can be referenced by e.g. Fedora RPM specfiles.

At present, the key happens to just simply be my own -- but future
releases may be signed by a different key. In that case, we can
increment '1.txt' to '2.txt' and so on. The old keys should be left in
place.

The format for the keyfile was chosen by copying what OpenStack was
doing:
https://releases.openstack.org/_static/0x2426b928085a020d8a90d0d879ab7008d0896c8a.txt

Generated with:
> gpg --with-fingerprint --list-keys js...@redhat.com > pubkey
> gpg --armor --export js...@redhat.com >> pubkey

Signed-off-by: John Snow 
---
 assets/keys/python-qemu-qmp.1.txt | 288 ++
 1 file changed, 288 insertions(+)
 create mode 100644 assets/keys/python-qemu-qmp.1.txt

diff --git a/assets/keys/python-qemu-qmp.1.txt 
b/assets/keys/python-qemu-qmp.1.txt
new file mode 100644
index 000..54edbbd
--- /dev/null
+++ b/assets/keys/python-qemu-qmp.1.txt
@@ -0,0 +1,288 @@
+pub   rsa4096 2015-01-29 [SC] [expires: 2023-05-28]
+  FAEB 9711 A12C F475 812F  18F2 88A9 064D 1835 61EB
+uid   [ultimate] John Snow (John Huston) 
+sub   rsa4096 2015-01-29 [E] [expires: 2023-05-28]
+sub   rsa4096 2015-01-29 [S] [expires: 2023-05-28]
+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQINBFTKefwBEAChvwqYC6saTzawbih87LqBYq0d5A8jXYXaiFMV/EvMSDqqY4EY
+6whXliNOIYzhgrPEe7ZmPxbCSe4iMykjhwMh5byIHDoPGDU+FsQty2KXuoxto+Zd
+rP9gymAgmyqdk3aVvzzmCa3cOppcqKvA0Kqr10UeX/z4OMVV390V+DVWUvzXpda4
+5/Sxup57pk+hyY52wxxjIqefrj8u5BN93s5uCVTus0oiVA6W+iXYzTvVDStMFVqn
+TxSxlpZoH5RGKvmoWV3uutByQyBPHW2U1Y6n6iEZ9MlP3hcDqlo0S8jeP03HaD4g
+OqCuqLceWF5+2WyHzNfylpNMFVi+Hp0H/nSDtCvQua7j+6Pt7q5rvqgHvRipkDDV
+sjqwasuNc3wyoHexrBeLU/iJBuDld5iLy+dHXoYMB3HmjMxj3K5/8XhGrDx6BDFe
+O3HIpi3u2z1jniB7RtyVEtdupED6lqsDj0oSz9NxaOFZrS3Jf6z/kHIfh42mM9Sx
+7+s4c07N2LieUxcfqhFTaa/voRibF4cmkBVUhOD1AKXNfhEsTvmcz9NbUchCkcvA
+T9119CrsxfVsE7bXiGvdXnzyGLXdsoosjzwacKdOrVaDmN3Uy+SHiQXo6TlkSdV0
+XH2PUxTMLsBFIO9qXO43Ai6J6iPAP/01l8fuZfpJE0/L/c25yyaND7xA3wARAQAB
+tCpKb2huIFNub3cgKEpvaG4gSHVzdG9uKSA8anNub3dAcmVkaGF0LmNvbT6JAj0E
+EwECACcCGwMCHgECF4AFCwkIBwMFFQoJCAsFFgIDAQAFAlTKigkFCQPCdwsACgkQ
+iKkGTRg1Yet1Pw/+KGEA0n30z1oSgFLPs2XyVvpeH8bpanTVufOHjwlcaBgmUEk8
+KnPRd7oL8y4cq9KjmJwip2hH2vjeBR1HtxmEx06GvGBA9X/YDMaihmJmIHSlxJfl
+YpaK52R1bJYWBTNyK7X5VCU+nQdhdz80X10MLQcdwX13HkP8DfxnbTSj1oSgoOwZ
+zb4ni9xOmwHOpdKUSCm6hJUlgsIHWB193CVpV9CHoU8ovUoGIDEt8l17tPtf/QcP
+wdW65Bfqq0k1WeVBjdq7birH216rcdP6FkEwyJcFBJWUk4U44iZPKJMiqhAysujH
++JCwOk3n4+/SUQd4uO8gdnkfTIqGu6wwOUq63B0B0qm50OOZ6Ir2tyQ44ae5X0PG
+13wJqvmWi9umlK1qiXDACCJX0xW6hRvLAnHYnGllidfZSopkFvxUvs+CpCwJYZuH
+DLbfUQnl/eF8oYR5QjQRxrFOr2l7TJVgxTEJQRuyWDFtJE4c1krB3IQPDA4f5jpM
+FagWp6J+oIzdLhMabxFlSTpDnrbkZxy1qra0FW1oWBoV83/nR+8rXY1q94/9+4ib
+cBKDdIYrQX22CCU3MRlksQVGPk7swNdlaucRuED6Ow5rQU/0GDWEkNsWrtb/EQ/e
+ZH4RcLifgEKfFvWhuxP3za/kWu0cmFtyhcxAMsJUolh4FzQf+LMJ8Y1/LimJARsE
+EAECAAYFAlXWjaEACgkQUhGOPAsp2mvr5gf49Dxc3tJ96er+pH/EoBZ4b+Q+0kWX
+NA2FQY8fDeNvHlvB7pn4mZ2wnFAhc94dbmFWe+Zd067tSC66wQboInaANSpt6PYC
+CazbxGtqxOimSpoPi1awQDk0rCJ3UBYnIPhiJUP52mH0hhgwo6Y9pWMCpNwyuVng
+XaZLWxnN0sL+k00DKpEnPJDDzux7B9dllIk1x91ux7rNWfM+EbUS/iLWUM5KxC/k
+9WTPC+38K46Erzhdd+ZwVH5/d+jXxQXYxPgDTTjmsq5Bq1gwzUlzZuKVt39G9rQW
+m3GJsWbCCtjJSQvYmHglm2t1A1A9aXiG982fsBQ3JZo1/w/8GJhU94MbiQEcBBAB
+AgAGBQJU045eAAoJEJykq7OBq3PInwgIAJ2VQOIdDZ0q6OohWchGZ5qdjk8f25wy
+kreyv7t+nZ/fWr3K4GvdRo9YboBPYe/A44oPBBc9E2JUp4nwlNVzqyuJDcS2T7cU
+lGcRcHdPg7mdq78V1HxRcgMXti8+dht/eReBnuc7Y0Whrst4336u8MoVcIuaix2X
+jMOt/qvZ4MYL7f9OXjT0I0k/FUXpThT/Lb5Yn60ZdeDvfTuSOtV5OIaevy9QgW3g
+tvRo5GHw0Mrn/IFY9ZFH9B9jqVqhm5om1l/9rcaZGGF4gsZ+Lnm6AKP04jGM3t5v
+PoeWYkG8k2Dt7KdpqgheK75U9NTR3E4PpHNSJ5vBnZEae05Prh+vTTqJARwEEgEI
+AAYFAlXbbx4ACgkQp6FrSiUnQ2ou7wf/TpH0oSP3KSn2bGN7+6fqq/OLQ1QXsrAn
+JUzDzf+/JdqvhoGRnFkWH4+6aSqp+tNSnmfqNFl4mSkFFVTCLc4Jg989zrGpGgzx
+G6qb3Dpx3zURGXW26x8b156dxUcCB339Uz0SiocDtwq/w54NQgZXWxob7XJIzx5z
+74biFVwncKGn4v9kr7CryI9bgf3BhSEmCzWCBvUYgeTGSV9qQZyQ01QFP84aS/2I
+I1lnsN0b1NrlBLhhmq8A/TLmGJhh8AbWc+6OC3ImWB/xSnLFebXvNGQfXiTOB7n8
+/p2n8yOaWMV7O/wn5s6tgpmbmArC3tcrRYoIq/2HyAwtFYfnRtO4HYkBHAQTAQgA
+BgUCVdtZyQAKCRD0B9sAYdXPQFQFCACNbrL26QDM2GkMlCXtC7MVyf6tRxF3diXv
+cnWil8BtP3b+Iqv35Udqx8Y49PLRDy7j2ATFDdIn3Pl/fu1mSmbai6hD6P07dwLc
+jzF8nimh/vTOFN1FgOAX3hTlmIyAn5eCW4nKshxsjaX5SwI7BKMELZ77Y1E823//
+yCvtSH1Nwq6sPTfhUiFlrLPJltCg0T7teg3nUsDaVE8FTuQXN/0HwGtpcGHjz3k0
+/vZH1vZd8W2vIzVaAnIUxU4H4myFT7V9vBBn1xlsLxmQALb8HGMQsuTP3zdTmReY
+WKgb3rtGAic18GtSIoqAoRLKqUZKh9AbT4AOHYT1OmFJYRzyix7NiQHwBBABAgAG
+BQJV7DxBAAoJEH4VEAzNNmmxTFkOoMTA5Af0IWHUg4EKbFHVO6gULGkAKw1ZjnIj
+UtfE+JZ8/bs6sAXCS7gGMa8llf6TprUefmkDsp8t6HjKyw6n7TBwfY7RSz9eqGFW
+3/DmAn0iai9c29gfBZXlzsVyrpgy/RcHYkDTW/e1rP3CKz7W2FMTlZGcHx3DFXJY
+fQDVRyt2lF1qUWrgByYXhcSVbva65M8gvUJRt2D6ODfxTU8+HgcA1XsbkPw6Yptu
+XbxRbGCv16hQsaN8dz8FuFCF5qxpeVje9w9N8vHxgEFjxyCGPvX9lHY48w9cOVtv
+6n+H5TFhkl56sWU2Zb0k2QnCwzRT68+o8RQZPDLD4fK7GEhGO9hdQtaV+S3DxoDz
+btruRrgSMpqnYMI2m0yn3bUBUd3qE2ZIiziFj+yuhod0f39802G8NmVRO1MBJBdI

Re: [PATCH 1/8] qapi/migration: Introduce x-vcpu-dirty-limit-period parameter

2022-08-18 Thread Hyman





在 2022/8/18 6:06, Peter Xu 写道:

On Sat, Jul 23, 2022 at 03:49:13PM +0800, huang...@chinatelecom.cn wrote:

From: Hyman Huang(黄勇) 

Introduce "x-vcpu-dirty-limit-period" migration experimental
parameter, which is used to make dirtyrate calculation period
configurable.

Signed-off-by: Hyman Huang(黄勇) 
---
  migration/migration.c | 16 
  monitor/hmp-cmds.c|  8 
  qapi/migration.json   | 31 ---
  3 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e03f698..7b19f85 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -116,6 +116,8 @@
  #define DEFAULT_MIGRATE_ANNOUNCE_ROUNDS5
  #define DEFAULT_MIGRATE_ANNOUNCE_STEP100
  
+#define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT_PERIOD 500 /* ms */


Why 500 but not DIRTYLIMIT_CALC_TIME_MS?
This is a empirical value actually, the iteration time of migration is 
less than 1000ms normally. In my test it varies from 200ms to 500ms, we 
assume iteration time is 500ms and calculation period is 1000ms， so 2 
iteration pass when 1 dirty page rate get calculated. We want 
calculation period as close to iteration time as possible so that 1 
iteration pass, 1 new dirty page rate be calculated and get compared, 
hoping the dirtylimit working more precisely.


But as the "x-" prefix implies, i'm a little unsure that if the solution 
works。


Is it intended to make this parameter experimental, but the other one not?
Since i'm not very sure vcpu-dirty-limit-period have impact on 
migration(as described above), so it is made experimental. As to 
vcpu-dirty-limit, it indeed have impact on migration in theory, so it is 
not made experimental. But from another point of view, 2 parameter are 
introduced in the first time and none of them suffer lots of tests, it 
is also reasonable to make 2 parameter experimental, i'm not insist that.


Yong


Thanks,

Re: [PATCH] target/arm: Add cortex-a35

2022-08-18 Thread Hao Wu

Hi,

This is used by a new series of Nuvoton SoC (NPCM8XX) which contains 4
Cortex A-35 cores.

I'll update the missing fields in a follow-up patch set.

On Thu, Aug 18, 2022 at 7:59 AM Peter Maydell 
wrote:

> On Mon, 15 Aug 2022 at 22:35, Hao Wu  wrote:
> >
> > Add cortex A35 core and enable it for virt board.
> >
> > Signed-off-by: Hao Wu 
> > Reviewed-by: Joe Komlodi 
>
> > +static void aarch64_a35_initfn(Object *obj)
> > +{
> > +ARMCPU *cpu = ARM_CPU(obj);
> > +
> > +cpu->dtb_compatible = "arm,cortex-a35";
> > +set_feature(>env, ARM_FEATURE_V8);
> > +set_feature(>env, ARM_FEATURE_NEON);
> > +set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
> > +set_feature(>env, ARM_FEATURE_AARCH64);
> > +set_feature(>env, ARM_FEATURE_CBAR_RO);
> > +set_feature(>env, ARM_FEATURE_EL2);
> > +set_feature(>env, ARM_FEATURE_EL3);
> > +set_feature(>env, ARM_FEATURE_PMU);
> > +
> > +/* From B2.2 AArch64 identification registers. */
> > +cpu->midr = 0x410fd042;
>
> The r1p0 TRM is out, so we might as well emulate that: 0x411FD040
>
> A few fields are missing:
>
>  cpu->isar.dbgdidr
>  cpu->isar.dbgdevid
>  cpu->isar.dbgdevid1
>  cpu->isar.reset_pmcr_el0
>  cpu->gic_pribits
>
> (these probably landed after you wrote these patch).
>
> Otherwise looks OK.
>
> Remind me, what did you want the Cortex-A35 in particular for ?
>
> thanks
> -- PMM
>

Re: [BUG] cxl can not create region

2022-08-18 Thread Jonathan Cameron via

On Wed, 17 Aug 2022 17:16:19 +0100
Jonathan Cameron  wrote:

> On Thu, 11 Aug 2022 17:46:55 -0700
> Dan Williams  wrote:
> 
> > Dan Williams wrote:  
> > > Bobo WL wrote:
> > > > Hi Dan,
> > > > 
> > > > Thanks for your reply!
> > > > 
> > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams  
> > > > wrote:
> > > > >
> > > > > What is the output of:
> > > > >
> > > > > cxl list -MDTu -d decoder0.0
> > > > >
> > > > > ...? It might be the case that mem1 cannot be mapped by decoder0.0, or
> > > > > at least not in the specified order, or that validation check is 
> > > > > broken.
> > > > 
> > > > Command "cxl list -MDTu -d decoder0.0" output:
> > > 
> > > Thanks for this, I think I know the problem, but will try some
> > > experiments with cxl_test first.
> > 
> > Hmm, so my cxl_test experiment unfortunately passed so I'm not
> > reproducing the failure mode. This is the result of creating x4 region
> > with devices directly attached to a single host-bridge:
> > 
> > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s 
> > $((1<<30))
> > {
> >   "region":"region8",
> >   "resource":"0xf1f000",
> >   "size":"1024.00 MiB (1073.74 MB)",
> >   "interleave_ways":4,
> >   "interleave_granularity":256,
> >   "decode_state":"commit",
> >   "mappings":[
> > {
> >   "position":3,
> >   "memdev":"mem11",
> >   "decoder":"decoder21.0"
> > },
> > {
> >   "position":2,
> >   "memdev":"mem9",
> >   "decoder":"decoder19.0"
> > },
> > {
> >   "position":1,
> >   "memdev":"mem10",
> >   "decoder":"decoder20.0"
> > },
> > {
> >   "position":0,
> >   "memdev":"mem12",
> >   "decoder":"decoder22.0"
> > }
> >   ]
> > }
> > cxl region: cmd_create_region: created 1 region
> >   
> > > Did the commit_store() crash stop reproducing with latest cxl/preview
> > > branch?
> > 
> > I missed the answer to this question.
> > 
> > All of these changes are now in Linus' tree perhaps give that a try and
> > post the debug log again?  
> 
> Hi Dan,
> 
> I've moved onto looking at this one.
> 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy 
> that up
> at some stage), 1 switch, 4 downstream switch ports each with a type 3
> 
> I'm not getting a crash, but can't successfully setup a region.
> Upon adding the final target
> It's failing in check_last_peer() as pos < distance.
> Seems distance is 4 which makes me think it's using the wrong level of the 
> heirarchy for
> some reason or that distance check is wrong.
> Wasn't a good idea to just skip that step though as it goes boom - though
> stack trace is not useful.

Turns out really weird corruption happens if you accidentally back two type3 
devices
with the same memory device. Who would have thought it :)

That aside ignoring the check_last_peer() failure seems to make everything work 
for this
topology.  I'm not seeing the crash, so my guess is we fixed it somewhere along 
the way.

Now for the fun one.  I've replicated the crash if we have

1HB 1*RP 1SW, 4SW-DSP, 4Type3

Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be 
programmed
but the null pointer dereference isn't related to that.

The bug is straight forward.  Not all decoders have commit callbacks... Will 
send out
a possible fix shortly.

Jonathan



> 
> Jonathan
> 
> 
> 
> 
> 
>

Re: towards a workable O_DIRECT outmigration to a file

2022-08-18 Thread Dr. David Alan Gilbert

* Claudio Fontana (cfont...@suse.de) wrote:
> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
> > * Nikolay Borisov (nbori...@suse.com) wrote:
> >> [adding Juan and David to cc as I had missed them. ]
> > 
> > Hi Nikolay,
> > 
> >> On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
> >>> Hello,
> >>>
> >>> I'm currently looking into implementing a 'file:' uri for migration save
> >>> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
> >>> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
> >>> process of brainstorming how a solution would like the a couple of
> >>> questions transpired that I think warrant wider discussion in the
> >>> community.
> > 
> > OK, so this seems to be a continuation with Claudio and Daniel and co as
> > of a few months back.  I'd definitely be leaving libvirt sides of the
> > question here to Dan, and so that also means definitely looking at that
> > tree above.
> 
> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
> 
> We have something working with libvirt for our short term needs which offers 
> good performance,
> but it is clear that that simple solution is barred for upstream libvirt 
> merging.
> 
> 
> > 
> >>> First, implementing a solution which is self-contained within qemu would
> >>> be easy enough( famous last words) but the gist is one  has to only care
> >>> about the format within qemu. However, I'm being told that what libvirt
> >>> does is prepend its own custom header to the resulting saved file, then
> >>> slipstreams the migration stream from qemu. Now with the solution that I
> >>> envision I intend to keep all write-related logic inside qemu, this
> >>> means there's no way to incorporate the logic of libvirt. The reason I'd
> >>> like to keep the write process within qemu is to avoid an extra copy of
> >>> data between the two processes (qemu outging migration and libvirt),
> >>> with the current fd approach qemu is passed an fd, data is copied
> >>> between qemu/libvirt and finally the libvirt_iohelper writes the data.
> >>> So the question which remains to be answered is how would libvirt make
> >>> use of this new functionality in qemu? I was thinking something along
> >>> the lines of :
> >>>
> >>> 1. Qemu writes its migration stream to a file, ideally on a filesystem
> >>> which supports reflink - xfs/btrfs
> >>>
> >>> 2. Libvirt writes it's header to a separate file
> >>> 2.1 Reflinks the qemu's stream right after its header
> >>> 2.2 Writes its trailer
> >>>
> >>> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
> >>>
> >>> I wouldn't call this solution hacky though it definitely leaves some
> >>> bitter aftertaste.
> > 
> > Wouldn't it be simpler to tell libvirt to write it's header, then tell
> > qemu to append everything?
> 
> I would think so as well. 
> 
> > 
> >>> Another solution would be to extend the 'fd:' protocol to allow multiple
> >>> descriptors (for multifd) support to be passed in. The reason dup()
> >>> can't be used is because in order for multifd to be supported it's
> >>> required to be able to write to multiple, non-overlapping regions of the
> >>> file. And duplicated fd's share their offsets etc. But that really seems
> >>> more or less hacky. Alternatively it's possible that pwrite() are used
> >>> to write to non-overlapping regions in the file. Any feedback is
> >>> welcomed.
> > 
> > I do like the idea of letting fd: take multiple fd's.
> 
> Fine in my view, I think we will still need then a helper process in libvirt 
> to merge the data into a single file, no?
> In case the libvirt multifd to single file multithreaded helper I proposed 
> before is helpful as a reference you could reuse/modify those patches.

Eww that's messy isn't it.
(You don't fancy a huge sparse file do you?)

> Maybe this new way will be acceptable to libvirt,
> ie avoiding the multifd code -> socket, but still merging the data from the 
> multiple fds into a single file?

It feels to me like the problem here is really what we want is something
closer to a dump than the migration code; you don't need all that
overhead of the code to deal with live migration bitmaps and dirty pages
that aren't going to happen.
Something that just does a nice single write(2) (for each memory
region);
and then ties the device state on.

Dave

> > 
> > Dave
> > 
> 
> Thanks for your comments,
> 
> Claudio
> >>>
> >>>
> >>> Regards,
> >>> Nikolay
> >>
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH] tests/qtest/migration-test: Only wait for serial output where migration succeeds

Waiting for the serial output can take a couple of seconds - and since
we're doing a lot of migration tests, this time easily sums up to
multiple minutes. But if a test is supposed to fail, it does not make
much sense to wait for the source to be in the right state first, so
we can skip the waiting here. This way we can speed up all tests where
the migration is supposed to fail. In the gitlab-CI gprov-gcov test,
each of the migration-tests now run two minutes faster!

Signed-off-by: Thomas Huth 
---
 tests/qtest/migration-test.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 520a5f917c..7be321b62d 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1307,7 +1307,9 @@ static void test_precopy_common(MigrateCommon *args)
 }
 
 /* Wait for the first serial output from the source */
-wait_for_serial("src_serial");
+if (args->result == MIG_TEST_SUCCEED) {
+wait_for_serial("src_serial");
+}
 
 if (!args->connect_uri) {
 g_autofree char *local_connect_uri =
-- 
2.31.1

Re: Using Unicamp's Minicloud for the QEMU CI

2022-08-18 Thread Peter Maydell

On Thu, 18 Aug 2022 at 17:11, Lucas Mateus Martins Araujo e Castro
 wrote:
> Lucas wrote:
>> I would like gauge the interest in using Minicloud's infrastructure[1]
>> for the CI, talking with some people from there they are interested.
>> It has both ppc64 and pp64le images, multiple versions of 4 distros
>> (Ubuntu, Fedora, Debian and CentOS).

> ping
>
> Any interest in this?

PPC host is something we're currently missing in our testing, so definitely
yes in principle. I don't know what the specifics of getting new runners
set up is, though. Alex ?

thanks
-- PMM

Re: [PATCH 2/8] qapi/migration: Introduce vcpu-dirty-limit parameters

2022-08-18 Thread Hyman





在 2022/8/18 6:07, Peter Xu 写道:

On Sat, Jul 23, 2022 at 03:49:14PM +0800, huang...@chinatelecom.cn wrote:

From: Hyman Huang(黄勇) 

Introduce "vcpu-dirty-limit" migration parameter used
to limit dirty page rate during live migration.

"vcpu-dirty-limit" and "x-vcpu-dirty-limit-period" are
two dirty-limit-related migration parameters, which can
be set before and during live migration by qmp
migrate-set-parameters.

This two parameters are used to help implement the dirty
page rate limit algo of migration.

Signed-off-by: Hyman Huang(黄勇) 
---
  migration/migration.c | 14 ++
  monitor/hmp-cmds.c|  8 
  qapi/migration.json   | 18 +++---
  3 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 7b19f85..ed1a47b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -117,6 +117,7 @@
  #define DEFAULT_MIGRATE_ANNOUNCE_STEP100
  
  #define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT_PERIOD 500 /* ms */

+#define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT1   /* MB/s */


This default value also looks a bit weird.. why 1MB/s?  Thanks,
Indeed, it seems kind of weired, the reason to set default dirty limit 
to 1MB/s is that we want to keep the dirty limit working until vcpu 
dirty page rate drop to 1MB/s once dirtylimit capability enabled during 
migration. In this way, migration has the largest chance to get 
converged before vcpu dirty page rate drop to 1MB/s。 If we set default
dirty limit greater than 1MB/s, the probability of success for migration 
may be reduced, and the default behavior of migration is try the best to 
become sucessful.

Re: [PATCH 1/2] tests/tcg/ppc64le: Added an overflow with OE=1 test

2022-08-18 Thread Lucas Mateus Martins Araujo e Castro



On 18/08/2022 12:32, Richard Henderson wrote:

On 8/17/22 09:57, Lucas Mateus Castro(alqotel) wrote:

+void sigfpe_handler(int sig, siginfo_t *si, void *ucontext)
+{
+    uint64_t t;
+    uint64_t ch = 0x5fcfffe4965a17e0ull;
+    asm (
+    "stfd 2, %0\n\t"
+    : "=m"(t)
+    :
+    : "memory", "fr2"
+    );


No, you need to fetch f2 from ucontext.  There's no guarantee of any 
specific values being

present in the signal handler otherwise.
Yeah, for some reason I completely forgot about this, my bad. I'll send 
a second version fixing this



+    return -1;


exit(-1), which return from main equates to, helpful over EXIT_FAILURE.
But here I'd tend to abort(), since it really shouldn't be reachable.

Good point, I'll change in v2



r~

--
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 


Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer

Re: Using Unicamp's Minicloud for the QEMU CI

2022-08-18 Thread Lucas Mateus Martins Araujo e Castro


ping

Any interest in this?

On 12/07/2022 11:51, Lucas Mateus Martins Araujo e Castro wrote:


Hi everyone!

I would like gauge the interest in using Minicloud's infrastructure[1] 
for the CI, talking with some people from there they are interested. 
It has both ppc64 and pp64le images, multiple versions of 4 distros 
(Ubuntu, Fedora, Debian and CentOS).




--
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 


Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer

Re: [PULL 00/12] pc,virtio: fixes


On 8/17/22 13:05, Michael S. Tsirkin wrote:

The following changes since commit c7208a6e0d049f9e8af15df908168a79b1f99685:

   Update version for v7.1.0-rc3 release (2022-08-16 20:45:19 -0500)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

for you to fetch changes up to 9afb4177d66ac1eee858aba07fa2fc729b274eb4:

   virtio-pci: don't touch pci on virtio reset (2022-08-17 13:08:11 -0400)


pc,virtio: fixes

Several bugfixes, they all look very safe to me. Revert
seed support since we aren't any closer to a proper fix.

Signed-off-by: Michael S. Tsirkin 


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~





Alex Bennée (3):
   hw/virtio: gracefully handle unset vhost_dev vdev
   hw/virtio: handle un-configured shutdown in virtio-pci
   hw/virtio: fix vhost_user_read tracepoint

Gerd Hoffmann (1):
   x86: disable rng seeding via setup_data

Igor Mammedov (1):
   tests: acpi: silence applesmc warning about invalid key

Jonathan Cameron (5):
   hw/cxl: Fix memory leak in error paths
   hw/cxl: Fix wrong query of target ports
   hw/cxl: Add stub write function for RO MemoryRegionOps entries.
   hw/cxl: Fix Get LSA input payload size which should be 8 bytes.
   hw/cxl: Correctly handle variable sized mailbox input payloads.

Michael S. Tsirkin (1):
   virtio-pci: don't touch pci on virtio reset

Stefan Hajnoczi (1):
   virtio-scsi: fix race in virtio_scsi_dataplane_start()

  hw/block/dataplane/virtio-blk.c |  5 +
  hw/cxl/cxl-device-utils.c   | 12 +---
  hw/cxl/cxl-host.c   | 17 -
  hw/cxl/cxl-mailbox-utils.c  |  4 ++--
  hw/i386/microvm.c   |  2 +-
  hw/i386/pc_piix.c   |  2 +-
  hw/i386/pc_q35.c|  2 +-
  hw/scsi/virtio-scsi-dataplane.c | 11 ---
  hw/virtio/vhost-user.c  |  4 ++--
  hw/virtio/vhost.c   | 10 +++---
  hw/virtio/virtio-pci.c  | 19 +++
  tests/qtest/bios-tables-test.c  |  4 +++-
  12 files changed, 62 insertions(+), 30 deletions(-)

Re: [PATCH for-7.1 3/4] target/loongarch: rename the TCG CPU "la464" to "qemu64-v1.00"

On 8/17/22 19:31, WANG Xuerui wrote:
Hmm, I've looked up more context and it is indeed reasonable to generally name the QEMU
models after real existing models. But in this case we could face a problem with
Loongson's nomenclature: all of Loongson 3A5000, 3C5000 and 3C5000L are LA464, yet they
should be distinguishable software-side by checking the model name CSR. But with only one
CPU model that is LA464, currently this CSR is hard-coded to read "3A5000", and this can
hurt IMO. And when we finally add LA264 and LA364 they would be identical ISA-level-wise,
again the only differentiator is the model name CSR.

Indeed, I believe that I pointed this out during review, and asked for loongarch_qemu_read
to be moved. But apparently I missed it the next time around, and it snuck in. There's
nothing in that memory region that is related to the core.

And by "not high-fidelity", I mean some of the features present on real HW might never get
implemented, or actually implementable, like the DVFS mechanism needed by cpufreq.

Certainly we can add stub versions of any such registers. Such things are extremely
common under target/arm/.

Lastly, the "ISA level" I proposed is not arbitrarily made up; it's direct reference to
the ISA manual revision. Each time the ISA gets some addition/revision the ISA manual has
to be updated, and currently the manual's revision is the only reliable source of said
information. (Loongson has a history of naming cores badly, like with the MIPS 3B1500 and
3A4000, both were "GS464V"; and 3A5000 was originally GS464V too, even though the insn
encodings and some semantics have been entirely different.)

That is a good argument for your isa level scheme, at least as aliases.

Re: [PATCH 1/2] tests/tcg/ppc64le: Added an overflow with OE=1 test


On 8/17/22 09:57, Lucas Mateus Castro(alqotel) wrote:

+void sigfpe_handler(int sig, siginfo_t *si, void *ucontext)
+{
+uint64_t t;
+uint64_t ch = 0x5fcfffe4965a17e0ull;
+asm (
+"stfd 2, %0\n\t"
+: "=m"(t)
+:
+: "memory", "fr2"
+);


No, you need to fetch f2 from ucontext.  There's no guarantee of any specific values being 
present in the signal handler otherwise.



+return -1;


exit(-1), which return from main equates to, helpful over EXIT_FAILURE.
But here I'd tend to abort(), since it really shouldn't be reachable.


r~

Re: [PULL 05/10] x86: disable rng seeding via setup_data

2022-08-18 Thread Jason A. Donenfeld

Hi Gerd, Michael, Paolo,

On Thu, Aug 18, 2022 at 01:56:14PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> > > index 3a35193ff7..2e5dae9a89 100644
> > > --- a/hw/i386/pc_q35.c
> > > +++ b/hw/i386/pc_q35.c
> > > @@ -376,6 +376,7 @@ static void pc_q35_7_1_machine_options(MachineClass 
> > > *m)
> > >   pc_q35_machine_options(m);
> > >   m->alias = "q35";
> > >   pcmc->default_cpu_version = 1;
> > > +pcmc->legacy_no_rng_seed = true;
> > >   }
> > >   DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
> > > @@ -386,7 +387,6 @@ static void pc_q35_7_0_machine_options(MachineClass 
> > > *m)
> > >   PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
> > >   pc_q35_7_1_machine_options(m);
> > >   m->alias = NULL;
> > > -pcmc->legacy_no_rng_seed = true;
> > >   pcmc->enforce_amd_1tb_hole = false;
> > >   compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
> > >   compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
> > 
> > Why not just revert the whole patch?
> 
> Tried that first.  Plain revert not working, there are conflicts.
> So just disabling the code looked simpler and safer to me.

Yea, this is fine with me. This commit will be easy enough to revert in
7.2 when things are hopefully working properly in all circumstances.

Jason

[RFC 2/2] virtio: enable f_in_order feature for virtio-net

2022-08-18 Thread Guo Zhi

In order feature is not a transparent feature in QEMU, only specific
devices(eg, virtio-net) support it.

Signed-off-by: Guo Zhi 
---
 hw/net/virtio-net.c| 1 +
 include/hw/virtio/virtio.h | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c8e83921..cf0b23d8 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -719,6 +719,7 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, 
uint64_t features,
 features |= n->host_features;
 
 virtio_add_feature(, VIRTIO_NET_F_MAC);
+virtio_add_feature(, VIRTIO_F_IN_ORDER);
 
 if (!peer_has_vnet_hdr(n)) {
 virtio_clear_feature(, VIRTIO_NET_F_CSUM);
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index db1c0ddf..578f22c8 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -291,7 +291,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
 DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
   VIRTIO_F_IOMMU_PLATFORM, false), \
 DEFINE_PROP_BIT64("packed", _state, _field, \
-  VIRTIO_F_RING_PACKED, false)
+  VIRTIO_F_RING_PACKED, false), \
+DEFINE_PROP_BIT64("in_order", _state, _field, \
+  VIRTIO_F_IN_ORDER, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
-- 
2.17.1

Re: [PATCH v2] hw/i386: place setup_data at fixed place in memory

2022-08-18 Thread Jason A. Donenfeld

Hey Gerd,

On Tue, Aug 16, 2022 at 10:55:11AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > We can make setup_data chaining work with OVMF, but the whole chain
> > > should be located in a GPA range that OVMF dictates.
> > 
> > It sounds like what you describe is pretty OVMF-specific though,
> > right? Do we want to tie things together so tightly like that?
> > 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> Joining the party late (and still catching up the thread).  Given we
> don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> those 48 bytes random seed?

Actually, I want this to work with EFI, very much so.

If our objective was to just not break EFI, the solution would be
simple: in the kernel we can have EFISTUB ignore the setup_data field
from the image, and then bump the boot header protocol number. If QEMU
sees the boot protocol number is below this one, then it won't set
setup_data. Done, fixed.

Except I think there's value in passing seeds even through with EFI.

Your option ROM idea is interesting; somebody mentioned that elsewhere
too I think. I'm wondering, though: do option ROMs still run when
EFI/OVMF is being used?

Jason

[RFC 1/2] virtio: expose used buffers

2022-08-18 Thread Guo Zhi

Follow VIRTIO 1.1 spec, we can only writing out a single used ring for a
batch of descriptors, and only notify guest when the batch of
descriptors have all been used.

We do that batch for tx, because the driver doesn't need to know the
length of tx buffer, while for rx, we don't apply the batch strategy.

Signed-off-by: Guo Zhi 
---
 hw/net/virtio-net.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056f..c8e83921 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2542,8 +2542,10 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
 VirtIONet *n = q->n;
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 VirtQueueElement *elem;
+VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
 int32_t num_packets = 0;
 int queue_index = vq2q(virtio_get_queue_index(q->tx_vq));
+size_t j;
 if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
 return num_packets;
 }
@@ -2621,14 +2623,35 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
 }
 
 drop:
-virtqueue_push(q->tx_vq, elem, 0);
-virtio_notify(vdev, q->tx_vq);
-g_free(elem);
+if (!virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
+virtqueue_push(q->tx_vq, elem, 0);
+virtio_notify(vdev, q->tx_vq);
+g_free(elem);
+} else {
+elems[num_packets] = elem;
+}
 
 if (++num_packets >= n->tx_burst) {
 break;
 }
 }
+
+if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER) && num_packets) {
+/**
+ * If in order feature negotiated, devices can notify the use of a 
batch
+ * of buffers to the driver by only writing out a single used ring 
entry
+ * with the id corresponding to the head entry of the descriptor chain
+ * describing the last buffer in the batch.
+ */
+virtqueue_fill(q->tx_vq, elems[num_packets - 1], 0, 0);
+for (j = 0; j < num_packets; j++) {
+g_free(elems[j]);
+}
+
+virtqueue_flush(q->tx_vq, num_packets);
+virtio_notify(vdev, q->tx_vq);
+}
+
 return num_packets;
 }
 
-- 
2.17.1

[RFC 0/2] Virtio in order feature support for virtio-net device.

2022-08-18 Thread Guo Zhi

In virtio-spec 1.1, new feature bit VIRTIO_F_IN_ORDER was introduced.
When this feature has been negotiated, virtio driver will use
descriptors in ring order: starting from offset 0 in the table, and
wrapping around at the end of the table. Virtio devices will always use
descriptors in the same order in which they have been made available.
This can reduce virtio accesses to used ring.

Based on updated virtio-spec, this series realized IN_ORDER prototype
for virtio-net device in QEMU.

Some work haven't been done in this patch series:
1. Virtio device in_order support for packed vq is left for the future.

Related patches:
In order feature in Linux(support virtio driver, vhost_test and vsock device): 
https://lkml.org/lkml/2022/8/17/643

Guo Zhi (2):
  virtio: expose used buffers
  virtio: enable f_in_order feature for virtio-net

 hw/net/virtio-net.c| 30 +++---
 include/hw/virtio/virtio.h |  4 +++-
 2 files changed, 30 insertions(+), 4 deletions(-)

-- 
2.17.1

[PATCH 9/9] parallels: Replace qemu_co_mutex_lock by WITH_QEMU_LOCK_GUARD

Replace the way we use mutex in parallels_co_check() for simplier
and less error prone code.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index f19e86d5d2..173c5d3721 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -563,24 +563,22 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 BDRVParallelsState *s = bs->opaque;
 int ret = 0;
 
-qemu_co_mutex_lock(>lock);
+WITH_QEMU_LOCK_GUARD(>lock) {
+parallels_check_unclean(bs, res, fix);
 
-parallels_check_unclean(bs, res, fix);
+ret = parallels_check_outside_image(bs, res, fix);
+if (ret < 0) {
+return ret;
+}
 
-ret = parallels_check_outside_image(bs, res, fix);
-if (ret < 0) {
-goto out;
-}
-
-ret = parallels_check_leak(bs, res, fix);
-if (ret < 0) {
-goto out;
-}
+ret = parallels_check_leak(bs, res, fix);
+if (ret < 0) {
+return ret;
+}
 
-parallels_collect_statistics(bs, res, fix);
+parallels_collect_statistics(bs, res, fix);
 
-out:
-qemu_co_mutex_unlock(>lock);
+}
 
 if (ret == 0) {
 ret = bdrv_co_flush(bs);
-- 
2.34.1

[PATCH 7/9] parallels: Move check of leaks to a separate function

We will add more and more checks so we need a better code structure
in parallels_co_check. Let each check performs in a separate loop
in a separate helper.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 84 +--
 1 file changed, 52 insertions(+), 32 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 1c7626c867..6a5fe8e5b2 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -478,14 +478,14 @@ static int parallels_check_outside_image(BlockDriverState 
*bs,
 return 0;
 }
 
-static int coroutine_fn parallels_co_check(BlockDriverState *bs,
-   BdrvCheckResult *res,
-   BdrvCheckMode fix)
+static int parallels_check_leak(BlockDriverState *bs,
+BdrvCheckResult *res,
+BdrvCheckMode fix)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t size, prev_off, high_off;
-int ret = 0;
+int64_t size, off, high_off, count;
 uint32_t i;
+int ret;
 
 size = bdrv_getlength(bs->file->bs);
 if (size < 0) {
@@ -493,41 +493,16 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 return size;
 }
 
-qemu_co_mutex_lock(>lock);
-
-parallels_check_unclean(bs, res, fix);
-
-ret = parallels_check_outside_image(bs, res, fix);
-if (ret < 0) {
-goto out;
-}
-
-res->bfi.total_clusters = s->bat_size;
-res->bfi.compressed_clusters = 0; /* compression is not supported */
-
 high_off = 0;
-prev_off = 0;
 for (i = 0; i < s->bat_size; i++) {
-int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
-if (off == 0) {
-prev_off = 0;
-continue;
-}
-
-res->bfi.allocated_clusters++;
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
 if (off > high_off) {
 high_off = off;
 }
-
-if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
-res->bfi.fragmented_clusters++;
-}
-prev_off = off;
 }
 
 res->image_end_offset = high_off + s->cluster_size;
 if (size > res->image_end_offset) {
-int64_t count;
 count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
 fprintf(stderr, "%s space leaked at the end of the image %" PRId64 
"\n",
 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
@@ -545,11 +520,56 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 if (ret < 0) {
 error_report_err(local_err);
 res->check_errors++;
-goto out;
+return ret;
 }
 res->leaks_fixed += count;
 }
 }
+return 0;
+}
+
+static int coroutine_fn parallels_co_check(BlockDriverState *bs,
+   BdrvCheckResult *res,
+   BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t prev_off;
+int ret = 0;
+uint32_t i;
+
+qemu_co_mutex_lock(>lock);
+
+parallels_check_unclean(bs, res, fix);
+
+ret = parallels_check_outside_image(bs, res, fix);
+if (ret < 0) {
+goto out;
+}
+
+ret = parallels_check_leak(bs, res, fix);
+if (ret < 0) {
+goto out;
+}
+
+res->bfi.total_clusters = s->bat_size;
+res->bfi.compressed_clusters = 0; /* compression is not supported */
+
+prev_off = 0;
+for (i = 0; i < s->bat_size; i++) {
+int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+prev_off = 0;
+continue;
+}
+
+res->bfi.allocated_clusters++;
+
+if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
+res->bfi.fragmented_clusters++;
+}
+prev_off = off;
+}
+
 out:
 qemu_co_mutex_unlock(>lock);
 
-- 
2.34.1

[PATCH 6/9] parallels: Move check of cluster outside image to a separate function

We will add more and more checks so we need a better code structure
in parallels_co_check. Let each check performs in a separate loop
in a separate helper.
s->data_end fix relates to out-of-image check so move it
to the helper too.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 67 +++
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 3900a0f4a9..1c7626c867 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -438,6 +438,46 @@ static void parallels_check_unclean(BlockDriverState *bs,
 }
 }
 
+static int parallels_check_outside_image(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+uint32_t i;
+int64_t off, size;
+
+size = bdrv_getlength(bs->file->bs);
+if (size < 0) {
+res->check_errors++;
+return size;
+}
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off > size) {
+fprintf(stderr, "%s cluster %u is outside image\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+res->corruptions++;
+if (fix & BDRV_FIX_ERRORS) {
+parallels_set_bat_entry(s, i, 0);
+res->corruptions_fixed++;
+}
+}
+}
+
+/*
+ * If there were an out-of-image cluster it would be repaired,
+ * but s->data_end still would point outside image.
+ * Fix s->data_end by the file size.
+ */
+size >>= BDRV_SECTOR_BITS;
+if (s->data_end > size) {
+s->data_end = size;
+}
+
+return 0;
+}
+
 static int coroutine_fn parallels_co_check(BlockDriverState *bs,
BdrvCheckResult *res,
BdrvCheckMode fix)
@@ -457,6 +497,11 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 
 parallels_check_unclean(bs, res, fix);
 
+ret = parallels_check_outside_image(bs, res, fix);
+if (ret < 0) {
+goto out;
+}
+
 res->bfi.total_clusters = s->bat_size;
 res->bfi.compressed_clusters = 0; /* compression is not supported */
 
@@ -469,19 +514,6 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 continue;
 }
 
-/* cluster outside the image */
-if (off > size) {
-fprintf(stderr, "%s cluster %u is outside image\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
-res->corruptions++;
-if (fix & BDRV_FIX_ERRORS) {
-prev_off = 0;
-parallels_set_bat_entry(s, i, 0);
-res->corruptions_fixed++;
-continue;
-}
-}
-
 res->bfi.allocated_clusters++;
 if (off > high_off) {
 high_off = off;
@@ -518,15 +550,6 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 res->leaks_fixed += count;
 }
 }
-/*
- * If there were an out-of-image cluster it would be repaired,
- * but s->data_end still would point outside image.
- * Fix s->data_end by the file size.
- */
-size >>= BDRV_SECTOR_BITS;
-if (s->data_end > size) {
-s->data_end = size;
-}
 out:
 qemu_co_mutex_unlock(>lock);
 
-- 
2.34.1

[PATCH 5/9] parallels: Move check of unclean image to a separate function

We will add more and more checks so we need a better code structure
in parallels_co_check. Let each check performs in a separate loop
in a separate helper.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 7d76d6ce9d..3900a0f4a9 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -418,6 +418,25 @@ static coroutine_fn int 
parallels_co_readv(BlockDriverState *bs,
 return ret;
 }
 
+static void parallels_check_unclean(BlockDriverState *bs,
+BdrvCheckResult *res,
+BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+
+if (!s->header_unclean) {
+return;
+}
+
+fprintf(stderr, "%s image was not closed correctly\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
+res->corruptions++;
+if (fix & BDRV_FIX_ERRORS) {
+/* parallels_close will do the job right */
+res->corruptions_fixed++;
+s->header_unclean = false;
+}
+}
 
 static int coroutine_fn parallels_co_check(BlockDriverState *bs,
BdrvCheckResult *res,
@@ -435,16 +454,8 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 }
 
 qemu_co_mutex_lock(>lock);
-if (s->header_unclean) {
-fprintf(stderr, "%s image was not closed correctly\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
-res->corruptions++;
-if (fix & BDRV_FIX_ERRORS) {
-/* parallels_close will do the job right */
-res->corruptions_fixed++;
-s->header_unclean = false;
-}
-}
+
+parallels_check_unclean(bs, res, fix);
 
 res->bfi.total_clusters = s->bat_size;
 res->bfi.compressed_clusters = 0; /* compression is not supported */
-- 
2.34.1

[PATCH 4/9] parallels: Use generic infrastructure for BAT writing in parallels_co_check()

BAT is written in the context of conventional operations over
the image inside bdrv_co_flush() when it calls
parallels_co_flush_to_os() callback. Thus we should not
modify BAT array directly, but call parallels_set_bat_entry()
helper and bdrv_co_flush() further on. After that there is no
need to manually write BAT and track its modification.

This makes code more generic and allows to split
parallels_set_bat_entry() for independent pieces.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index f460b36054..7d76d6ce9d 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -425,9 +425,8 @@ static int coroutine_fn parallels_co_check(BlockDriverState 
*bs,
 {
 BDRVParallelsState *s = bs->opaque;
 int64_t size, prev_off, high_off;
-int ret;
+int ret = 0;
 uint32_t i;
-bool flush_bat = false;
 
 size = bdrv_getlength(bs->file->bs);
 if (size < 0) {
@@ -466,9 +465,8 @@ static int coroutine_fn parallels_co_check(BlockDriverState 
*bs,
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 prev_off = 0;
-s->bat_bitmap[i] = 0;
+parallels_set_bat_entry(s, i, 0);
 res->corruptions_fixed++;
-flush_bat = true;
 continue;
 }
 }
@@ -484,15 +482,6 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 prev_off = off;
 }
 
-ret = 0;
-if (flush_bat) {
-ret = bdrv_co_pwrite_sync(bs->file, 0, s->header_size, s->header, 0);
-if (ret < 0) {
-res->check_errors++;
-goto out;
-}
-}
-
 res->image_end_offset = high_off + s->cluster_size;
 if (size > res->image_end_offset) {
 int64_t count;
@@ -529,6 +518,14 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 }
 out:
 qemu_co_mutex_unlock(>lock);
+
+if (ret == 0) {
+ret = bdrv_co_flush(bs);
+if (ret < 0) {
+res->check_errors++;
+}
+}
+
 return ret;
 }
 
-- 
2.34.1

[PATCH 8/9] parallels: Move statistic collection to a separate function

We will add more and more checks so we need a better code structure
in parallels_co_check. Let each check performs in a separate loop
in a separate helper.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 53 +++
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 6a5fe8e5b2..f19e86d5d2 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -528,47 +528,56 @@ static int parallels_check_leak(BlockDriverState *bs,
 return 0;
 }
 
-static int coroutine_fn parallels_co_check(BlockDriverState *bs,
-   BdrvCheckResult *res,
-   BdrvCheckMode fix)
+static void parallels_collect_statistics(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode fix)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t prev_off;
-int ret = 0;
+int64_t off, prev_off;
 uint32_t i;
 
-qemu_co_mutex_lock(>lock);
-
-parallels_check_unclean(bs, res, fix);
-
-ret = parallels_check_outside_image(bs, res, fix);
-if (ret < 0) {
-goto out;
-}
-
-ret = parallels_check_leak(bs, res, fix);
-if (ret < 0) {
-goto out;
-}
-
 res->bfi.total_clusters = s->bat_size;
 res->bfi.compressed_clusters = 0; /* compression is not supported */
 
 prev_off = 0;
 for (i = 0; i < s->bat_size; i++) {
-int64_t off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
 if (off == 0) {
 prev_off = 0;
 continue;
 }
 
-res->bfi.allocated_clusters++;
-
 if (prev_off != 0 && (prev_off + s->cluster_size) != off) {
 res->bfi.fragmented_clusters++;
 }
+
 prev_off = off;
+res->bfi.allocated_clusters++;
 }
+}
+
+static int coroutine_fn parallels_co_check(BlockDriverState *bs,
+   BdrvCheckResult *res,
+   BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+int ret = 0;
+
+qemu_co_mutex_lock(>lock);
+
+parallels_check_unclean(bs, res, fix);
+
+ret = parallels_check_outside_image(bs, res, fix);
+if (ret < 0) {
+goto out;
+}
+
+ret = parallels_check_leak(bs, res, fix);
+if (ret < 0) {
+goto out;
+}
+
+parallels_collect_statistics(bs, res, fix);
 
 out:
 qemu_co_mutex_unlock(>lock);
-- 
2.34.1

[PATCH 3/9] parallels: create parallels_set_bat_entry_helper() to assign BAT value

This helper will be reused in next patches during parallels_co_check
rework to simplify its code.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 24c05b95e8..f460b36054 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -165,6 +165,13 @@ static int64_t block_status(BDRVParallelsState *s, int64_t 
sector_num,
 return start_off;
 }
 
+static void parallels_set_bat_entry(BDRVParallelsState *s,
+uint32_t index, uint32_t offset)
+{
+s->bat_bitmap[index] = cpu_to_le32(offset);
+bitmap_set(s->bat_dirty_bmap, bat_entry_off(index) / s->bat_dirty_block, 
1);
+}
+
 static int64_t allocate_clusters(BlockDriverState *bs, int64_t sector_num,
  int nb_sectors, int *pnum)
 {
@@ -250,10 +257,8 @@ static int64_t allocate_clusters(BlockDriverState *bs, 
int64_t sector_num,
 }
 
 for (i = 0; i < to_allocate; i++) {
-s->bat_bitmap[idx + i] = cpu_to_le32(s->data_end / s->off_multiplier);
+parallels_set_bat_entry(s, idx + i, s->data_end / s->off_multiplier);
 s->data_end += s->tracks;
-bitmap_set(s->bat_dirty_bmap,
-   bat_entry_off(idx + i) / s->bat_dirty_block, 1);
 }
 
 return bat2sect(s, idx) + sector_num % s->tracks;
-- 
2.34.1

[PATCH 2/9] parallels: Fix data_end field value in parallels_co_check()

When an image is opened for check there is no error if an offset in the BAT
points outside the image. In such a way we can repair the image.
Out-of-image offsets are repaired in the check, but data_end field
still points outside. Fix this field by file size.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/parallels.c b/block/parallels.c
index c245ca35cd..24c05b95e8 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -513,7 +513,15 @@ static int coroutine_fn 
parallels_co_check(BlockDriverState *bs,
 res->leaks_fixed += count;
 }
 }
-
+/*
+ * If there were an out-of-image cluster it would be repaired,
+ * but s->data_end still would point outside image.
+ * Fix s->data_end by the file size.
+ */
+size >>= BDRV_SECTOR_BITS;
+if (s->data_end > size) {
+s->data_end = size;
+}
 out:
 qemu_co_mutex_unlock(>lock);
 return ret;
-- 
2.34.1

[PATCH 0/9] parallels: Refactor the code of images checks and fix a bug

Fix image inflation when offset in BAT is out of image.

Replace whole BAT syncing by flushing only dirty blocks.

Move all the checks outside the main check function in
separate functions

Use WITH_QEMU_LOCK_GUARD for simplier code.

v4 changes:

  Move s->data_end fixing to parallels_co_check(). Split the check
  in parallels_open() and the fix in parallels_co_check() to two patches.

  Move offset convertation to parallels_set_bat_entry().

  Fix 'ret' rewriting by bdrv_co_flush() results.

  Keep 'i' as uint32_t.

Alexander Ivanov (9):
  parallels: Out of image offset in BAT leads to image inflation
  parallels: Fix data_end field value in parallels_co_check()
  parallels: create parallels_set_bat_entry_helper() to assign BAT value
  parallels: Use generic infrastructure for BAT writing in
parallels_co_check()
  parallels: Move check of unclean image to a separate function
  parallels: Move check of cluster outside image to a separate function
  parallels: Move check of leaks to a separate function
  parallels: Move statistic collection to a separate function
  parallels: Replace qemu_co_mutex_lock by WITH_QEMU_LOCK_GUARD

 block/parallels.c | 197 +-
 1 file changed, 141 insertions(+), 56 deletions(-)

-- 
2.34.1

[PATCH 1/9] parallels: Out of image offset in BAT leads to image inflation