date:20240206

Re: [PATCH] docs/style: allow C99 mixed declarations

2024-02-06 Thread Markus Armbruster

Philippe Mathieu-Daudé  writes:

> On 6/2/24 06:53, Markus Armbruster wrote:
>> Daniel P. Berrangé  writes:
>> 
>>> On Mon, Feb 05, 2024 at 12:18:19PM -0500, Stefan Hajnoczi wrote:
 C99 mixed declarations support interleaving of local variable
 declarations and code.

 The coding style "generally" forbids C99 mixed declarations with some
 exceptions to the rule. This rule is not checked by checkpatch.pl and
 naturally there are violations in the source tree.

 While contemplating adding another exception, I came to the conclusion
 that the best location for declarations depends on context. Let the
 programmer declare variables where it is best for legibility. Don't try
 to define all possible scenarios/exceptions.
> ...
>
>>> Even if the compiler does reliably warn, I think the code pattern
>>> remains misleading to contributors, as the flow control flaw is
>>> very non-obvious.
>> 
>> Yup.  Strong dislike.
>> 
>>> Rather than accept the status quo and remove the coding guideline,
>>> I think we should strengthen the guidelines, such that it is
>>> explicitly forbidden in any method that uses 'goto'. Personally
>>> I'd go all the way to -Werror=declaration-after-statement, as
>> 
>> I support this.
>> 
>>> while C99 mixed decl is appealing,
>> 
>> Not to me.
>> 
>> I much prefer declarations and statements to be visually distinct.
>> Putting declarations first and separating from statements them with a
>> blank line accomplishes that.  Less necessary in languages where
>> declarations are syntactically obvious.
>
> But we already implicitly suggest C99, see commit ae7c80a7bd
> ("error: New macro ERRP_GUARD()"):
>
>   * To use ERRP_GUARD(), add it right at the beginning of the function.
>   * @errp can then be used without worrying about the argument being
>   * NULL or _fatal.
>
>   #define ERRP_GUARD()   \
>  g_auto(ErrorPropagator) _auto_errp_prop = {.errp = errp};   \
>  do {\
>  if (!errp || errp == _fatal) {\
>  errp = &_auto_errp_prop.local_err;  \
>  }   \
>  } while (0)

We can make ERRP_GUARD() expand into just a declaration:

#define ERRP_GUARD()\
g_auto(ErrorPropagator) _auto_errp_prop = { \
.errp = errp,   \
.local_err = ((!errp || errp == _fatal\
  ? errp = &_auto_errp_prop.local_err   \
  : NULL),  \
  NULL) }

> Or commit 5626f8c6d4 ("rcu: Add automatically released rcu_read_lock
> variants") with WITH_RCU_READ*:
>
> util/aio-posix.c:540:5: error: mixing declarations and code is 
> incompatible with standards before C99 
> [-Werror,-Wdeclaration-after-statement]
>  RCU_READ_LOCK_GUARD();
>  ^
> include/qemu/rcu.h:189:28: note: expanded from macro 'RCU_READ_LOCK_GUARD'
>  g_autoptr(RCUReadAuto) _rcu_read_auto __attribute__((unused)) = 
> rcu_read_auto_lock()
> ^

Valid example; RCU_READ_LOCK_GUARD() expands into a declaration.

To enable -Wdeclaration-after-statement, we'd have to futz around with
_Pragma.

Re: [PATCH 2/2] tests/qtest/npcm7xx_emc-test: Connect all NICs to a backend

2024-02-06 Thread Thomas Huth


On 06/02/2024 18.12, Peter Maydell wrote:

Currently QEMU will warn if there is a NIC on the board that
is not connected to a backend. By default the '-nic user' will
get used for all NICs, but if you manually connect a specific
NIC to a specific backend, then the other NICs on the board
have no backend and will be warned about:

qemu-system-arm: warning: nic npcm7xx-emc.1 has no peer
qemu-system-arm: warning: nic npcm-gmac.0 has no peer
qemu-system-arm: warning: nic npcm-gmac.1 has no peer

So suppress those warnings by manually connecting every NIC
on the board to some backend.

Signed-off-by: Peter Maydell 
---
  tests/qtest/npcm7xx_emc-test.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/npcm7xx_emc-test.c b/tests/qtest/npcm7xx_emc-test.c
index f7646fae2c9..63f6cadb5cc 100644
--- a/tests/qtest/npcm7xx_emc-test.c
+++ b/tests/qtest/npcm7xx_emc-test.c
@@ -228,7 +228,10 @@ static int *packet_test_init(int module_num, GString 
*cmd_line)
   * KISS and use -nic. The driver accepts 'emc0' and 'emc1' as aliases
   * in the 'model' field to specify the device to match.
   */
-g_string_append_printf(cmd_line, " -nic socket,fd=%d,model=emc%d ",
+g_string_append_printf(cmd_line, " -nic socket,fd=%d,model=emc%d "
+   "-nic user,model=npcm7xx-emc "
+   "-nic user,model=npcm-gmac "
+   "-nic user,model=npcm-gmac",


Alternatively, use -nic hubport,hubid=0 in case we even want to run this 
test without slirp support, too (but currently there is already a check for 
this in the meson.build file, so -nic user should be fine, too). Anyway,


Reviewed-by: Thomas Huth

[PATCH trivial] qemu-nbd: mention --tls-hostname option in qemu-nbd --help

2024-02-06 Thread Michael Tokarev

This option was not documented.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1240
Signed-off-by: Michael Tokarev 
---
 qemu-nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/qemu-nbd.c b/qemu-nbd.c
index bac0b5e3ec..d7b3ccab21 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -114,6 +114,7 @@ static void usage(const char *name)
 "  --tls-creds=IDuse id of an earlier --object to provide TLS\n"
 "  --tls-authz=IDuse id of an earlier --object to provide\n"
 "authorization\n"
+"  --tls-hostname=HOSTNAME   override hostname used to check x509 
certificate\n"
 "  -T, --trace [[enable=]][,events=][,file=]\n"
 "specify tracing options\n"
 "  --forkfork off the server process and exit the parent\n"
-- 
2.39.2

Re: [RFC PATCH 0/4] Confidential Guest Support: Introduce kvm_init() and kvm_reset() virtual functions

2024-02-06 Thread Xiaoyao Li


On 2/6/2024 10:19 PM, Daniel P. Berrangé wrote:

On Tue, Feb 06, 2024 at 03:28:48AM -0500, Xiaoyao Li wrote:

This series is inspired and suggested by Daniel:
https://lore.kernel.org/qemu-devel/zbfoqseuv6_zw...@redhat.com/

Currently, different confidential VMs in different architectures have
their own specific *_kvm_init() (and some have *_kvm_reset()) exposed
for KVM stuff when it's a confidential VM. e.g., sev_kmv_init() for x86
SEV, pef_kvm_init() and pef_kvm_reset() for PPC PEF, and s390_pv_init()
for s390 PV VMs.

Introduce a generic .kvm_init() and .kvm_reset() functions in
ConfidentialGuestSupportClass, so that different cgs technologies in
different architectures can implement their own, while common interface
of cgs can be used.

This RFC implements two helper functions confidential_guest_kvm_init()
and confidential_guest_kvm_reset() in Patch 1. In the following patches,
they are called in arch specific implementation. X86 will benefit more
for the generic implementation when TDX support is added.

There is one step forward possible, that calling
confidential_guest_kvm_init() before kvm_arch_init() in kvm_int() in
accel/kvm/kvm-all.c. This way, each arch doesn't need to call in their
arch specific code.

X86 fits it, however I'm not sure if ppc and s390 fit it as well.
Because currently, ppc calls it in machine->init()
and s390 calls in MachineClass->init(). I'm not sure if there is any
order dependency.


IIUC that s390 call is still a machine->init method, rather than
class init.


I double check the code again. Only struct MachineClass has .init() 
function defined. And I find both ppc and s390 calls the 
confidential_guest_kvm_init() (or their specific cgs kvm_init()) inside 
their machine_class->init().



I think this series is nice, but its up to the KVM maintainers
to decide...


With regards,
Daniel

Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper

2024-02-06 Thread Alexander Monakov



On Wed, 7 Feb 2024, Richard Henderson wrote:

> On 2/7/24 06:48, Alexander Monakov wrote:
> > Make buffer_is_zero a 'static inline' function that tests up to three
> > bytes from the buffer before handing off to an unrolled loop. This
> > eliminates call overhead for most non-zero buffers, and allows to
> > optimize out length checks when it is known at compile time (which is
> > often the case in Qemu).
> > 
> > Signed-off-by: Alexander Monakov 
> > Signed-off-by: Mikhail Romanov 
> > ---
> >   include/qemu/cutils.h | 28 +++-
> >   util/bufferiszero.c   | 76 ---
> >   2 files changed, 47 insertions(+), 57 deletions(-)
> > 
> > diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> > index 92c927a6a3..62b153e603 100644
> > --- a/include/qemu/cutils.h
> > +++ b/include/qemu/cutils.h
> > @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
> >   /* used to print char* safely */
> >   #define STR_OR_NULL(str) ((str) ? (str) : "null")
> >   
> > -bool buffer_is_zero(const void *buf, size_t len);
> > +bool buffer_is_zero_len_4_plus(const void *, size_t);
> > +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);
> 
> Why 256, when the avx2 routine can handle size 128, and you're about to remove
> avx512?

(yes, avx2 is bumped to 256-byte chunks in a later patch)

> You appear to have missed that select_accel_fn() resolves directly to
> buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an
> indirect function call.
> 
> I think you should not attempt to expose the 4 vs larger implementation detail
> here in the inline function.  Presumably the bulk of the benefit in avoiding
> the function call is already realized via the three byte spot checks.

Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest
I'd really like to keep this optimization because so many places in Qemu invoke
buffer_is_zero with a constant length, allowing the compiler to optimize out
the length test. Would you be open to testing availability of optimized variants
in the inline wrapper like this:

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 62b153e603..7a2145ffef 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, 
size_t len)
 return true;
 }

+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 if (len >= 256) {
 return buffer_is_zero_len_256_plus(vbuf, len);
-} else {
-return buffer_is_zero_len_4_plus(vbuf, len);
 }
+#endif
+return buffer_is_zero_len_4_plus(vbuf, len);
 }

 /*

Alexander

Re: [RFC PATCH 2/4] i386/sev: Switch to use confidential_guest_kvm_init()

2024-02-06 Thread Xiaoyao Li


On 2/6/2024 10:16 PM, Daniel P. Berrangé wrote:

On Tue, Feb 06, 2024 at 03:28:50AM -0500, Xiaoyao Li wrote:

Use confidential_guest_kvm_init() instead of calling SEV specific
sev_kvm_init(). As a bouns, it fits to future TDX when TDX implements
its own confidential_guest_support and .kvm_init().

Move the "TypeInfo sev_guest_info" definition and related functions to
the end of the file, to avoid declaring the sev_kvm_init() ahead.

Clean up the sve-stub.c since it's not needed anymore.

Signed-off-by: Xiaoyao Li 
---
  target/i386/kvm/kvm.c   |   2 +-
  target/i386/kvm/meson.build |   2 -
  target/i386/kvm/sev-stub.c  |   5 --
  target/i386/sev.c   | 120 +++-
  target/i386/sev.h   |   2 -
  5 files changed, 63 insertions(+), 68 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 76a66246eb72..bb63bba61fa1 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2534,7 +2534,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
   * mechanisms are supported in future (e.g. TDX), they'll need
   * their own initialization either here or elsewhere.
   */
-ret = sev_kvm_init(ms->cgs, _err);
+ret = confidential_guest_kvm_init(ms->cgs, _err);


If you agree with my comment in patch 1 about the API expecting non-NULL,
then this would need to be conditionalized (same for the 2 following
patches too)


sure. Will change.


if (ms->cgs) {
   ret = confidential_guest_kvm_init()
   if (ret < 0) {
  
   }
}


  if (ret < 0) {
  error_report_err(local_err);
  return ret;
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 84d9143e6029..e7850981e62d 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -7,8 +7,6 @@ i386_kvm_ss.add(files(
  
  i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
  
-i386_kvm_ss.add(when: 'CONFIG_SEV', if_false: files('sev-stub.c'))

-
  i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), 
if_false: files('hyperv-stub.c'))
  
  i386_system_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss)

diff --git a/target/i386/kvm/sev-stub.c b/target/i386/kvm/sev-stub.c
index 1be5341e8a6a..4a1560cf8ad7 100644
--- a/target/i386/kvm/sev-stub.c
+++ b/target/i386/kvm/sev-stub.c
@@ -14,8 +14,3 @@
  #include "qemu/osdep.h"
  #include "sev.h"
  
-int sev_kvm_init(ConfidentialGuestSupport *cgs, Error **errp)

-{
-/* If we get here, cgs must be some non-SEV thing */
-return 0;
-}


You can actually delete this entire file, since you removed the
only method in it, and stopped building it in the meson.build
patch above.


I intented to do it. Apprarently I missed it somehow and didn't catch it 
before sending out.


will fix in next version.


diff --git a/target/i386/sev.c b/target/i386/sev.c
index 173de91afe7d..19e79d3631d0 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -353,63 +353,6 @@ static void sev_guest_set_kernel_hashes(Object *obj, bool 
value, Error **errp)
  sev->kernel_hashes = value;
  }
  
-static void

-sev_guest_class_init(ObjectClass *oc, void *data)
-{
-object_class_property_add_str(oc, "sev-device",
-  sev_guest_get_sev_device,
-  sev_guest_set_sev_device);
-object_class_property_set_description(oc, "sev-device",
-"SEV device to use");
-object_class_property_add_str(oc, "dh-cert-file",
-  sev_guest_get_dh_cert_file,
-  sev_guest_set_dh_cert_file);
-object_class_property_set_description(oc, "dh-cert-file",
-"guest owners DH certificate (encoded with base64)");
-object_class_property_add_str(oc, "session-file",
-  sev_guest_get_session_file,
-  sev_guest_set_session_file);
-object_class_property_set_description(oc, "session-file",
-"guest owners session parameters (encoded with base64)");
-object_class_property_add_bool(oc, "kernel-hashes",
-   sev_guest_get_kernel_hashes,
-   sev_guest_set_kernel_hashes);
-object_class_property_set_description(oc, "kernel-hashes",
-"add kernel hashes to guest firmware for measured Linux boot");
-}
-
-static void
-sev_guest_instance_init(Object *obj)
-{
-SevGuestState *sev = SEV_GUEST(obj);
-
-sev->sev_device = g_strdup(DEFAULT_SEV_DEVICE);
-sev->policy = DEFAULT_GUEST_POLICY;
-object_property_add_uint32_ptr(obj, "policy", >policy,
-   OBJ_PROP_FLAG_READWRITE);
-object_property_add_uint32_ptr(obj, "handle", >handle,
-   OBJ_PROP_FLAG_READWRITE);
-object_property_add_uint32_ptr(obj, "cbitpos", >cbitpos,
-   OBJ_PROP_FLAG_READWRITE);
-

Re: [PATCH 5/5] monitor: use aio_co_reschedule_self()

2024-02-06 Thread Markus Armbruster

Stefan Hajnoczi  writes:

> The aio_co_reschedule_self() API is designed to avoid the race
> condition between scheduling the coroutine in another AioContext and
> yielding.
>
> The QMP dispatch code uses the open-coded version that appears
> susceptible to the race condition at first glance:
>
>   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
>   qemu_coroutine_yield();
>
> The code is actually safe because the iohandler and qemu_aio_context
> AioContext run under the Big QEMU Lock. Nevertheless, set a good example
> and use aio_co_reschedule_self() so it's obvious that there is no race.
>
> Suggested-by: Hanna Reitz 
> Signed-off-by: Stefan Hajnoczi 

Acked-by: Markus Armbruster 

Feel free to merge this together with the remainder of the series.

Re: [PATCH] qapi/migration: Add missing tls-authz documentation

2024-02-06 Thread Peter Xu

On Wed, Feb 07, 2024 at 07:07:58AM +0100, Markus Armbruster wrote:
> pet...@redhat.com writes:
> 
> > From: Peter Xu 
> >
> > As reported in Markus's recent enforcement series on qapi doc [1], we
> > accidentally miss one entry for tls-authz.  Add it.  Then we can drop
> > @MigrateSetParameters from documentation-exceptions safely later.
> >
> > [1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com
> >
> > Cc: Daniel P. Berrangé 
> > Cc: Fabiano Rosas 
> > Reported-by: Markus Armbruster 
> > Signed-off-by: Peter Xu 
> > ---
> >  qapi/migration.json | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 819708321d..f4c5f59e01 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -980,6 +980,10 @@
> >  # 2.9) Previously (since 2.7), this was reported by omitting
> >  # tls-hostname instead.
> >  #
> > +# @tls-authz: ID of the 'authz' object subclass that provides access
> > +# control checking of the TLS x509 certificate distinguished name.
> > +# (Since 4.0)
> > +#
> >  # @max-bandwidth: to set maximum speed for migration.  maximum speed
> >  # in bytes per second.  (Since 2.8)
> >  #
> 
> Reviewed-by: Markus Armbruster 
> 
> I propose I queue this right after [1] with the update to pragma.json
> squashed in (appended), and the sentence "Then we can drop ... later"
> dropped.
> 
> Thanks for your help!
> 
> 
> diff --git a/qapi/pragma.json b/qapi/pragma.json
> index 7ac05ccc26..6929ab776e 100644
> --- a/qapi/pragma.json
> +++ b/qapi/pragma.json
> @@ -69,7 +69,6 @@
>  'JSONType',
>  'KeyValueKind',
>  'MemoryDeviceInfoKind',
> -'MigrateSetParameters',
>  'NetClientDriver',
>  'ObjectType',
>  'PciMemoryRegion',
> 

Yes, please.

Or queue this prior to that series, then below diff can be squashed into
the other patch; either way works.

Thanks Markus!

-- 
Peter Xu

Re: [PATCH] hw/intc: Handle the error of IOAPICCommonClass.realize()

2024-02-06 Thread Markus Armbruster

Zhao Liu  writes:

> Hi Philippe,
>
> On Wed, Jan 31, 2024 at 05:48:24PM +0100, Philippe Mathieu-Daudé wrote:
>> Date: Wed, 31 Jan 2024 17:48:24 +0100
>> From: Philippe Mathieu-Daudé 
>> Subject: Re: [PATCH] hw/intc: Handle the error of  
>> IOAPICCommonClass.realize()
>> 
>> Hi Zhao,
>> 
>> On 31/1/24 15:29, Zhao Liu wrote:
>> > From: Zhao Liu 
>> > 
>> > IOAPICCommonClass implements its own private realize(), and this private
>> > realize() allows error.
>> > 
>> > Therefore, return directly if IOAPICCommonClass.realize() meets error.
>> > 
>> > Signed-off-by: Zhao Liu 
>> > ---
>> >   hw/intc/ioapic_common.c | 3 +++
>> >   1 file changed, 3 insertions(+)
>> > 
>> > diff --git a/hw/intc/ioapic_common.c b/hw/intc/ioapic_common.c
>> > index cb9bf6214608..3772863377c2 100644
>> > --- a/hw/intc/ioapic_common.c
>> > +++ b/hw/intc/ioapic_common.c
>> > @@ -162,6 +162,9 @@ static void ioapic_common_realize(DeviceState *dev, 
>> > Error **errp)
>> >  info = IOAPIC_COMMON_GET_CLASS(s);
>> >  info->realize(dev, errp);
>> > +if (*errp) {
>> > +return;
>> > +}

This is wrong, although it'll work in practice.

It's wrong, because dereferencing @errp requires ERRP_GUARD().
qapi/error.h:

 * = Why, when and how to use ERRP_GUARD() =
 *
 * Without ERRP_GUARD(), use of the @errp parameter is restricted:
 * - It must not be dereferenced, because it may be null.
 * - It should not be passed to error_prepend() or
 *   error_append_hint(), because that doesn't work with _fatal.
 * ERRP_GUARD() lifts these restrictions.
 *
 * To use ERRP_GUARD(), add it right at the beginning of the function.
 * @errp can then be used without worrying about the argument being
 * NULL or _fatal.
 *
 * Using it when it's not needed is safe, but please avoid cluttering
 * the source with useless code.

It'll work anyway, because the caller never passes null.

Obvious fix:

diff --git a/hw/intc/ioapic_common.c b/hw/intc/ioapic_common.c
index cb9bf62146..280404cba5 100644
--- a/hw/intc/ioapic_common.c
+++ b/hw/intc/ioapic_common.c
@@ -152,6 +152,7 @@ static int ioapic_dispatch_post_load(void *opaque, int 
version_id)

 static void ioapic_common_realize(DeviceState *dev, Error **errp)
 {
+ERRP_GUARD();
 IOAPICCommonState *s = IOAPIC_COMMON(dev);
 IOAPICCommonClass *info;

>> Could be clearer to deviate from DeviceRealize and let the
>> handler return a boolean:
>> 
>> -- >8 --
>> diff --git a/hw/intc/ioapic_internal.h b/hw/intc/ioapic_internal.h
>> index 37b8565539..9664bb3e00 100644
>> --- a/hw/intc/ioapic_internal.h
>> +++ b/hw/intc/ioapic_internal.h
>> @@ -92,3 +92,3 @@ struct IOAPICCommonClass {
>> 
>> -DeviceRealize realize;
>> +bool (*realize)(DeviceState *dev, Error **errp);

qapi.error.h advises:

 * - Whenever practical, also return a value that indicates success /
 *   failure.  This can make the error checking more concise, and can
 *   avoid useless error object creation and destruction.  Note that
 *   we still have many functions returning void.  We recommend
 *   • bool-valued functions return true on success / false on failure,
 *   • pointer-valued functions return non-null / null pointer, and
 *   • integer-valued functions return non-negative / negative.

The patch then becomes

  info = IOAPIC_COMMON_GET_CLASS(s);
 -info->realize(dev, errp);
 +if (!info->realize(dev, errp) {
 +return;
 +}

DeviceClass and BusClass callbacks realize, unrealize ignore this
advice: they return void.  Why?

Following the advice makes calls easier to read, but the callees have to
do a tiny bit of extra work: return something.  Good trade when we have
at least as many callers as callees.

But these callbacks have many more callees: many devices implement them,
but only a few places call.  Changing them to return something looked
like more trouble than it's worth, so we didn't.

> What about I change the name of this interface?
>
> Maybe ioapic_realize(), to distinguish it from DeviceClass.realize().

I wouldn't bother.

>>  DeviceUnrealize unrealize;
>
> Additionally, if I change the pattern of realize(), should I also avoid
> the DeviceUnrealize macro for symmetry's sake and just declare a similar
> function pointer as you said?
>
> Further, do you think it's necessary to introduce InternalRealize and
> InternalUnrealize macros for qdev

You mean typedefs?

>  for qdev to wrap these special realize/unrealize
> to differentiate them from normal DeviceRealize/DeviceUnrealize?
>
> Because I found that this pattern of realize() (i.e. registering the
> realize() of the child class in the parent class instead of DeviceClass,
> and then calling the registered realize() in parent realize()) is also
> widely used in many cases:
>
> * xen_block_realize()
> * virtser_port_device_realize()
> * x86_iommu_realize()
> * virtio_input_device_realize()
> * apic_common_realize()
> * pc_dimm_realize()
> * virtio_device_realize()
> ...

Yes.

When a

Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant

2024-02-06 Thread Alexander Monakov

On Tue, 6 Feb 2024, Elena Ufimtseva wrote:

> Hello Alexander
> 
> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov 
> wrote:
> 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> 
> I would like to point out that the frequency scaling is not currently an
> issue on AMD Zen4 Genoa CPUs, for example.
> And microcode architecture description here:
> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
> Although, the cpu frequency downscaling mentioned in the above document is
> only in relation to floating point operations.
> But from other online discussions I gather that the data path for the
> integer registers in Zen4 is also 256 bits and it allows to avoid
> frequency downscaling for FP and heavy instructions.

Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load
ports for two consecutive cycles, so from load throughput perspective there's
no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512
still has benefits on Zen 4 since it's a richer instruction set (it also reduces
pressure in the CPU front-end and is more power-efficient), but as the new AVX2
buffer_is_zero is saturating load ports I would expect that AVX512 can exceed
its performance only by a small margin if at all, not anywhere close to 2x.

> And looking at the optimizations for AVX2 in your other patch, would
> unrolling the loop for AVX512 ops benefit from the speedup taken that the
> data path has the same width?

No, 256-bit datapath on Zen 4 means that it's easier to saturate it with
512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable
to a similar AVX-256 loop unrolled twice.

Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly.

> If the frequency downscaling is not observed on some of the CPUs, can
> AVX512 be maintained and used selectively for some
> of the CPUs?

Please note that a properly optimized buffer_is_zero is limited by load
throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load
bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits
of AVX-512 diminish more and more.

I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial
there for buffer_is_zero for reasons like reaching higher turbo clocks or
higher memory parallelism.

Finally, let's consider a somewhat broader perspective. Let's suppose
buffer_is_zero takes 50% of overall application runtime, and 9 out of
10 buffers are found out to be non-zero in the inline wrapper that samples
three bytes. Then the vectorized routine takes about 5% of application
time, and speeding it up even by 20% only shaves off 1% from overall
execution time.

Alexander

Re: [PATCH] qapi/migration: Add missing tls-authz documentation

2024-02-06 Thread Markus Armbruster

pet...@redhat.com writes:

> From: Peter Xu 
>
> As reported in Markus's recent enforcement series on qapi doc [1], we
> accidentally miss one entry for tls-authz.  Add it.  Then we can drop
> @MigrateSetParameters from documentation-exceptions safely later.
>
> [1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com
>
> Cc: Daniel P. Berrangé 
> Cc: Fabiano Rosas 
> Reported-by: Markus Armbruster 
> Signed-off-by: Peter Xu 
> ---
>  qapi/migration.json | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 819708321d..f4c5f59e01 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -980,6 +980,10 @@
>  # 2.9) Previously (since 2.7), this was reported by omitting
>  # tls-hostname instead.
>  #
> +# @tls-authz: ID of the 'authz' object subclass that provides access
> +# control checking of the TLS x509 certificate distinguished name.
> +# (Since 4.0)
> +#
>  # @max-bandwidth: to set maximum speed for migration.  maximum speed
>  # in bytes per second.  (Since 2.8)
>  #

Reviewed-by: Markus Armbruster 

I propose I queue this right after [1] with the update to pragma.json
squashed in (appended), and the sentence "Then we can drop ... later"
dropped.

Thanks for your help!


diff --git a/qapi/pragma.json b/qapi/pragma.json
index 7ac05ccc26..6929ab776e 100644
--- a/qapi/pragma.json
+++ b/qapi/pragma.json
@@ -69,7 +69,6 @@
 'JSONType',
 'KeyValueKind',
 'MemoryDeviceInfoKind',
-'MigrateSetParameters',
 'NetClientDriver',
 'ObjectType',
 'PciMemoryRegion',

Re: [PATCH v3 08/17] plugins: add inline operation per vcpu

2024-02-06 Thread Pierrick Bouvier


On 2/7/24 07:45, Richard Henderson wrote:

On 2/6/24 19:24, Pierrick Bouvier wrote:

--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -442,6 +442,13 @@ static TCGOp *append_inline_cb(const struct 
qemu_plugin_dyn_cb *cb,
   char *ptr = cb->userp;
   size_t elem_size = 0;
   size_t offset = 0;
+if (!ptr) {
+/* use inline entry */
+ptr = cb->inline_insn.entry.score->data->data;


This value will not survive the first resize.
You need to add a pointer dereference from the first "data".



If you look at scoreboard patch, you'll notice tb are flushed when we 
resize, and thus, invalidate the pointer.


We discussed this with Alex previously, and he recommended to implement 
this, instead of adding another indirection.


By the way, this is what created the need to fix cpu_init hook call 
site, to be able to call start/end exclusive. Thus the related patches 
at the beginning of the series.




r~

Re: [PATCH v3 07/17] plugins: implement inline operation relative to cpu_index

2024-02-06 Thread Pierrick Bouvier


On 2/7/24 07:42, Richard Henderson wrote:

On 2/6/24 19:24, Pierrick Bouvier wrote:

Instead of working on a fixed memory location, allow to address it based
on cpu_index, an element size and a given offset.
Result address: ptr + offset + cpu_index * element_size.

With this, we can target a member in a struct array from a base pointer.

Current semantic is not modified, thus inline operation still targets
always the same memory location.

Signed-off-by: Pierrick Bouvier 
---
   plugins/plugin.h   |  2 +-
   accel/tcg/plugin-gen.c | 65 +++---
   plugins/api.c  |  3 +-
   plugins/core.c | 12 +---
   4 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/plugins/plugin.h b/plugins/plugin.h
index fd93a372803..77ed10689ca 100644
--- a/plugins/plugin.h
+++ b/plugins/plugin.h
@@ -100,7 +100,7 @@ void plugin_register_vcpu_mem_cb(GArray **arr,
enum qemu_plugin_mem_rw rw,
void *udata);
   
-void exec_inline_op(struct qemu_plugin_dyn_cb *cb);

+void exec_inline_op(struct qemu_plugin_dyn_cb *cb, int cpu_index);
   
   int plugin_num_vcpus(void);
   
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c

index b37ce7683e6..68dee4c68d3 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -132,16 +132,28 @@ static void gen_empty_udata_cb_no_rwg(void)
*/
   static void gen_empty_inline_cb(void)
   {
+TCGv_i32 cpu_index = tcg_temp_ebb_new_i32();
+TCGv_ptr cpu_index_as_ptr = tcg_temp_ebb_new_ptr();
   TCGv_i64 val = tcg_temp_ebb_new_i64();
   TCGv_ptr ptr = tcg_temp_ebb_new_ptr();
   
+tcg_gen_ld_i32(cpu_index, tcg_env,

+   -offsetof(ArchCPU, env) + offsetof(CPUState, cpu_index));
+/* pass an immediate != 0 so that it doesn't get optimized away */
+tcg_gen_muli_i32(cpu_index, cpu_index, 0xdeadbeef);


You don't need a random immediate here.
You can just as easily use

  tcg_gen_mul_i32(cpu_index, cpu_index, cpu_index);

with a similar comment about the true size being inserted later.



Followed the tcg_gen_addi_i64 that was using this pattern in the same 
file. I'll change this to what you recommend.



Otherwise,
Reviewed-by: Richard Henderson 


r~

Re: [PATCH v8 10/21] i386: Split topology types of CPUID[0x1F] from the definitions of CPUID[0xB]

2024-02-06 Thread Philippe Mathieu-Daudé


On 31/1/24 11:13, Zhao Liu wrote:

From: Zhao Liu 

CPUID[0xB] defines SMT, Core and Invalid types, and this leaf is shared
by Intel and AMD CPUs.

But for extended topology levels, Intel CPU (in CPUID[0x1F]) and AMD CPU
(in CPUID[0x8026]) have the different definitions with different
enumeration values.

Though CPUID[0x8026] hasn't been implemented in QEMU, to avoid
possible misunderstanding, split topology types of CPUID[0x1F] from the
definitions of CPUID[0xB] and introduce CPUID[0x1F]-specific topology
types.

Signed-off-by: Zhao Liu 
Tested-by: Babu Moger 
Tested-by: Yongwei Ma 
Acked-by: Michael S. Tsirkin 
---
Changes since v3:
  * New commit to prepare to refactor CPUID[0x1F] encoding.
---
  target/i386/cpu.c | 14 +++---
  target/i386/cpu.h | 13 +
  2 files changed, 16 insertions(+), 11 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v3 05/17] plugins: scoreboard API

2024-02-06 Thread Pierrick Bouvier


On 2/7/24 07:21, Richard Henderson wrote:

On 2/6/24 19:24, Pierrick Bouvier wrote:

We introduce a cpu local storage, automatically managed (and extended)
by QEMU itself. Plugin allocate a scoreboard, and don't have to deal
with how many cpus are launched.

This API will be used by new inline functions but callbacks can benefit
from this as well. This way, they can operate without a global lock for
simple operations.

At any point during execution, any scoreboard will be dimensioned with
at least qemu_plugin_num_vcpus entries.

New functions:
- qemu_plugin_scoreboard_find
- qemu_plugin_scoreboard_free
- qemu_plugin_scoreboard_new

In more, we define a qemu_plugin_u64, which is a simple struct holding
a pointer to a scoreboard, and a given offset.
This allows to have a scoreboard containing structs, without having to
bring offset for all operations on a specific field.

Since most of the plugins are simply collecting a sum of per-cpu values,
qemu_plugin_u64 directly support this operation as well.

New functions:
- qemu_plugin_u64_add
- qemu_plugin_u64_get
- qemu_plugin_u64_set
- qemu_plugin_u64_sum
New macros:
- qemu_plugin_scoreboard_u64
- qemu_plugin_scoreboard_u64_in_struct


I think the u64 stuff should be a second patch built upon the basic scoreboard 
support.



You're right, should be easier to review.


+/* A scoreboard is an array of values, indexed by vcpu_index */
+struct qemu_plugin_scoreboard {
+GArray *data;
+};


Unnecessary?  Generates an extra pointer dereference for no apparent benefit.
Alternately, might be useful for other data structure changes...



Thought to change it to a typedef after removing other members. Will do 
if you noticed this too.



+/**
+ * typedef qemu_plugin_u64 - uint64_t member of an entry in a scoreboard
+ *
+ * This field allows to access a specific uint64_t member in one given entry,
+ * located at a specified offset. Inline operations expect this as entry.
+ */
+typedef struct {
+struct qemu_plugin_scoreboard *score;


Embed the struct instead?



Several qemu_plugin_u64 can point to the same scoreboard, so it has to 
be a pointer. It saves a scoreboard pointer + offset for a given entry.



@@ -31,6 +31,9 @@ struct qemu_plugin_state {
* but with the HT we avoid adding a field to CPUState.
*/
   GHashTable *cpu_ht;
+/* Scoreboards, indexed by their addresses. */
+GHashTable *scoreboards;


Why a hash table?  All you want is to be able to iterate through all, and 
add/remove
easily.  Seems like QLIST from  would be better, and the 
QLIST_ENTRY member
would make struct qemu_plugin_scoreboard useful.



Thought that having O(1) removal was a nice property, compared to a 
linked list. I can switch to a QLIST if you still think it's better.


What do you mean by "make struct qemu_plugin_scoreboard useful"?



r~

Re: [PATCH v8 08/21] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()

2024-02-06 Thread Philippe Mathieu-Daudé


On 31/1/24 11:13, Zhao Liu wrote:

From: Zhao Liu 

In cpu_x86_cpuid(), there are many variables in representing the cpu
topology, e.g., topo_info, cs->nr_cores and cs->nr_threads.

Since the names of cs->nr_cores/cs->nr_threads does not accurately
represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone
to confusion and mistakes.

And the structure X86CPUTopoInfo names its members clearly, thus the
variable "topo_info" should be preferred.

In addition, in cpu_x86_cpuid(), to uniformly use the topology variable,
replace env->dies with topo_info.dies_per_pkg as well.

Suggested-by: Robert Hoo 
Tested-by: Yongwei Ma 
Signed-off-by: Zhao Liu 
Reviewed-by: Xiaoyao Li 
---
Changes since v7:
  * Renamed cpus_per_pkg to threads_per_pkg. (Xiaoyao)
  * Dropped Michael/Babu's Acked/Tested tags since the code change.
  * Re-added Yongwei's Tested tag For his re-testing.
  * Added Xiaoyao's Reviewed tag.

Changes since v3:
  * Fixed typo. (Babu)

Changes since v1:
  * Extracted cores_per_socket from the code block and use it as a local
variable for cpu_x86_cpuid(). (Yanan)
  * Removed vcpus_per_socket variable and use cpus_per_pkg directly.
(Yanan)
  * Replaced env->dies with topo_info.dies_per_pkg in cpu_x86_cpuid().
---
  target/i386/cpu.c | 31 ++-
  1 file changed, 18 insertions(+), 13 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH] docs/style: allow C99 mixed declarations

2024-02-06 Thread Philippe Mathieu-Daudé


On 6/2/24 06:53, Markus Armbruster wrote:

Daniel P. Berrangé  writes:


On Mon, Feb 05, 2024 at 12:18:19PM -0500, Stefan Hajnoczi wrote:

C99 mixed declarations support interleaving of local variable
declarations and code.

The coding style "generally" forbids C99 mixed declarations with some
exceptions to the rule. This rule is not checked by checkpatch.pl and
naturally there are violations in the source tree.

While contemplating adding another exception, I came to the conclusion
that the best location for declarations depends on context. Let the
programmer declare variables where it is best for legibility. Don't try
to define all possible scenarios/exceptions.

...


Even if the compiler does reliably warn, I think the code pattern
remains misleading to contributors, as the flow control flaw is
very non-obvious.


Yup.  Strong dislike.


Rather than accept the status quo and remove the coding guideline,
I think we should strengthen the guidelines, such that it is
explicitly forbidden in any method that uses 'goto'. Personally
I'd go all the way to -Werror=declaration-after-statement, as


I support this.


while C99 mixed decl is appealing,


Not to me.

I much prefer declarations and statements to be visually distinct.
Putting declarations first and separating from statements them with a
blank line accomplishes that.  Less necessary in languages where
declarations are syntactically obvious.


But we already implicitly suggest C99, see commit ae7c80a7bd
("error: New macro ERRP_GUARD()"):

 * To use ERRP_GUARD(), add it right at the beginning of the function.
 * @errp can then be used without worrying about the argument being
 * NULL or _fatal.

 #define ERRP_GUARD()   \
g_auto(ErrorPropagator) _auto_errp_prop = {.errp = errp};   \
do {\
if (!errp || errp == _fatal) {\
errp = &_auto_errp_prop.local_err;  \
}   \
} while (0)

Or commit 5626f8c6d4 ("rcu: Add automatically released rcu_read_lock
variants") with WITH_RCU_READ*:

util/aio-posix.c:540:5: error: mixing declarations and code is 
incompatible with standards before C99 
[-Werror,-Wdeclaration-after-statement]

RCU_READ_LOCK_GUARD();
^
include/qemu/rcu.h:189:28: note: expanded from macro 'RCU_READ_LOCK_GUARD'
g_autoptr(RCUReadAuto) _rcu_read_auto __attribute__((unused)) = 
rcu_read_auto_lock()

   ^

[PATCH] hw/char/pl011: Add support for loopback

2024-02-06 Thread Tong Ho

This patch adds loopback for sent characters as well as
modem-control signals.

Loopback of send and modem-control is often used for uart
self tests in real hardware but missing from current pl011
model, resulting in self-test failures when running in QEMU.

Signed-off-by: Tong Ho 
Signed-off-by: Francisco Iglesias 
---
 hw/char/pl011.c | 51 +++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/hw/char/pl011.c b/hw/char/pl011.c
index 855cb82d08..3c0e07aa35 100644
--- a/hw/char/pl011.c
+++ b/hw/char/pl011.c
@@ -121,6 +121,51 @@ static void pl011_update(PL011State *s)
 }
 }
 
+static void pl011_put_fifo(void *opaque, uint32_t value);
+
+static bool pl011_is_loopback(PL011State *s)
+{
+return !!(s->cr & (1U << 7));
+}
+
+static void pl011_tx_loopback(PL011State *s, uint32_t value)
+{
+if (pl011_is_loopback(s)) {
+pl011_put_fifo(s, value);
+}
+}
+
+static uint32_t pl011_cr_loopback(PL011State *s, bool update)
+{
+uint32_t cr = s->cr;
+uint32_t fr = s->flags;
+uint32_t ri = 1 << 8, dcd = 1 << 2, dsr = 1 << 1, cts = 0;
+uint32_t out2 = 1 << 13, out1 = 1 << 12, rts = 1 << 11, dtr = 1 << 10;
+
+if (!pl011_is_loopback(s)) {
+return fr;
+}
+
+fr &= ~(ri | dcd | dsr | cts);
+fr |= (cr & out2) ?  ri : 0;   /* FR.RI  <= CR.Out2 */
+fr |= (cr & out1) ? dcd : 0;   /* FR.DCD <= CR.Out1 */
+fr |= (cr &  rts) ? cts : 0;   /* FR.CTS <= CR.RTS */
+fr |= (cr &  dtr) ? dsr : 0;   /* FR.DSR <= CR.DTR */
+
+if (!update) {
+return fr;
+}
+
+s->int_level &= ~(INT_DSR | INT_DCD | INT_CTS | INT_RI);
+s->int_level |= (fr & dsr) ? INT_DSR : 0;
+s->int_level |= (fr & dcd) ? INT_DCD : 0;
+s->int_level |= (fr & cts) ? INT_CTS : 0;
+s->int_level |= (fr &  ri) ? INT_RI  : 0;
+pl011_update(s);
+
+return fr;
+}
+
 static bool pl011_is_fifo_enabled(PL011State *s)
 {
 return (s->lcr & LCR_FEN) != 0;
@@ -172,7 +217,7 @@ static uint64_t pl011_read(void *opaque, hwaddr offset,
 r = s->rsr;
 break;
 case 6: /* UARTFR */
-r = s->flags;
+r = pl011_cr_loopback(s, false);
 break;
 case 8: /* UARTILPR */
 r = s->ilpr;
@@ -267,6 +312,7 @@ static void pl011_write(void *opaque, hwaddr offset,
  * qemu_chr_fe_write and background I/O callbacks */
 qemu_chr_fe_write_all(>chr, , 1);
 s->int_level |= INT_TX;
+pl011_tx_loopback(s, ch);
 pl011_update(s);
 break;
 case 1: /* UARTRSR/UARTECR */
@@ -300,8 +346,9 @@ static void pl011_write(void *opaque, hwaddr offset,
 pl011_set_read_trigger(s);
 break;
 case 12: /* UARTCR */
-/* ??? Need to implement the enable and loopback bits.  */
+/* ??? Need to implement the enable bit.  */
 s->cr = value;
+pl011_cr_loopback(s, true);
 break;
 case 13: /* UARTIFS */
 s->ifl = value;
-- 
2.25.1

[PATCH] hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT

2024-02-06 Thread David Hubbard

This changes the ohci validation to not assert if invalid
data is fed to the ohci controller. The poc suggested in
https://bugs.launchpad.net/qemu/+bug/1907042
and then migrated to bug #303 does the following to
feed it a SETUP pid and EndPt of 1:

uint32_t MaxPacket = 64;
uint32_t TDFormat = 0;
uint32_t Skip = 0;
uint32_t Speed = 0;
uint32_t Direction = 0;  /* #define OHCI_TD_DIR_SETUP 0 */
uint32_t EndPt = 1;
uint32_t FuncAddress = 0;
ed->attr = (MaxPacket << 16) | (TDFormat << 15) | (Skip << 14)
   | (Speed << 13) | (Direction << 11) | (EndPt << 7)
   | FuncAddress;
ed->tailp = /*TDQTailPntr= */ 0;
ed->headp = ((/*TDQHeadPntr= */ [0]) & 0xfff0)
   | (/* ToggleCarry= */ 0 << 1);
ed->next_ed = (/* NextED= */ 0 & 0xfff0)

qemu-fuzz also caught the same issue in #1510. They are
both fixed by this patch.

The if (td.cbp > td.be) logic in ohci_service_td() causes an
ohci_die(). My understanding of the OHCI spec 4.3.1.2
Table 4-2 allows td.cbp to be one byte more than td.be to
signal the buffer has zero length. The new check in qemu
appears to have been added since qemu-4.2. This patch
includes both fixes since they are located very close
to each other.

Signed-off-by: David Hubbard 
---
 hw/usb/hcd-ohci.c   | 9 +++--
 hw/usb/trace-events | 2 ++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/hw/usb/hcd-ohci.c b/hw/usb/hcd-ohci.c
index d73b53f33c..a53808126f 100644
--- a/hw/usb/hcd-ohci.c
+++ b/hw/usb/hcd-ohci.c
@@ -927,6 +927,11 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed 
*ed)
 case OHCI_TD_DIR_SETUP:
 str = "setup";
 pid = USB_TOKEN_SETUP;
+if (OHCI_BM(ed->flags, ED_EN) > 0) {  /* setup only allowed to ep 0 */
+trace_usb_ohci_td_bad_pid(str, ed->flags, td.flags);
+ohci_die(ohci);
+return 1;
+}
 break;
 default:
 trace_usb_ohci_td_bad_direction(dir);
@@ -936,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed 
*ed)
 if ((td.cbp & 0xf000) != (td.be & 0xf000)) {
 len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff);
 } else {
-if (td.cbp > td.be) {
-trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be);
+if (td.cbp > td.be + 1) {
+trace_usb_ohci_td_bad_buf(td.cbp, td.be);
 ohci_die(ohci);
 return 1;
 }
diff --git a/hw/usb/trace-events b/hw/usb/trace-events
index ed7dc210d3..b47d082fa3 100644
--- a/hw/usb/trace-events
+++ b/hw/usb/trace-events
@@ -28,6 +28,8 @@ usb_ohci_iso_td_data_overrun(int ret, ssize_t len) 
"DataOverrun %d > %zu"
 usb_ohci_iso_td_data_underrun(int ret) "DataUnderrun %d"
 usb_ohci_iso_td_nak(int ret) "got NAK/STALL %d"
 usb_ohci_iso_td_bad_response(int ret) "Bad device response %d"
+usb_ohci_td_bad_buf(uint32_t cbp, uint32_t be) "Bad cbp = 0x%x > be = 0x%x"
+usb_ohci_td_bad_pid(const char *s, uint32_t edf, uint32_t tdf) "Bad pid %s: 
ed.flags 0x%x td.flags 0x%x"
 usb_ohci_port_attach(int index) "port #%d"
 usb_ohci_port_detach(int index) "port #%d"
 usb_ohci_port_wakeup(int index) "port #%d"
-- 
2.34.1

Re: [PATCH 4/6] migration/multifd: Zero page transmission on the multifd thread.

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 11:19:06PM +, Hao Xiang wrote:
> This implements the zero page detection and handling on the multifd
> threads.
> 
> Signed-off-by: Hao Xiang 
> ---
>  migration/multifd.c | 62 +
>  migration/multifd.h |  5 
>  2 files changed, 62 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index a20d0ed10e..c031f947c7 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -11,6 +11,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qemu/rcu.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
> @@ -278,6 +279,12 @@ static void multifd_send_fill_packet(MultiFDSendParams 
> *p)
>  
>  packet->offset[i] = cpu_to_be64(temp);
>  }
> +for (i = 0; i < p->zero_num; i++) {
> +/* there are architectures where ram_addr_t is 32 bit */
> +uint64_t temp = p->zero[i];
> +
> +packet->offset[p->normal_num + i] = cpu_to_be64(temp);
> +}
>  }

Please be noted taht p->normal_num will be dropped very soon, see:

https://lore.kernel.org/all/20240202102857.110210-6-pet...@redhat.com/

Please use p->pages->num instead.

This patch also relies on some changes in previous patch.. IMHO we can
split the patch better in this way:

  - Patch 1: Add new parameter "zero-page-detection", support "none",
"legacy".  You'll need to implement "none" here that we skip zero page
by returning 0 in save_zero_page() if "none".

  - Patch 2: Add new "multifd" mode in above, implement it in the same
patch completely.

  - Patch 3: introduce ram_save_target_page_multifd()

  - Patch 4: test case

If you want to add "zeros" accounting, that can be done as more patches on
top.

Thanks,

-- 
Peter Xu

Re: [PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.

2024-02-06 Thread Peter Xu

On Wed, Feb 07, 2024 at 12:13:10PM +0800, Peter Xu wrote:
> On Tue, Feb 06, 2024 at 11:19:04PM +, Hao Xiang wrote:
> > This change extends the MigrationStatus interface to track zero pages
> > and zero bytes counter.
> > 
> > Signed-off-by: Hao Xiang 
> 
> Reviewed-by: Peter Xu 

I'll need to scratch this, sorry..

The issue is I forgot we have "duplicate" which is exactly "zero
page"s.. See:

info->ram->duplicate = stat64_get(_stats.zero_pages);

If you think the name too confusing and want a replacement, maybe it's fine
and maybe we can do that.  Then we can keep this zero page counter
introduced, reporting the same value as duplicates, then with a follow up
patch to deprecate "duplicate" parameter.  See an exmaple on how to
deprecate in 7b24d326348e1672.

One thing I'm not sure is whether Libvirt will be fine on losing
"duplicates" after 2+ QEMU major releases.  Copy Jiri for this.  My
understanding is that Libvirt should be keeping an eye on deprecation list
and react, but I'd like to double check..

Or we can keep using "duplicates", but I agree it just reads weird..

Thanks,

-- 
Peter Xu

Re: [PATCH 3/6] migration/multifd: Support for zero pages transmission in multifd format.

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 11:19:05PM +, Hao Xiang wrote:
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 25cbc6dc6b..a20d0ed10e 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -264,6 +264,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>  packet->flags = cpu_to_be32(p->flags);
>  packet->pages_alloc = cpu_to_be32(p->pages->allocated);
>  packet->normal_pages = cpu_to_be32(p->normal_num);
> +packet->zero_pages = cpu_to_be32(p->zero_num);

This doesn't look right..

If to fill up the zero accounting only, we shouldn't be touching multifd
packet at all since multifd zero page detection is not yet supported.

We should only reference mig_stats.zero_pages.

>  packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>  packet->packet_num = cpu_to_be64(p->packet_num);

-- 
Peter Xu

Re: [PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 11:19:04PM +, Hao Xiang wrote:
> This change extends the MigrationStatus interface to track zero pages
> and zero bytes counter.
> 
> Signed-off-by: Hao Xiang 

Reviewed-by: Peter Xu 

When post anything QAPI relevant, please always remember to copy QAPI
maintainers too, thanks.

$ ./scripts/get_maintainer.pl -f qapi/migration.json 
Eric Blake  (supporter:QAPI Schema)
Markus Armbruster  (supporter:QAPI Schema)
Peter Xu  (maintainer:Migration)
Fabiano Rosas  (maintainer:Migration)
qemu-devel@nongnu.org (open list:All patches CC here)

-- 
Peter Xu

Re: [PATCH 1/6] migration/multifd: Add new migration option multifd-zero-page.

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 11:19:03PM +, Hao Xiang wrote:
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 819708321d..ff033a0344 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -874,6 +874,11 @@
>  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
>  #(Since 8.2)
>  #
> +# @multifd-zero-page: Multifd zero page checking. If the parameter is true,
> +# zero page checking is done on the multifd sender thread. If the 
> parameter
> +# is false, zero page checking is done on the migration main thread. 
> Default
> +# is set to true. (Since 9.0)

I replied somewhere before on this, but I can try again..

Do you think it'll be better to introduce a generic parameter for zero page
detection?

  - "none" if disabled,
  - "legacy" for main thread,
  - "multifd" for multifd (software-based).

A string could work, but maybe cleaner to introduce
@MigrationZeroPageDetector enum?

When you add more, you can keep extending that with the single field
("multifd-dsa", etc.).

-- 
Peter Xu

Re: [PATCH v3 08/17] plugins: add inline operation per vcpu

2024-02-06 Thread Richard Henderson


On 2/6/24 19:24, Pierrick Bouvier wrote:

--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -442,6 +442,13 @@ static TCGOp *append_inline_cb(const struct 
qemu_plugin_dyn_cb *cb,
  char *ptr = cb->userp;
  size_t elem_size = 0;
  size_t offset = 0;
+if (!ptr) {
+/* use inline entry */
+ptr = cb->inline_insn.entry.score->data->data;


This value will not survive the first resize.
You need to add a pointer dereference from the first "data".


r~

RE: [PATCH v0 1/2] aspeed: support uart controller both 0 and 1 base

2024-02-06 Thread Jamin Lin

> -Original Message-
> From: Cédric Le Goater 
> Sent: Wednesday, February 7, 2024 1:00 AM
> To: Jamin Lin ; Peter Maydell
> ; Andrew Jeffery ;
> Joel Stanley ; open list:ASPEED BMCs
> ; open list:All patches CC here
> 
> Cc: Troy Lee 
> Subject: Re: [PATCH v0 1/2] aspeed: support uart controller both 0 and 1 base
> 
> On 2/6/24 04:29, Jamin Lin wrote:
> >> -Original Message-
> >> The uart definitions on the AST2700 are different :
> >>
> >>
> >> https://github.com/AspeedTech-BMC/linux/blob/aspeed-master-v6.6/arch/
> >> arm
> >> 64/boot/dts/aspeed/aspeed-g7.dtsi
> >>
> >>serial0 = 
> >>serial1 = 
> >>serial2 = 
> >>serial3 = 
> >>serial4 = 
> >>serial5 = 
> >>serial6 = 
> >>serial7 = 
> >>serial8 = 
> >>   ...
> >>
> >> I think the names in the DT (and consequently in the QEMU models)
> >> follow the IP names in the datasheet.
> >>
> >> I don't think we care in QEMU, so I would be inclined to change the
> >> indexing of the device names in QEMU and start at 0, which would
> >> introduce a discrepancy for the AST2400, AST2600, AST2600 SoC.
> >>
> >> Let's see what the other maintainers have to say.
> >>
> >> Thanks,
> >>
> >> C.
> > Hi Cedric,
> >
> > Did you mean to change the naming of uart device to 0 base for all ASPEED
> SOCs?
> > If yes, it seems we need to do the following changes.
> > 1. add ASPEED_DEV_UART0 in aspeed_soc.h 2. Re-defined uart memory map
> > for ast2600, ast10x0, ast2500 and ast2400(uart0 -> ASPEED_DEV_UART0)
> > Take ast2600 for example:
> > static const hwaddr aspeed_soc_ast2600_memmap[] = {
> >  [ASPEED_DEV_UART1] = 0x1E783000, --->
> [ASPEED_DEV_UART0]
> >  [ASPEED_DEV_UART2] = 0x1E78D000, --->
> [ASPEED_DEV_UART1]
> >  [ASPEED_DEV_UART3] = 0x1E78E000,
> >  [ASPEED_DEV_UART4] = 0x1E78F000,
> >  [ASPEED_DEV_UART5] = 0x1E784000,
> >  [ASPEED_DEV_UART6] = 0x1E79,
> >  [ASPEED_DEV_UART7] = 0x1E790100,
> >  [ASPEED_DEV_UART8] = 0x1E790200,
> >  [ASPEED_DEV_UART9] = 0x1E790300,
> >  [ASPEED_DEV_UART10]= 0x1E790400,
> >  [ASPEED_DEV_UART11]= 0x1E790500,
> >  [ASPEED_DEV_UART12]= 0x1E790600,
> >  [ASPEED_DEV_UART13]= 0x1E790700, --->
> [ASPEED_DEV_UART12]
> > };
> > If no, could you please descript it more detail? So, I can change it and 
> > re-send
> this patch series.
> 
> Let's keep the datasheet names. I had forgotten the reason initially and from
> an HW POV it makes sense to keep them in sync. I will add some more
> comments to the patch.
> 
> > By the way, I will send a new patch series to support AST2700 in two weeks.
> > We encountered GIC issues. It seems that QEMU support GIC v3 but SPI did
> not support, yet.
> >
> >
> https://github.com/qemu/qemu/blob/master/hw/intc/arm_gicv3_dist.c#L383
> > https://github.com/AspeedTech-BMC/linux/blob/aspeed-master-v6.6/arch/a
> > rm64/boot/dts/aspeed/aspeed-g7.dtsi#L229
> 
> If you did any hacks or workarounds in the QEMU models, please keep them
> separate from the other patches so that we can discuss.
> 
Okay. Will do
Thanks-Jamin
> > It think that we can discuss it in a new AST2700 patch series.
> Sure.
> 
> Thanks,
> 
> C.
>

RE: [v0 0/2] uart base and hardcode boot address 0

2024-02-06 Thread Jamin Lin

> -Original Message-
> From: Cédric Le Goater 
> Sent: Wednesday, February 7, 2024 12:48 AM
> To: Jamin Lin ; Peter Maydell
> ; Andrew Jeffery ;
> Joel Stanley ; open list:ASPEED BMCs
> ; open list:All patches CC here
> 
> Cc: Troy Lee 
> Subject: Re: [v0 0/2] uart base and hardcode boot address 0
> 
> On 2/5/24 10:14, Jamin Lin wrote:
> > v0:
> 
> usually we start at v1, so the next version would be a v2. Indexing again :)
> 
Got it.
Thanks-Jamin
> 
> Thanks,
> 
> C.
> 
> 
> 
> > 1. support uart controller both 0 and 1 base 2. fix hardcode boot
> > address 0
> >
> > Jamin Lin (2):
> >aspeed: support uart controller both 0 and 1 base
> >aspeed: fix hardcode boot address 0
> >
> >   hw/arm/aspeed.c | 12 
> >   hw/arm/aspeed_ast10x0.c |  1 +
> >   hw/arm/aspeed_ast2400.c |  2 ++
> >   hw/arm/aspeed_ast2600.c |  1 +
> >   hw/arm/aspeed_soc_common.c  |  4 ++--
> >   include/hw/arm/aspeed_soc.h |  1 +
> >   6 files changed, 15 insertions(+), 6 deletions(-)
> >

Re: [PATCH v3 07/17] plugins: implement inline operation relative to cpu_index

2024-02-06 Thread Richard Henderson


On 2/6/24 19:24, Pierrick Bouvier wrote:

Instead of working on a fixed memory location, allow to address it based
on cpu_index, an element size and a given offset.
Result address: ptr + offset + cpu_index * element_size.

With this, we can target a member in a struct array from a base pointer.

Current semantic is not modified, thus inline operation still targets
always the same memory location.

Signed-off-by: Pierrick Bouvier 
---
  plugins/plugin.h   |  2 +-
  accel/tcg/plugin-gen.c | 65 +++---
  plugins/api.c  |  3 +-
  plugins/core.c | 12 +---
  4 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/plugins/plugin.h b/plugins/plugin.h
index fd93a372803..77ed10689ca 100644
--- a/plugins/plugin.h
+++ b/plugins/plugin.h
@@ -100,7 +100,7 @@ void plugin_register_vcpu_mem_cb(GArray **arr,
   enum qemu_plugin_mem_rw rw,
   void *udata);
  
-void exec_inline_op(struct qemu_plugin_dyn_cb *cb);

+void exec_inline_op(struct qemu_plugin_dyn_cb *cb, int cpu_index);
  
  int plugin_num_vcpus(void);
  
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c

index b37ce7683e6..68dee4c68d3 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -132,16 +132,28 @@ static void gen_empty_udata_cb_no_rwg(void)
   */
  static void gen_empty_inline_cb(void)
  {
+TCGv_i32 cpu_index = tcg_temp_ebb_new_i32();
+TCGv_ptr cpu_index_as_ptr = tcg_temp_ebb_new_ptr();
  TCGv_i64 val = tcg_temp_ebb_new_i64();
  TCGv_ptr ptr = tcg_temp_ebb_new_ptr();
  
+tcg_gen_ld_i32(cpu_index, tcg_env,

+   -offsetof(ArchCPU, env) + offsetof(CPUState, cpu_index));
+/* pass an immediate != 0 so that it doesn't get optimized away */
+tcg_gen_muli_i32(cpu_index, cpu_index, 0xdeadbeef);


You don't need a random immediate here.
You can just as easily use

tcg_gen_mul_i32(cpu_index, cpu_index, cpu_index);

with a similar comment about the true size being inserted later.

Otherwise,
Reviewed-by: Richard Henderson 


r~

Re: [PATCH 0/6] Introduce multifd zero page checking.

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 11:19:02PM +, Hao Xiang wrote:
> This patchset is based on Juan Quintela's old series here
> https://lore.kernel.org/all/20220802063907.18882-1-quint...@redhat.com/
> 
> In the multifd live migration model, there is a single migration main
> thread scanning the page map, queuing the pages to multiple multifd
> sender threads. The migration main thread runs zero page checking on
> every page before queuing the page to the sender threads. Zero page
> checking is a CPU intensive task and hence having a single thread doing
> all that doesn't scale well. This change introduces a new function
> to run the zero page checking on the multifd sender threads. This
> patchset also lays the ground work for future changes to offload zero
> page checking task to accelerator hardwares.
> 
> Use two Intel 4th generation Xeon servers for testing.
> 
> Architecture:x86_64
> CPU(s):  192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):   2
> NUMA node(s):2
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   143
> Model name:  Intel(R) Xeon(R) Platinum 8457C
> Stepping:8
> CPU MHz: 2538.624
> CPU max MHz: 3800.
> CPU min MHz: 800.
> 
> Perform multifd live migration with below setup:
> 1. VM has 100GB memory. All pages in the VM are zero pages.
> 2. Use tcp socket for live migratio.
> 3. Use 4 multifd channels and zero page checking on migration main thread.
> 4. Use 1/2/4 multifd channels and zero page checking on multifd sender
> threads.
> 5. Record migration total time from sender QEMU console's "info migrate"
> command.
> 6. Calculate throughput with "100GB / total time".
> 
> +--+
> |zero-page-checking | total-time(ms) | throughput(GB/s)|
> +--+
> |main-thread| 9629   | 10.38GB/s   |
> +--+
> |multifd-1-threads  | 6182   | 16.17GB/s   |
> +--+
> |multifd-2-threads  | 4643   | 21.53GB/s   |
> +--+
> |multifd-4-threads  | 4143   | 24.13GB/s   |
> +--+

This "throughput" is slightly confusing; I was initially surprised to see a
large throughput for idle guests.  IMHO the "total-time" would explain.
Feel free to drop that column if there's a repost.

Did you check why 4 channels mostly already reached the top line?  Is it
because main thread is already spinning 100%?

Thanks,

-- 
Peter Xu

Re: [PATCH] target/riscv: Update $pc after linking to $ra in trans_cm_jalt()

2024-02-06 Thread Jason Chien

You are right. I'll send patch v2 shortly. Thank you for the reply.

Richard Henderson  於 2024年2月7日 週三 上午4:24寫道：

> On 2/6/24 23:18, Jason Chien wrote:
> > The original implementation sets $pc to the address read from the jump
> > vector table first and links $ra with the address of the next instruction
> > after the updated $pc. After jumping to the updated $pc and executing the
> > next ret instruction, the program jumps to $ra, which is in the same
> > function currently executing, which results in an infinite loop.
> > This commit reverses the two action. Firstly, $ra is updated with the
> > address of the next instruction after $pc, and sets $pc to the address
> > read from the jump vector table.
>
> This is unlikely to be correct in the case the vector table read faults,
> leaving $ra updated.
>
> I guess this got broken with CF_PCREL.  Anyway, the solution is to use a
> temporary...
>
> > -/*
> > - * Update pc to current for the non-unwinding exception
> > - * that might come from cpu_ld*_code() in the helper.
> > - */
> > -gen_update_pc(ctx, 0);
> > -gen_helper_cm_jalt(cpu_pc, cpu_env, tcg_constant_i32(a->index));
>
> ... here and then ...
>
> > @@ -307,6 +300,13 @@ static bool trans_cm_jalt(DisasContext *ctx,
> arg_cm_jalt *a)
> >   gen_set_gpr(ctx, xRA, succ_pc);
> >   }
> >
>
> ... copy the temp to cpu_pc here.
>
> >   tcg_gen_lookup_and_goto_ptr();
> >   ctx->base.is_jmp = DISAS_NORETURN;
> >   return true;
>
>
>
> r~
>

[PATCH] qapi/migration: Add missing tls-authz documentation

2024-02-06 Thread peterx

From: Peter Xu 

As reported in Markus's recent enforcement series on qapi doc [1], we
accidentally miss one entry for tls-authz.  Add it.  Then we can drop
@MigrateSetParameters from documentation-exceptions safely later.

[1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com

Cc: Daniel P. Berrangé 
Cc: Fabiano Rosas 
Reported-by: Markus Armbruster 
Signed-off-by: Peter Xu 
---
 qapi/migration.json | 4 
 1 file changed, 4 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index 819708321d..f4c5f59e01 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -980,6 +980,10 @@
 # 2.9) Previously (since 2.7), this was reported by omitting
 # tls-hostname instead.
 #
+# @tls-authz: ID of the 'authz' object subclass that provides access
+# control checking of the TLS x509 certificate distinguished name.
+# (Since 4.0)
+#
 # @max-bandwidth: to set maximum speed for migration.  maximum speed
 # in bytes per second.  (Since 2.8)
 #
-- 
2.43.0

Re: [PATCH v3 05/17] plugins: scoreboard API

2024-02-06 Thread Richard Henderson


On 2/6/24 19:24, Pierrick Bouvier wrote:

We introduce a cpu local storage, automatically managed (and extended)
by QEMU itself. Plugin allocate a scoreboard, and don't have to deal
with how many cpus are launched.

This API will be used by new inline functions but callbacks can benefit
from this as well. This way, they can operate without a global lock for
simple operations.

At any point during execution, any scoreboard will be dimensioned with
at least qemu_plugin_num_vcpus entries.

New functions:
- qemu_plugin_scoreboard_find
- qemu_plugin_scoreboard_free
- qemu_plugin_scoreboard_new

In more, we define a qemu_plugin_u64, which is a simple struct holding
a pointer to a scoreboard, and a given offset.
This allows to have a scoreboard containing structs, without having to
bring offset for all operations on a specific field.

Since most of the plugins are simply collecting a sum of per-cpu values,
qemu_plugin_u64 directly support this operation as well.

New functions:
- qemu_plugin_u64_add
- qemu_plugin_u64_get
- qemu_plugin_u64_set
- qemu_plugin_u64_sum
New macros:
- qemu_plugin_scoreboard_u64
- qemu_plugin_scoreboard_u64_in_struct


I think the u64 stuff should be a second patch built upon the basic scoreboard 
support.


+/* A scoreboard is an array of values, indexed by vcpu_index */
+struct qemu_plugin_scoreboard {
+GArray *data;
+};


Unnecessary?  Generates an extra pointer dereference for no apparent benefit. 
Alternately, might be useful for other data structure changes...



+/**
+ * typedef qemu_plugin_u64 - uint64_t member of an entry in a scoreboard
+ *
+ * This field allows to access a specific uint64_t member in one given entry,
+ * located at a specified offset. Inline operations expect this as entry.
+ */
+typedef struct {
+struct qemu_plugin_scoreboard *score;


Embed the struct instead?


@@ -31,6 +31,9 @@ struct qemu_plugin_state {
   * but with the HT we avoid adding a field to CPUState.
   */
  GHashTable *cpu_ht;
+/* Scoreboards, indexed by their addresses. */
+GHashTable *scoreboards;


Why a hash table?  All you want is to be able to iterate through all, and add/remove 
easily.  Seems like QLIST from  would be better, and the QLIST_ENTRY member 
would make struct qemu_plugin_scoreboard useful.



r~

Re: [PATCH 00/15] qapi: Require member documentation (with loophole)

2024-02-06 Thread Peter Xu

On Mon, Feb 05, 2024 at 08:46:54AM +0100, Markus Armbruster wrote:
> qapi/migration.json
>   MigrateSetParameters 1

It's tls-authz.  I'll send a patch for this one.

Thanks,

-- 
Peter Xu

Re: Re: [PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-02-06 Thread Jason Wang

On Tue, Feb 6, 2024 at 4:31 PM Stefano Garzarella  wrote:
>
> On Tue, Feb 06, 2024 at 10:47:40AM +0800, Jason Wang wrote:
> >On Mon, Feb 5, 2024 at 6:51 PM Stefano Garzarella  
> >wrote:
> >>
> >> On Fri, Feb 02, 2024 at 02:25:21PM +0100, Kevin Wolf wrote:
> >> >VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> >> >status flag is set; with the current API of the kernel module, it is
> >> >impossible to enable the opposite order in our block export code because
> >> >userspace is not notified when a virtqueue is enabled.
> >
> >Did this mean virtio-blk will enable a virtqueue after DRIVER_OK?
>
> It's not specific to virtio-blk, but to the generic vdpa device we have
> in QEMU (i.e. vhost-vdpa-device). Yep, after commit
> 6c4825476a4351530bcac17abab72295b75ffe98, virtqueues are enabled after
> DRIVER_OK.

Right.

>
> >Sepc is not clear about this and that's why we introduce
> >VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK.
>
> Ah, I didn't know about this new feature. So after commit
> 6c4825476a4351530bcac17abab72295b75ffe98 the vhost-vdpa-device is not
> complying with the specification, right?

Kind of, but as stated, it's just because spec is unclear about the
behaviour. There's a chance that spec will explicitly support it in
the future.

>
> >
> >>
> >> Yeah, IMHO the VDUSE protocol is missing a VDUSE_SET_VQ_READY message,
> >
> >I think you meant when VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is
> >negotiated.
>
> At this point yes. But if VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not
> negotiated, should we return an error in vhost-vdpa kernel module if
> VHOST_VDPA_SET_VRING_ENABLE is called when DRIVER_OK is already set?

I'm not sure if this can break some setups or not. It might be better
to leave it as is?

Without VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we don't know if
parent support vq_ready after driver_ok.
With VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we know parent support
vq_ready after driver_ok.

>
> >If this is truth, it seems a little more complicated, for
> >example the get_backend_features needs to be forward to the userspace?
>
> I'm not understanding, don't we already have VHOST_GET_BACKEND_FEATURES
> for this? Or do you mean userspace on the VDUSE side?

Yes, since in this case the parent is in the userspace, there's no way
for VDUSE to know if user space supports vq_ready after driver_ok or
not.

As you may have noticed, we don't have a message for vq_ready which
implies that vq_ready after driver_ok can't be supported.

>
> >This seems suboptimal to implement this in the spec first and then we
> >can leverage the features. Or we can have another parameter for the
> >ioctl that creates the vduse device.
>
> I got a little lost, though in vhost-user, the device can always expect
> a vring_enable/disable, so I thought it was not complicated in VDUSE.

Yes, the problem is assuming we have a message for vq_ready, there
could be  a "legacy" userspace that doesn't support that.  So in that
case, VDUSE needs to know if the userspace parent can support that or
not.

>
> >
> >> I'll start another thread about that, but in the meantime I agree that
> >> we should fix QEMU since we need to work properly with old kernels as
> >> well.
> >>
> >> >
> >> >This requirement also mathces the normal initialisation order as done by
> >> >the generic vhost code in QEMU. However, commit 6c482547 accidentally
> >> >changed the order for vdpa-dev and broke access to VDUSE devices with
> >> >this.
> >> >
> >> >This changes vdpa-dev to use the normal order again and use the standard
> >> >vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> >> >used with vdpa-dev again after this fix.
> >>
> >> I like this approach and the patch LGTM, but I'm a bit worried about
> >> this function in hw/net/vhost_net.c:
> >>
> >>  int vhost_set_vring_enable(NetClientState *nc, int enable)
> >>  {
> >>  VHostNetState *net = get_vhost_net(nc);
> >>  const VhostOps *vhost_ops = net->dev.vhost_ops;
> >>
> >>  nc->vring_enable = enable;
> >>
> >>  if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
> >>  return vhost_ops->vhost_set_vring_enable(>dev, enable);
> >>  }
> >>
> >>  return 0;
> >>  }
> >>
> >> @Eugenio, @Jason, should we change some things there if vhost-vdpa
> >> implements the vhost_set_vring_enable callback?
> >
> >Eugenio may know more, I remember we need to enable cvq first for
> >shadow virtqueue to restore some states.
> >
> >>
> >> Do you remember why we didn't implement it from the beginning?
> >
> >It seems the vrings parameter is introduced after vhost-vdpa is
> >implemented.
>
> Sorry, I mean why we didn't implement the vhost_set_vring_enable
> callback for vhost-vdpa from the beginning.

Adding Cindy who writes those codes for more thoughts.

Thanks

>
> Thanks,
> Stefano
>

Re: [PATCH v3 04/17] cpu: call plugin init hook asynchronously

2024-02-06 Thread Richard Henderson


On 2/6/24 19:24, Pierrick Bouvier wrote:

This ensures we run during a cpu_exec, which allows to call start/end
exclusive from this init hook (needed for new scoreboard API introduced
later).

async work is run before any tb is translated/executed, so we can
guarantee plugin init will be called before any other hook.

The previous change made sure that any idle/resume cb call will not be
done before initializing plugin for a given vcpu.

Signed-off-by: Pierrick Bouvier
---
  hw/core/cpu-common.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 03/17] plugins: fix order of init/idle/resume callback

2024-02-06 Thread Richard Henderson


On 2/6/24 19:24, Pierrick Bouvier wrote:

We found that vcpu_init_hook was called*after*  idle callback.
vcpu_init is called from cpu_realize_fn, while idle/resume cb are called
from qemu_wait_io_event (in vcpu thread).

This change ensures we only call idle and resume cb only once a plugin
was init for a given vcpu.

Next change in the series will run vcpu_init asynchronously, which will
make it run*after*  resume callback as well. So we fix this now.

Signed-off-by: Pierrick Bouvier
---
  plugins/core.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)


Reviewed-by: Richard Henderson 

r~

[PATCH v3 3/6] target/arm: Adjust and validate mtedesc sizem1

2024-02-06 Thread Richard Henderson

When we added SVE_MTEDESC_SHIFT, we effectively limited the
maximum size of MTEDESC.  Adjust SIZEM1 to consume the remaining
bits (32 - 10 - 5 - 12 == 5).  Assert that the data to be stored
fits within the field (expecting 8 * 4 - 1 == 31, exact fit).

Cc: qemu-sta...@nongnu.org
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 target/arm/internals.h | 2 +-
 target/arm/tcg/translate-sve.c | 7 ---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index fc337fe40e..50bff44549 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -1278,7 +1278,7 @@ FIELD(MTEDESC, TBI,   4, 2)
 FIELD(MTEDESC, TCMA,  6, 2)
 FIELD(MTEDESC, WRITE, 8, 1)
 FIELD(MTEDESC, ALIGN, 9, 3)
-FIELD(MTEDESC, SIZEM1, 12, SIMD_DATA_BITS - 12)  /* size - 1 */
+FIELD(MTEDESC, SIZEM1, 12, SIMD_DATA_BITS - SVE_MTEDESC_SHIFT - 12)  /* size - 
1 */
 
 bool mte_probe(CPUARMState *env, uint32_t desc, uint64_t ptr);
 uint64_t mte_check(CPUARMState *env, uint32_t desc, uint64_t ptr, uintptr_t 
ra);
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 7108938251..a88e523cba 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -4443,17 +4443,18 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
 {
 unsigned vsz = vec_full_reg_size(s);
 TCGv_ptr t_pg;
+uint32_t sizem1;
 int desc = 0;
 
 assert(mte_n >= 1 && mte_n <= 4);
+sizem1 = (mte_n << dtype_msz(dtype)) - 1;
+assert(sizem1 <= R_MTEDESC_SIZEM1_MASK >> R_MTEDESC_SIZEM1_SHIFT);
 if (s->mte_active[0]) {
-int msz = dtype_msz(dtype);
-
 desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s));
 desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid);
 desc = FIELD_DP32(desc, MTEDESC, TCMA, s->tcma);
 desc = FIELD_DP32(desc, MTEDESC, WRITE, is_write);
-desc = FIELD_DP32(desc, MTEDESC, SIZEM1, (mte_n << msz) - 1);
+desc = FIELD_DP32(desc, MTEDESC, SIZEM1, sizem1);
 desc <<= SVE_MTEDESC_SHIFT;
 } else {
 addr = clean_data_tbi(s, addr);
-- 
2.34.1

[PATCH v3 5/6] target/arm: Handle mte in do_ldrq, do_ldro

2024-02-06 Thread Richard Henderson

These functions "use the standard load helpers", but
fail to clean_data_tbi or populate mtedesc.

Cc: qemu-sta...@nongnu.org
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 target/arm/tcg/translate-sve.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 508f7b6bbd..ada05aa530 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -4861,8 +4861,13 @@ static void do_ldrq(DisasContext *s, int zt, int pg, 
TCGv_i64 addr, int dtype)
 unsigned vsz = vec_full_reg_size(s);
 TCGv_ptr t_pg;
 int poff;
+uint32_t desc;
 
 /* Load the first quadword using the normal predicated load helpers.  */
+if (!s->mte_active[0]) {
+addr = clean_data_tbi(s, addr);
+}
+
 poff = pred_full_reg_offset(s, pg);
 if (vsz > 16) {
 /*
@@ -4886,7 +4891,8 @@ static void do_ldrq(DisasContext *s, int zt, int pg, 
TCGv_i64 addr, int dtype)
 
 gen_helper_gvec_mem *fn
 = ldr_fns[s->mte_active[0]][s->be_data == MO_BE][dtype][0];
-fn(tcg_env, t_pg, addr, tcg_constant_i32(simd_desc(16, 16, zt)));
+desc = make_svemte_desc(s, 16, 1, dtype_msz(dtype), false, zt);
+fn(tcg_env, t_pg, addr, tcg_constant_i32(desc));
 
 /* Replicate that first quadword.  */
 if (vsz > 16) {
@@ -4929,6 +4935,7 @@ static void do_ldro(DisasContext *s, int zt, int pg, 
TCGv_i64 addr, int dtype)
 unsigned vsz_r32;
 TCGv_ptr t_pg;
 int poff, doff;
+uint32_t desc;
 
 if (vsz < 32) {
 /*
@@ -4941,6 +4948,9 @@ static void do_ldro(DisasContext *s, int zt, int pg, 
TCGv_i64 addr, int dtype)
 }
 
 /* Load the first octaword using the normal predicated load helpers.  */
+if (!s->mte_active[0]) {
+addr = clean_data_tbi(s, addr);
+}
 
 poff = pred_full_reg_offset(s, pg);
 if (vsz > 32) {
@@ -4965,7 +4975,8 @@ static void do_ldro(DisasContext *s, int zt, int pg, 
TCGv_i64 addr, int dtype)
 
 gen_helper_gvec_mem *fn
 = ldr_fns[s->mte_active[0]][s->be_data == MO_BE][dtype][0];
-fn(tcg_env, t_pg, addr, tcg_constant_i32(simd_desc(32, 32, zt)));
+desc = make_svemte_desc(s, 32, 1, dtype_msz(dtype), false, zt);
+fn(tcg_env, t_pg, addr, tcg_constant_i32(desc));
 
 /*
  * Replicate that first octaword.
-- 
2.34.1

[PATCH v3 6/6] target/arm: Fix SVE/SME gross MTE suppression checks

2024-02-06 Thread Richard Henderson

The TBI and TCMA bits are located within mtedesc, not desc.

Cc: qemu-sta...@nongnu.org
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 target/arm/tcg/sme_helper.c |  8 
 target/arm/tcg/sve_helper.c | 12 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 1ee2690ceb..904bfdac43 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -573,8 +573,8 @@ void sme_ld1_mte(CPUARMState *env, void *za, uint64_t *vg,
 desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT);
 
 /* Perform gross MTE suppression early. */
-if (!tbi_check(desc, bit55) ||
-tcma_check(desc, bit55, allocation_tag_from_addr(addr))) {
+if (!tbi_check(mtedesc, bit55) ||
+tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) {
 mtedesc = 0;
 }
 
@@ -750,8 +750,8 @@ void sme_st1_mte(CPUARMState *env, void *za, uint64_t *vg, 
target_ulong addr,
 desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT);
 
 /* Perform gross MTE suppression early. */
-if (!tbi_check(desc, bit55) ||
-tcma_check(desc, bit55, allocation_tag_from_addr(addr))) {
+if (!tbi_check(mtedesc, bit55) ||
+tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) {
 mtedesc = 0;
 }
 
diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
index bce4295d28..6853f58c19 100644
--- a/target/arm/tcg/sve_helper.c
+++ b/target/arm/tcg/sve_helper.c
@@ -5800,8 +5800,8 @@ void sve_ldN_r_mte(CPUARMState *env, uint64_t *vg, 
target_ulong addr,
 desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT);
 
 /* Perform gross MTE suppression early. */
-if (!tbi_check(desc, bit55) ||
-tcma_check(desc, bit55, allocation_tag_from_addr(addr))) {
+if (!tbi_check(mtedesc, bit55) ||
+tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) {
 mtedesc = 0;
 }
 
@@ -6156,8 +6156,8 @@ void sve_ldnfff1_r_mte(CPUARMState *env, void *vg, 
target_ulong addr,
 desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT);
 
 /* Perform gross MTE suppression early. */
-if (!tbi_check(desc, bit55) ||
-tcma_check(desc, bit55, allocation_tag_from_addr(addr))) {
+if (!tbi_check(mtedesc, bit55) ||
+tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) {
 mtedesc = 0;
 }
 
@@ -6410,8 +6410,8 @@ void sve_stN_r_mte(CPUARMState *env, uint64_t *vg, 
target_ulong addr,
 desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT);
 
 /* Perform gross MTE suppression early. */
-if (!tbi_check(desc, bit55) ||
-tcma_check(desc, bit55, allocation_tag_from_addr(addr))) {
+if (!tbi_check(mtedesc, bit55) ||
+tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) {
 mtedesc = 0;
 }
 
-- 
2.34.1

[PATCH v3 1/6] linux-user/aarch64: Choose SYNC as the preferred MTE mode

2024-02-06 Thread Richard Henderson

The API does not generate an error for setting ASYNC | SYNC; that merely
constrains the selection vs the per-cpu default.  For qemu linux-user,
choose SYNC as the default.

Cc: qemu-sta...@nongnu.org
Reported-by: Gustavo Romero 
Signed-off-by: Richard Henderson 
---
 linux-user/aarch64/target_prctl.h | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/linux-user/aarch64/target_prctl.h 
b/linux-user/aarch64/target_prctl.h
index 5067e7d731..aa8e203c15 100644
--- a/linux-user/aarch64/target_prctl.h
+++ b/linux-user/aarch64/target_prctl.h
@@ -173,21 +173,26 @@ static abi_long 
do_prctl_set_tagged_addr_ctrl(CPUArchState *env, abi_long arg2)
 env->tagged_addr_enable = arg2 & PR_TAGGED_ADDR_ENABLE;
 
 if (cpu_isar_feature(aa64_mte, cpu)) {
-switch (arg2 & PR_MTE_TCF_MASK) {
-case PR_MTE_TCF_NONE:
-case PR_MTE_TCF_SYNC:
-case PR_MTE_TCF_ASYNC:
-break;
-default:
-return -EINVAL;
-}
-
 /*
  * Write PR_MTE_TCF to SCTLR_EL1[TCF0].
- * Note that the syscall values are consistent with hw.
+ *
+ * The kernel has a per-cpu configuration for the sysadmin,
+ * /sys/devices/system/cpu/cpu/mte_tcf_preferred,
+ * which qemu does not implement.
+ *
+ * Because there is no performance difference between the modes, and
+ * because SYNC is most useful for debugging MTE errors, choose SYNC
+ * as the preferred mode.  With this preference, and the way the API
+ * uses only two bits, there is no way for the program to select
+ * ASYMM mode.
  */
-env->cp15.sctlr_el[1] =
-deposit64(env->cp15.sctlr_el[1], 38, 2, arg2 >> PR_MTE_TCF_SHIFT);
+unsigned tcf = 0;
+if (arg2 & PR_MTE_TCF_SYNC) {
+tcf = 1;
+} else if (arg2 & PR_MTE_TCF_ASYNC) {
+tcf = 2;
+}
+env->cp15.sctlr_el[1] = deposit64(env->cp15.sctlr_el[1], 38, 2, tcf);
 
 /*
  * Write PR_MTE_TAG to GCR_EL1[Exclude].
-- 
2.34.1

[PATCH v3 2/6] target/arm: Fix nregs computation in do_{ld,st}_zpa

2024-02-06 Thread Richard Henderson

The field is encoded as [0-3], which is convenient for
indexing our array of function pointers, but the true
value is [1-4].  Adjust before calling do_mem_zpa.

Add an assert, and move the comment re passing ZT to
the helper back next to the relevant code.

Cc: qemu-sta...@nongnu.org
Fixes: 206adacfb8d ("target/arm: Add mte helpers for sve scalar + int loads")
Signed-off-by: Richard Henderson 
---
 target/arm/tcg/translate-sve.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 296e7d1ce2..7108938251 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -4445,11 +4445,7 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
 TCGv_ptr t_pg;
 int desc = 0;
 
-/*
- * For e.g. LD4, there are not enough arguments to pass all 4
- * registers as pointers, so encode the regno into the data field.
- * For consistency, do this even for LD1.
- */
+assert(mte_n >= 1 && mte_n <= 4);
 if (s->mte_active[0]) {
 int msz = dtype_msz(dtype);
 
@@ -4463,6 +4459,11 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
 addr = clean_data_tbi(s, addr);
 }
 
+/*
+ * For e.g. LD4, there are not enough arguments to pass all 4
+ * registers as pointers, so encode the regno into the data field.
+ * For consistency, do this even for LD1.
+ */
 desc = simd_desc(vsz, vsz, zt | desc);
 t_pg = tcg_temp_new_ptr();
 
@@ -4600,7 +4601,7 @@ static void do_ld_zpa(DisasContext *s, int zt, int pg,
  * accessible via the instruction encoding.
  */
 assert(fn != NULL);
-do_mem_zpa(s, zt, pg, addr, dtype, nreg, false, fn);
+do_mem_zpa(s, zt, pg, addr, dtype, nreg + 1, false, fn);
 }
 
 static bool trans_LD_zprr(DisasContext *s, arg_rprr_load *a)
@@ -5168,14 +5169,13 @@ static void do_st_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
 if (nreg == 0) {
 /* ST1 */
 fn = fn_single[s->mte_active[0]][be][msz][esz];
-nreg = 1;
 } else {
 /* ST2, ST3, ST4 -- msz == esz, enforced by encoding */
 assert(msz == esz);
 fn = fn_multiple[s->mte_active[0]][be][nreg - 1][msz];
 }
 assert(fn != NULL);
-do_mem_zpa(s, zt, pg, addr, msz_dtype(s, msz), nreg, true, fn);
+do_mem_zpa(s, zt, pg, addr, msz_dtype(s, msz), nreg + 1, true, fn);
 }
 
 static bool trans_ST_zprr(DisasContext *s, arg_rprr_store *a)
-- 
2.34.1

[PATCH v3 0/6] target/arm: assorted mte fixes

2024-02-06 Thread Richard Henderson

Changes for v3:
  - As if /sys/devices/system/cpu/cpu/mte_tcf_preferred is "sync".
  - Fix do_st_zpa as well as do_ld_zpa.  Oops.

Because of the above, I dropped Gustavo's t-b.


r~


Richard Henderson (6):
  linux-user/aarch64: Choose SYNC as the preferred MTE mode
  target/arm: Fix nregs computation in do_{ld,st}_zpa
  target/arm: Adjust and validate mtedesc sizem1
  target/arm: Split out make_svemte_desc
  target/arm: Handle mte in do_ldrq, do_ldro
  target/arm: Fix SVE/SME gross MTE suppression checks

 linux-user/aarch64/target_prctl.h | 29 ++-
 target/arm/internals.h|  2 +-
 target/arm/tcg/translate-a64.h|  2 +
 target/arm/tcg/sme_helper.c   |  8 +--
 target/arm/tcg/sve_helper.c   | 12 ++---
 target/arm/tcg/translate-sme.c| 15 ++
 target/arm/tcg/translate-sve.c| 83 ++-
 7 files changed, 83 insertions(+), 68 deletions(-)

-- 
2.34.1

[PATCH v3 4/6] target/arm: Split out make_svemte_desc

2024-02-06 Thread Richard Henderson

Share code that creates mtedesc and embeds within simd_desc.

Cc: qemu-sta...@nongnu.org
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 target/arm/tcg/translate-a64.h |  2 ++
 target/arm/tcg/translate-sme.c | 15 +++
 target/arm/tcg/translate-sve.c | 47 ++
 3 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/target/arm/tcg/translate-a64.h b/target/arm/tcg/translate-a64.h
index 96ba39b37e..7b811b8ac5 100644
--- a/target/arm/tcg/translate-a64.h
+++ b/target/arm/tcg/translate-a64.h
@@ -28,6 +28,8 @@ bool logic_imm_decode_wmask(uint64_t *result, unsigned int 
immn,
 bool sve_access_check(DisasContext *s);
 bool sme_enabled_check(DisasContext *s);
 bool sme_enabled_check_with_svcr(DisasContext *s, unsigned);
+uint32_t make_svemte_desc(DisasContext *s, unsigned vsz, uint32_t nregs,
+  uint32_t msz, bool is_write, uint32_t data);
 
 /* This function corresponds to CheckStreamingSVEEnabled. */
 static inline bool sme_sm_enabled_check(DisasContext *s)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 8f0dfc884e..46c7fce8b4 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -206,7 +206,7 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
 
 TCGv_ptr t_za, t_pg;
 TCGv_i64 addr;
-int svl, desc = 0;
+uint32_t desc;
 bool be = s->be_data == MO_BE;
 bool mte = s->mte_active[0];
 
@@ -224,18 +224,11 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
 tcg_gen_shli_i64(addr, cpu_reg(s, a->rm), a->esz);
 tcg_gen_add_i64(addr, addr, cpu_reg_sp(s, a->rn));
 
-if (mte) {
-desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s));
-desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid);
-desc = FIELD_DP32(desc, MTEDESC, TCMA, s->tcma);
-desc = FIELD_DP32(desc, MTEDESC, WRITE, a->st);
-desc = FIELD_DP32(desc, MTEDESC, SIZEM1, (1 << a->esz) - 1);
-desc <<= SVE_MTEDESC_SHIFT;
-} else {
+if (!mte) {
 addr = clean_data_tbi(s, addr);
 }
-svl = streaming_vec_reg_size(s);
-desc = simd_desc(svl, svl, desc);
+
+desc = make_svemte_desc(s, streaming_vec_reg_size(s), 1, a->esz, a->st, 0);
 
 fns[a->esz][be][a->v][mte][a->st](tcg_env, t_za, t_pg, addr,
   tcg_constant_i32(desc));
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index a88e523cba..508f7b6bbd 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -4437,18 +4437,18 @@ static const uint8_t dtype_esz[16] = {
 3, 2, 1, 3
 };
 
-static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr,
-   int dtype, uint32_t mte_n, bool is_write,
-   gen_helper_gvec_mem *fn)
+uint32_t make_svemte_desc(DisasContext *s, unsigned vsz, uint32_t nregs,
+  uint32_t msz, bool is_write, uint32_t data)
 {
-unsigned vsz = vec_full_reg_size(s);
-TCGv_ptr t_pg;
 uint32_t sizem1;
-int desc = 0;
+uint32_t desc = 0;
 
-assert(mte_n >= 1 && mte_n <= 4);
-sizem1 = (mte_n << dtype_msz(dtype)) - 1;
+/* Assert all of the data fits, with or without MTE enabled. */
+assert(nregs >= 1 && nregs <= 4);
+sizem1 = (nregs << msz) - 1;
 assert(sizem1 <= R_MTEDESC_SIZEM1_MASK >> R_MTEDESC_SIZEM1_SHIFT);
+assert(data < 1u << SVE_MTEDESC_SHIFT);
+
 if (s->mte_active[0]) {
 desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s));
 desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid);
@@ -4456,7 +4456,18 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
 desc = FIELD_DP32(desc, MTEDESC, WRITE, is_write);
 desc = FIELD_DP32(desc, MTEDESC, SIZEM1, sizem1);
 desc <<= SVE_MTEDESC_SHIFT;
-} else {
+}
+return simd_desc(vsz, vsz, desc | data);
+}
+
+static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr,
+   int dtype, uint32_t nregs, bool is_write,
+   gen_helper_gvec_mem *fn)
+{
+TCGv_ptr t_pg;
+uint32_t desc;
+
+if (!s->mte_active[0]) {
 addr = clean_data_tbi(s, addr);
 }
 
@@ -4465,7 +4476,8 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, 
TCGv_i64 addr,
  * registers as pointers, so encode the regno into the data field.
  * For consistency, do this even for LD1.
  */
-desc = simd_desc(vsz, vsz, zt | desc);
+desc = make_svemte_desc(s, vec_full_reg_size(s), nregs,
+dtype_msz(dtype), is_write, zt);
 t_pg = tcg_temp_new_ptr();
 
 tcg_gen_addi_ptr(t_pg, tcg_env, pred_full_reg_offset(s, pg));
@@ -5224,25 +5236,16 @@ static void do_mem_zpz(DisasContext *s, int zt, int pg, 
int zm,
int scale, TCGv_i64 scalar, int msz, bool is_write,
gen_helper_gvec_mem_scatter *fn)
 {
-

Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races

2024-02-06 Thread Peter Xu

On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote:
> Based-on: 20240202102857.110210-1-pet...@redhat.com
> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
> https://lore.kernel.org/r/20240202102857.110210-1-pet...@redhat.com
> 
> Hi,
> 
> For v3 I fixed the refcounting issue spotted by Avihai. The situation
> there is a bit clunky due to historical reasons. The gist is that we
> have an assumption that channel creation never fails after p->c has
> been set, so when 'p->c == NULL' we have to unref and when 'p->c !=
> NULL' the cleanup code will do the unref.

Yes, this looks good to me.  That's a good catch.

I'll leave at least one more day for Avihai and/or Dan to have another
look.  My r-b persist as of now on patch 5.

Actually I think the conditional unref is slightly tricky, but it's not its
own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly
abused.  My understanding is we can avoid that conditional unref with below
patch 1 as a cleanup (on top of this series).  Then patch 2 comes all
alongside.

We don't need to rush on these, though, we should fix the thread race first
because multiple of us hit it, and all cleanups can be done later.

=
>From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001
From: Peter Xu 
Date: Wed, 7 Feb 2024 10:08:35 +0800
Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing

Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to
blocking handshake") introduced a thread for TLS channels, which will
resolve the issue on blocking the main thread.  However in the same commit
p->c is slightly abused just to be able to pass over the pointer "p" into
the thread.

That's the major reason we'll need to conditionally free the io channel in
the fault paths.

To clean it up, using a separate structure to pass over both "p" and "tioc"
in the tls handshake thread.  Then we can make it a rule that p->c will
never be set until the channel is completely setup.  With that, we can drop
the tricky conditional unref of the io channel in the error path.

Signed-off-by: Peter Xu 
---
 migration/multifd.c | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index adfe8c9a0a..4a85a6b7b3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -873,16 +873,22 @@ out:
 
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
 
+typedef struct {
+MultiFDSendParams *p;
+QIOChannelTLS *tioc;
+} MultiFDTLSThreadArgs;
+
 static void *multifd_tls_handshake_thread(void *opaque)
 {
-MultiFDSendParams *p = opaque;
-QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
+MultiFDTLSThreadArgs *args = opaque;
 
-qio_channel_tls_handshake(tioc,
+qio_channel_tls_handshake(args->tioc,
   multifd_new_send_channel_async,
-  p,
+  args->p,
   NULL,
   NULL);
+g_free(args);
+
 return NULL;
 }
 
@@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
*p,
 {
 MigrationState *s = migrate_get_current();
 const char *hostname = s->hostname;
+MultiFDTLSThreadArgs *args;
 QIOChannelTLS *tioc;
 
 tioc = migration_tls_client_create(ioc, hostname, errp);
@@ -906,11 +913,14 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
*p,
 object_unref(OBJECT(ioc));
 trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
 qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
-p->c = QIO_CHANNEL(tioc);
+
+args = g_new0(MultiFDTLSThreadArgs, 1);
+args->tioc = tioc;
+args->p = p;
 
 p->tls_thread_created = true;
 qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker",
-   multifd_tls_handshake_thread, p,
+   multifd_tls_handshake_thread, args,
QEMU_THREAD_JOINABLE);
 return true;
 }
@@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 
 migration_ioc_register_yank(ioc);
 p->registered_yank = true;
+/* Setup p->c only if the channel is completely setup */
 p->c = ioc;
 
 p->thread_created = true;
@@ -976,14 +987,12 @@ out:
 
 trace_multifd_new_send_channel_async_error(p->id, local_err);
 multifd_send_set_error(local_err);
-if (!p->c) {
-/*
- * If no channel has been created, drop the initial
- * reference. Otherwise cleanup happens at
- * multifd_send_channel_destroy()
- */
-object_unref(OBJECT(ioc));
-}
+/*
+ * For error cases (TLS or non-TLS), IO channel is always freed here
+ * rather than when cleanup multifd: since p->c is not set, multifd
+ * cleanup code doesn't even know its existance.
+ */
+object_unref(OBJECT(ioc));

Re: [PATCH v2 1/6] linux-user/aarch64: Extend PR_SET_TAGGED_ADDR_CTRL for FEAT_MTE3

2024-02-06 Thread Richard Henderson


On 2/7/24 00:23, Peter Maydell wrote:

+++ b/linux-user/aarch64/target_prctl.h
@@ -173,21 +173,22 @@ static abi_long 
do_prctl_set_tagged_addr_ctrl(CPUArchState *env, abi_long arg2)
  env->tagged_addr_enable = arg2 & PR_TAGGED_ADDR_ENABLE;

  if (cpu_isar_feature(aa64_mte, cpu)) {
-switch (arg2 & PR_MTE_TCF_MASK) {
-case PR_MTE_TCF_NONE:
-case PR_MTE_TCF_SYNC:
-case PR_MTE_TCF_ASYNC:
-break;
-default:
-return -EINVAL;
-}


We should probably check here and reject unknown bits being
set in arg2, as set_tagged_addr_ctrl() does; but the old
code didn't get that right either.


This is done higher up in this function:

if (arg2 & ~valid_mask) {
return -TARGET_EINVAL;
}

The rejection of ASYNC | SYNC here was either a bug in my original implementation, or the 
kernel API changed since the initial implementation in June 2020 (not worth digging to 
find out).



r~

[PATCH v2 2/3] ci: Remove tag dependency for build-previous-qemu

2024-02-06 Thread peterx

From: Peter Xu 

The new build-previous-qemu job relies on QEMU release tag being present,
while that may not be always true for personal git repositories since by
default tag is not pushed.  The job can fail on those CI kicks, as reported
by Peter Maydell.

Fix it by fetching the tags remotely from the official repository, as
suggested by Dan.

[1] https://lore.kernel.org/r/zcc9sckj7vvqe...@redhat.com

Reported-by: Peter Maydell 
Suggested-by: Daniel P. Berrangé 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Peter Xu 
---
 .gitlab-ci.d/buildtest.yml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 79bbc8585b..cfe95c1b17 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -189,6 +189,8 @@ build-previous-qemu:
 TARGETS: x86_64-softmmu aarch64-softmmu
   before_script:
 - export QEMU_PREV_VERSION="$(sed 's/\([0-9.]*\)\.[0-9]*/v\1.0/' VERSION)"
+- git remote add upstream https://gitlab.com/qemu-project/qemu
+- git fetch upstream $QEMU_PREV_VERSION
 - git checkout $QEMU_PREV_VERSION
   after_script:
 - mv build build-previous
-- 
2.43.0

[PATCH v2 3/3] ci: Update comment for migration-compat-aarch64

2024-02-06 Thread peterx

From: Peter Xu 

It turns out that we may not be able to enable this test even for the
upcoming v9.0.  Document what we're still missing.

Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Peter Xu 
---
 .gitlab-ci.d/buildtest.yml | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index cfe95c1b17..f56df59c94 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -219,9 +219,10 @@ build-previous-qemu:
 - QTEST_QEMU_BINARY_DST=./qemu-system-${TARGET}
   QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} 
./tests/qtest/migration-test
 
-# This job is disabled until we release 9.0. The existing
-# migration-test in 8.2 is broken on aarch64. The fix was already
-# commited, but it will only take effect once 9.0 is out.
+# This job needs to be disabled until we can have an aarch64 CPU model that
+# will both (1) support both KVM and TCG, and (2) provide a stable ABI.
+# Currently only "-cpu max" can provide (1), however it doesn't guarantee
+# (2).  Mark this test skipped until later.
 migration-compat-aarch64:
   extends: .migration-compat-common
   variables:
-- 
2.43.0

[PATCH v2 0/3] ci: Fixes on the recent cross-binary test case

2024-02-06 Thread peterx

From: Peter Xu 

v2:
- Fix a typo in patch 2 on QEMU_PREV_VERSION
- Added R-bs for Dan

Hi,

This small patchset updates the recent cross-binary test for migration on
a few things.

Patch 1 modifies the aarch64 test GIC version to 3 rather than "max",
paving way for enabling it, even if the CPU model is not yet ready.

Patch 2 removes the tag dependency of the new build-previous-qemu job, so
that in personal CI pipelines the job won't fail if the tag is missing, as
reported by Peter Maydell, and solution suggested by Dan.

Patch 3 updates the comment for aarch64 on the test to state the fact, and
what is missing.  Then we don't target it support for v9.0, but only until
we have a stable CPU model for aarch64 (if ever possible to support both
tcg and kvm).

Comments welcomed, thanks.

Peter Xu (3):
  tests/migration-test: Stick with gicv3 in aarch64 test
  ci: Remove tag dependency for build-previous-qemu
  ci: Update comment for migration-compat-aarch64

 tests/qtest/migration-test.c | 2 +-
 .gitlab-ci.d/buildtest.yml   | 9 ++---
 2 files changed, 7 insertions(+), 4 deletions(-)

-- 
2.43.0

[PATCH v2 1/3] tests/migration-test: Stick with gicv3 in aarch64 test

2024-02-06 Thread peterx

From: Peter Xu 

Recently we introduced cross-binary migration test.  It's always wanted
that migration-test uses stable guest ABI for both QEMU binaries in this
case, so that both QEMU binaries will be compatible on the migration
stream with the cmdline specified.

Switch to a static gic version "3" rather than using version "max", so that
GIC should be stable now across any future QEMU binaries for migration-test.

Here the version can actually be anything as long as the ABI is stable.  We
choose "3" because it's the majority of what we already use in QEMU while
still new enough: "git grep gic-version=3" shows 6 hit, while version 4 has
no direct user yet besides "max".

Note that even with this change, aarch64 won't be able to work yet with
migration cross binary test, but then the only missing piece will be the
stable CPU model.

Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 7675519cfa..8a5bb1752e 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -819,7 +819,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 } else if (strcmp(arch, "aarch64") == 0) {
 memory_size = "150M";
 machine_alias = "virt";
-machine_opts = "gic-version=max";
+machine_opts = "gic-version=3";
 arch_opts = g_strdup_printf("-cpu max -kernel %s", bootpath);
 start_address = ARM_TEST_MEM_START;
 end_address = ARM_TEST_MEM_END;
-- 
2.43.0

Re: [PATCH v2 2/6] target/arm: Fix nregs computation in do_ld_zpa

2024-02-06 Thread Richard Henderson


On 2/7/24 00:46, Peter Maydell wrote:

@@ -4600,7 +4601,7 @@ static void do_ld_zpa(DisasContext *s, int zt, int pg,
   * accessible via the instruction encoding.
   */
  assert(fn != NULL);
-do_mem_zpa(s, zt, pg, addr, dtype, nreg, false, fn);
+do_mem_zpa(s, zt, pg, addr, dtype, nreg + 1, false, fn);
  }

  static bool trans_LD_zprr(DisasContext *s, arg_rprr_load *a)


What about do_st_zpa() ? It's not obvious what the 'nreg'
encoding is in the a->nreg field in arg_rprr_store, but
it's definitely confusing that do_st_zpa() calls
do_mem_zpa() passing "nreg" whereas do_ld_zpa() now
passes it "nreg + 1". Can we make it so the handling
in these two functions lines up?


Yes, I think there may be a bug in store as well.
Comparing the two is complicated by the cut outs for LDFF1, LDNF1, LD1R and PRF.


r~

RE: [PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx

2024-02-06 Thread kft...@nuvoton.com



-Original Message-
From: Nabih Estefan 
Sent: Wednesday, February 7, 2024 7:24 AM
To: peter.mayd...@linaro.org
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; CS20 KFTing 
; wuhao...@google.com; jasow...@redhat.com; IS20 Avi 
Fishman ; nabiheste...@google.com; CS20 KWLiu 
; IS20 Tomer Maimon ; IN20 Hila 
Miranda-Kuzi 
Subject: [PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx

CAUTION - External Email: Do not click links or open attachments unless you 
acknowledge the sender and content.


Fixing the nocm_gmac-test.c file to run on a nuvoton 7xx machine instead of 
8xx. Also fixing comments referencing this and values expecting 8xx.

Change-Id: I07b91e8be473e6a1ece65a2202608b52ed4025b8
Signed-Off-By: Nabih Estefan 
Reviewed-by: Tyrone Ting 

---
 tests/qtest/meson.build  |  4 ++--
 tests/qtest/npcm_gmac-test.c | 12 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build index 
39557d5ecb..2b89e8634b 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -192,7 +192,8 @@ qtests_npcm7xx = \
'npcm7xx_sdhci-test',
'npcm7xx_smbus-test',
'npcm7xx_timer-test',
-   'npcm7xx_watchdog_timer-test'] + \
+   'npcm7xx_watchdog_timer-test',
+   'npcm_gmac-test'] + \
(slirp.found() ? ['npcm7xx_emc-test'] : [])  qtests_aspeed = \
   ['aspeed_hace-test',
@@ -231,7 +232,6 @@ qtests_aarch64 = \
   (config_all_devices.has_key('CONFIG_RASPI') ? ['bcm2835-dma-test'] : []) +  \
   (config_all_accel.has_key('CONFIG_TCG') and  
  \
config_all_devices.has_key('CONFIG_TPM_TIS_I2C') ? ['tpm-tis-i2c-test'] : 
[]) + \
-  (config_all_devices.has_key('CONFIG_NPCM7XX') ? qtests_npcm7xx : []) + \
   ['arm-cpu-features',
'numa-test',
'boot-serial-test',
diff --git a/tests/qtest/npcm_gmac-test.c b/tests/qtest/npcm_gmac-test.c index 
9e58b15ca1..0d1bc8107b 100644
--- a/tests/qtest/npcm_gmac-test.c
+++ b/tests/qtest/npcm_gmac-test.c
@@ -36,7 +36,7 @@ typedef struct TestData {
 const GMACModule *module;
 } TestData;

-/* Values extracted from hw/arm/npcm8xx.c */
+/* Values extracted from hw/arm/npcm7xx.c */
 static const GMACModule gmac_module_list[] = {
 {
 .irq= 14,
@@ -46,14 +46,6 @@ static const GMACModule gmac_module_list[] = {
 .irq= 15,
 .base_addr  = 0xf0804000
 },
-{
-.irq= 16,
-.base_addr  = 0xf0806000
-},
-{
-.irq= 17,
-.base_addr  = 0xf0808000
-}
 };

 /* Returns the index of the GMAC module. */ @@ -196,7 +188,7 @@ static void 
test_init(gconstpointer test_data)  {
 const TestData *td = test_data;
 const GMACModule *mod = td->module;
-QTestState *qts = qtest_init("-machine npcm845-evb");
+QTestState *qts = qtest_init("-machine npcm750-evb");

 #define CHECK_REG32(regno, value) \
 do { \
--
2.43.0.594.gd9cf4e227d-goog



 The privileged confidential information contained in this email is intended 
for use only by the addressees as indicated by the original sender of this 
email. If you are not the addressee indicated in this email or are not 
responsible for delivery of the email to such a person, please kindly reply to 
the sender indicating this fact and delete all copies of it from your computer 
and network server immediately. Your cooperation is highly appreciated. It is 
advised that any unauthorized use of confidential information of Nuvoton is 
strictly prohibited; and any information in this email irrelevant to the 
official business of Nuvoton shall be deemed as neither given nor endorsed by 
Nuvoton.

Re: [PATCH v2 1/6] linux-user/aarch64: Extend PR_SET_TAGGED_ADDR_CTRL for FEAT_MTE3

2024-02-06 Thread Richard Henderson


On 2/7/24 00:23, Peter Maydell wrote:

On Tue, 6 Feb 2024 at 03:06, Richard Henderson
 wrote:


When MTE3 is supported, the kernel maps
   PR_MTE_TCF_ASYNC | PR_MTE_TCF_SYNC
to
   MTE_CTRL_TCF_ASYMM
and from there to
   SCTLR_EL1.TCF0 = 3


This depends on the setting of
/sys/devices/system/cpu/cpu/mte_tcf_preferred :
I think you only get asymm here if the sysadmin has set
mte_tcf_preferred to 'asymm'; the default is 'async'.


Hmm, I missed that somewhere in the rat's nest.
I suspect this is over-engineered, such that no one will understand how to use 
it.


For QEMU's implementation, are there any particular
performance differences between sync, async and asymm ?


I doubt it.  Getting to the error path at all is the bulk of the work.

I think "performance" in this case would be highly test-case-centric.
Does the test "perform better" with async, which would allow the entire vector operation 
to finish in one go?


I suspect that for debugging purposes, sync is always preferred.
That might be the best setting for qemu.


r~

Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant

2024-02-06 Thread Elena Ufimtseva

Hello Alexander

On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov 
wrote:

> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html


I would like to point out that the frequency scaling is not currently an
issue on AMD Zen4 Genoa CPUs, for example.
And microcode architecture description here:
https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
Although, the cpu frequency downscaling mentioned in the above document is
only in relation to floating point operations.
But from other online discussions I gather that the data path for the
integer registers in Zen4 is also 256 bits and it allows to avoid
frequency downscaling for FP and heavy instructions.
And looking at the optimizations for AVX2 in your other patch, would
unrolling the loop for AVX512 ops benefit from the speedup taken that the
data path has the same width?
If the frequency downscaling is not observed on some of the CPUs, can
AVX512 be maintained and used selectively for some
of the CPUs?

Thank you!


>
>
> Signed-off-by: Mikhail Romanov 
> Signed-off-by: Alexander Monakov 
> ---
>  util/bufferiszero.c | 36 ++--
>  1 file changed, 2 insertions(+), 34 deletions(-)
>
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 01050694a6..c037d11d04 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
>  }
>  }
>
> -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) ||
> defined(__SSE2__)
> +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
>  #include 
>
>  /* Note that each of these vectorized functions require len >= 64.  */
> @@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len)
>  }
>  #endif /* CONFIG_AVX2_OPT */
>
> -#ifdef CONFIG_AVX512F_OPT
> -static bool __attribute__((target("avx512f")))
> -buffer_zero_avx512(const void *buf, size_t len)
> -{
> -/* Begin with an unaligned head of 64 bytes.  */
> -__m512i t = _mm512_loadu_si512(buf);
> -__m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
> -__m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
> -
> -/* Loop over 64-byte aligned blocks of 256.  */
> -while (p <= e) {
> -__builtin_prefetch(p);
> -if (unlikely(_mm512_test_epi64_mask(t, t))) {
> -return false;
> -}
> -t = p[-4] | p[-3] | p[-2] | p[-1];
> -p += 4;
> -}
> -
> -t |= _mm512_loadu_si512(buf + len - 4 * 64);
> -t |= _mm512_loadu_si512(buf + len - 3 * 64);
> -t |= _mm512_loadu_si512(buf + len - 2 * 64);
> -t |= _mm512_loadu_si512(buf + len - 1 * 64);
> -
> -return !_mm512_test_epi64_mask(t, t);
> -
> -}
> -#endif /* CONFIG_AVX512F_OPT */
> -
>  static unsigned __attribute__((noinline))
>  select_accel_cpuinfo(unsigned info)
>  {
> @@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info)
>  unsigned bit;
>  bool (*fn)(const void *, size_t);
>  } all[] = {
> -#ifdef CONFIG_AVX512F_OPT
> -{ CPUINFO_AVX512F, buffer_zero_avx512 },
> -#endif
>  #ifdef CONFIG_AVX2_OPT
>  { CPUINFO_AVX2,buffer_zero_avx2 },
>  #endif
> @@ -191,7 +159,7 @@ static unsigned used_accel
>  = 0;
>  #endif
>
> -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
> +#if defined(CONFIG_AVX2_OPT)
>  static void __attribute__((constructor)) init_accel(void)
>  {
>  used_accel = select_accel_cpuinfo(cpuinfo_init());
> --
> 2.32.0
>
>
>

-- 
Elena

[PATCH 0/1] Sending small fix for NPCM GMAC test to properly test on Nuvoton 7xx

2024-02-06 Thread Nabih Estefan



Nabih Estefan (1):
  tests/qtest: Fixing GMAC test to run in 7xx

 tests/qtest/meson.build  |  4 ++--
 tests/qtest/npcm_gmac-test.c | 12 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

-- 
2.43.0.594.gd9cf4e227d-goog

[PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx

2024-02-06 Thread Nabih Estefan

Fixing the nocm_gmac-test.c file to run on a nuvoton 7xx machine instead
of 8xx. Also fixing comments referencing this and values expecting 8xx.

Change-Id: I07b91e8be473e6a1ece65a2202608b52ed4025b8
Signed-Off-By: Nabih Estefan 
---
 tests/qtest/meson.build  |  4 ++--
 tests/qtest/npcm_gmac-test.c | 12 ++--
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 39557d5ecb..2b89e8634b 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -192,7 +192,8 @@ qtests_npcm7xx = \
'npcm7xx_sdhci-test',
'npcm7xx_smbus-test',
'npcm7xx_timer-test',
-   'npcm7xx_watchdog_timer-test'] + \
+   'npcm7xx_watchdog_timer-test',
+   'npcm_gmac-test'] + \
(slirp.found() ? ['npcm7xx_emc-test'] : [])
 qtests_aspeed = \
   ['aspeed_hace-test',
@@ -231,7 +232,6 @@ qtests_aarch64 = \
   (config_all_devices.has_key('CONFIG_RASPI') ? ['bcm2835-dma-test'] : []) +  \
   (config_all_accel.has_key('CONFIG_TCG') and  
  \
config_all_devices.has_key('CONFIG_TPM_TIS_I2C') ? ['tpm-tis-i2c-test'] : 
[]) + \
-  (config_all_devices.has_key('CONFIG_NPCM7XX') ? qtests_npcm7xx : []) + \
   ['arm-cpu-features',
'numa-test',
'boot-serial-test',
diff --git a/tests/qtest/npcm_gmac-test.c b/tests/qtest/npcm_gmac-test.c
index 9e58b15ca1..0d1bc8107b 100644
--- a/tests/qtest/npcm_gmac-test.c
+++ b/tests/qtest/npcm_gmac-test.c
@@ -36,7 +36,7 @@ typedef struct TestData {
 const GMACModule *module;
 } TestData;
 
-/* Values extracted from hw/arm/npcm8xx.c */
+/* Values extracted from hw/arm/npcm7xx.c */
 static const GMACModule gmac_module_list[] = {
 {
 .irq= 14,
@@ -46,14 +46,6 @@ static const GMACModule gmac_module_list[] = {
 .irq= 15,
 .base_addr  = 0xf0804000
 },
-{
-.irq= 16,
-.base_addr  = 0xf0806000
-},
-{
-.irq= 17,
-.base_addr  = 0xf0808000
-}
 };
 
 /* Returns the index of the GMAC module. */
@@ -196,7 +188,7 @@ static void test_init(gconstpointer test_data)
 {
 const TestData *td = test_data;
 const GMACModule *mod = td->module;
-QTestState *qts = qtest_init("-machine npcm845-evb");
+QTestState *qts = qtest_init("-machine npcm750-evb");
 
 #define CHECK_REG32(regno, value) \
 do { \
-- 
2.43.0.594.gd9cf4e227d-goog

[PATCH 4/6] migration/multifd: Zero page transmission on the multifd thread.

2024-02-06 Thread Hao Xiang

This implements the zero page detection and handling on the multifd
threads.

Signed-off-by: Hao Xiang 
---
 migration/multifd.c | 62 +
 migration/multifd.h |  5 
 2 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index a20d0ed10e..c031f947c7 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -11,6 +11,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -278,6 +279,12 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 
 packet->offset[i] = cpu_to_be64(temp);
 }
+for (i = 0; i < p->zero_num; i++) {
+/* there are architectures where ram_addr_t is 32 bit */
+uint64_t temp = p->zero[i];
+
+packet->offset[p->normal_num + i] = cpu_to_be64(temp);
+}
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -360,6 +367,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
 p->normal[i] = offset;
 }
 
+for (i = 0; i < p->zero_num; i++) {
+uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]);
+
+if (offset > (p->block->used_length - p->page_size)) {
+error_setg(errp, "multifd: offset too long %" PRIu64
+   " (max " RAM_ADDR_FMT ")",
+   offset, p->block->used_length);
+return -1;
+}
+p->zero[i] = offset;
+}
+
 return 0;
 }
 
@@ -658,13 +677,37 @@ int multifd_send_sync_main(void)
 return 0;
 }
 
+static void zero_page_check_send(MultiFDSendParams *p)
+{
+/*
+ * QEMU older than 9.0 don't understand zero page
+ * on multifd channel. This switch is required to
+ * maintain backward compatibility.
+ */
+bool use_multifd_zero_page = migrate_multifd_zero_page();
+RAMBlock *rb = p->pages->block;
+
+for (int i = 0; i < p->pages->num; i++) {
+uint64_t offset = p->pages->offset[i];
+if (use_multifd_zero_page &&
+buffer_is_zero(rb->host + offset, p->page_size)) {
+p->zero[p->zero_num] = offset;
+p->zero_num++;
+ram_release_page(rb->idstr, offset);
+} else {
+p->normal[p->normal_num] = offset;
+p->normal_num++;
+}
+}
+}
+
 static void *multifd_send_thread(void *opaque)
 {
 MultiFDSendParams *p = opaque;
 MigrationThread *thread = NULL;
 Error *local_err = NULL;
-int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
+int ret = 0;
 
 thread = migration_threads_add(p->name, qemu_get_thread_id());
 
@@ -699,10 +742,7 @@ static void *multifd_send_thread(void *opaque)
 p->iovs_num = 1;
 }
 
-for (int i = 0; i < p->pages->num; i++) {
-p->normal[p->normal_num] = p->pages->offset[i];
-p->normal_num++;
-}
+zero_page_check_send(p);
 
 if (p->normal_num) {
 ret = multifd_send_state->ops->send_prepare(p, _err);
@@ -1107,6 +1147,16 @@ void multifd_recv_sync_main(void)
 trace_multifd_recv_sync_main(multifd_recv_state->packet_num);
 }
 
+static void zero_page_check_recv(MultiFDRecvParams *p)
+{
+for (int i = 0; i < p->zero_num; i++) {
+void *page = p->host + p->zero[i];
+if (!buffer_is_zero(page, p->page_size)) {
+memset(page, 0, p->page_size);
+}
+}
+}
+
 static void *multifd_recv_thread(void *opaque)
 {
 MultiFDRecvParams *p = opaque;
@@ -1153,6 +1203,8 @@ static void *multifd_recv_thread(void *opaque)
 }
 }
 
+zero_page_check_recv(p);
+
 if (flags & MULTIFD_FLAG_SYNC) {
 qemu_sem_post(_recv_state->sem_sync);
 qemu_sem_wait(>sem_sync);
diff --git a/migration/multifd.h b/migration/multifd.h
index 6be9b2f6c1..7448cb1aa9 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -53,6 +53,11 @@ typedef struct {
 uint32_t unused32[1];/* Reserved for future use */
 uint64_t unused64[3];/* Reserved for future use */
 char ramblock[256];
+/*
+ * This array contains the pointers to:
+ *  - normal pages (initial normal_pages entries)
+ *  - zero pages (following zero_pages entries)
+ */
 uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
-- 
2.30.2

[PATCH 3/6] migration/multifd: Support for zero pages transmission in multifd format.

2024-02-06 Thread Hao Xiang

This change adds zero page counters and updates multifd send/receive
tracing format to track the newly added counters.

Signed-off-by: Hao Xiang 
---
 migration/migration-hmp-cmds.c |  4 
 migration/multifd.c| 43 ++
 migration/multifd.h| 17 +-
 migration/trace-events |  8 +++
 4 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 8b0c205a41..2dd99b0509 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -111,6 +111,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
info->ram->normal);
 monitor_printf(mon, "normal bytes: %" PRIu64 " kbytes\n",
info->ram->normal_bytes >> 10);
+monitor_printf(mon, "zero: %" PRIu64 " pages\n",
+   info->ram->zero);
+monitor_printf(mon, "zero bytes: %" PRIu64 " kbytes\n",
+   info->ram->zero_bytes >> 10);
 monitor_printf(mon, "dirty sync count: %" PRIu64 "\n",
info->ram->dirty_sync_count);
 monitor_printf(mon, "page size: %" PRIu64 " kbytes\n",
diff --git a/migration/multifd.c b/migration/multifd.c
index 25cbc6dc6b..a20d0ed10e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -264,6 +264,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 packet->flags = cpu_to_be32(p->flags);
 packet->pages_alloc = cpu_to_be32(p->pages->allocated);
 packet->normal_pages = cpu_to_be32(p->normal_num);
+packet->zero_pages = cpu_to_be32(p->zero_num);
 packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 packet->packet_num = cpu_to_be64(p->packet_num);
 
@@ -317,18 +318,26 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
 p->normal_num = be32_to_cpu(packet->normal_pages);
 if (p->normal_num > packet->pages_alloc) {
 error_setg(errp, "multifd: received packet "
-   "with %u pages and expected maximum pages are %u",
+   "with %u normal pages and expected maximum pages are %u",
p->normal_num, packet->pages_alloc) ;
 return -1;
 }
 
-p->next_packet_size = be32_to_cpu(packet->next_packet_size);
-p->packet_num = be64_to_cpu(packet->packet_num);
+p->zero_num = be32_to_cpu(packet->zero_pages);
+if (p->zero_num > packet->pages_alloc - p->normal_num) {
+error_setg(errp, "multifd: received packet "
+   "with %u zero pages and expected maximum zero pages are %u",
+   p->zero_num, packet->pages_alloc - p->normal_num) ;
+return -1;
+}
 
-if (p->normal_num == 0) {
+if (p->normal_num == 0 && p->zero_num == 0) {
 return 0;
 }
 
+p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+p->packet_num = be64_to_cpu(packet->packet_num);
+
 /* make sure that ramblock is 0 terminated */
 packet->ramblock[255] = 0;
 p->block = qemu_ram_block_by_name(packet->ramblock);
@@ -430,6 +439,7 @@ static int multifd_send_pages(void)
 p->packet_num = multifd_send_state->packet_num++;
 multifd_send_state->pages = p->pages;
 p->pages = pages;
+
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
@@ -551,6 +561,8 @@ void multifd_save_cleanup(void)
 p->iov = NULL;
 g_free(p->normal);
 p->normal = NULL;
+g_free(p->zero);
+p->zero = NULL;
 multifd_send_state->ops->send_cleanup(p, _err);
 if (local_err) {
 migrate_set_error(migrate_get_current(), local_err);
@@ -679,6 +691,7 @@ static void *multifd_send_thread(void *opaque)
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
 p->normal_num = 0;
+p->zero_num = 0;
 
 if (use_zero_copy_send) {
 p->iovs_num = 0;
@@ -703,12 +716,13 @@ static void *multifd_send_thread(void *opaque)
 p->flags = 0;
 p->num_packets++;
 p->total_normal_pages += p->normal_num;
+p->total_zero_pages += p->zero_num;
 p->pages->num = 0;
 p->pages->block = NULL;
 qemu_mutex_unlock(>mutex);
 
-trace_multifd_send(p->id, packet_num, p->normal_num, flags,
-   p->next_packet_size);
+trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
+   flags, p->next_packet_size);
 
 if (use_zero_copy_send) {
 /* Send header first, without zerocopy */
@@ -731,6 +745,8 @@ static void *multifd_send_thread(void *opaque)
 
 stat64_add(_stats.multifd_bytes,
p->next_packet_size + p->packet_len);
+stat64_add(_stats.normal_pages, p->normal_num);
+stat64_add(_stats.zero_pages, p->zero_num);

[PATCH 6/6] migration/multifd: Add a new migration test case for legacy zero page checking.

2024-02-06 Thread Hao Xiang

Now that zero page checking is done on the multifd sender threads by
default, we still provide an option for backward compatibility. This
change adds a qtest migration test case to set the multifd-zero-page
option to false and run multifd migration with zero page checking on the
migration main thread.

Signed-off-by: Hao Xiang 
---
 tests/qtest/migration-test.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 7675519cfa..2c13df04c3 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2621,6 +2621,15 @@ test_migrate_precopy_tcp_multifd_start(QTestState *from,
 return test_migrate_precopy_tcp_multifd_start_common(from, to, "none");
 }
 
+static void *
+test_migrate_precopy_tcp_multifd_start_zero_page_legacy(QTestState *from,
+QTestState *to)
+{
+test_migrate_precopy_tcp_multifd_start_common(from, to, "none");
+migrate_set_parameter_bool(from, "multifd-zero-page", false);
+return NULL;
+}
+
 static void *
 test_migrate_precopy_tcp_multifd_zlib_start(QTestState *from,
 QTestState *to)
@@ -2652,6 +2661,21 @@ static void test_multifd_tcp_none(void)
 test_precopy_common();
 }
 
+static void test_multifd_tcp_zero_page_legacy(void)
+{
+MigrateCommon args = {
+.listen_uri = "defer",
+.start_hook = test_migrate_precopy_tcp_multifd_start_zero_page_legacy,
+/*
+ * Multifd is more complicated than most of the features, it
+ * directly takes guest page buffers when sending, make sure
+ * everything will work alright even if guest page is changing.
+ */
+.live = true,
+};
+test_precopy_common();
+}
+
 static void test_multifd_tcp_zlib(void)
 {
 MigrateCommon args = {
@@ -3550,6 +3574,8 @@ int main(int argc, char **argv)
 }
 migration_test_add("/migration/multifd/tcp/plain/none",
test_multifd_tcp_none);
+migration_test_add("/migration/multifd/tcp/plain/zero_page_legacy",
+   test_multifd_tcp_zero_page_legacy);
 migration_test_add("/migration/multifd/tcp/plain/cancel",
test_multifd_tcp_cancel);
 migration_test_add("/migration/multifd/tcp/plain/zlib",
-- 
2.30.2

[PATCH 5/6] migration/multifd: Enable zero page checking from multifd threads.

2024-02-06 Thread Hao Xiang

This change adds a dedicated handler for MigrationOps::ram_save_target_page in
multifd live migration. Now zero page checking can be done in the multifd 
threads
and this becomes the default configuration. We still provide backward 
compatibility
where zero page checking is done from the migration main thread.

Signed-off-by: Hao Xiang 
---
 migration/multifd.c |  3 ++-
 migration/ram.c | 49 -
 2 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c031f947c7..c6833ccb07 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
+#include "qemu/cutils.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -458,7 +459,6 @@ static int multifd_send_pages(void)
 p->packet_num = multifd_send_state->packet_num++;
 multifd_send_state->pages = p->pages;
 p->pages = pages;
-
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
@@ -733,6 +733,7 @@ static void *multifd_send_thread(void *opaque)
 if (p->pending_job) {
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
+
 p->normal_num = 0;
 p->zero_num = 0;
 
diff --git a/migration/ram.c b/migration/ram.c
index d5b7cd5ac2..e6742c9593 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1252,6 +1252,10 @@ static int ram_save_page(RAMState *rs, PageSearchStatus 
*pss)
 
 static int ram_save_multifd_page(RAMBlock *block, ram_addr_t offset)
 {
+assert(migrate_multifd());
+assert(!migrate_compress());
+assert(!migration_in_postcopy());
+
 if (multifd_queue_page(block, offset) < 0) {
 return -1;
 }
@@ -2043,7 +2047,6 @@ static bool save_compress_page(RAMState *rs, 
PageSearchStatus *pss,
  */
 static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
 {
-RAMBlock *block = pss->block;
 ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
 int res;
 
@@ -2059,17 +2062,40 @@ static int ram_save_target_page_legacy(RAMState *rs, 
PageSearchStatus *pss)
 return 1;
 }
 
+return ram_save_page(rs, pss);
+}
+
+/**
+ * ram_save_target_page_multifd: save one target page
+ *
+ * Returns the number of pages written
+ *
+ * @rs: current RAM state
+ * @pss: data about the page we want to send
+ */
+static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
+{
+RAMBlock *block = pss->block;
+ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
+
+/* Multifd is not compatible with old compression. */
+assert(!migrate_compress());
+
+/* Multifd is not compabible with postcopy. */
+assert(!migration_in_postcopy());
+
 /*
- * Do not use multifd in postcopy as one whole host page should be
- * placed.  Meanwhile postcopy requires atomic update of pages, so even
- * if host page size == guest page size the dest guest during run may
- * still see partially copied pages which is data corruption.
+ * Backward compatibility support. While using multifd live
+ * migration, we still need to handle zero page checking on the
+ * migration main thread.
  */
-if (migrate_multifd() && !migration_in_postcopy()) {
-return ram_save_multifd_page(block, offset);
+if (!migrate_multifd_zero_page()) {
+if (save_zero_page(rs, pss, offset)) {
+return 1;
+}
 }
 
-return ram_save_page(rs, pss);
+return ram_save_multifd_page(block, offset);
 }
 
 /* Should be called before sending a host page */
@@ -2981,7 +3007,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 }
 
 migration_ops = g_malloc0(sizeof(MigrationOps));
-migration_ops->ram_save_target_page = ram_save_target_page_legacy;
+
+if (migrate_multifd()) {
+migration_ops->ram_save_target_page = ram_save_target_page_multifd;
+} else {
+migration_ops->ram_save_target_page = ram_save_target_page_legacy;
+}
 
 bql_unlock();
 ret = multifd_send_sync_main();
-- 
2.30.2

[PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.

2024-02-06 Thread Hao Xiang

This change extends the MigrationStatus interface to track zero pages
and zero bytes counter.

Signed-off-by: Hao Xiang 
---
 qapi/migration.json | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index ff033a0344..69366fe3f4 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -63,6 +63,10 @@
 # between 0 and @dirty-sync-count * @multifd-channels.  (since
 # 7.1)
 #
+# @zero: number of zero pages (since 9.0)
+#
+# @zero-bytes: number of zero bytes sent (since 9.0)
+#
 # Features:
 #
 # @deprecated: Member @skipped is always zero since 1.5.3
@@ -81,7 +85,8 @@
'multifd-bytes': 'uint64', 'pages-per-second': 'uint64',
'precopy-bytes': 'uint64', 'downtime-bytes': 'uint64',
'postcopy-bytes': 'uint64',
-   'dirty-sync-missed-zero-copy': 'uint64' } }
+   'dirty-sync-missed-zero-copy': 'uint64',
+   'zero': 'int', 'zero-bytes': 'int' } }
 
 ##
 # @XBZRLECacheStats:
@@ -332,6 +337,8 @@
 #   "duplicate":123,
 #   "normal":123,
 #   "normal-bytes":123456,
+#   "zero":123,
+#   "zero-bytes":123456,
 #   "dirty-sync-count":15
 # }
 #  }
@@ -358,6 +365,8 @@
 # "duplicate":123,
 # "normal":123,
 # "normal-bytes":123456,
+# "zero":123,
+# "zero-bytes":123456,
 # "dirty-sync-count":15
 #  }
 #   }
@@ -379,6 +388,8 @@
 # "duplicate":123,
 # "normal":123,
 # "normal-bytes":123456,
+# "zero":123,
+# "zero-bytes":123456,
 # "dirty-sync-count":15
 #  },
 #  "disk":{
@@ -405,6 +416,8 @@
 # "duplicate":10,
 # "normal":,
 # "normal-bytes":3412992,
+# "zero":,
+# "zero-bytes":3412992,
 # "dirty-sync-count":15
 #  },
 #  "xbzrle-cache":{
-- 
2.30.2

[PATCH 1/6] migration/multifd: Add new migration option multifd-zero-page.

2024-02-06 Thread Hao Xiang

This new parameter controls where the zero page checking is running. If
this parameter is set to true, zero page checking is done in the multifd
sender threads. If this parameter is set to false, zero page checking is
done in the migration main thread.

Signed-off-by: Hao Xiang 
---
 migration/migration-hmp-cmds.c |  7 +++
 migration/options.c| 20 
 migration/options.h|  1 +
 qapi/migration.json| 24 +---
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 99b49df5dd..8b0c205a41 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -344,6 +344,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %s\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION),
 MultiFDCompression_str(params->multifd_compression));
+monitor_printf(mon, "%s: %s\n",
+MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_ZERO_PAGE),
+params->multifd_zero_page ? "on" : "off");
 monitor_printf(mon, "%s: %" PRIu64 " bytes\n",
 MigrationParameter_str(MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE),
 params->xbzrle_cache_size);
@@ -634,6 +637,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->has_multifd_zstd_level = true;
 visit_type_uint8(v, param, >multifd_zstd_level, );
 break;
+case MIGRATION_PARAMETER_MULTIFD_ZERO_PAGE:
+p->has_multifd_zero_page = true;
+visit_type_bool(v, param, >multifd_zero_page, );
+break;
 case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
 p->has_xbzrle_cache_size = true;
 if (!visit_type_size(v, param, _size, )) {
diff --git a/migration/options.c b/migration/options.c
index 3e3e0b93b4..cb18a41267 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -179,6 +179,8 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_MODE("mode", MigrationState,
   parameters.mode,
   MIG_MODE_NORMAL),
+DEFINE_PROP_BOOL("multifd-zero-page", MigrationState,
+ parameters.multifd_zero_page, true),
 
 /* Migration capabilities */
 DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -903,6 +905,13 @@ uint64_t migrate_xbzrle_cache_size(void)
 return s->parameters.xbzrle_cache_size;
 }
 
+bool migrate_multifd_zero_page(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->parameters.multifd_zero_page;
+}
+
 /* parameter setters */
 
 void migrate_set_block_incremental(bool value)
@@ -1013,6 +1022,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
 params->has_mode = true;
 params->mode = s->parameters.mode;
+params->has_multifd_zero_page = true;
+params->multifd_zero_page = s->parameters.multifd_zero_page;
 
 return params;
 }
@@ -1049,6 +1060,7 @@ void migrate_params_init(MigrationParameters *params)
 params->has_x_vcpu_dirty_limit_period = true;
 params->has_vcpu_dirty_limit = true;
 params->has_mode = true;
+params->has_multifd_zero_page = true;
 }
 
 /*
@@ -1350,6 +1362,10 @@ static void 
migrate_params_test_apply(MigrateSetParameters *params,
 if (params->has_mode) {
 dest->mode = params->mode;
 }
+
+if (params->has_multifd_zero_page) {
+dest->multifd_zero_page = params->multifd_zero_page;
+}
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1494,6 +1510,10 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 if (params->has_mode) {
 s->parameters.mode = params->mode;
 }
+
+if (params->has_multifd_zero_page) {
+s->parameters.multifd_zero_page = params->multifd_zero_page;
+}
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 246c160aee..c080a6ba18 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -93,6 +93,7 @@ const char *migrate_tls_authz(void);
 const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
+bool migrate_multifd_zero_page(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 819708321d..ff033a0344 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -874,6 +874,11 @@
 # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
 #(Since 8.2)
 #
+# @multifd-zero-page: Multifd zero page checking. If the parameter is true,
+# zero page checking is done on the multifd sender thread. If the parameter
+# is false, zero page checking is done on the migration main thread. 
Default
+# is set

[PATCH 0/6] Introduce multifd zero page checking.

2024-02-06 Thread Hao Xiang

This patchset is based on Juan Quintela's old series here
https://lore.kernel.org/all/20220802063907.18882-1-quint...@redhat.com/

In the multifd live migration model, there is a single migration main
thread scanning the page map, queuing the pages to multiple multifd
sender threads. The migration main thread runs zero page checking on
every page before queuing the page to the sender threads. Zero page
checking is a CPU intensive task and hence having a single thread doing
all that doesn't scale well. This change introduces a new function
to run the zero page checking on the multifd sender threads. This
patchset also lays the ground work for future changes to offload zero
page checking task to accelerator hardwares.

Use two Intel 4th generation Xeon servers for testing.

Architecture:x86_64
CPU(s):  192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   143
Model name:  Intel(R) Xeon(R) Platinum 8457C
Stepping:8
CPU MHz: 2538.624
CPU max MHz: 3800.
CPU min MHz: 800.

Perform multifd live migration with below setup:
1. VM has 100GB memory. All pages in the VM are zero pages.
2. Use tcp socket for live migratio.
3. Use 4 multifd channels and zero page checking on migration main thread.
4. Use 1/2/4 multifd channels and zero page checking on multifd sender
threads.
5. Record migration total time from sender QEMU console's "info migrate"
command.
6. Calculate throughput with "100GB / total time".

+--+
|zero-page-checking | total-time(ms) | throughput(GB/s)|
+--+
|main-thread| 9629   | 10.38GB/s   |
+--+
|multifd-1-threads  | 6182   | 16.17GB/s   |
+--+
|multifd-2-threads  | 4643   | 21.53GB/s   |
+--+
|multifd-4-threads  | 4143   | 24.13GB/s   |
+--+

Apply this patchset on top of commit
39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440

Hao Xiang (6):
  migration/multifd: Add new migration option multifd-zero-page.
  migration/multifd: Add zero pages and zero bytes counter to migration
status interface.
  migration/multifd: Support for zero pages transmission in multifd
format.
  migration/multifd: Zero page transmission on the multifd thread.
  migration/multifd: Enable zero page checking from multifd threads.
  migration/multifd: Add a new migration test case for legacy zero page
checking.

 migration/migration-hmp-cmds.c |  11 
 migration/multifd.c| 106 -
 migration/multifd.h|  22 ++-
 migration/options.c|  20 +++
 migration/options.h|   1 +
 migration/ram.c|  49 ---
 migration/trace-events |   8 +--
 qapi/migration.json|  39 ++--
 tests/qtest/migration-test.c   |  26 
 9 files changed, 249 insertions(+), 33 deletions(-)

-- 
2.30.2

Re: [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).


Ah, that answers my question re 128 vs 256 byte minimum.

So as far as this patch goes,
Reviewed-by: Richard Henderson 


r~

Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant

2024-02-06 Thread Richard Henderson


On 2/7/24 08:34, Richard Henderson wrote:

On 2/7/24 06:48, Alexander Monakov wrote:

-    /* Otherwise, use the unaligned memory access functions to
-   handle the beginning and end of the buffer, with a couple
+    /* Use unaligned memory access functions to handle
+   the beginning and end of the buffer, with a couple
 of loops handling the middle aligned section.  */
-    uint64_t t = ldq_he_p(buf);
-    const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-    const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
+    uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    typedef uint64_t uint64_a __attribute__((may_alias));
+    const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
+    const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);


You appear to be optimizing this routine for x86, which is not the primary 
consumer.

This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. 
Sparc and some RISC-V).


I beg your pardon, I mis-read this.  You're only replacing the byte loops, which will be 
more-or-less identical, modulo unrolling, when unaligned access is not supported.  But 
will be much improved if some unaligned access support is available (e.g. MIPS LWL+LWR).


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

Make buffer_is_zero a 'static inline' function that tests up to three
bytes from the buffer before handing off to an unrolled loop. This
eliminates call overhead for most non-zero buffers, and allows to
optimize out length checks when it is known at compile time (which is
often the case in Qemu).

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
  include/qemu/cutils.h | 28 +++-
  util/bufferiszero.c   | 76 ---
  2 files changed, 47 insertions(+), 57 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..62b153e603 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
  /* used to print char* safely */
  #define STR_OR_NULL(str) ((str) ? (str) : "null")
  
-bool buffer_is_zero(const void *buf, size_t len);

+bool buffer_is_zero_len_4_plus(const void *, size_t);
+extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);


Why 256, when the avx2 routine can handle size 128, and you're about to remove 
avx512?

You appear to have missed that select_accel_fn() resolves directly to buffer_zero_int, aka 
buffer_is_zero_len_4_plus for non-x86, without an indirect function call.


I think you should not attempt to expose the 4 vs larger implementation detail here in the 
inline function.  Presumably the bulk of the benefit in avoiding the function call is 
already realized via the three byte spot checks.



r~

Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

-/* Otherwise, use the unaligned memory access functions to
-   handle the beginning and end of the buffer, with a couple
+/* Use unaligned memory access functions to handle
+   the beginning and end of the buffer, with a couple
 of loops handling the middle aligned section.  */
-uint64_t t = ldq_he_p(buf);
-const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
+uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+typedef uint64_t uint64_a __attribute__((may_alias));
+const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
+const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);


You appear to be optimizing this routine for x86, which is not the primary 
consumer.

This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. 
Sparc and some RISC-V).



r~

Re: [PATCH] hw/net/tulip: add chip status register values

2024-02-06 Thread Helge Deller


On 2/5/24 20:47, Sven Schnelle wrote:

Netbsd isn't able to detect a link on the emulated tulip card. That's
because netbsd reads the Chip Status Register of the Phy (address
0x14). The default phy data in the qemu tulip driver is all zero,
which means no link is established and autonegotation isn't complete.

Therefore set the register to 0x3b40, which means:

Link is up, Autonegotation complete, Full Duplex, 100MBit/s Link
speed.

Also clear the mask because this register is read only.

Signed-off-by: Sven Schnelle 


Reviewed-by: Helge Deller 
Tested-by: Helge Deller 

Can be easily tested without installation:
Download: wget 
https://cdn.netbsd.org/pub/NetBSD/NetBSD-9.3/iso/NetBSD-9.3-hppa.iso
Run: ./qemu-system-hppa -cdrom NetBSD-9.3-hppa.iso -nographic
-> a) Installation on English
-> e) Utility Menu
-> c) configure network

Helge


---
  hw/net/tulip.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/net/tulip.c b/hw/net/tulip.c
index 6d4fb06dad..1f2ef20977 100644
--- a/hw/net/tulip.c
+++ b/hw/net/tulip.c
@@ -421,7 +421,7 @@ static uint16_t tulip_mdi_default[] = {
  /* MDI Registers 8 - 15 */
  0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
  /* MDI Registers 16 - 31 */
-0x0003, 0x, 0x0001, 0x, 0x, 0x, 0x, 0x,
+0x0003, 0x, 0x0001, 0x, 0x3b40, 0x, 0x, 0x,
  0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
  };

@@ -429,7 +429,7 @@ static uint16_t tulip_mdi_default[] = {
  static const uint16_t tulip_mdi_mask[] = {
  0x, 0x, 0x, 0x, 0xc01f, 0x, 0x, 0x,
  0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
-0x0fff, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
+0x0fff, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
  0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x,
  };

Re: [PATCH v3 4/6] util/bufferiszero: remove useless prefetches

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
   line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
   prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
   port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov
Signed-off-by: Mikhail Romanov
---
  util/bufferiszero.c | 3 ---
  1 file changed, 3 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov
Signed-off-by: Alexander Monakov
---
  util/bufferiszero.c | 36 ++--
  1 file changed, 2 insertions(+), 34 deletions(-)


Reviewed-by: Richard Henderson 

Although I think this patch should be ordered second.


r~

Re: [PATCH v3 5/9] hw/mem/cxl_type3: Add host backend and address space handling for DC regions

2024-02-06 Thread fan

On Wed, Jan 24, 2024 at 03:47:21PM +, Jonathan Cameron wrote:
> On Tue,  7 Nov 2023 10:07:09 -0800
> nifan@gmail.com wrote:
> 
> > From: Fan Ni 
> > 
> > Add (file/memory backed) host backend, all the dynamic capacity regions
> > will share a single, large enough host backend. Set up address space for
> > DC regions to support read/write operations to dynamic capacity for DCD.
> > 
> > With the change, following supports are added:
> > 1. Add a new property to type3 device "nonvolatile-dc-memdev" to point to 
> > host
> >memory backend for dynamic capacity. Currently, all dc regions share one
> >one host backend.
> > 2. Add namespace for dynamic capacity for read/write support;
> > 3. Create cdat entries for each dynamic capacity region;
> > 4. Fix dvsec range registers to include DC regions.
> > 
> > Signed-off-by: Fan Ni 
> Some minor comments inline, mostly suggesting pulling refactors out before
> you do the new stuff.
> 
> Thanks,
> 
> Jonathan

Hi Jonathan,
   One question about DVSEC setting inline.
   Please search ""QUESTION:"

> 
> > ---
> >  hw/cxl/cxl-mailbox-utils.c  |  16 ++-
> >  hw/mem/cxl_type3.c  | 198 +---
> >  include/hw/cxl/cxl_device.h |   4 +
> >  3 files changed, 179 insertions(+), 39 deletions(-)
> > 
> 
> 
> 
> >  
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index 2d67d2015c..152a51306d 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -31,6 +31,7 @@
> >  #include "hw/pci/spdm.h"
> >  
> >  #define DWORD_BYTE 4
> > +#define CXL_CAPACITY_MULTIPLIER   (256 * MiB)
> >  
> >  /* Default CDAT entries for a memory region */
> >  enum {
> > @@ -44,8 +45,9 @@ enum {
> >  };
> >  
> >  static int ct3_build_cdat_entries_for_mr(CDATSubHeader **cdat_table,
> > - int dsmad_handle, MemoryRegion 
> > *mr,
> > - bool is_pmem, uint64_t dpa_base)
> > + int dsmad_handle, uint64_t size,
> > + bool is_pmem, bool is_dynamic,
> > + uint64_t dpa_base)
> >  {
> >  g_autofree CDATDsmas *dsmas = NULL;
> >  g_autofree CDATDslbis *dslbis0 = NULL;
> > @@ -64,9 +66,10 @@ static int ct3_build_cdat_entries_for_mr(CDATSubHeader 
> > **cdat_table,
> >  .length = sizeof(*dsmas),
> >  },
> >  .DSMADhandle = dsmad_handle,
> > -.flags = is_pmem ? CDAT_DSMAS_FLAG_NV : 0,
> > +.flags = (is_pmem ? CDAT_DSMAS_FLAG_NV : 0) |
> > +(is_dynamic ? CDAT_DSMAS_FLAG_DYNAMIC_CAP : 0),
> >  .DPA_base = dpa_base,
> > -.DPA_length = memory_region_size(mr),
> > +.DPA_length = size,
> >  };
> >  
> >  /* For now, no memory side cache, plausiblish numbers */
> > @@ -150,7 +153,7 @@ static int ct3_build_cdat_entries_for_mr(CDATSubHeader 
> > **cdat_table,
> >   */
> >  .EFI_memory_type_attr = is_pmem ? 2 : 1,
> >  .DPA_offset = 0,
> > -.DPA_length = memory_region_size(mr),
> > +.DPA_length = size,
> >  };
> 
> Might be better to make the change to this function as a precursor patch 
> before
> you introduce the new users.  Will separate the DC bits out from the rest.
> 
> >  
> >  /* Header always at start of structure */
> > @@ -169,21 +172,28 @@ static int ct3_build_cdat_table(CDATSubHeader 
> > ***cdat_table, void *priv)
> >  g_autofree CDATSubHeader **table = NULL;
> >  CXLType3Dev *ct3d = priv;
> >  MemoryRegion *volatile_mr = NULL, *nonvolatile_mr = NULL;
> > +MemoryRegion *dc_mr = NULL;
> >  int dsmad_handle = 0;
> >  int cur_ent = 0;
> >  int len = 0;
> >  int rc, i;
> > +uint64_t vmr_size = 0, pmr_size = 0;
> 
> Put these next to the memory region definitions above given they are 
> referring to the
> same regions.
> 
> >  
> > -if (!ct3d->hostpmem && !ct3d->hostvmem) {
> > +if (!ct3d->hostpmem && !ct3d->hostvmem && !ct3d->dc.num_regions) {
> >  return 0;
> >  }
> >  
> > +if (ct3d->hostpmem && ct3d->hostvmem && ct3d->dc.host_dc) {
> > +warn_report("The device has static ram and pmem and dynamic 
> > capacity");
> 
> This is the whole how many DVSEC ranges question? 
> I hope we resolved that so we don't care about this...
> 
> > +}
> > +
> >  if (ct3d->hostvmem) {
> >  volatile_mr = host_memory_backend_get_memory(ct3d->hostvmem);
> >  if (!volatile_mr) {
> >  return -EINVAL;
> >  }
> >  len += CT3_CDAT_NUM_ENTRIES;
> > +vmr_size = memory_region_size(volatile_mr);
> >  }
> >  
> >  if (ct3d->hostpmem) {
> 
> 
> 
> > @@ -210,14 +233,38 @@ static int ct3_build_cdat_table(CDATSubHeader 
> > ***cdat_table, void *priv)
> >  }
> >  
> >  if (nonvolatile_mr) {
> > -uint64_t base = volatile_mr ? memory_region_size(volatile_mr) : 0;
> >  rc =

Re: [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant

2024-02-06 Thread Richard Henderson


On 2/7/24 06:48, Alexander Monakov wrote:

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov
Signed-off-by: Mikhail Romanov
---
  util/bufferiszero.c | 29 -
  1 file changed, 29 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 03/13] target/arm: Add Cortex-R52 IMPDEF sysregs

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

Add the Cortex-R52 IMPDEF sysregs, by defining them here and
also by enabling the AUXCR feature which defines the ACTLR
and HACTLR registers. As is our usual practice, we make these
simple reads-as-zero stubs for now.

Signed-off-by: Peter Maydell
---
  target/arm/tcg/cpu32.c | 108 +
  1 file changed, 108 insertions(+)


Reviewed-by: Richard Henderson 

r~

[PATCH v3 6/6] migration/multifd: Add a synchronization point for channel creation

2024-02-06 Thread Fabiano Rosas

It is possible that one of the multifd channels fails to be created at
multifd_new_send_channel_async() while the rest of the channel
creation tasks are still in flight.

This could lead to multifd_save_cleanup() executing the
qemu_thread_join() loop too early and not waiting for the threads
which haven't been created yet, leading to the freeing of resources
that the newly created threads will try to access and crash.

Add a synchronization point after which there will be no attempts at
thread creation and therefore calling multifd_save_cleanup() past that
point will ensure it properly waits for the threads.

A note about performance: Prior to this patch, if a channel took too
long to be established, other channels could finish connecting first
and already start taking load. Now we're bounded by the
slowest-connecting channel.

Reported-by: Avihai Horon 
Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/multifd.c | 32 ++--
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 339f2428f3..ee77047031 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -62,6 +62,11 @@ struct {
  * Make it easy for now.
  */
 uintptr_t packet_num;
+/*
+ * Synchronization point past which no more channels will be
+ * created.
+ */
+QemuSemaphore channels_created;
 /* send channels ready */
 QemuSemaphore channels_ready;
 /*
@@ -622,10 +627,6 @@ static void multifd_send_terminate_threads(void)
 
 /*
  * Finally recycle all the threads.
- *
- * TODO: p->running is still buggy, e.g. we can reach here without the
- * corresponding multifd_new_send_channel_async() get invoked yet,
- * then a new thread can even be created after this function returns.
  */
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDSendParams *p = _send_state->params[i];
@@ -670,6 +671,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams 
*p, Error **errp)
 
 static void multifd_send_cleanup_state(void)
 {
+qemu_sem_destroy(_send_state->channels_created);
 qemu_sem_destroy(_send_state->channels_ready);
 g_free(multifd_send_state->params);
 multifd_send_state->params = NULL;
@@ -954,18 +956,26 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 
 if (migrate_channel_requires_tls_upgrade(ioc)) {
 ret = multifd_tls_channel_connect(p, ioc, _err);
+if (ret) {
+return;
+}
 } else {
 ret = multifd_channel_connect(p, ioc, _err);
 }
 
+out:
+/*
+ * Here we're not interested whether creation succeeded, only that
+ * it happened at all.
+ */
+qemu_sem_post(_send_state->channels_created);
+
 if (ret) {
 return;
 }
 
-out:
 trace_multifd_new_send_channel_async_error(p->id, local_err);
 multifd_send_set_error(local_err);
-multifd_send_kick_main(p);
 if (!p->c) {
 /*
  * If no channel has been created, drop the initial
@@ -998,6 +1008,7 @@ bool multifd_send_setup(void)
 multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
 multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
 multifd_send_state->pages = multifd_pages_init(page_count);
+qemu_sem_init(_send_state->channels_created, 0);
 qemu_sem_init(_send_state->channels_ready, 0);
 qatomic_set(_send_state->exiting, 0);
 multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1023,6 +1034,15 @@ bool multifd_send_setup(void)
 multifd_new_send_channel_create(p);
 }
 
+/*
+ * Wait until channel creation has started for all channels. The
+ * creation can still fail, but no more channels will be created
+ * past this point.
+ */
+for (i = 0; i < thread_count; i++) {
+qemu_sem_wait(_send_state->channels_created);
+}
+
 for (i = 0; i < thread_count; i++) {
 MultiFDSendParams *p = _send_state->params[i];
 
-- 
2.35.3

[PATCH v3 4/6] migration/multifd: Move multifd_send_setup into migration thread

2024-02-06 Thread Fabiano Rosas

We currently have an unfavorable situation around multifd channels
creation and the migration thread execution.

We create the multifd channels with qio_channel_socket_connect_async
-> qio_task_run_in_thread, but only connect them at the
multifd_new_send_channel_async callback, called from
qio_task_complete, which is registered as a glib event.

So at multifd_send_setup() we create the channels, but they will only
be actually usable after the whole multifd_send_setup() calling stack
returns back to the main loop. Which means that the migration thread
is already up and running without any possibility for the multifd
channels to be ready on time.

We currently rely on the channels-ready semaphore blocking
multifd_send_sync_main() until channels start to come up and release
it. However there have been bugs recently found when a channel's
creation fails and multifd_send_cleanup() is allowed to run while
other channels are still being created.

Let's start to organize this situation by moving the
multifd_send_setup() call into the migration thread. That way we
unblock the main-loop to dispatch the completion callbacks and
actually have a chance of getting the multifd channels ready for when
the migration thread needs them.

The next patches will deal with the synchronization aspects.

Note that this takes multifd_send_setup() out of the BQL.

Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/migration.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2942f8cf42..0675e12c64 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3315,6 +3315,10 @@ static void *migration_thread(void *opaque)
 object_ref(OBJECT(s));
 update_iteration_initial_status(s);
 
+if (!multifd_send_setup()) {
+goto out;
+}
+
 bql_lock();
 qemu_savevm_state_header(s->to_dst_file);
 bql_unlock();
@@ -3386,6 +3390,7 @@ static void *migration_thread(void *opaque)
 urgent = migration_rate_limit();
 }
 
+out:
 trace_migration_thread_after_loop();
 migration_iteration_finish(s);
 object_unref(OBJECT(s));
@@ -3623,11 +3628,6 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 return;
 }
 
-if (!multifd_send_setup()) {
-migrate_fd_cleanup(s);
-return;
-}
-
 if (migrate_background_snapshot()) {
 qemu_thread_create(>thread, "bg_snapshot",
 bg_migration_thread, s, QEMU_THREAD_JOINABLE);
-- 
2.35.3

[PATCH v3 5/6] migration/multifd: Unify multifd and TLS connection paths

2024-02-06 Thread Fabiano Rosas

During multifd channel creation (multifd_send_new_channel_async) when
TLS is enabled, the multifd_channel_connect function is called twice,
once to create the TLS handshake thread and another time after the
asynchrounous TLS handshake has finished.

This creates a slightly confusing call stack where
multifd_channel_connect() is called more times than the number of
channels. It also splits error handling between the two callers of
multifd_channel_connect() causing some code duplication. Lastly, it
gets in the way of having a single point to determine whether all
channel creation tasks have been initiated.

Refactor the code to move the reentrancy one level up at the
multifd_new_send_channel_async() level, de-duplicating the error
handling and allowing for the next patch to introduce a
synchronization point common to all the multifd channel creation,
regardless of TLS.

Note that the previous code would never fail once p->c had been set.
This patch changes this assumption, which affects refcounting, so add
comments around object_unref to explain the situation.

Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/multifd.c | 83 ++---
 1 file changed, 40 insertions(+), 43 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index cc10be2c3f..339f2428f3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -869,30 +869,7 @@ out:
 return NULL;
 }
 
-static bool multifd_channel_connect(MultiFDSendParams *p,
-QIOChannel *ioc,
-Error **errp);
-
-static void multifd_tls_outgoing_handshake(QIOTask *task,
-   gpointer opaque)
-{
-MultiFDSendParams *p = opaque;
-QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
-Error *err = NULL;
-
-if (!qio_task_propagate_error(task, )) {
-trace_multifd_tls_outgoing_handshake_complete(ioc);
-if (multifd_channel_connect(p, ioc, )) {
-return;
-}
-}
-
-trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
-
-multifd_send_set_error(err);
-multifd_send_kick_main(p);
-error_free(err);
-}
+static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
 
 static void *multifd_tls_handshake_thread(void *opaque)
 {
@@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque)
 QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
 
 qio_channel_tls_handshake(tioc,
-  multifd_tls_outgoing_handshake,
+  multifd_new_send_channel_async,
   p,
   NULL,
   NULL);
@@ -920,6 +897,10 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
*p,
 return false;
 }
 
+/*
+ * Ownership of the socket channel now transfers to the newly
+ * created TLS channel, which has already taken a reference.
+ */
 object_unref(OBJECT(ioc));
 trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
 qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
@@ -936,18 +917,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 QIOChannel *ioc,
 Error **errp)
 {
-trace_multifd_set_outgoing_channel(
-ioc, object_get_typename(OBJECT(ioc)),
-migrate_get_current()->hostname);
-
-if (migrate_channel_requires_tls_upgrade(ioc)) {
-/*
- * tls_channel_connect will call back to this
- * function after the TLS handshake,
- * so we mustn't call multifd_send_thread until then
- */
-return multifd_tls_channel_connect(p, ioc, errp);
-}
+qio_channel_set_delay(ioc, false);
 
 migration_ioc_register_yank(ioc);
 p->registered_yank = true;
@@ -959,24 +929,51 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 return true;
 }
 
+/*
+ * When TLS is enabled this function is called once to establish the
+ * TLS connection and a second time after the TLS handshake to create
+ * the multifd channel. Without TLS it goes straight into the channel
+ * creation.
+ */
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
 MultiFDSendParams *p = opaque;
 QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
 Error *local_err = NULL;
+bool ret;
 
 trace_multifd_new_send_channel_async(p->id);
-if (!qio_task_propagate_error(task, _err)) {
-qio_channel_set_delay(ioc, false);
-if (multifd_channel_connect(p, ioc, _err)) {
-return;
-}
+
+if (qio_task_propagate_error(task, _err)) {
+ret = false;
+goto out;
+}
+
+trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)),
+   migrate_get_current()->hostname);

[PATCH v3 2/6] migration/multifd: Remove p->running

2024-02-06 Thread Fabiano Rosas

We currently only need p->running to avoid calling qemu_thread_join()
on a non existent thread if the thread has never been created.

However, there are at least two bugs in this logic:

1) On the sending side, p->running is set too early and
qemu_thread_create() can be skipped due to an error during TLS
handshake, leaving the flag set and leading to a crash when
multifd_send_cleanup() calls qemu_thread_join().

2) During exit, the multifd thread clears the flag while holding the
channel lock. The counterpart at multifd_send_cleanup() reads the flag
outside of the lock and might free the mutex while the multifd thread
still has it locked.

Fix the first issue by setting the flag right before creating the
thread. Rename it from p->running to p->thread_created to clarify its
usage.

Fix the second issue by not clearing the flag at the multifd thread
exit. We don't have any use for that.

Note that these bugs are straight-forward logic issues and not race
conditions. There is still a gap for races to affect this code due to
multifd_send_cleanup() being allowed to run concurrently with the
thread creation loop. This issue is solved in the next patches.

Cc: qemu-stable 
Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
Reported-by: Avihai Horon 
Reported-by: 
Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/multifd.c | 27 ---
 migration/multifd.h |  7 ++-
 2 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 8195c1daf3..515d88e04b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -634,7 +634,7 @@ static void multifd_send_terminate_threads(void)
 qemu_thread_join(>tls_thread);
 }
 
-if (p->running) {
+if (p->thread_created) {
 qemu_thread_join(>thread);
 }
 }
@@ -862,7 +862,6 @@ out:
 error_free(local_err);
 }
 
-p->running = false;
 rcu_unregister_thread();
 migration_threads_remove(thread);
 trace_multifd_send_thread_end(p->id, p->packets_sent, 
p->total_normal_pages);
@@ -953,6 +952,8 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 migration_ioc_register_yank(ioc);
 p->registered_yank = true;
 p->c = ioc;
+
+p->thread_created = true;
 qemu_thread_create(>thread, p->name, multifd_send_thread, p,
QEMU_THREAD_JOINABLE);
 return true;
@@ -967,7 +968,6 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 trace_multifd_new_send_channel_async(p->id);
 if (!qio_task_propagate_error(task, _err)) {
 qio_channel_set_delay(ioc, false);
-p->running = true;
 if (multifd_channel_connect(p, ioc, _err)) {
 return;
 }
@@ -1128,15 +1128,15 @@ void multifd_recv_cleanup(void)
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDRecvParams *p = _recv_state->params[i];
 
-if (p->running) {
-/*
- * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
- * however try to wakeup it without harm in cleanup phase.
- */
-qemu_sem_post(>sem_sync);
-}
+/*
+ * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
+ * however try to wakeup it without harm in cleanup phase.
+ */
+qemu_sem_post(>sem_sync);
 
-qemu_thread_join(>thread);
+if (p->thread_created) {
+qemu_thread_join(>thread);
+}
 }
 for (i = 0; i < migrate_multifd_channels(); i++) {
 multifd_recv_cleanup_channel(_recv_state->params[i]);
@@ -1222,9 +1222,6 @@ static void *multifd_recv_thread(void *opaque)
 multifd_recv_terminate_threads(local_err);
 error_free(local_err);
 }
-qemu_mutex_lock(>mutex);
-p->running = false;
-qemu_mutex_unlock(>mutex);
 
 rcu_unregister_thread();
 trace_multifd_recv_thread_end(p->id, p->packets_recved, 
p->total_normal_pages);
@@ -1330,7 +1327,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error 
**errp)
 p->c = ioc;
 object_ref(OBJECT(ioc));
 
-p->running = true;
+p->thread_created = true;
 qemu_thread_create(>thread, p->name, multifd_recv_thread, p,
QEMU_THREAD_JOINABLE);
 qatomic_inc(_recv_state->count);
diff --git a/migration/multifd.h b/migration/multifd.h
index 720c9d50db..7881980ee6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -73,6 +73,7 @@ typedef struct {
 char *name;
 /* channel thread id */
 QemuThread thread;
+bool thread_created;
 QemuThread tls_thread;
 bool tls_thread_created;
 /* communication channel */
@@ -93,8 +94,6 @@ typedef struct {
 /* syncs main thread and channels */
 QemuSemaphore sem_sync;
 
-/* is this channel thread running */
-bool running;
 /* multifd flags for each packet */
 uint32_t

[PATCH v3 3/6] migration/multifd: Move multifd_send_setup error handling in to the function

2024-02-06 Thread Fabiano Rosas

Hide the error handling inside multifd_send_setup to make it cleaner
for the next patch to move the function around.

Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/migration.c |  6 +-
 migration/multifd.c   | 24 +---
 migration/multifd.h   |  2 +-
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index ba99772e76..2942f8cf42 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3623,11 +3623,7 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 return;
 }
 
-if (multifd_send_setup(_err) != 0) {
-migrate_set_error(s, local_err);
-error_report_err(local_err);
-migrate_set_state(>state, MIGRATION_STATUS_SETUP,
-  MIGRATION_STATUS_FAILED);
+if (!multifd_send_setup()) {
 migrate_fd_cleanup(s);
 return;
 }
diff --git a/migration/multifd.c b/migration/multifd.c
index 515d88e04b..cc10be2c3f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -985,14 +985,16 @@ static void multifd_new_send_channel_create(gpointer 
opaque)
 socket_send_channel_create(multifd_new_send_channel_async, opaque);
 }
 
-int multifd_send_setup(Error **errp)
+bool multifd_send_setup(void)
 {
-int thread_count;
+MigrationState *s = migrate_get_current();
+Error *local_err = NULL;
+int thread_count, ret = 0;
 uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
 uint8_t i;
 
 if (!migrate_multifd()) {
-return 0;
+return true;
 }
 
 thread_count = migrate_multifd_channels();
@@ -1026,14 +1028,22 @@ int multifd_send_setup(Error **errp)
 
 for (i = 0; i < thread_count; i++) {
 MultiFDSendParams *p = _send_state->params[i];
-int ret;
 
-ret = multifd_send_state->ops->send_setup(p, errp);
+ret = multifd_send_state->ops->send_setup(p, _err);
 if (ret) {
-return ret;
+break;
 }
 }
-return 0;
+
+if (ret) {
+migrate_set_error(s, local_err);
+error_report_err(local_err);
+migrate_set_state(>state, MIGRATION_STATUS_SETUP,
+  MIGRATION_STATUS_FAILED);
+return false;
+}
+
+return true;
 }
 
 struct {
diff --git a/migration/multifd.h b/migration/multifd.h
index 7881980ee6..8a1cad0996 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -13,7 +13,7 @@
 #ifndef QEMU_MIGRATION_MULTIFD_H
 #define QEMU_MIGRATION_MULTIFD_H
 
-int multifd_send_setup(Error **errp);
+bool multifd_send_setup(void);
 void multifd_send_shutdown(void);
 int multifd_recv_setup(Error **errp);
 void multifd_recv_cleanup(void);
-- 
2.35.3

[PATCH v3 1/6] migration/multifd: Join the TLS thread

2024-02-06 Thread Fabiano Rosas

We're currently leaking the resources of the TLS thread by not joining
it and also overwriting the p->thread pointer altogether.

Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to 
blocking handshake")
Cc: qemu-stable 
Reviewed-by: Peter Xu 
Signed-off-by: Fabiano Rosas 
---
 migration/multifd.c | 8 +++-
 migration/multifd.h | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ef13e2e781..8195c1daf3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void)
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDSendParams *p = _send_state->params[i];
 
+if (p->tls_thread_created) {
+qemu_thread_join(>tls_thread);
+}
+
 if (p->running) {
 qemu_thread_join(>thread);
 }
@@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
*p,
 trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
 qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
 p->c = QIO_CHANNEL(tioc);
-qemu_thread_create(>thread, "multifd-tls-handshake-worker",
+
+p->tls_thread_created = true;
+qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker",
multifd_tls_handshake_thread, p,
QEMU_THREAD_JOINABLE);
 return true;
diff --git a/migration/multifd.h b/migration/multifd.h
index 78a2317263..720c9d50db 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -73,6 +73,8 @@ typedef struct {
 char *name;
 /* channel thread id */
 QemuThread thread;
+QemuThread tls_thread;
+bool tls_thread_created;
 /* communication channel */
 QIOChannel *c;
 /* is the yank function registered */
-- 
2.35.3

[PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races

2024-02-06 Thread Fabiano Rosas

Based-on: 20240202102857.110210-1-pet...@redhat.com
[PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
https://lore.kernel.org/r/20240202102857.110210-1-pet...@redhat.com

Hi,

For v3 I fixed the refcounting issue spotted by Avihai. The situation
there is a bit clunky due to historical reasons. The gist is that we
have an assumption that channel creation never fails after p->c has
been set, so when 'p->c == NULL' we have to unref and when 'p->c !=
NULL' the cleanup code will do the unref.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1166889341

v2:
https://lore.kernel.org/r/20240205194929.28963-1-faro...@suse.de

In this v2 I made sure NO channel is created after the semaphores are
posted. Feel free to call me out if that's not the case.

Not much changes, except that now both TLS and non-TLS go through the
same code, so there's a centralized place to do error handling and
releasing the semaphore.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1165206107
based on Peter's code: https://gitlab.com/farosas/qemu/-/pipelines/1165303276

v1:
https://lore.kernel.org/r/20240202191128.1901-1-faro...@suse.de

This contains 2 patches from my previous series addressing the
p->running misuse and the TLS thread leak and 3 new patches to fix the
cleanup-while-creating-threads race.

For the p->running I'm keeping the idea from the other series to
remove p->running and use a more narrow p->thread_created flag. This
flag is used only inform whether the thread has been created so we can
join it.

For the cleanup race I have moved some code around and added a
semaphore to make multifd_save_setup() only return once all channel
creation tasks have started.

The idea is that after multifd_save_setup() returns, no new creations
are in flight and the p->thread_created flags will never change again,
so they're enough to cause the cleanup code to wait for the threads to
join.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1162798843

@Peter: I can rebase this on top of your series once we decide about
it.

Fabiano Rosas (6):
  migration/multifd: Join the TLS thread
  migration/multifd: Remove p->running
  migration/multifd: Move multifd_send_setup error handling in to the
function
  migration/multifd: Move multifd_send_setup into migration thread
  migration/multifd: Unify multifd and TLS connection paths
  migration/multifd: Add a synchronization point for channel creation

 migration/migration.c |  14 ++--
 migration/multifd.c   | 168 +-
 migration/multifd.h   |  11 ++-
 3 files changed, 109 insertions(+), 84 deletions(-)

-- 
2.35.3

Re: [PATCH 07/13] hw/misc/mps2-scc: Make changes needed for AN536 FPGA image

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

The MPS2 SCC device is broadly the same for all FPGA images, but has
minor differences in the behaviour of the CFG registers depending on
the image. In many cases we don't really care about the functionality
controlled by these registers and a reads-as-written or similar
behaviour is sufficient for the moment.

For the AN536 the required behaviour is:

  * A_CFG0 has CPU reset and halt bits
 - implement as reads-as-written for the moment
  * A_CFG1 has flash or ATCM address 0 remap handling
 - QEMU doesn't model this; implement as reads-as-written
  * A_CFG2 has QSPI select (like AN524)
 - implemented (no behaviour, as with AN524)
  * A_CFG3 is MCC_MSB_ADDR "additional MCC addressing bits"
 - QEMU doesn't care about these, so use the existing
   RAZ behaviour for convenience
  * A_CFG4 is board rev (like all other images)
 - no change needed
  * A_CFG5 is ACLK frq in hz (like AN524)
 - implemented as reads-as-written, as for other boards
  * A_CFG6 is core 0 vector table base address
 - implemented as reads-as-written for the moment
  * A_CFG7 is core 1 vector table base address
 - implemented as reads-as-written for the moment

Make the changes necessary for this; leave TODO comments where
appropriate to indicate where we might want to come back and
implement things like CPU reset.

The other aspects of the device specific to this FPGA image (like the
values of the board ID and similar registers) will be set via the
device's qdev properties.

Signed-off-by: Peter Maydell
---
  include/hw/misc/mps2-scc.h |   1 +
  hw/misc/mps2-scc.c | 101 +
  2 files changed, 92 insertions(+), 10 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 06/13] hw/misc/mps2-scc: Factor out which-board conditionals

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

The MPS SCC device has a lot of different flavours for the various
different MPS FPGA images, which look mostly similar but have
differences in how particular registers are handled.  Currently we
deal with this with a lot of open-coded checks on scc_partno(), but
as we add more board types this is getting a bit hard to read.

Factor out the conditions into some functions which we can
give more descriptive names to.

Signed-off-by: Peter Maydell
---
  hw/misc/mps2-scc.c | 45 +++--
  1 file changed, 31 insertions(+), 14 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 05/13] hw/misc/mps2-scc: Fix condition for CFG3 register

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

We currently guard the CFG3 register read with
  (scc_partno(s) == 0x524 && scc_partno(s) == 0x547)
which is clearly wrong as it is never true.

This register is present on all board types except AN524
and AN527; correct the condition.

Fixes: 6ac80818941829c0 ("hw/misc/mps2-scc: Implement changes for AN547")
Signed-off-by: Peter Maydell
---
  hw/misc/mps2-scc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 04/13] target/arm: Allow access to SPSR_hyp from hyp mode

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

Architecturally, the AArch32 MSR/MRS to/from banked register
instructions are UNPREDICTABLE for attempts to access a banked
register that the guest could access in a more direct way (e.g.
using this insn to access r8_fiq when already in FIQ mode).  QEMU has
chosen to UNDEF on all of these.

However, for the case of accessing SPSR_hyp from hyp mode, it turns
out that real hardware permits this, with the same effect as if the
guest had directly written to SPSR. Further, there is some
guest code out there that assumes it can do this, because it
happens to work on hardware: an example Cortex-R52 startup code
fragment uses this, and it got copied into various other places,
including Zephyr. Zephyr was fixed to not use this:
  https://github.com/zephyrproject-rtos/zephyr/issues/47330
but other examples are still out there, like the selftest
binary for the MPS3-AN536.

For convenience of being able to run guest code, permit
this UNPREDICTABLE access instead of UNDEFing it.

Signed-off-by: Peter Maydell
---
Last time this came up I preferred the "keep QEMU behaviour
as it is, try to get the guest code fixed" approach:
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg899970.html
but as this is the second time I lean a bit more towards
behaving like the hardware.
---
  target/arm/tcg/op_helper.c | 43 ++
  target/arm/tcg/translate.c | 19 +++--
  2 files changed, 43 insertions(+), 19 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PULL v2 00/39] tcg patch queue

2024-02-06 Thread Peter Maydell

On Tue, 6 Feb 2024 at 21:24, Peter Maydell  wrote:
>
> On Tue, 6 Feb 2024 at 03:22, Richard Henderson
>  wrote:
> >
> > v2: Fix rebase error in patch 38 (tcg/s390x: Support TCG_COND_TST{EQ,NE}).
> >
> >
> > r~
> >
> >
> > The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:
> >
> >   Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
> > staging (2024-02-03 13:31:58 +)
> >
> > are available in the Git repository at:
> >
> >   https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20240205-2
> >
> > for you to fetch changes up to 23c5692abc3917151dee36c00d751cf5bc46ef19:
> >
> >   tcg/tci: Support TCG_COND_TST{EQ,NE} (2024-02-05 22:45:41 +)
> >
> > 
> > tcg: Introduce TCG_COND_TST{EQ,NE}
> > target/alpha: Use TCG_COND_TST{EQ,NE}
> > target/m68k: Use TCG_COND_TST{EQ,NE} in gen_fcc_cond
> > target/sparc: Use TCG_COND_TSTEQ in gen_op_mulscc
> > target/s390x: Use TCG_COND_TSTNE for CC_OP_{TM,ICM}
> > target/s390x: Improve general case of disas_jcc
>
> This really doesn't want to pass the ubuntu-20.04-s390x-all job:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/6109442678
> https://gitlab.com/qemu-project/qemu/-/jobs/6108249863
> https://gitlab.com/qemu-project/qemu/-/jobs/6106928534
> https://gitlab.com/qemu-project/qemu/-/jobs/6105718495
>
> Now, this has definitely been a flaky job recently, so maybe it's
> not this pullreq's fault.
>
> This is a passing job from the last successful merge:
> https://gitlab.com/qemu-project/qemu/-/jobs/6089342252
> That took 24 minutes to run, and all the failed jobs above
> took 70 minutes plus.

Ruling out anything about this particular merge attempt:

This is a passing job from a recent succesful merge:
 https://gitlab.com/qemu-project/qemu/-/jobs/6089089816
That took 37 minutes to run (21 mins in configure-n-compile).

This is a failing job for the same commit:
  https://gitlab.com/qemu-project/qemu/-/jobs/6086439717
That took 58 minutes (26 mins in configure-n-compile).

So there's a lot of between run variation, though in that
case it was not so much as in some of these examples.

-- PMM

Re: [PULL v2 00/39] tcg patch queue

2024-02-06 Thread Peter Maydell

On Tue, 6 Feb 2024 at 03:22, Richard Henderson
 wrote:
>
> v2: Fix rebase error in patch 38 (tcg/s390x: Support TCG_COND_TST{EQ,NE}).
>
>
> r~
>
>
> The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:
>
>   Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
> staging (2024-02-03 13:31:58 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20240205-2
>
> for you to fetch changes up to 23c5692abc3917151dee36c00d751cf5bc46ef19:
>
>   tcg/tci: Support TCG_COND_TST{EQ,NE} (2024-02-05 22:45:41 +)
>
> 
> tcg: Introduce TCG_COND_TST{EQ,NE}
> target/alpha: Use TCG_COND_TST{EQ,NE}
> target/m68k: Use TCG_COND_TST{EQ,NE} in gen_fcc_cond
> target/sparc: Use TCG_COND_TSTEQ in gen_op_mulscc
> target/s390x: Use TCG_COND_TSTNE for CC_OP_{TM,ICM}
> target/s390x: Improve general case of disas_jcc

This really doesn't want to pass the ubuntu-20.04-s390x-all job:

https://gitlab.com/qemu-project/qemu/-/jobs/6109442678
https://gitlab.com/qemu-project/qemu/-/jobs/6108249863
https://gitlab.com/qemu-project/qemu/-/jobs/6106928534
https://gitlab.com/qemu-project/qemu/-/jobs/6105718495

Now, this has definitely been a flaky job recently, so maybe it's
not this pullreq's fault.

This is a passing job from the last successful merge:
https://gitlab.com/qemu-project/qemu/-/jobs/6089342252
That took 24 minutes to run, and all the failed jobs above
took 70 minutes plus.

TBH I think there is something weird with the runner. Looking
at the timestamps in the log, it seems like the passing job
completed its compile step in about 14 minutes, whereas one
of the failing jobs took about 39 minutes. So the entire
run of the job slowed down by more than 2.5x, which is enough
to put it into the range where either the whole job or
individual tests time out.

thuth: any idea why that might happen? (I look in on the
machine from time to time and it doesn't seem to be doing
anything it shouldn't that would be eating CPU.)

Christian: this is on the s390x machine we have. Does the
VM setup for that share IO or CPU with other VMs somehow?
Is there some reason why it might have very variable
performance over time?

thanks
-- PMM

Re: [PATCH 02/13] target/arm: The Cortex-R52 has a read-only CBAR

2024-02-06 Thread Peter Maydell

On Tue, 6 Feb 2024 at 20:38, Richard Henderson
 wrote:
>
> On 2/6/24 23:29, Peter Maydell wrote:
> > The Cortex-R52 implements the Configuration Base Address Register
> > (CBAR), as a read-only register.  Add ARM_FEATURE_CBAR_RO to this CPU
> > type, so that our implementation provides the register and the
> > associated qdev property.
> >
> > Signed-off-by: Peter Maydell 
> > ---
> >   target/arm/tcg/cpu32.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/target/arm/tcg/cpu32.c b/target/arm/tcg/cpu32.c
> > index 11253051156..311d654cdce 100644
> > --- a/target/arm/tcg/cpu32.c
> > +++ b/target/arm/tcg/cpu32.c
> > @@ -809,6 +809,7 @@ static void cortex_r52_initfn(Object *obj)
> >   set_feature(>env, ARM_FEATURE_PMSA);
> >   set_feature(>env, ARM_FEATURE_NEON);
> >   set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
> > +set_feature(>env, ARM_FEATURE_CBAR_RO);
>
> Reviewed-by: Richard Henderson 
>
> I just noticed that arm_cpu_post_init can be simplified to not check CBAR_RO, 
> now that we
> have arm_cpu_propagate_feature_implications.

The other bit of CBAR cleanup I have is that cortex-a55, cortex-a76,
neoverse-n1, neoverse-v1, neoverse-v2 and cortex-a710 have all
cut-n-pasted the line that sets ARM_FEATURE_CBAR_RO, but none
of them actually have a CBAR according to their TRM. The only
reason I didn't throw in a patch fixing that is that I think
it would be a migration compat break and I didn't feel like
it was worth the effort to try to deal with that...

-- PMM

Re: [PATCH 01/13] target/arm: Use new CBAR encoding for all v8 CPUs, not all aarch64 CPUs

2024-02-06 Thread Peter Maydell

On Tue, 6 Feb 2024 at 20:34, Richard Henderson
 wrote:
>
> On 2/6/24 23:29, Peter Maydell wrote:
> > We support two different encodings for the AArch32 IMPDEF
> > CBAR register -- older cores like the Cortex A9, A7, A15
> > have this at 4, c15, c0, 0; newer cores like the
> > Cortex A35, A53, A57 and A72 have it at 1 c15 c0 0.
> >
> > When we implemented this we picked which encoding to
> > use based on whether the CPU set ARM_FEATURE_AARCH64.
> > However this isn't right for three cases:
> >   * the qemu-system-arm 'max' CPU, which is supposed to be
> > a variant on a Cortex-A57; it ought to use the same
> > encoding the A57 does and which the AArch64 'max'
> > exposes to AArch32 guest code
> >   * the Cortex-R52, which is AArch32-only but has the CBAR
> > at the newer encoding (and where we incorrectly are
> > not yet setting ARM_FEATURE_CBAR_RO anyway)
> >   * any possible future support for other v8 AArch32
> > only CPUs, or for supporting "boot the CPU into
> > AArch32 mode" on our existing cores like the A57 etc
> >
> > Make the decision of the encoding be based on whether
> > the CPU implements the ARM_FEATURE_V8 flag instead.
> >
> > This changes the behaviour only for the qemu-system-arm
> > '-cpu max'. We don't expect anybody to be relying on the
> > old behaviour because:
> >   * it's not what the real hardware Cortex-A57 does
> > (and that's what our ID register claims we are)
>
> Not even that, because max resets MIDR.

qemu-system-aarch64 max does (in aarch64_max_tcg_initfn(),
yes; but qemu-system-arm max is set up in arm_max_initfn()
in cpu32.c, and that sets cpu->midr = 0x411fd070 (which is
the same as A57's MIDR)...

> Anyway,
> Reviewed-by: Richard Henderson 

thanks
-- PMM

Re: [PATCH 08/13] hw/arm/mps3r: Initial skeleton for mps3-an536 board

2024-02-06 Thread Peter Maydell

On Tue, 6 Feb 2024 at 19:21, Philippe Mathieu-Daudé  wrote:
>
> Hi Peter,
>
> On 6/2/24 14:29, Peter Maydell wrote:
> > The AN536 is another FPGA image for the MPS3 development board. Unlike
> > the existing FPGA images we already model, this board uses a Cortex-R
> > family CPU, and it does not use any equivalent to the M-profile
> > "Subsystem for Embedded" SoC-equivalent that we model in hw/arm/armsse.c.
> > It's therefore more convenient for us to model it as a completely
> > separate C file.
> >
> > This commit adds the basic skeleton of the board model, and the
> > code to create all the RAM and ROM. We assume that we're probably
> > going to want to add more images in future, so use the same
> > base class/subclass setup that mps2-tz.c uses, even though at
> > the moment there's only a single subclass.
> >
> > Following commits will add the CPUs and the peripherals.
> >
> > Signed-off-by: Peter Maydell 
> > ---
> >   MAINTAINERS |   3 +-
> >   configs/devices/arm-softmmu/default.mak |   1 +
> >   hw/arm/mps3r.c  | 239 
> >   hw/arm/Kconfig  |   5 +
> >   hw/arm/meson.build  |   1 +
> >   5 files changed, 248 insertions(+), 1 deletion(-)
> >   create mode 100644 hw/arm/mps3r.c
>
>
> > +static MemoryRegion *mr_for_raminfo(MPS3RMachineState *mms,
> > +const RAMInfo *raminfo)
> > +{
> > +/* Return an initialized MemoryRegion for the RAMInfo. */
> > +MemoryRegion *ram;
> > +
> > +if (raminfo->mrindex < 0) {
> > +/* Means this RAMInfo is for QEMU's "system memory" */
> > +MachineState *machine = MACHINE(mms);
> > +assert(!(raminfo->flags & IS_ROM));
> > +return machine->ram;
> > +}
> > +
> > +assert(raminfo->mrindex < MPS3R_RAM_MAX);
> > +ram = >ram[raminfo->mrindex];
> > +
> > +memory_region_init_ram(ram, NULL, raminfo->name,
>
> You are not using the parent=mms, is that deliberate?
> (as in: easier to migrate eventually?)

No, I didn't have a particular reason for not setting the parent;
I just copied this bit of code from mps2-tz.c, which also doesn't
set the parent pointer...

-- PMM

Re: [PATCH v2] hw: riscv: Allow large kernels to boot by moving the initrd further away in RAM

2024-02-06 Thread Alexandre Ghiti

On Tue, Feb 6, 2024 at 9:39 PM Daniel Henrique Barboza
 wrote:
>
>
>
> On 2/6/24 12:40, Alexandre Ghiti wrote:
> > Currently, the initrd is placed at 128MB, which overlaps with the kernel
> > when it is large (for example syzbot kernels are). From the kernel side,
> > there is no reason we could not push the initrd further away in memory
> > to accommodate large kernels, so move the initrd at 512MB when possible.
> >
> > The ideal solution would have been to place the initrd based on the
> > kernel size but we actually can't since the bss size is not known when
> > the image is loaded by load_image_targphys_as() and the initrd would
> > then overlap with this section.
> >
> > Signed-off-by: Alexandre Ghiti 
> > ---
>
> Reviewed-by: Daniel Henrique Barboza 

Thanks for your help!

Alex

>
> >
> > Changes in v2:
> > - Fix typos in commit log (Daniel) and title
> > - Added to the commit log why using the kernel size does not work
> >(Daniel)
> >
> >   hw/riscv/boot.c | 12 ++--
> >   1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
> > index 0ffca05189..9a367af2fa 100644
> > --- a/hw/riscv/boot.c
> > +++ b/hw/riscv/boot.c
> > @@ -188,13 +188,13 @@ static void riscv_load_initrd(MachineState *machine, 
> > uint64_t kernel_entry)
> >* kernel is uncompressed it will not clobber the initrd. However
> >* on boards without much RAM we must ensure that we still leave
> >* enough room for a decent sized initrd, and on boards with large
> > - * amounts of RAM we must avoid the initrd being so far up in RAM
> > - * that it is outside lowmem and inaccessible to the kernel.
> > - * So for boards with less  than 256MB of RAM we put the initrd
> > - * halfway into RAM, and for boards with 256MB of RAM or more we put
> > - * the initrd at 128MB.
> > + * amounts of RAM, we put the initrd at 512MB to allow large kernels
> > + * to boot.
> > + * So for boards with less than 1GB of RAM we put the initrd
> > + * halfway into RAM, and for boards with 1GB of RAM or more we put
> > + * the initrd at 512MB.
> >*/
> > -start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
> > +start = kernel_entry + MIN(mem_size / 2, 512 * MiB);
> >
> >   size = load_ramdisk(filename, start, mem_size - start);
> >   if (size == -1) {

[PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant

2024-02-06 Thread Alexander Monakov

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
 util/bufferiszero.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..f5a3634f9a 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 }
 
 #ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("sse4")))
-buffer_zero_sse4(const void *buf, size_t len)
-{
-__m128i t = _mm_loadu_si128(buf);
-__m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-__m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-
-/* Loop over 16-byte aligned blocks of 64.  */
-while (likely(p <= e)) {
-__builtin_prefetch(p);
-if (unlikely(!_mm_testz_si128(t, t))) {
-return false;
-}
-t = p[-4] | p[-3] | p[-2] | p[-1];
-p += 4;
-}
-
-/* Finish the aligned tail.  */
-t |= e[-3];
-t |= e[-2];
-t |= e[-1];
-
-/* Finish the unaligned tail.  */
-t |= _mm_loadu_si128(buf + len - 16);
-
-return _mm_testz_si128(t, t);
-}
-
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
@@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info)
 #endif
 #ifdef CONFIG_AVX2_OPT
 { CPUINFO_AVX2,128, buffer_zero_avx2 },
-{ CPUINFO_SSE4, 64, buffer_zero_sse4 },
 #endif
 { CPUINFO_SSE2, 64, buffer_zero_sse2 },
 { CPUINFO_ALWAYS,0, buffer_zero_int },
-- 
2.32.0

[PATCH v3 6/6] util/bufferiszero: improve scalar variant

2024-02-06 Thread Alexander Monakov

Take into account that the inline wrapper ensures len >= 4.

Use __attribute__((may_alias)) for accesses via non-char pointers.

Avoid using out-of-bounds pointers in loop boundary conditions by
reformulating the 'for' loop as 'if (...) do { ... } while (...)'.

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
 util/bufferiszero.c | 30 +++---
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index d752edd8cc..1f4cbfaea4 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -29,35 +29,27 @@
 bool
 buffer_is_zero_len_4_plus(const void *buf, size_t len)
 {
-if (unlikely(len < 8)) {
-/* For a very small buffer, simply accumulate all the bytes.  */
-const unsigned char *p = buf;
-const unsigned char *e = buf + len;
-unsigned char t = 0;
-
-do {
-t |= *p++;
-} while (p < e);
-
-return t == 0;
+if (unlikely(len <= 8)) {
+/* Our caller ensures len >= 4.  */
+return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
 } else {
-/* Otherwise, use the unaligned memory access functions to
-   handle the beginning and end of the buffer, with a couple
+/* Use unaligned memory access functions to handle
+   the beginning and end of the buffer, with a couple
of loops handling the middle aligned section.  */
-uint64_t t = ldq_he_p(buf);
-const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
+uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+typedef uint64_t uint64_a __attribute__((may_alias));
+const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
+const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);
 
-for (; p + 8 <= e; p += 8) {
+if (e - p >= 8) do {
 if (t) {
 return false;
 }
 t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
-}
+} while ((p += 8) <= e - 8);
 while (p < e) {
 t |= *p++;
 }
-t |= ldq_he_p(buf + len - 8);
 
 return t == 0;
 }
-- 
2.32.0

[PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper

2024-02-06 Thread Alexander Monakov

Make buffer_is_zero a 'static inline' function that tests up to three
bytes from the buffer before handing off to an unrolled loop. This
eliminates call overhead for most non-zero buffers, and allows to
optimize out length checks when it is known at compile time (which is
often the case in Qemu).

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
 include/qemu/cutils.h | 28 +++-
 util/bufferiszero.c   | 76 ---
 2 files changed, 47 insertions(+), 57 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..62b153e603 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool buffer_is_zero(const void *buf, size_t len);
+bool buffer_is_zero_len_4_plus(const void *, size_t);
+extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);
 bool test_buffer_is_zero_next_accel(void);
 
+/*
+ * Check if a buffer is all zeroes.
+ */
+static inline bool buffer_is_zero(const void *vbuf, size_t len)
+{
+const char *buf = vbuf;
+
+if (len == 0) {
+return true;
+}
+if (buf[0] || buf[len - 1] || buf[len / 2]) {
+return false;
+}
+/* All bytes are covered for any len <= 3.  */
+if (len <= 3) {
+return true;
+}
+
+if (len >= 256) {
+return buffer_is_zero_len_256_plus(vbuf, len);
+} else {
+return buffer_is_zero_len_4_plus(vbuf, len);
+}
+}
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..01050694a6 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,8 +26,8 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool
-buffer_zero_int(const void *buf, size_t len)
+bool
+buffer_is_zero_len_4_plus(const void *buf, size_t len)
 {
 if (unlikely(len < 8)) {
 /* For a very small buffer, simply accumulate all the bytes.  */
@@ -157,57 +157,40 @@ buffer_zero_avx512(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX512F_OPT */
 
-/*
- * Make sure that these variables are appropriately initialized when
- * SSE2 is enabled on the compiler command-line, but the compiler is
- * too old to support CONFIG_AVX2_OPT.
- */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
-# define INIT_USED 0
-# define INIT_LENGTH   0
-# define INIT_ACCELbuffer_zero_int
-#else
-# ifndef __SSE2__
-#  error "ISA selection confusion"
-# endif
-# define INIT_USED CPUINFO_SSE2
-# define INIT_LENGTH   64
-# define INIT_ACCELbuffer_zero_sse2
-#endif
-
-static unsigned used_accel = INIT_USED;
-static unsigned length_to_accel = INIT_LENGTH;
-static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
 /* Array is sorted in order of algorithm preference. */
 static const struct {
 unsigned bit;
-unsigned len;
 bool (*fn)(const void *, size_t);
 } all[] = {
 #ifdef CONFIG_AVX512F_OPT
-{ CPUINFO_AVX512F, 256, buffer_zero_avx512 },
+{ CPUINFO_AVX512F, buffer_zero_avx512 },
 #endif
 #ifdef CONFIG_AVX2_OPT
-{ CPUINFO_AVX2,128, buffer_zero_avx2 },
+{ CPUINFO_AVX2,buffer_zero_avx2 },
 #endif
-{ CPUINFO_SSE2, 64, buffer_zero_sse2 },
-{ CPUINFO_ALWAYS,0, buffer_zero_int },
+{ CPUINFO_SSE2,buffer_zero_sse2 },
+{ CPUINFO_ALWAYS,  buffer_is_zero_len_4_plus },
 };
 
 for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
 if (info & all[i].bit) {
-length_to_accel = all[i].len;
-buffer_accel = all[i].fn;
+buffer_is_zero_len_256_plus = all[i].fn;
 return all[i].bit;
 }
 }
 return 0;
 }
 
+static unsigned used_accel
+#if defined(__SSE2__)
+= CPUINFO_SSE2;
+#else
+= 0;
+#endif
+
 #if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
@@ -227,35 +210,16 @@ bool test_buffer_is_zero_next_accel(void)
 return used;
 }
 
-static bool select_accel_fn(const void *buf, size_t len)
-{
-if (likely(len >= length_to_accel)) {
-return buffer_accel(buf, len);
-}
-return buffer_zero_int(buf, len);
-}
-
 #else
-#define select_accel_fn  buffer_zero_int
 bool test_buffer_is_zero_next_accel(void)
 {
 return false;
 }
 #endif
 
-/*
- * Checks if a buffer is all zeroes
- */
-bool buffer_is_zero(const void *buf, size_t len)
-{
-if (unlikely(len == 0)) {
-return true;
-}
-
-/* Fetch the beginning of the buffer while we select the accelerator.  */
-__builtin_prefetch(buf);
-
-/* Use an optimized zero check if possible.  Note that this also
-   includes a

[PATCH v3 3/6] util/bufferiszero: remove AVX512 variant

2024-02-06 Thread Alexander Monakov

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov 
Signed-off-by: Alexander Monakov 
---
 util/bufferiszero.c | 36 ++--
 1 file changed, 2 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 01050694a6..c037d11d04 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
 }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || 
defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include 
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-/* Begin with an unaligned head of 64 bytes.  */
-__m512i t = _mm512_loadu_si512(buf);
-__m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-__m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-/* Loop over 64-byte aligned blocks of 256.  */
-while (p <= e) {
-__builtin_prefetch(p);
-if (unlikely(_mm512_test_epi64_mask(t, t))) {
-return false;
-}
-t = p[-4] | p[-3] | p[-2] | p[-1];
-p += 4;
-}
-
-t |= _mm512_loadu_si512(buf + len - 4 * 64);
-t |= _mm512_loadu_si512(buf + len - 3 * 64);
-t |= _mm512_loadu_si512(buf + len - 2 * 64);
-t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
@@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info)
 unsigned bit;
 bool (*fn)(const void *, size_t);
 } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-{ CPUINFO_AVX512F, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
 { CPUINFO_AVX2,buffer_zero_avx2 },
 #endif
@@ -191,7 +159,7 @@ static unsigned used_accel
 = 0;
 #endif
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
 used_accel = select_accel_cpuinfo(cpuinfo_init());
-- 
2.32.0

[PATCH v3 4/6] util/bufferiszero: remove useless prefetches

2024-02-06 Thread Alexander Monakov

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
 util/bufferiszero.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index c037d11d04..cb3eb2543f 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -49,7 +49,6 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
 const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
 for (; p + 8 <= e; p += 8) {
-__builtin_prefetch(p + 8);
 if (t) {
 return false;
 }
@@ -79,7 +78,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 
 /* Loop over 16-byte aligned blocks of 64.  */
 while (likely(p <= e)) {
-__builtin_prefetch(p);
 t = _mm_cmpeq_epi8(t, zero);
 if (unlikely(_mm_movemask_epi8(t) != 0x)) {
 return false;
@@ -110,7 +108,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 
 /* Loop over 32-byte aligned blocks of 128.  */
 while (p <= e) {
-__builtin_prefetch(p);
 if (unlikely(!_mm256_testz_si256(t, t))) {
 return false;
 }
-- 
2.32.0

[PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants

2024-02-06 Thread Alexander Monakov

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov 
Signed-off-by: Mikhail Romanov 
---
 util/bufferiszero.c | 108 
 1 file changed, 69 insertions(+), 39 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index cb3eb2543f..d752edd8cc 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -66,62 +66,92 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include 
 
-/* Note that each of these vectorized functions require len >= 64.  */
+/* Helper for preventing the compiler from reassociating
+   chains of binary vector operations.  */
+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
+
+/* Note that these vectorized functions may assume len >= 256.  */
 
 static bool __attribute__((target("sse2")))
 buffer_zero_sse2(const void *buf, size_t len)
 {
-__m128i t = _mm_loadu_si128(buf);
-__m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-__m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-__m128i zero = _mm_setzero_si128();
-
-/* Loop over 16-byte aligned blocks of 64.  */
-while (likely(p <= e)) {
-t = _mm_cmpeq_epi8(t, zero);
-if (unlikely(_mm_movemask_epi8(t) != 0x)) {
+/* Unaligned loads at head/tail.  */
+__m128i v = *(__m128i_u *)(buf);
+__m128i w = *(__m128i_u *)(buf + len - 16);
+/* Align head/tail to 16-byte boundaries.  */
+__m128i *p = (void *)(((uintptr_t)buf + 16) & -16);
+__m128i *e = (void *)(((uintptr_t)buf + len - 1) & -16);
+__m128i zero = { 0 };
+
+/* Collect a partial block at tail end.  */
+v |= e[-1]; w |= e[-2];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-3]; w |= e[-4];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-5]; w |= e[-6];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-7]; v |= w;
+
+/* Loop over complete 128-byte blocks.  */
+for (; p < e - 7; p += 8) {
+v = _mm_cmpeq_epi8(v, zero);
+if (unlikely(_mm_movemask_epi8(v) != 0x)) {
 return false;
 }
-t = p[-4] | p[-3] | p[-2] | p[-1];
-p += 4;
+v = p[0]; w = p[1];
+SSE_REASSOC_BARRIER(v, w);
+v |= p[2]; w |= p[3];
+SSE_REASSOC_BARRIER(v, w);
+v |= p[4]; w |= p[5];
+SSE_REASSOC_BARRIER(v, w);
+v |= p[6]; w |= p[7];
+SSE_REASSOC_BARRIER(v, w);
+v |= w;
 }
 
-/* Finish the aligned tail.  */
-t |= e[-3];
-t |= e[-2];
-t |= e[-1];
-
-/* Finish the unaligned tail.  */
-t |= _mm_loadu_si128(buf + len - 16);
-
-return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0x;
+return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0x;
 }
 
 #ifdef CONFIG_AVX2_OPT
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
-/* Begin with an unaligned head of 32 bytes.  */
-__m256i t = _mm256_loadu_si256(buf);
-__m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
-__m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
-
-/* Loop over 32-byte aligned blocks of 128.  */
-while (p <= e) {
-if (unlikely(!_mm256_testz_si256(t, t))) {
+/* Unaligned loads at head/tail.  */
+__m256i v = *(__m256i_u *)(buf);
+__m256i w = *(__m256i_u *)(buf + len - 32);
+/* Align head/tail to 32-byte boundaries.  */
+__m256i *p = (void *)(((uintptr_t)buf + 32) & -32);
+__m256i *e = (void *)(((uintptr_t)buf + len - 1) & -32);
+__m256i zero = { 0 };
+
+/* Collect a partial block at tail end.  */
+v |= e[-1]; w |= e[-2];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-3]; w |= e[-4];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-5]; w |= e[-6];
+SSE_REASSOC_BARRIER(v, w);
+v |= e[-7]; v |= w;
+
+/* Loop over complete 256-byte blocks.  */
+for (; p < e - 7; p += 8) {
+/* PTEST is not profitable here.  */
+v = _mm256_cmpeq_epi8(v, zero);
+if (unlikely(_mm256_movemask_epi8(v) != 0x)) {
 return false;
 }
-t = p[-4] | p[-3] | p[-2] | p[-1];
-p += 4;
-} ;
-
-/* Finish the last block of 128 unaligned.  */
-t |= _mm256_loadu_si256(buf + len - 4 * 32);
-t |= _mm256_loadu_si256(buf + len - 3 * 32);
-t |= _mm256_loadu_si256(buf + len - 2 * 32);
-t |= _mm256_loadu_si256(buf + len - 1 * 32);
+v = p[0]; w = p[1];
+SSE_REASSOC_BARRIER(v, w);
+v |= p[2]; w |= p[3];
+SSE_REASSOC_BARRIER(v, w);
+v |= p[4]; w |= p[5];
+SSE_REASSOC_BARRIER(v, w);
+v |=

[PATCH v3 0/6] Optimize buffer_is_zero

2024-02-06 Thread Alexander Monakov

I am posting a new revision of buffer_is_zero improvements (v2 can be found at
https://patchew.org/QEMU/20231027143704.7060-1-mmroma...@ispras.ru/ ).

In our experiments buffer_is_zero took about 40%-50% of overall qemu-img run
time, even though Glib I/O is not very efficient. Hence, it remains an important
routine to optimize.

We substantially improve its performance in typical cases, mostly by introducing
an inline wrapper that samples three bytes from head/middle/tail, avoid call
overhead when any of those is non-zero. We also provide improvements for SIMD
and portable scalar variants.

Changed for v3:

- separate into 6 patches
- fix an oversight which would break the build on non-x86 hosts
- properly avoid out-of-bounds pointers in the scalar variant

Alexander Monakov (6):
  util/bufferiszero: remove SSE4.1 variant
  util/bufferiszero: introduce an inline wrapper
  util/bufferiszero: remove AVX512 variant
  util/bufferiszero: remove useless prefetches
  util/bufferiszero: optimize SSE2 and AVX2 variants
  util/bufferiszero: improve scalar variant

 include/qemu/cutils.h |  28 -
 util/bufferiszero.c   | 280 +++---
 2 files changed, 128 insertions(+), 180 deletions(-)

-- 
2.32.0

Re: [PATCH v2] hw: riscv: Allow large kernels to boot by moving the initrd further away in RAM

2024-02-06 Thread Daniel Henrique Barboza





On 2/6/24 12:40, Alexandre Ghiti wrote:

Currently, the initrd is placed at 128MB, which overlaps with the kernel
when it is large (for example syzbot kernels are). From the kernel side,
there is no reason we could not push the initrd further away in memory
to accommodate large kernels, so move the initrd at 512MB when possible.

The ideal solution would have been to place the initrd based on the
kernel size but we actually can't since the bss size is not known when
the image is loaded by load_image_targphys_as() and the initrd would
then overlap with this section.

Signed-off-by: Alexandre Ghiti 
---


Reviewed-by: Daniel Henrique Barboza 



Changes in v2:
- Fix typos in commit log (Daniel) and title
- Added to the commit log why using the kernel size does not work
   (Daniel)

  hw/riscv/boot.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
index 0ffca05189..9a367af2fa 100644
--- a/hw/riscv/boot.c
+++ b/hw/riscv/boot.c
@@ -188,13 +188,13 @@ static void riscv_load_initrd(MachineState *machine, 
uint64_t kernel_entry)
   * kernel is uncompressed it will not clobber the initrd. However
   * on boards without much RAM we must ensure that we still leave
   * enough room for a decent sized initrd, and on boards with large
- * amounts of RAM we must avoid the initrd being so far up in RAM
- * that it is outside lowmem and inaccessible to the kernel.
- * So for boards with less  than 256MB of RAM we put the initrd
- * halfway into RAM, and for boards with 256MB of RAM or more we put
- * the initrd at 128MB.
+ * amounts of RAM, we put the initrd at 512MB to allow large kernels
+ * to boot.
+ * So for boards with less than 1GB of RAM we put the initrd
+ * halfway into RAM, and for boards with 1GB of RAM or more we put
+ * the initrd at 512MB.
   */
-start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
+start = kernel_entry + MIN(mem_size / 2, 512 * MiB);
  
  size = load_ramdisk(filename, start, mem_size - start);

  if (size == -1) {

Re: [PATCH 02/13] target/arm: The Cortex-R52 has a read-only CBAR

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

The Cortex-R52 implements the Configuration Base Address Register
(CBAR), as a read-only register.  Add ARM_FEATURE_CBAR_RO to this CPU
type, so that our implementation provides the register and the
associated qdev property.

Signed-off-by: Peter Maydell 
---
  target/arm/tcg/cpu32.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/target/arm/tcg/cpu32.c b/target/arm/tcg/cpu32.c
index 11253051156..311d654cdce 100644
--- a/target/arm/tcg/cpu32.c
+++ b/target/arm/tcg/cpu32.c
@@ -809,6 +809,7 @@ static void cortex_r52_initfn(Object *obj)
  set_feature(>env, ARM_FEATURE_PMSA);
  set_feature(>env, ARM_FEATURE_NEON);
  set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
+set_feature(>env, ARM_FEATURE_CBAR_RO);


Reviewed-by: Richard Henderson 

I just noticed that arm_cpu_post_init can be simplified to not check CBAR_RO, now that we 
have arm_cpu_propagate_feature_implications.



r~

Re: [PATCH 01/13] target/arm: Use new CBAR encoding for all v8 CPUs, not all aarch64 CPUs

2024-02-06 Thread Richard Henderson


On 2/6/24 23:29, Peter Maydell wrote:

We support two different encodings for the AArch32 IMPDEF
CBAR register -- older cores like the Cortex A9, A7, A15
have this at 4, c15, c0, 0; newer cores like the
Cortex A35, A53, A57 and A72 have it at 1 c15 c0 0.

When we implemented this we picked which encoding to
use based on whether the CPU set ARM_FEATURE_AARCH64.
However this isn't right for three cases:
  * the qemu-system-arm 'max' CPU, which is supposed to be
a variant on a Cortex-A57; it ought to use the same
encoding the A57 does and which the AArch64 'max'
exposes to AArch32 guest code
  * the Cortex-R52, which is AArch32-only but has the CBAR
at the newer encoding (and where we incorrectly are
not yet setting ARM_FEATURE_CBAR_RO anyway)
  * any possible future support for other v8 AArch32
only CPUs, or for supporting "boot the CPU into
AArch32 mode" on our existing cores like the A57 etc

Make the decision of the encoding be based on whether
the CPU implements the ARM_FEATURE_V8 flag instead.

This changes the behaviour only for the qemu-system-arm
'-cpu max'. We don't expect anybody to be relying on the
old behaviour because:
  * it's not what the real hardware Cortex-A57 does
(and that's what our ID register claims we are)


Not even that, because max resets MIDR.

Anyway,
Reviewed-by: Richard Henderson 


r~

Re: [PATCH] target/riscv: Update $pc after linking to $ra in trans_cm_jalt()

2024-02-06 Thread Richard Henderson


On 2/6/24 23:18, Jason Chien wrote:

The original implementation sets $pc to the address read from the jump
vector table first and links $ra with the address of the next instruction
after the updated $pc. After jumping to the updated $pc and executing the
next ret instruction, the program jumps to $ra, which is in the same
function currently executing, which results in an infinite loop.
This commit reverses the two action. Firstly, $ra is updated with the
address of the next instruction after $pc, and sets $pc to the address
read from the jump vector table.


This is unlikely to be correct in the case the vector table read faults,
leaving $ra updated.

I guess this got broken with CF_PCREL.  Anyway, the solution is to use a 
temporary...


-/*
- * Update pc to current for the non-unwinding exception
- * that might come from cpu_ld*_code() in the helper.
- */
-gen_update_pc(ctx, 0);
-gen_helper_cm_jalt(cpu_pc, cpu_env, tcg_constant_i32(a->index));


... here and then ...


@@ -307,6 +300,13 @@ static bool trans_cm_jalt(DisasContext *ctx, arg_cm_jalt 
*a)
  gen_set_gpr(ctx, xRA, succ_pc);
  }
  


... copy the temp to cpu_pc here.


  tcg_gen_lookup_and_goto_ptr();
  ctx->base.is_jmp = DISAS_NORETURN;
  return true;




r~

1 2 3 >

1 - 100 of 262 matches

Mail list logo