Re: [PATCH] docs/style: allow C99 mixed declarations
Philippe Mathieu-Daudé writes: > On 6/2/24 06:53, Markus Armbruster wrote: >> Daniel P. Berrangé writes: >> >>> On Mon, Feb 05, 2024 at 12:18:19PM -0500, Stefan Hajnoczi wrote: C99 mixed declarations support interleaving of local variable declarations and code. The coding style "generally" forbids C99 mixed declarations with some exceptions to the rule. This rule is not checked by checkpatch.pl and naturally there are violations in the source tree. While contemplating adding another exception, I came to the conclusion that the best location for declarations depends on context. Let the programmer declare variables where it is best for legibility. Don't try to define all possible scenarios/exceptions. > ... > >>> Even if the compiler does reliably warn, I think the code pattern >>> remains misleading to contributors, as the flow control flaw is >>> very non-obvious. >> >> Yup. Strong dislike. >> >>> Rather than accept the status quo and remove the coding guideline, >>> I think we should strengthen the guidelines, such that it is >>> explicitly forbidden in any method that uses 'goto'. Personally >>> I'd go all the way to -Werror=declaration-after-statement, as >> >> I support this. >> >>> while C99 mixed decl is appealing, >> >> Not to me. >> >> I much prefer declarations and statements to be visually distinct. >> Putting declarations first and separating from statements them with a >> blank line accomplishes that. Less necessary in languages where >> declarations are syntactically obvious. > > But we already implicitly suggest C99, see commit ae7c80a7bd > ("error: New macro ERRP_GUARD()"): > > * To use ERRP_GUARD(), add it right at the beginning of the function. > * @errp can then be used without worrying about the argument being > * NULL or _fatal. > > #define ERRP_GUARD() \ > g_auto(ErrorPropagator) _auto_errp_prop = {.errp = errp}; \ > do {\ > if (!errp || errp == _fatal) {\ > errp = &_auto_errp_prop.local_err; \ > } \ > } while (0) We can make ERRP_GUARD() expand into just a declaration: #define ERRP_GUARD()\ g_auto(ErrorPropagator) _auto_errp_prop = { \ .errp = errp, \ .local_err = ((!errp || errp == _fatal\ ? errp = &_auto_errp_prop.local_err \ : NULL), \ NULL) } > Or commit 5626f8c6d4 ("rcu: Add automatically released rcu_read_lock > variants") with WITH_RCU_READ*: > > util/aio-posix.c:540:5: error: mixing declarations and code is > incompatible with standards before C99 > [-Werror,-Wdeclaration-after-statement] > RCU_READ_LOCK_GUARD(); > ^ > include/qemu/rcu.h:189:28: note: expanded from macro 'RCU_READ_LOCK_GUARD' > g_autoptr(RCUReadAuto) _rcu_read_auto __attribute__((unused)) = > rcu_read_auto_lock() > ^ Valid example; RCU_READ_LOCK_GUARD() expands into a declaration. To enable -Wdeclaration-after-statement, we'd have to futz around with _Pragma.
Re: [PATCH 2/2] tests/qtest/npcm7xx_emc-test: Connect all NICs to a backend
On 06/02/2024 18.12, Peter Maydell wrote: Currently QEMU will warn if there is a NIC on the board that is not connected to a backend. By default the '-nic user' will get used for all NICs, but if you manually connect a specific NIC to a specific backend, then the other NICs on the board have no backend and will be warned about: qemu-system-arm: warning: nic npcm7xx-emc.1 has no peer qemu-system-arm: warning: nic npcm-gmac.0 has no peer qemu-system-arm: warning: nic npcm-gmac.1 has no peer So suppress those warnings by manually connecting every NIC on the board to some backend. Signed-off-by: Peter Maydell --- tests/qtest/npcm7xx_emc-test.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tests/qtest/npcm7xx_emc-test.c b/tests/qtest/npcm7xx_emc-test.c index f7646fae2c9..63f6cadb5cc 100644 --- a/tests/qtest/npcm7xx_emc-test.c +++ b/tests/qtest/npcm7xx_emc-test.c @@ -228,7 +228,10 @@ static int *packet_test_init(int module_num, GString *cmd_line) * KISS and use -nic. The driver accepts 'emc0' and 'emc1' as aliases * in the 'model' field to specify the device to match. */ -g_string_append_printf(cmd_line, " -nic socket,fd=%d,model=emc%d ", +g_string_append_printf(cmd_line, " -nic socket,fd=%d,model=emc%d " + "-nic user,model=npcm7xx-emc " + "-nic user,model=npcm-gmac " + "-nic user,model=npcm-gmac", Alternatively, use -nic hubport,hubid=0 in case we even want to run this test without slirp support, too (but currently there is already a check for this in the meson.build file, so -nic user should be fine, too). Anyway, Reviewed-by: Thomas Huth
[PATCH trivial] qemu-nbd: mention --tls-hostname option in qemu-nbd --help
This option was not documented. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1240 Signed-off-by: Michael Tokarev --- qemu-nbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/qemu-nbd.c b/qemu-nbd.c index bac0b5e3ec..d7b3ccab21 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -114,6 +114,7 @@ static void usage(const char *name) " --tls-creds=IDuse id of an earlier --object to provide TLS\n" " --tls-authz=IDuse id of an earlier --object to provide\n" "authorization\n" +" --tls-hostname=HOSTNAME override hostname used to check x509 certificate\n" " -T, --trace [[enable=]][,events=][,file=]\n" "specify tracing options\n" " --forkfork off the server process and exit the parent\n" -- 2.39.2
Re: [RFC PATCH 0/4] Confidential Guest Support: Introduce kvm_init() and kvm_reset() virtual functions
On 2/6/2024 10:19 PM, Daniel P. Berrangé wrote: On Tue, Feb 06, 2024 at 03:28:48AM -0500, Xiaoyao Li wrote: This series is inspired and suggested by Daniel: https://lore.kernel.org/qemu-devel/zbfoqseuv6_zw...@redhat.com/ Currently, different confidential VMs in different architectures have their own specific *_kvm_init() (and some have *_kvm_reset()) exposed for KVM stuff when it's a confidential VM. e.g., sev_kmv_init() for x86 SEV, pef_kvm_init() and pef_kvm_reset() for PPC PEF, and s390_pv_init() for s390 PV VMs. Introduce a generic .kvm_init() and .kvm_reset() functions in ConfidentialGuestSupportClass, so that different cgs technologies in different architectures can implement their own, while common interface of cgs can be used. This RFC implements two helper functions confidential_guest_kvm_init() and confidential_guest_kvm_reset() in Patch 1. In the following patches, they are called in arch specific implementation. X86 will benefit more for the generic implementation when TDX support is added. There is one step forward possible, that calling confidential_guest_kvm_init() before kvm_arch_init() in kvm_int() in accel/kvm/kvm-all.c. This way, each arch doesn't need to call in their arch specific code. X86 fits it, however I'm not sure if ppc and s390 fit it as well. Because currently, ppc calls it in machine->init() and s390 calls in MachineClass->init(). I'm not sure if there is any order dependency. IIUC that s390 call is still a machine->init method, rather than class init. I double check the code again. Only struct MachineClass has .init() function defined. And I find both ppc and s390 calls the confidential_guest_kvm_init() (or their specific cgs kvm_init()) inside their machine_class->init(). I think this series is nice, but its up to the KVM maintainers to decide... With regards, Daniel
Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
On Wed, 7 Feb 2024, Richard Henderson wrote: > On 2/7/24 06:48, Alexander Monakov wrote: > > Make buffer_is_zero a 'static inline' function that tests up to three > > bytes from the buffer before handing off to an unrolled loop. This > > eliminates call overhead for most non-zero buffers, and allows to > > optimize out length checks when it is known at compile time (which is > > often the case in Qemu). > > > > Signed-off-by: Alexander Monakov > > Signed-off-by: Mikhail Romanov > > --- > > include/qemu/cutils.h | 28 +++- > > util/bufferiszero.c | 76 --- > > 2 files changed, 47 insertions(+), 57 deletions(-) > > > > diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h > > index 92c927a6a3..62b153e603 100644 > > --- a/include/qemu/cutils.h > > +++ b/include/qemu/cutils.h > > @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz); > > /* used to print char* safely */ > > #define STR_OR_NULL(str) ((str) ? (str) : "null") > > > > -bool buffer_is_zero(const void *buf, size_t len); > > +bool buffer_is_zero_len_4_plus(const void *, size_t); > > +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t); > > Why 256, when the avx2 routine can handle size 128, and you're about to remove > avx512? (yes, avx2 is bumped to 256-byte chunks in a later patch) > You appear to have missed that select_accel_fn() resolves directly to > buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an > indirect function call. > > I think you should not attempt to expose the 4 vs larger implementation detail > here in the inline function. Presumably the bulk of the benefit in avoiding > the function call is already realized via the three byte spot checks. Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest I'd really like to keep this optimization because so many places in Qemu invoke buffer_is_zero with a constant length, allowing the compiler to optimize out the length test. Would you be open to testing availability of optimized variants in the inline wrapper like this: diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h index 62b153e603..7a2145ffef 100644 --- a/include/qemu/cutils.h +++ b/include/qemu/cutils.h @@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, size_t len) return true; } +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) if (len >= 256) { return buffer_is_zero_len_256_plus(vbuf, len); -} else { -return buffer_is_zero_len_4_plus(vbuf, len); } +#endif +return buffer_is_zero_len_4_plus(vbuf, len); } /* Alexander
Re: [RFC PATCH 2/4] i386/sev: Switch to use confidential_guest_kvm_init()
On 2/6/2024 10:16 PM, Daniel P. Berrangé wrote: On Tue, Feb 06, 2024 at 03:28:50AM -0500, Xiaoyao Li wrote: Use confidential_guest_kvm_init() instead of calling SEV specific sev_kvm_init(). As a bouns, it fits to future TDX when TDX implements its own confidential_guest_support and .kvm_init(). Move the "TypeInfo sev_guest_info" definition and related functions to the end of the file, to avoid declaring the sev_kvm_init() ahead. Clean up the sve-stub.c since it's not needed anymore. Signed-off-by: Xiaoyao Li --- target/i386/kvm/kvm.c | 2 +- target/i386/kvm/meson.build | 2 - target/i386/kvm/sev-stub.c | 5 -- target/i386/sev.c | 120 +++- target/i386/sev.h | 2 - 5 files changed, 63 insertions(+), 68 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 76a66246eb72..bb63bba61fa1 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2534,7 +2534,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) * mechanisms are supported in future (e.g. TDX), they'll need * their own initialization either here or elsewhere. */ -ret = sev_kvm_init(ms->cgs, _err); +ret = confidential_guest_kvm_init(ms->cgs, _err); If you agree with my comment in patch 1 about the API expecting non-NULL, then this would need to be conditionalized (same for the 2 following patches too) sure. Will change. if (ms->cgs) { ret = confidential_guest_kvm_init() if (ret < 0) { } } if (ret < 0) { error_report_err(local_err); return ret; diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build index 84d9143e6029..e7850981e62d 100644 --- a/target/i386/kvm/meson.build +++ b/target/i386/kvm/meson.build @@ -7,8 +7,6 @@ i386_kvm_ss.add(files( i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c')) -i386_kvm_ss.add(when: 'CONFIG_SEV', if_false: files('sev-stub.c')) - i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), if_false: files('hyperv-stub.c')) i386_system_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss) diff --git a/target/i386/kvm/sev-stub.c b/target/i386/kvm/sev-stub.c index 1be5341e8a6a..4a1560cf8ad7 100644 --- a/target/i386/kvm/sev-stub.c +++ b/target/i386/kvm/sev-stub.c @@ -14,8 +14,3 @@ #include "qemu/osdep.h" #include "sev.h" -int sev_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) -{ -/* If we get here, cgs must be some non-SEV thing */ -return 0; -} You can actually delete this entire file, since you removed the only method in it, and stopped building it in the meson.build patch above. I intented to do it. Apprarently I missed it somehow and didn't catch it before sending out. will fix in next version. diff --git a/target/i386/sev.c b/target/i386/sev.c index 173de91afe7d..19e79d3631d0 100644 --- a/target/i386/sev.c +++ b/target/i386/sev.c @@ -353,63 +353,6 @@ static void sev_guest_set_kernel_hashes(Object *obj, bool value, Error **errp) sev->kernel_hashes = value; } -static void -sev_guest_class_init(ObjectClass *oc, void *data) -{ -object_class_property_add_str(oc, "sev-device", - sev_guest_get_sev_device, - sev_guest_set_sev_device); -object_class_property_set_description(oc, "sev-device", -"SEV device to use"); -object_class_property_add_str(oc, "dh-cert-file", - sev_guest_get_dh_cert_file, - sev_guest_set_dh_cert_file); -object_class_property_set_description(oc, "dh-cert-file", -"guest owners DH certificate (encoded with base64)"); -object_class_property_add_str(oc, "session-file", - sev_guest_get_session_file, - sev_guest_set_session_file); -object_class_property_set_description(oc, "session-file", -"guest owners session parameters (encoded with base64)"); -object_class_property_add_bool(oc, "kernel-hashes", - sev_guest_get_kernel_hashes, - sev_guest_set_kernel_hashes); -object_class_property_set_description(oc, "kernel-hashes", -"add kernel hashes to guest firmware for measured Linux boot"); -} - -static void -sev_guest_instance_init(Object *obj) -{ -SevGuestState *sev = SEV_GUEST(obj); - -sev->sev_device = g_strdup(DEFAULT_SEV_DEVICE); -sev->policy = DEFAULT_GUEST_POLICY; -object_property_add_uint32_ptr(obj, "policy", >policy, - OBJ_PROP_FLAG_READWRITE); -object_property_add_uint32_ptr(obj, "handle", >handle, - OBJ_PROP_FLAG_READWRITE); -object_property_add_uint32_ptr(obj, "cbitpos", >cbitpos, - OBJ_PROP_FLAG_READWRITE); -
Re: [PATCH 5/5] monitor: use aio_co_reschedule_self()
Stefan Hajnoczi writes: > The aio_co_reschedule_self() API is designed to avoid the race > condition between scheduling the coroutine in another AioContext and > yielding. > > The QMP dispatch code uses the open-coded version that appears > susceptible to the race condition at first glance: > > aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self()); > qemu_coroutine_yield(); > > The code is actually safe because the iohandler and qemu_aio_context > AioContext run under the Big QEMU Lock. Nevertheless, set a good example > and use aio_co_reschedule_self() so it's obvious that there is no race. > > Suggested-by: Hanna Reitz > Signed-off-by: Stefan Hajnoczi Acked-by: Markus Armbruster Feel free to merge this together with the remainder of the series.
Re: [PATCH] qapi/migration: Add missing tls-authz documentation
On Wed, Feb 07, 2024 at 07:07:58AM +0100, Markus Armbruster wrote: > pet...@redhat.com writes: > > > From: Peter Xu > > > > As reported in Markus's recent enforcement series on qapi doc [1], we > > accidentally miss one entry for tls-authz. Add it. Then we can drop > > @MigrateSetParameters from documentation-exceptions safely later. > > > > [1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com > > > > Cc: Daniel P. Berrangé > > Cc: Fabiano Rosas > > Reported-by: Markus Armbruster > > Signed-off-by: Peter Xu > > --- > > qapi/migration.json | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/qapi/migration.json b/qapi/migration.json > > index 819708321d..f4c5f59e01 100644 > > --- a/qapi/migration.json > > +++ b/qapi/migration.json > > @@ -980,6 +980,10 @@ > > # 2.9) Previously (since 2.7), this was reported by omitting > > # tls-hostname instead. > > # > > +# @tls-authz: ID of the 'authz' object subclass that provides access > > +# control checking of the TLS x509 certificate distinguished name. > > +# (Since 4.0) > > +# > > # @max-bandwidth: to set maximum speed for migration. maximum speed > > # in bytes per second. (Since 2.8) > > # > > Reviewed-by: Markus Armbruster > > I propose I queue this right after [1] with the update to pragma.json > squashed in (appended), and the sentence "Then we can drop ... later" > dropped. > > Thanks for your help! > > > diff --git a/qapi/pragma.json b/qapi/pragma.json > index 7ac05ccc26..6929ab776e 100644 > --- a/qapi/pragma.json > +++ b/qapi/pragma.json > @@ -69,7 +69,6 @@ > 'JSONType', > 'KeyValueKind', > 'MemoryDeviceInfoKind', > -'MigrateSetParameters', > 'NetClientDriver', > 'ObjectType', > 'PciMemoryRegion', > Yes, please. Or queue this prior to that series, then below diff can be squashed into the other patch; either way works. Thanks Markus! -- Peter Xu
Re: [PATCH] hw/intc: Handle the error of IOAPICCommonClass.realize()
Zhao Liu writes: > Hi Philippe, > > On Wed, Jan 31, 2024 at 05:48:24PM +0100, Philippe Mathieu-Daudé wrote: >> Date: Wed, 31 Jan 2024 17:48:24 +0100 >> From: Philippe Mathieu-Daudé >> Subject: Re: [PATCH] hw/intc: Handle the error of >> IOAPICCommonClass.realize() >> >> Hi Zhao, >> >> On 31/1/24 15:29, Zhao Liu wrote: >> > From: Zhao Liu >> > >> > IOAPICCommonClass implements its own private realize(), and this private >> > realize() allows error. >> > >> > Therefore, return directly if IOAPICCommonClass.realize() meets error. >> > >> > Signed-off-by: Zhao Liu >> > --- >> > hw/intc/ioapic_common.c | 3 +++ >> > 1 file changed, 3 insertions(+) >> > >> > diff --git a/hw/intc/ioapic_common.c b/hw/intc/ioapic_common.c >> > index cb9bf6214608..3772863377c2 100644 >> > --- a/hw/intc/ioapic_common.c >> > +++ b/hw/intc/ioapic_common.c >> > @@ -162,6 +162,9 @@ static void ioapic_common_realize(DeviceState *dev, >> > Error **errp) >> > info = IOAPIC_COMMON_GET_CLASS(s); >> > info->realize(dev, errp); >> > +if (*errp) { >> > +return; >> > +} This is wrong, although it'll work in practice. It's wrong, because dereferencing @errp requires ERRP_GUARD(). qapi/error.h: * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. * - It should not be passed to error_prepend() or * error_append_hint(), because that doesn't work with _fatal. * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or _fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. It'll work anyway, because the caller never passes null. Obvious fix: diff --git a/hw/intc/ioapic_common.c b/hw/intc/ioapic_common.c index cb9bf62146..280404cba5 100644 --- a/hw/intc/ioapic_common.c +++ b/hw/intc/ioapic_common.c @@ -152,6 +152,7 @@ static int ioapic_dispatch_post_load(void *opaque, int version_id) static void ioapic_common_realize(DeviceState *dev, Error **errp) { +ERRP_GUARD(); IOAPICCommonState *s = IOAPIC_COMMON(dev); IOAPICCommonClass *info; >> Could be clearer to deviate from DeviceRealize and let the >> handler return a boolean: >> >> -- >8 -- >> diff --git a/hw/intc/ioapic_internal.h b/hw/intc/ioapic_internal.h >> index 37b8565539..9664bb3e00 100644 >> --- a/hw/intc/ioapic_internal.h >> +++ b/hw/intc/ioapic_internal.h >> @@ -92,3 +92,3 @@ struct IOAPICCommonClass { >> >> -DeviceRealize realize; >> +bool (*realize)(DeviceState *dev, Error **errp); qapi.error.h advises: * - Whenever practical, also return a value that indicates success / * failure. This can make the error checking more concise, and can * avoid useless error object creation and destruction. Note that * we still have many functions returning void. We recommend * • bool-valued functions return true on success / false on failure, * • pointer-valued functions return non-null / null pointer, and * • integer-valued functions return non-negative / negative. The patch then becomes info = IOAPIC_COMMON_GET_CLASS(s); -info->realize(dev, errp); +if (!info->realize(dev, errp) { +return; +} DeviceClass and BusClass callbacks realize, unrealize ignore this advice: they return void. Why? Following the advice makes calls easier to read, but the callees have to do a tiny bit of extra work: return something. Good trade when we have at least as many callers as callees. But these callbacks have many more callees: many devices implement them, but only a few places call. Changing them to return something looked like more trouble than it's worth, so we didn't. > What about I change the name of this interface? > > Maybe ioapic_realize(), to distinguish it from DeviceClass.realize(). I wouldn't bother. >> DeviceUnrealize unrealize; > > Additionally, if I change the pattern of realize(), should I also avoid > the DeviceUnrealize macro for symmetry's sake and just declare a similar > function pointer as you said? > > Further, do you think it's necessary to introduce InternalRealize and > InternalUnrealize macros for qdev You mean typedefs? > for qdev to wrap these special realize/unrealize > to differentiate them from normal DeviceRealize/DeviceUnrealize? > > Because I found that this pattern of realize() (i.e. registering the > realize() of the child class in the parent class instead of DeviceClass, > and then calling the registered realize() in parent realize()) is also > widely used in many cases: > > * xen_block_realize() > * virtser_port_device_realize() > * x86_iommu_realize() > * virtio_input_device_realize() > * apic_common_realize() > * pc_dimm_realize() > * virtio_device_realize() > ... Yes. When a
Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
On Tue, 6 Feb 2024, Elena Ufimtseva wrote: > Hello Alexander > > On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov > wrote: > > > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD > > routines are invoked much more rarely in normal use when most buffers > > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra > > frequency and voltage transition periods during which the CPU operates > > at reduced performance, as described in > > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html > > > I would like to point out that the frequency scaling is not currently an > issue on AMD Zen4 Genoa CPUs, for example. > And microcode architecture description here: > https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf > Although, the cpu frequency downscaling mentioned in the above document is > only in relation to floating point operations. > But from other online discussions I gather that the data path for the > integer registers in Zen4 is also 256 bits and it allows to avoid > frequency downscaling for FP and heavy instructions. Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load ports for two consecutive cycles, so from load throughput perspective there's no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512 still has benefits on Zen 4 since it's a richer instruction set (it also reduces pressure in the CPU front-end and is more power-efficient), but as the new AVX2 buffer_is_zero is saturating load ports I would expect that AVX512 can exceed its performance only by a small margin if at all, not anywhere close to 2x. > And looking at the optimizations for AVX2 in your other patch, would > unrolling the loop for AVX512 ops benefit from the speedup taken that the > data path has the same width? No, 256-bit datapath on Zen 4 means that it's easier to saturate it with 512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable to a similar AVX-256 loop unrolled twice. Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly. > If the frequency downscaling is not observed on some of the CPUs, can > AVX512 be maintained and used selectively for some > of the CPUs? Please note that a properly optimized buffer_is_zero is limited by load throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits of AVX-512 diminish more and more. I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial there for buffer_is_zero for reasons like reaching higher turbo clocks or higher memory parallelism. Finally, let's consider a somewhat broader perspective. Let's suppose buffer_is_zero takes 50% of overall application runtime, and 9 out of 10 buffers are found out to be non-zero in the inline wrapper that samples three bytes. Then the vectorized routine takes about 5% of application time, and speeding it up even by 20% only shaves off 1% from overall execution time. Alexander
Re: [PATCH] qapi/migration: Add missing tls-authz documentation
pet...@redhat.com writes: > From: Peter Xu > > As reported in Markus's recent enforcement series on qapi doc [1], we > accidentally miss one entry for tls-authz. Add it. Then we can drop > @MigrateSetParameters from documentation-exceptions safely later. > > [1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com > > Cc: Daniel P. Berrangé > Cc: Fabiano Rosas > Reported-by: Markus Armbruster > Signed-off-by: Peter Xu > --- > qapi/migration.json | 4 > 1 file changed, 4 insertions(+) > > diff --git a/qapi/migration.json b/qapi/migration.json > index 819708321d..f4c5f59e01 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -980,6 +980,10 @@ > # 2.9) Previously (since 2.7), this was reported by omitting > # tls-hostname instead. > # > +# @tls-authz: ID of the 'authz' object subclass that provides access > +# control checking of the TLS x509 certificate distinguished name. > +# (Since 4.0) > +# > # @max-bandwidth: to set maximum speed for migration. maximum speed > # in bytes per second. (Since 2.8) > # Reviewed-by: Markus Armbruster I propose I queue this right after [1] with the update to pragma.json squashed in (appended), and the sentence "Then we can drop ... later" dropped. Thanks for your help! diff --git a/qapi/pragma.json b/qapi/pragma.json index 7ac05ccc26..6929ab776e 100644 --- a/qapi/pragma.json +++ b/qapi/pragma.json @@ -69,7 +69,6 @@ 'JSONType', 'KeyValueKind', 'MemoryDeviceInfoKind', -'MigrateSetParameters', 'NetClientDriver', 'ObjectType', 'PciMemoryRegion',
Re: [PATCH v3 08/17] plugins: add inline operation per vcpu
On 2/7/24 07:45, Richard Henderson wrote: On 2/6/24 19:24, Pierrick Bouvier wrote: --- a/accel/tcg/plugin-gen.c +++ b/accel/tcg/plugin-gen.c @@ -442,6 +442,13 @@ static TCGOp *append_inline_cb(const struct qemu_plugin_dyn_cb *cb, char *ptr = cb->userp; size_t elem_size = 0; size_t offset = 0; +if (!ptr) { +/* use inline entry */ +ptr = cb->inline_insn.entry.score->data->data; This value will not survive the first resize. You need to add a pointer dereference from the first "data". If you look at scoreboard patch, you'll notice tb are flushed when we resize, and thus, invalidate the pointer. We discussed this with Alex previously, and he recommended to implement this, instead of adding another indirection. By the way, this is what created the need to fix cpu_init hook call site, to be able to call start/end exclusive. Thus the related patches at the beginning of the series. r~
Re: [PATCH v3 07/17] plugins: implement inline operation relative to cpu_index
On 2/7/24 07:42, Richard Henderson wrote: On 2/6/24 19:24, Pierrick Bouvier wrote: Instead of working on a fixed memory location, allow to address it based on cpu_index, an element size and a given offset. Result address: ptr + offset + cpu_index * element_size. With this, we can target a member in a struct array from a base pointer. Current semantic is not modified, thus inline operation still targets always the same memory location. Signed-off-by: Pierrick Bouvier --- plugins/plugin.h | 2 +- accel/tcg/plugin-gen.c | 65 +++--- plugins/api.c | 3 +- plugins/core.c | 12 +--- 4 files changed, 65 insertions(+), 17 deletions(-) diff --git a/plugins/plugin.h b/plugins/plugin.h index fd93a372803..77ed10689ca 100644 --- a/plugins/plugin.h +++ b/plugins/plugin.h @@ -100,7 +100,7 @@ void plugin_register_vcpu_mem_cb(GArray **arr, enum qemu_plugin_mem_rw rw, void *udata); -void exec_inline_op(struct qemu_plugin_dyn_cb *cb); +void exec_inline_op(struct qemu_plugin_dyn_cb *cb, int cpu_index); int plugin_num_vcpus(void); diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c index b37ce7683e6..68dee4c68d3 100644 --- a/accel/tcg/plugin-gen.c +++ b/accel/tcg/plugin-gen.c @@ -132,16 +132,28 @@ static void gen_empty_udata_cb_no_rwg(void) */ static void gen_empty_inline_cb(void) { +TCGv_i32 cpu_index = tcg_temp_ebb_new_i32(); +TCGv_ptr cpu_index_as_ptr = tcg_temp_ebb_new_ptr(); TCGv_i64 val = tcg_temp_ebb_new_i64(); TCGv_ptr ptr = tcg_temp_ebb_new_ptr(); +tcg_gen_ld_i32(cpu_index, tcg_env, + -offsetof(ArchCPU, env) + offsetof(CPUState, cpu_index)); +/* pass an immediate != 0 so that it doesn't get optimized away */ +tcg_gen_muli_i32(cpu_index, cpu_index, 0xdeadbeef); You don't need a random immediate here. You can just as easily use tcg_gen_mul_i32(cpu_index, cpu_index, cpu_index); with a similar comment about the true size being inserted later. Followed the tcg_gen_addi_i64 that was using this pattern in the same file. I'll change this to what you recommend. Otherwise, Reviewed-by: Richard Henderson r~
Re: [PATCH v8 10/21] i386: Split topology types of CPUID[0x1F] from the definitions of CPUID[0xB]
On 31/1/24 11:13, Zhao Liu wrote: From: Zhao Liu CPUID[0xB] defines SMT, Core and Invalid types, and this leaf is shared by Intel and AMD CPUs. But for extended topology levels, Intel CPU (in CPUID[0x1F]) and AMD CPU (in CPUID[0x8026]) have the different definitions with different enumeration values. Though CPUID[0x8026] hasn't been implemented in QEMU, to avoid possible misunderstanding, split topology types of CPUID[0x1F] from the definitions of CPUID[0xB] and introduce CPUID[0x1F]-specific topology types. Signed-off-by: Zhao Liu Tested-by: Babu Moger Tested-by: Yongwei Ma Acked-by: Michael S. Tsirkin --- Changes since v3: * New commit to prepare to refactor CPUID[0x1F] encoding. --- target/i386/cpu.c | 14 +++--- target/i386/cpu.h | 13 + 2 files changed, 16 insertions(+), 11 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH v3 05/17] plugins: scoreboard API
On 2/7/24 07:21, Richard Henderson wrote: On 2/6/24 19:24, Pierrick Bouvier wrote: We introduce a cpu local storage, automatically managed (and extended) by QEMU itself. Plugin allocate a scoreboard, and don't have to deal with how many cpus are launched. This API will be used by new inline functions but callbacks can benefit from this as well. This way, they can operate without a global lock for simple operations. At any point during execution, any scoreboard will be dimensioned with at least qemu_plugin_num_vcpus entries. New functions: - qemu_plugin_scoreboard_find - qemu_plugin_scoreboard_free - qemu_plugin_scoreboard_new In more, we define a qemu_plugin_u64, which is a simple struct holding a pointer to a scoreboard, and a given offset. This allows to have a scoreboard containing structs, without having to bring offset for all operations on a specific field. Since most of the plugins are simply collecting a sum of per-cpu values, qemu_plugin_u64 directly support this operation as well. New functions: - qemu_plugin_u64_add - qemu_plugin_u64_get - qemu_plugin_u64_set - qemu_plugin_u64_sum New macros: - qemu_plugin_scoreboard_u64 - qemu_plugin_scoreboard_u64_in_struct I think the u64 stuff should be a second patch built upon the basic scoreboard support. You're right, should be easier to review. +/* A scoreboard is an array of values, indexed by vcpu_index */ +struct qemu_plugin_scoreboard { +GArray *data; +}; Unnecessary? Generates an extra pointer dereference for no apparent benefit. Alternately, might be useful for other data structure changes... Thought to change it to a typedef after removing other members. Will do if you noticed this too. +/** + * typedef qemu_plugin_u64 - uint64_t member of an entry in a scoreboard + * + * This field allows to access a specific uint64_t member in one given entry, + * located at a specified offset. Inline operations expect this as entry. + */ +typedef struct { +struct qemu_plugin_scoreboard *score; Embed the struct instead? Several qemu_plugin_u64 can point to the same scoreboard, so it has to be a pointer. It saves a scoreboard pointer + offset for a given entry. @@ -31,6 +31,9 @@ struct qemu_plugin_state { * but with the HT we avoid adding a field to CPUState. */ GHashTable *cpu_ht; +/* Scoreboards, indexed by their addresses. */ +GHashTable *scoreboards; Why a hash table? All you want is to be able to iterate through all, and add/remove easily. Seems like QLIST from would be better, and the QLIST_ENTRY member would make struct qemu_plugin_scoreboard useful. Thought that having O(1) removal was a nice property, compared to a linked list. I can switch to a QLIST if you still think it's better. What do you mean by "make struct qemu_plugin_scoreboard useful"? r~
Re: [PATCH v8 08/21] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()
On 31/1/24 11:13, Zhao Liu wrote: From: Zhao Liu In cpu_x86_cpuid(), there are many variables in representing the cpu topology, e.g., topo_info, cs->nr_cores and cs->nr_threads. Since the names of cs->nr_cores/cs->nr_threads does not accurately represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone to confusion and mistakes. And the structure X86CPUTopoInfo names its members clearly, thus the variable "topo_info" should be preferred. In addition, in cpu_x86_cpuid(), to uniformly use the topology variable, replace env->dies with topo_info.dies_per_pkg as well. Suggested-by: Robert Hoo Tested-by: Yongwei Ma Signed-off-by: Zhao Liu Reviewed-by: Xiaoyao Li --- Changes since v7: * Renamed cpus_per_pkg to threads_per_pkg. (Xiaoyao) * Dropped Michael/Babu's Acked/Tested tags since the code change. * Re-added Yongwei's Tested tag For his re-testing. * Added Xiaoyao's Reviewed tag. Changes since v3: * Fixed typo. (Babu) Changes since v1: * Extracted cores_per_socket from the code block and use it as a local variable for cpu_x86_cpuid(). (Yanan) * Removed vcpus_per_socket variable and use cpus_per_pkg directly. (Yanan) * Replaced env->dies with topo_info.dies_per_pkg in cpu_x86_cpuid(). --- target/i386/cpu.c | 31 ++- 1 file changed, 18 insertions(+), 13 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH] docs/style: allow C99 mixed declarations
On 6/2/24 06:53, Markus Armbruster wrote: Daniel P. Berrangé writes: On Mon, Feb 05, 2024 at 12:18:19PM -0500, Stefan Hajnoczi wrote: C99 mixed declarations support interleaving of local variable declarations and code. The coding style "generally" forbids C99 mixed declarations with some exceptions to the rule. This rule is not checked by checkpatch.pl and naturally there are violations in the source tree. While contemplating adding another exception, I came to the conclusion that the best location for declarations depends on context. Let the programmer declare variables where it is best for legibility. Don't try to define all possible scenarios/exceptions. ... Even if the compiler does reliably warn, I think the code pattern remains misleading to contributors, as the flow control flaw is very non-obvious. Yup. Strong dislike. Rather than accept the status quo and remove the coding guideline, I think we should strengthen the guidelines, such that it is explicitly forbidden in any method that uses 'goto'. Personally I'd go all the way to -Werror=declaration-after-statement, as I support this. while C99 mixed decl is appealing, Not to me. I much prefer declarations and statements to be visually distinct. Putting declarations first and separating from statements them with a blank line accomplishes that. Less necessary in languages where declarations are syntactically obvious. But we already implicitly suggest C99, see commit ae7c80a7bd ("error: New macro ERRP_GUARD()"): * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or _fatal. #define ERRP_GUARD() \ g_auto(ErrorPropagator) _auto_errp_prop = {.errp = errp}; \ do {\ if (!errp || errp == _fatal) {\ errp = &_auto_errp_prop.local_err; \ } \ } while (0) Or commit 5626f8c6d4 ("rcu: Add automatically released rcu_read_lock variants") with WITH_RCU_READ*: util/aio-posix.c:540:5: error: mixing declarations and code is incompatible with standards before C99 [-Werror,-Wdeclaration-after-statement] RCU_READ_LOCK_GUARD(); ^ include/qemu/rcu.h:189:28: note: expanded from macro 'RCU_READ_LOCK_GUARD' g_autoptr(RCUReadAuto) _rcu_read_auto __attribute__((unused)) = rcu_read_auto_lock() ^
[PATCH] hw/char/pl011: Add support for loopback
This patch adds loopback for sent characters as well as modem-control signals. Loopback of send and modem-control is often used for uart self tests in real hardware but missing from current pl011 model, resulting in self-test failures when running in QEMU. Signed-off-by: Tong Ho Signed-off-by: Francisco Iglesias --- hw/char/pl011.c | 51 +++-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/hw/char/pl011.c b/hw/char/pl011.c index 855cb82d08..3c0e07aa35 100644 --- a/hw/char/pl011.c +++ b/hw/char/pl011.c @@ -121,6 +121,51 @@ static void pl011_update(PL011State *s) } } +static void pl011_put_fifo(void *opaque, uint32_t value); + +static bool pl011_is_loopback(PL011State *s) +{ +return !!(s->cr & (1U << 7)); +} + +static void pl011_tx_loopback(PL011State *s, uint32_t value) +{ +if (pl011_is_loopback(s)) { +pl011_put_fifo(s, value); +} +} + +static uint32_t pl011_cr_loopback(PL011State *s, bool update) +{ +uint32_t cr = s->cr; +uint32_t fr = s->flags; +uint32_t ri = 1 << 8, dcd = 1 << 2, dsr = 1 << 1, cts = 0; +uint32_t out2 = 1 << 13, out1 = 1 << 12, rts = 1 << 11, dtr = 1 << 10; + +if (!pl011_is_loopback(s)) { +return fr; +} + +fr &= ~(ri | dcd | dsr | cts); +fr |= (cr & out2) ? ri : 0; /* FR.RI <= CR.Out2 */ +fr |= (cr & out1) ? dcd : 0; /* FR.DCD <= CR.Out1 */ +fr |= (cr & rts) ? cts : 0; /* FR.CTS <= CR.RTS */ +fr |= (cr & dtr) ? dsr : 0; /* FR.DSR <= CR.DTR */ + +if (!update) { +return fr; +} + +s->int_level &= ~(INT_DSR | INT_DCD | INT_CTS | INT_RI); +s->int_level |= (fr & dsr) ? INT_DSR : 0; +s->int_level |= (fr & dcd) ? INT_DCD : 0; +s->int_level |= (fr & cts) ? INT_CTS : 0; +s->int_level |= (fr & ri) ? INT_RI : 0; +pl011_update(s); + +return fr; +} + static bool pl011_is_fifo_enabled(PL011State *s) { return (s->lcr & LCR_FEN) != 0; @@ -172,7 +217,7 @@ static uint64_t pl011_read(void *opaque, hwaddr offset, r = s->rsr; break; case 6: /* UARTFR */ -r = s->flags; +r = pl011_cr_loopback(s, false); break; case 8: /* UARTILPR */ r = s->ilpr; @@ -267,6 +312,7 @@ static void pl011_write(void *opaque, hwaddr offset, * qemu_chr_fe_write and background I/O callbacks */ qemu_chr_fe_write_all(>chr, , 1); s->int_level |= INT_TX; +pl011_tx_loopback(s, ch); pl011_update(s); break; case 1: /* UARTRSR/UARTECR */ @@ -300,8 +346,9 @@ static void pl011_write(void *opaque, hwaddr offset, pl011_set_read_trigger(s); break; case 12: /* UARTCR */ -/* ??? Need to implement the enable and loopback bits. */ +/* ??? Need to implement the enable bit. */ s->cr = value; +pl011_cr_loopback(s, true); break; case 13: /* UARTIFS */ s->ifl = value; -- 2.25.1
[PATCH] hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT
This changes the ohci validation to not assert if invalid data is fed to the ohci controller. The poc suggested in https://bugs.launchpad.net/qemu/+bug/1907042 and then migrated to bug #303 does the following to feed it a SETUP pid and EndPt of 1: uint32_t MaxPacket = 64; uint32_t TDFormat = 0; uint32_t Skip = 0; uint32_t Speed = 0; uint32_t Direction = 0; /* #define OHCI_TD_DIR_SETUP 0 */ uint32_t EndPt = 1; uint32_t FuncAddress = 0; ed->attr = (MaxPacket << 16) | (TDFormat << 15) | (Skip << 14) | (Speed << 13) | (Direction << 11) | (EndPt << 7) | FuncAddress; ed->tailp = /*TDQTailPntr= */ 0; ed->headp = ((/*TDQHeadPntr= */ [0]) & 0xfff0) | (/* ToggleCarry= */ 0 << 1); ed->next_ed = (/* NextED= */ 0 & 0xfff0) qemu-fuzz also caught the same issue in #1510. They are both fixed by this patch. The if (td.cbp > td.be) logic in ohci_service_td() causes an ohci_die(). My understanding of the OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than td.be to signal the buffer has zero length. The new check in qemu appears to have been added since qemu-4.2. This patch includes both fixes since they are located very close to each other. Signed-off-by: David Hubbard --- hw/usb/hcd-ohci.c | 9 +++-- hw/usb/trace-events | 2 ++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/hw/usb/hcd-ohci.c b/hw/usb/hcd-ohci.c index d73b53f33c..a53808126f 100644 --- a/hw/usb/hcd-ohci.c +++ b/hw/usb/hcd-ohci.c @@ -927,6 +927,11 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed *ed) case OHCI_TD_DIR_SETUP: str = "setup"; pid = USB_TOKEN_SETUP; +if (OHCI_BM(ed->flags, ED_EN) > 0) { /* setup only allowed to ep 0 */ +trace_usb_ohci_td_bad_pid(str, ed->flags, td.flags); +ohci_die(ohci); +return 1; +} break; default: trace_usb_ohci_td_bad_direction(dir); @@ -936,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed *ed) if ((td.cbp & 0xf000) != (td.be & 0xf000)) { len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff); } else { -if (td.cbp > td.be) { -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be); +if (td.cbp > td.be + 1) { +trace_usb_ohci_td_bad_buf(td.cbp, td.be); ohci_die(ohci); return 1; } diff --git a/hw/usb/trace-events b/hw/usb/trace-events index ed7dc210d3..b47d082fa3 100644 --- a/hw/usb/trace-events +++ b/hw/usb/trace-events @@ -28,6 +28,8 @@ usb_ohci_iso_td_data_overrun(int ret, ssize_t len) "DataOverrun %d > %zu" usb_ohci_iso_td_data_underrun(int ret) "DataUnderrun %d" usb_ohci_iso_td_nak(int ret) "got NAK/STALL %d" usb_ohci_iso_td_bad_response(int ret) "Bad device response %d" +usb_ohci_td_bad_buf(uint32_t cbp, uint32_t be) "Bad cbp = 0x%x > be = 0x%x" +usb_ohci_td_bad_pid(const char *s, uint32_t edf, uint32_t tdf) "Bad pid %s: ed.flags 0x%x td.flags 0x%x" usb_ohci_port_attach(int index) "port #%d" usb_ohci_port_detach(int index) "port #%d" usb_ohci_port_wakeup(int index) "port #%d" -- 2.34.1
Re: [PATCH 4/6] migration/multifd: Zero page transmission on the multifd thread.
On Tue, Feb 06, 2024 at 11:19:06PM +, Hao Xiang wrote: > This implements the zero page detection and handling on the multifd > threads. > > Signed-off-by: Hao Xiang > --- > migration/multifd.c | 62 + > migration/multifd.h | 5 > 2 files changed, 62 insertions(+), 5 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index a20d0ed10e..c031f947c7 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -11,6 +11,7 @@ > */ > > #include "qemu/osdep.h" > +#include "qemu/cutils.h" > #include "qemu/rcu.h" > #include "exec/target_page.h" > #include "sysemu/sysemu.h" > @@ -278,6 +279,12 @@ static void multifd_send_fill_packet(MultiFDSendParams > *p) > > packet->offset[i] = cpu_to_be64(temp); > } > +for (i = 0; i < p->zero_num; i++) { > +/* there are architectures where ram_addr_t is 32 bit */ > +uint64_t temp = p->zero[i]; > + > +packet->offset[p->normal_num + i] = cpu_to_be64(temp); > +} > } Please be noted taht p->normal_num will be dropped very soon, see: https://lore.kernel.org/all/20240202102857.110210-6-pet...@redhat.com/ Please use p->pages->num instead. This patch also relies on some changes in previous patch.. IMHO we can split the patch better in this way: - Patch 1: Add new parameter "zero-page-detection", support "none", "legacy". You'll need to implement "none" here that we skip zero page by returning 0 in save_zero_page() if "none". - Patch 2: Add new "multifd" mode in above, implement it in the same patch completely. - Patch 3: introduce ram_save_target_page_multifd() - Patch 4: test case If you want to add "zeros" accounting, that can be done as more patches on top. Thanks, -- Peter Xu
Re: [PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.
On Wed, Feb 07, 2024 at 12:13:10PM +0800, Peter Xu wrote: > On Tue, Feb 06, 2024 at 11:19:04PM +, Hao Xiang wrote: > > This change extends the MigrationStatus interface to track zero pages > > and zero bytes counter. > > > > Signed-off-by: Hao Xiang > > Reviewed-by: Peter Xu I'll need to scratch this, sorry.. The issue is I forgot we have "duplicate" which is exactly "zero page"s.. See: info->ram->duplicate = stat64_get(_stats.zero_pages); If you think the name too confusing and want a replacement, maybe it's fine and maybe we can do that. Then we can keep this zero page counter introduced, reporting the same value as duplicates, then with a follow up patch to deprecate "duplicate" parameter. See an exmaple on how to deprecate in 7b24d326348e1672. One thing I'm not sure is whether Libvirt will be fine on losing "duplicates" after 2+ QEMU major releases. Copy Jiri for this. My understanding is that Libvirt should be keeping an eye on deprecation list and react, but I'd like to double check.. Or we can keep using "duplicates", but I agree it just reads weird.. Thanks, -- Peter Xu
Re: [PATCH 3/6] migration/multifd: Support for zero pages transmission in multifd format.
On Tue, Feb 06, 2024 at 11:19:05PM +, Hao Xiang wrote: > diff --git a/migration/multifd.c b/migration/multifd.c > index 25cbc6dc6b..a20d0ed10e 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -264,6 +264,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) > packet->flags = cpu_to_be32(p->flags); > packet->pages_alloc = cpu_to_be32(p->pages->allocated); > packet->normal_pages = cpu_to_be32(p->normal_num); > +packet->zero_pages = cpu_to_be32(p->zero_num); This doesn't look right.. If to fill up the zero accounting only, we shouldn't be touching multifd packet at all since multifd zero page detection is not yet supported. We should only reference mig_stats.zero_pages. > packet->next_packet_size = cpu_to_be32(p->next_packet_size); > packet->packet_num = cpu_to_be64(p->packet_num); -- Peter Xu
Re: [PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.
On Tue, Feb 06, 2024 at 11:19:04PM +, Hao Xiang wrote: > This change extends the MigrationStatus interface to track zero pages > and zero bytes counter. > > Signed-off-by: Hao Xiang Reviewed-by: Peter Xu When post anything QAPI relevant, please always remember to copy QAPI maintainers too, thanks. $ ./scripts/get_maintainer.pl -f qapi/migration.json Eric Blake (supporter:QAPI Schema) Markus Armbruster (supporter:QAPI Schema) Peter Xu (maintainer:Migration) Fabiano Rosas (maintainer:Migration) qemu-devel@nongnu.org (open list:All patches CC here) -- Peter Xu
Re: [PATCH 1/6] migration/multifd: Add new migration option multifd-zero-page.
On Tue, Feb 06, 2024 at 11:19:03PM +, Hao Xiang wrote: > diff --git a/qapi/migration.json b/qapi/migration.json > index 819708321d..ff033a0344 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -874,6 +874,11 @@ > # @mode: Migration mode. See description in @MigMode. Default is 'normal'. > #(Since 8.2) > # > +# @multifd-zero-page: Multifd zero page checking. If the parameter is true, > +# zero page checking is done on the multifd sender thread. If the > parameter > +# is false, zero page checking is done on the migration main thread. > Default > +# is set to true. (Since 9.0) I replied somewhere before on this, but I can try again.. Do you think it'll be better to introduce a generic parameter for zero page detection? - "none" if disabled, - "legacy" for main thread, - "multifd" for multifd (software-based). A string could work, but maybe cleaner to introduce @MigrationZeroPageDetector enum? When you add more, you can keep extending that with the single field ("multifd-dsa", etc.). -- Peter Xu
Re: [PATCH v3 08/17] plugins: add inline operation per vcpu
On 2/6/24 19:24, Pierrick Bouvier wrote: --- a/accel/tcg/plugin-gen.c +++ b/accel/tcg/plugin-gen.c @@ -442,6 +442,13 @@ static TCGOp *append_inline_cb(const struct qemu_plugin_dyn_cb *cb, char *ptr = cb->userp; size_t elem_size = 0; size_t offset = 0; +if (!ptr) { +/* use inline entry */ +ptr = cb->inline_insn.entry.score->data->data; This value will not survive the first resize. You need to add a pointer dereference from the first "data". r~
RE: [PATCH v0 1/2] aspeed: support uart controller both 0 and 1 base
> -Original Message- > From: Cédric Le Goater > Sent: Wednesday, February 7, 2024 1:00 AM > To: Jamin Lin ; Peter Maydell > ; Andrew Jeffery ; > Joel Stanley ; open list:ASPEED BMCs > ; open list:All patches CC here > > Cc: Troy Lee > Subject: Re: [PATCH v0 1/2] aspeed: support uart controller both 0 and 1 base > > On 2/6/24 04:29, Jamin Lin wrote: > >> -Original Message- > >> The uart definitions on the AST2700 are different : > >> > >> > >> https://github.com/AspeedTech-BMC/linux/blob/aspeed-master-v6.6/arch/ > >> arm > >> 64/boot/dts/aspeed/aspeed-g7.dtsi > >> > >>serial0 = > >>serial1 = > >>serial2 = > >>serial3 = > >>serial4 = > >>serial5 = > >>serial6 = > >>serial7 = > >>serial8 = > >> ... > >> > >> I think the names in the DT (and consequently in the QEMU models) > >> follow the IP names in the datasheet. > >> > >> I don't think we care in QEMU, so I would be inclined to change the > >> indexing of the device names in QEMU and start at 0, which would > >> introduce a discrepancy for the AST2400, AST2600, AST2600 SoC. > >> > >> Let's see what the other maintainers have to say. > >> > >> Thanks, > >> > >> C. > > Hi Cedric, > > > > Did you mean to change the naming of uart device to 0 base for all ASPEED > SOCs? > > If yes, it seems we need to do the following changes. > > 1. add ASPEED_DEV_UART0 in aspeed_soc.h 2. Re-defined uart memory map > > for ast2600, ast10x0, ast2500 and ast2400(uart0 -> ASPEED_DEV_UART0) > > Take ast2600 for example: > > static const hwaddr aspeed_soc_ast2600_memmap[] = { > > [ASPEED_DEV_UART1] = 0x1E783000, ---> > [ASPEED_DEV_UART0] > > [ASPEED_DEV_UART2] = 0x1E78D000, ---> > [ASPEED_DEV_UART1] > > [ASPEED_DEV_UART3] = 0x1E78E000, > > [ASPEED_DEV_UART4] = 0x1E78F000, > > [ASPEED_DEV_UART5] = 0x1E784000, > > [ASPEED_DEV_UART6] = 0x1E79, > > [ASPEED_DEV_UART7] = 0x1E790100, > > [ASPEED_DEV_UART8] = 0x1E790200, > > [ASPEED_DEV_UART9] = 0x1E790300, > > [ASPEED_DEV_UART10]= 0x1E790400, > > [ASPEED_DEV_UART11]= 0x1E790500, > > [ASPEED_DEV_UART12]= 0x1E790600, > > [ASPEED_DEV_UART13]= 0x1E790700, ---> > [ASPEED_DEV_UART12] > > }; > > If no, could you please descript it more detail? So, I can change it and > > re-send > this patch series. > > Let's keep the datasheet names. I had forgotten the reason initially and from > an HW POV it makes sense to keep them in sync. I will add some more > comments to the patch. > > > By the way, I will send a new patch series to support AST2700 in two weeks. > > We encountered GIC issues. It seems that QEMU support GIC v3 but SPI did > not support, yet. > > > > > https://github.com/qemu/qemu/blob/master/hw/intc/arm_gicv3_dist.c#L383 > > https://github.com/AspeedTech-BMC/linux/blob/aspeed-master-v6.6/arch/a > > rm64/boot/dts/aspeed/aspeed-g7.dtsi#L229 > > If you did any hacks or workarounds in the QEMU models, please keep them > separate from the other patches so that we can discuss. > Okay. Will do Thanks-Jamin > > It think that we can discuss it in a new AST2700 patch series. > Sure. > > Thanks, > > C. >
RE: [v0 0/2] uart base and hardcode boot address 0
> -Original Message- > From: Cédric Le Goater > Sent: Wednesday, February 7, 2024 12:48 AM > To: Jamin Lin ; Peter Maydell > ; Andrew Jeffery ; > Joel Stanley ; open list:ASPEED BMCs > ; open list:All patches CC here > > Cc: Troy Lee > Subject: Re: [v0 0/2] uart base and hardcode boot address 0 > > On 2/5/24 10:14, Jamin Lin wrote: > > v0: > > usually we start at v1, so the next version would be a v2. Indexing again :) > Got it. Thanks-Jamin > > Thanks, > > C. > > > > > 1. support uart controller both 0 and 1 base 2. fix hardcode boot > > address 0 > > > > Jamin Lin (2): > >aspeed: support uart controller both 0 and 1 base > >aspeed: fix hardcode boot address 0 > > > > hw/arm/aspeed.c | 12 > > hw/arm/aspeed_ast10x0.c | 1 + > > hw/arm/aspeed_ast2400.c | 2 ++ > > hw/arm/aspeed_ast2600.c | 1 + > > hw/arm/aspeed_soc_common.c | 4 ++-- > > include/hw/arm/aspeed_soc.h | 1 + > > 6 files changed, 15 insertions(+), 6 deletions(-) > >
Re: [PATCH v3 07/17] plugins: implement inline operation relative to cpu_index
On 2/6/24 19:24, Pierrick Bouvier wrote: Instead of working on a fixed memory location, allow to address it based on cpu_index, an element size and a given offset. Result address: ptr + offset + cpu_index * element_size. With this, we can target a member in a struct array from a base pointer. Current semantic is not modified, thus inline operation still targets always the same memory location. Signed-off-by: Pierrick Bouvier --- plugins/plugin.h | 2 +- accel/tcg/plugin-gen.c | 65 +++--- plugins/api.c | 3 +- plugins/core.c | 12 +--- 4 files changed, 65 insertions(+), 17 deletions(-) diff --git a/plugins/plugin.h b/plugins/plugin.h index fd93a372803..77ed10689ca 100644 --- a/plugins/plugin.h +++ b/plugins/plugin.h @@ -100,7 +100,7 @@ void plugin_register_vcpu_mem_cb(GArray **arr, enum qemu_plugin_mem_rw rw, void *udata); -void exec_inline_op(struct qemu_plugin_dyn_cb *cb); +void exec_inline_op(struct qemu_plugin_dyn_cb *cb, int cpu_index); int plugin_num_vcpus(void); diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c index b37ce7683e6..68dee4c68d3 100644 --- a/accel/tcg/plugin-gen.c +++ b/accel/tcg/plugin-gen.c @@ -132,16 +132,28 @@ static void gen_empty_udata_cb_no_rwg(void) */ static void gen_empty_inline_cb(void) { +TCGv_i32 cpu_index = tcg_temp_ebb_new_i32(); +TCGv_ptr cpu_index_as_ptr = tcg_temp_ebb_new_ptr(); TCGv_i64 val = tcg_temp_ebb_new_i64(); TCGv_ptr ptr = tcg_temp_ebb_new_ptr(); +tcg_gen_ld_i32(cpu_index, tcg_env, + -offsetof(ArchCPU, env) + offsetof(CPUState, cpu_index)); +/* pass an immediate != 0 so that it doesn't get optimized away */ +tcg_gen_muli_i32(cpu_index, cpu_index, 0xdeadbeef); You don't need a random immediate here. You can just as easily use tcg_gen_mul_i32(cpu_index, cpu_index, cpu_index); with a similar comment about the true size being inserted later. Otherwise, Reviewed-by: Richard Henderson r~
Re: [PATCH 0/6] Introduce multifd zero page checking.
On Tue, Feb 06, 2024 at 11:19:02PM +, Hao Xiang wrote: > This patchset is based on Juan Quintela's old series here > https://lore.kernel.org/all/20220802063907.18882-1-quint...@redhat.com/ > > In the multifd live migration model, there is a single migration main > thread scanning the page map, queuing the pages to multiple multifd > sender threads. The migration main thread runs zero page checking on > every page before queuing the page to the sender threads. Zero page > checking is a CPU intensive task and hence having a single thread doing > all that doesn't scale well. This change introduces a new function > to run the zero page checking on the multifd sender threads. This > patchset also lays the ground work for future changes to offload zero > page checking task to accelerator hardwares. > > Use two Intel 4th generation Xeon servers for testing. > > Architecture:x86_64 > CPU(s): 192 > Thread(s) per core: 2 > Core(s) per socket: 48 > Socket(s): 2 > NUMA node(s):2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 143 > Model name: Intel(R) Xeon(R) Platinum 8457C > Stepping:8 > CPU MHz: 2538.624 > CPU max MHz: 3800. > CPU min MHz: 800. > > Perform multifd live migration with below setup: > 1. VM has 100GB memory. All pages in the VM are zero pages. > 2. Use tcp socket for live migratio. > 3. Use 4 multifd channels and zero page checking on migration main thread. > 4. Use 1/2/4 multifd channels and zero page checking on multifd sender > threads. > 5. Record migration total time from sender QEMU console's "info migrate" > command. > 6. Calculate throughput with "100GB / total time". > > +--+ > |zero-page-checking | total-time(ms) | throughput(GB/s)| > +--+ > |main-thread| 9629 | 10.38GB/s | > +--+ > |multifd-1-threads | 6182 | 16.17GB/s | > +--+ > |multifd-2-threads | 4643 | 21.53GB/s | > +--+ > |multifd-4-threads | 4143 | 24.13GB/s | > +--+ This "throughput" is slightly confusing; I was initially surprised to see a large throughput for idle guests. IMHO the "total-time" would explain. Feel free to drop that column if there's a repost. Did you check why 4 channels mostly already reached the top line? Is it because main thread is already spinning 100%? Thanks, -- Peter Xu
Re: [PATCH] target/riscv: Update $pc after linking to $ra in trans_cm_jalt()
You are right. I'll send patch v2 shortly. Thank you for the reply. Richard Henderson 於 2024年2月7日 週三 上午4:24寫道: > On 2/6/24 23:18, Jason Chien wrote: > > The original implementation sets $pc to the address read from the jump > > vector table first and links $ra with the address of the next instruction > > after the updated $pc. After jumping to the updated $pc and executing the > > next ret instruction, the program jumps to $ra, which is in the same > > function currently executing, which results in an infinite loop. > > This commit reverses the two action. Firstly, $ra is updated with the > > address of the next instruction after $pc, and sets $pc to the address > > read from the jump vector table. > > This is unlikely to be correct in the case the vector table read faults, > leaving $ra updated. > > I guess this got broken with CF_PCREL. Anyway, the solution is to use a > temporary... > > > -/* > > - * Update pc to current for the non-unwinding exception > > - * that might come from cpu_ld*_code() in the helper. > > - */ > > -gen_update_pc(ctx, 0); > > -gen_helper_cm_jalt(cpu_pc, cpu_env, tcg_constant_i32(a->index)); > > ... here and then ... > > > @@ -307,6 +300,13 @@ static bool trans_cm_jalt(DisasContext *ctx, > arg_cm_jalt *a) > > gen_set_gpr(ctx, xRA, succ_pc); > > } > > > > ... copy the temp to cpu_pc here. > > > tcg_gen_lookup_and_goto_ptr(); > > ctx->base.is_jmp = DISAS_NORETURN; > > return true; > > > > r~ >
[PATCH] qapi/migration: Add missing tls-authz documentation
From: Peter Xu As reported in Markus's recent enforcement series on qapi doc [1], we accidentally miss one entry for tls-authz. Add it. Then we can drop @MigrateSetParameters from documentation-exceptions safely later. [1] https://lore.kernel.org/r/20240205074709.3613229-1-arm...@redhat.com Cc: Daniel P. Berrangé Cc: Fabiano Rosas Reported-by: Markus Armbruster Signed-off-by: Peter Xu --- qapi/migration.json | 4 1 file changed, 4 insertions(+) diff --git a/qapi/migration.json b/qapi/migration.json index 819708321d..f4c5f59e01 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -980,6 +980,10 @@ # 2.9) Previously (since 2.7), this was reported by omitting # tls-hostname instead. # +# @tls-authz: ID of the 'authz' object subclass that provides access +# control checking of the TLS x509 certificate distinguished name. +# (Since 4.0) +# # @max-bandwidth: to set maximum speed for migration. maximum speed # in bytes per second. (Since 2.8) # -- 2.43.0
Re: [PATCH v3 05/17] plugins: scoreboard API
On 2/6/24 19:24, Pierrick Bouvier wrote: We introduce a cpu local storage, automatically managed (and extended) by QEMU itself. Plugin allocate a scoreboard, and don't have to deal with how many cpus are launched. This API will be used by new inline functions but callbacks can benefit from this as well. This way, they can operate without a global lock for simple operations. At any point during execution, any scoreboard will be dimensioned with at least qemu_plugin_num_vcpus entries. New functions: - qemu_plugin_scoreboard_find - qemu_plugin_scoreboard_free - qemu_plugin_scoreboard_new In more, we define a qemu_plugin_u64, which is a simple struct holding a pointer to a scoreboard, and a given offset. This allows to have a scoreboard containing structs, without having to bring offset for all operations on a specific field. Since most of the plugins are simply collecting a sum of per-cpu values, qemu_plugin_u64 directly support this operation as well. New functions: - qemu_plugin_u64_add - qemu_plugin_u64_get - qemu_plugin_u64_set - qemu_plugin_u64_sum New macros: - qemu_plugin_scoreboard_u64 - qemu_plugin_scoreboard_u64_in_struct I think the u64 stuff should be a second patch built upon the basic scoreboard support. +/* A scoreboard is an array of values, indexed by vcpu_index */ +struct qemu_plugin_scoreboard { +GArray *data; +}; Unnecessary? Generates an extra pointer dereference for no apparent benefit. Alternately, might be useful for other data structure changes... +/** + * typedef qemu_plugin_u64 - uint64_t member of an entry in a scoreboard + * + * This field allows to access a specific uint64_t member in one given entry, + * located at a specified offset. Inline operations expect this as entry. + */ +typedef struct { +struct qemu_plugin_scoreboard *score; Embed the struct instead? @@ -31,6 +31,9 @@ struct qemu_plugin_state { * but with the HT we avoid adding a field to CPUState. */ GHashTable *cpu_ht; +/* Scoreboards, indexed by their addresses. */ +GHashTable *scoreboards; Why a hash table? All you want is to be able to iterate through all, and add/remove easily. Seems like QLIST from would be better, and the QLIST_ENTRY member would make struct qemu_plugin_scoreboard useful. r~
Re: [PATCH 00/15] qapi: Require member documentation (with loophole)
On Mon, Feb 05, 2024 at 08:46:54AM +0100, Markus Armbruster wrote: > qapi/migration.json > MigrateSetParameters 1 It's tls-authz. I'll send a patch for this one. Thanks, -- Peter Xu
Re: Re: [PATCH] vdpa-dev: Fix initialisation order to restore VDUSE compatibility
On Tue, Feb 6, 2024 at 4:31 PM Stefano Garzarella wrote: > > On Tue, Feb 06, 2024 at 10:47:40AM +0800, Jason Wang wrote: > >On Mon, Feb 5, 2024 at 6:51 PM Stefano Garzarella > >wrote: > >> > >> On Fri, Feb 02, 2024 at 02:25:21PM +0100, Kevin Wolf wrote: > >> >VDUSE requires that virtqueues are first enabled before the DRIVER_OK > >> >status flag is set; with the current API of the kernel module, it is > >> >impossible to enable the opposite order in our block export code because > >> >userspace is not notified when a virtqueue is enabled. > > > >Did this mean virtio-blk will enable a virtqueue after DRIVER_OK? > > It's not specific to virtio-blk, but to the generic vdpa device we have > in QEMU (i.e. vhost-vdpa-device). Yep, after commit > 6c4825476a4351530bcac17abab72295b75ffe98, virtqueues are enabled after > DRIVER_OK. Right. > > >Sepc is not clear about this and that's why we introduce > >VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK. > > Ah, I didn't know about this new feature. So after commit > 6c4825476a4351530bcac17abab72295b75ffe98 the vhost-vdpa-device is not > complying with the specification, right? Kind of, but as stated, it's just because spec is unclear about the behaviour. There's a chance that spec will explicitly support it in the future. > > > > >> > >> Yeah, IMHO the VDUSE protocol is missing a VDUSE_SET_VQ_READY message, > > > >I think you meant when VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is > >negotiated. > > At this point yes. But if VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not > negotiated, should we return an error in vhost-vdpa kernel module if > VHOST_VDPA_SET_VRING_ENABLE is called when DRIVER_OK is already set? I'm not sure if this can break some setups or not. It might be better to leave it as is? Without VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we don't know if parent support vq_ready after driver_ok. With VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we know parent support vq_ready after driver_ok. > > >If this is truth, it seems a little more complicated, for > >example the get_backend_features needs to be forward to the userspace? > > I'm not understanding, don't we already have VHOST_GET_BACKEND_FEATURES > for this? Or do you mean userspace on the VDUSE side? Yes, since in this case the parent is in the userspace, there's no way for VDUSE to know if user space supports vq_ready after driver_ok or not. As you may have noticed, we don't have a message for vq_ready which implies that vq_ready after driver_ok can't be supported. > > >This seems suboptimal to implement this in the spec first and then we > >can leverage the features. Or we can have another parameter for the > >ioctl that creates the vduse device. > > I got a little lost, though in vhost-user, the device can always expect > a vring_enable/disable, so I thought it was not complicated in VDUSE. Yes, the problem is assuming we have a message for vq_ready, there could be a "legacy" userspace that doesn't support that. So in that case, VDUSE needs to know if the userspace parent can support that or not. > > > > >> I'll start another thread about that, but in the meantime I agree that > >> we should fix QEMU since we need to work properly with old kernels as > >> well. > >> > >> > > >> >This requirement also mathces the normal initialisation order as done by > >> >the generic vhost code in QEMU. However, commit 6c482547 accidentally > >> >changed the order for vdpa-dev and broke access to VDUSE devices with > >> >this. > >> > > >> >This changes vdpa-dev to use the normal order again and use the standard > >> >vhost callback .vhost_set_vring_enable for this. VDUSE devices can be > >> >used with vdpa-dev again after this fix. > >> > >> I like this approach and the patch LGTM, but I'm a bit worried about > >> this function in hw/net/vhost_net.c: > >> > >> int vhost_set_vring_enable(NetClientState *nc, int enable) > >> { > >> VHostNetState *net = get_vhost_net(nc); > >> const VhostOps *vhost_ops = net->dev.vhost_ops; > >> > >> nc->vring_enable = enable; > >> > >> if (vhost_ops && vhost_ops->vhost_set_vring_enable) { > >> return vhost_ops->vhost_set_vring_enable(>dev, enable); > >> } > >> > >> return 0; > >> } > >> > >> @Eugenio, @Jason, should we change some things there if vhost-vdpa > >> implements the vhost_set_vring_enable callback? > > > >Eugenio may know more, I remember we need to enable cvq first for > >shadow virtqueue to restore some states. > > > >> > >> Do you remember why we didn't implement it from the beginning? > > > >It seems the vrings parameter is introduced after vhost-vdpa is > >implemented. > > Sorry, I mean why we didn't implement the vhost_set_vring_enable > callback for vhost-vdpa from the beginning. Adding Cindy who writes those codes for more thoughts. Thanks > > Thanks, > Stefano >
Re: [PATCH v3 04/17] cpu: call plugin init hook asynchronously
On 2/6/24 19:24, Pierrick Bouvier wrote: This ensures we run during a cpu_exec, which allows to call start/end exclusive from this init hook (needed for new scoreboard API introduced later). async work is run before any tb is translated/executed, so we can guarantee plugin init will be called before any other hook. The previous change made sure that any idle/resume cb call will not be done before initializing plugin for a given vcpu. Signed-off-by: Pierrick Bouvier --- hw/core/cpu-common.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH v3 03/17] plugins: fix order of init/idle/resume callback
On 2/6/24 19:24, Pierrick Bouvier wrote: We found that vcpu_init_hook was called*after* idle callback. vcpu_init is called from cpu_realize_fn, while idle/resume cb are called from qemu_wait_io_event (in vcpu thread). This change ensures we only call idle and resume cb only once a plugin was init for a given vcpu. Next change in the series will run vcpu_init asynchronously, which will make it run*after* resume callback as well. So we fix this now. Signed-off-by: Pierrick Bouvier --- plugins/core.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) Reviewed-by: Richard Henderson r~
[PATCH v3 3/6] target/arm: Adjust and validate mtedesc sizem1
When we added SVE_MTEDESC_SHIFT, we effectively limited the maximum size of MTEDESC. Adjust SIZEM1 to consume the remaining bits (32 - 10 - 5 - 12 == 5). Assert that the data to be stored fits within the field (expecting 8 * 4 - 1 == 31, exact fit). Cc: qemu-sta...@nongnu.org Reviewed-by: Peter Maydell Signed-off-by: Richard Henderson --- target/arm/internals.h | 2 +- target/arm/tcg/translate-sve.c | 7 --- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/target/arm/internals.h b/target/arm/internals.h index fc337fe40e..50bff44549 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -1278,7 +1278,7 @@ FIELD(MTEDESC, TBI, 4, 2) FIELD(MTEDESC, TCMA, 6, 2) FIELD(MTEDESC, WRITE, 8, 1) FIELD(MTEDESC, ALIGN, 9, 3) -FIELD(MTEDESC, SIZEM1, 12, SIMD_DATA_BITS - 12) /* size - 1 */ +FIELD(MTEDESC, SIZEM1, 12, SIMD_DATA_BITS - SVE_MTEDESC_SHIFT - 12) /* size - 1 */ bool mte_probe(CPUARMState *env, uint32_t desc, uint64_t ptr); uint64_t mte_check(CPUARMState *env, uint32_t desc, uint64_t ptr, uintptr_t ra); diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c index 7108938251..a88e523cba 100644 --- a/target/arm/tcg/translate-sve.c +++ b/target/arm/tcg/translate-sve.c @@ -4443,17 +4443,18 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, { unsigned vsz = vec_full_reg_size(s); TCGv_ptr t_pg; +uint32_t sizem1; int desc = 0; assert(mte_n >= 1 && mte_n <= 4); +sizem1 = (mte_n << dtype_msz(dtype)) - 1; +assert(sizem1 <= R_MTEDESC_SIZEM1_MASK >> R_MTEDESC_SIZEM1_SHIFT); if (s->mte_active[0]) { -int msz = dtype_msz(dtype); - desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s)); desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid); desc = FIELD_DP32(desc, MTEDESC, TCMA, s->tcma); desc = FIELD_DP32(desc, MTEDESC, WRITE, is_write); -desc = FIELD_DP32(desc, MTEDESC, SIZEM1, (mte_n << msz) - 1); +desc = FIELD_DP32(desc, MTEDESC, SIZEM1, sizem1); desc <<= SVE_MTEDESC_SHIFT; } else { addr = clean_data_tbi(s, addr); -- 2.34.1
[PATCH v3 5/6] target/arm: Handle mte in do_ldrq, do_ldro
These functions "use the standard load helpers", but fail to clean_data_tbi or populate mtedesc. Cc: qemu-sta...@nongnu.org Reviewed-by: Peter Maydell Signed-off-by: Richard Henderson --- target/arm/tcg/translate-sve.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c index 508f7b6bbd..ada05aa530 100644 --- a/target/arm/tcg/translate-sve.c +++ b/target/arm/tcg/translate-sve.c @@ -4861,8 +4861,13 @@ static void do_ldrq(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype) unsigned vsz = vec_full_reg_size(s); TCGv_ptr t_pg; int poff; +uint32_t desc; /* Load the first quadword using the normal predicated load helpers. */ +if (!s->mte_active[0]) { +addr = clean_data_tbi(s, addr); +} + poff = pred_full_reg_offset(s, pg); if (vsz > 16) { /* @@ -4886,7 +4891,8 @@ static void do_ldrq(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype) gen_helper_gvec_mem *fn = ldr_fns[s->mte_active[0]][s->be_data == MO_BE][dtype][0]; -fn(tcg_env, t_pg, addr, tcg_constant_i32(simd_desc(16, 16, zt))); +desc = make_svemte_desc(s, 16, 1, dtype_msz(dtype), false, zt); +fn(tcg_env, t_pg, addr, tcg_constant_i32(desc)); /* Replicate that first quadword. */ if (vsz > 16) { @@ -4929,6 +4935,7 @@ static void do_ldro(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype) unsigned vsz_r32; TCGv_ptr t_pg; int poff, doff; +uint32_t desc; if (vsz < 32) { /* @@ -4941,6 +4948,9 @@ static void do_ldro(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype) } /* Load the first octaword using the normal predicated load helpers. */ +if (!s->mte_active[0]) { +addr = clean_data_tbi(s, addr); +} poff = pred_full_reg_offset(s, pg); if (vsz > 32) { @@ -4965,7 +4975,8 @@ static void do_ldro(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype) gen_helper_gvec_mem *fn = ldr_fns[s->mte_active[0]][s->be_data == MO_BE][dtype][0]; -fn(tcg_env, t_pg, addr, tcg_constant_i32(simd_desc(32, 32, zt))); +desc = make_svemte_desc(s, 32, 1, dtype_msz(dtype), false, zt); +fn(tcg_env, t_pg, addr, tcg_constant_i32(desc)); /* * Replicate that first octaword. -- 2.34.1
[PATCH v3 6/6] target/arm: Fix SVE/SME gross MTE suppression checks
The TBI and TCMA bits are located within mtedesc, not desc. Cc: qemu-sta...@nongnu.org Reviewed-by: Peter Maydell Signed-off-by: Richard Henderson --- target/arm/tcg/sme_helper.c | 8 target/arm/tcg/sve_helper.c | 12 ++-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c index 1ee2690ceb..904bfdac43 100644 --- a/target/arm/tcg/sme_helper.c +++ b/target/arm/tcg/sme_helper.c @@ -573,8 +573,8 @@ void sme_ld1_mte(CPUARMState *env, void *za, uint64_t *vg, desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT); /* Perform gross MTE suppression early. */ -if (!tbi_check(desc, bit55) || -tcma_check(desc, bit55, allocation_tag_from_addr(addr))) { +if (!tbi_check(mtedesc, bit55) || +tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) { mtedesc = 0; } @@ -750,8 +750,8 @@ void sme_st1_mte(CPUARMState *env, void *za, uint64_t *vg, target_ulong addr, desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT); /* Perform gross MTE suppression early. */ -if (!tbi_check(desc, bit55) || -tcma_check(desc, bit55, allocation_tag_from_addr(addr))) { +if (!tbi_check(mtedesc, bit55) || +tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) { mtedesc = 0; } diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c index bce4295d28..6853f58c19 100644 --- a/target/arm/tcg/sve_helper.c +++ b/target/arm/tcg/sve_helper.c @@ -5800,8 +5800,8 @@ void sve_ldN_r_mte(CPUARMState *env, uint64_t *vg, target_ulong addr, desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT); /* Perform gross MTE suppression early. */ -if (!tbi_check(desc, bit55) || -tcma_check(desc, bit55, allocation_tag_from_addr(addr))) { +if (!tbi_check(mtedesc, bit55) || +tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) { mtedesc = 0; } @@ -6156,8 +6156,8 @@ void sve_ldnfff1_r_mte(CPUARMState *env, void *vg, target_ulong addr, desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT); /* Perform gross MTE suppression early. */ -if (!tbi_check(desc, bit55) || -tcma_check(desc, bit55, allocation_tag_from_addr(addr))) { +if (!tbi_check(mtedesc, bit55) || +tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) { mtedesc = 0; } @@ -6410,8 +6410,8 @@ void sve_stN_r_mte(CPUARMState *env, uint64_t *vg, target_ulong addr, desc = extract32(desc, 0, SIMD_DATA_SHIFT + SVE_MTEDESC_SHIFT); /* Perform gross MTE suppression early. */ -if (!tbi_check(desc, bit55) || -tcma_check(desc, bit55, allocation_tag_from_addr(addr))) { +if (!tbi_check(mtedesc, bit55) || +tcma_check(mtedesc, bit55, allocation_tag_from_addr(addr))) { mtedesc = 0; } -- 2.34.1
[PATCH v3 1/6] linux-user/aarch64: Choose SYNC as the preferred MTE mode
The API does not generate an error for setting ASYNC | SYNC; that merely constrains the selection vs the per-cpu default. For qemu linux-user, choose SYNC as the default. Cc: qemu-sta...@nongnu.org Reported-by: Gustavo Romero Signed-off-by: Richard Henderson --- linux-user/aarch64/target_prctl.h | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/linux-user/aarch64/target_prctl.h b/linux-user/aarch64/target_prctl.h index 5067e7d731..aa8e203c15 100644 --- a/linux-user/aarch64/target_prctl.h +++ b/linux-user/aarch64/target_prctl.h @@ -173,21 +173,26 @@ static abi_long do_prctl_set_tagged_addr_ctrl(CPUArchState *env, abi_long arg2) env->tagged_addr_enable = arg2 & PR_TAGGED_ADDR_ENABLE; if (cpu_isar_feature(aa64_mte, cpu)) { -switch (arg2 & PR_MTE_TCF_MASK) { -case PR_MTE_TCF_NONE: -case PR_MTE_TCF_SYNC: -case PR_MTE_TCF_ASYNC: -break; -default: -return -EINVAL; -} - /* * Write PR_MTE_TCF to SCTLR_EL1[TCF0]. - * Note that the syscall values are consistent with hw. + * + * The kernel has a per-cpu configuration for the sysadmin, + * /sys/devices/system/cpu/cpu/mte_tcf_preferred, + * which qemu does not implement. + * + * Because there is no performance difference between the modes, and + * because SYNC is most useful for debugging MTE errors, choose SYNC + * as the preferred mode. With this preference, and the way the API + * uses only two bits, there is no way for the program to select + * ASYMM mode. */ -env->cp15.sctlr_el[1] = -deposit64(env->cp15.sctlr_el[1], 38, 2, arg2 >> PR_MTE_TCF_SHIFT); +unsigned tcf = 0; +if (arg2 & PR_MTE_TCF_SYNC) { +tcf = 1; +} else if (arg2 & PR_MTE_TCF_ASYNC) { +tcf = 2; +} +env->cp15.sctlr_el[1] = deposit64(env->cp15.sctlr_el[1], 38, 2, tcf); /* * Write PR_MTE_TAG to GCR_EL1[Exclude]. -- 2.34.1
[PATCH v3 2/6] target/arm: Fix nregs computation in do_{ld,st}_zpa
The field is encoded as [0-3], which is convenient for indexing our array of function pointers, but the true value is [1-4]. Adjust before calling do_mem_zpa. Add an assert, and move the comment re passing ZT to the helper back next to the relevant code. Cc: qemu-sta...@nongnu.org Fixes: 206adacfb8d ("target/arm: Add mte helpers for sve scalar + int loads") Signed-off-by: Richard Henderson --- target/arm/tcg/translate-sve.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c index 296e7d1ce2..7108938251 100644 --- a/target/arm/tcg/translate-sve.c +++ b/target/arm/tcg/translate-sve.c @@ -4445,11 +4445,7 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, TCGv_ptr t_pg; int desc = 0; -/* - * For e.g. LD4, there are not enough arguments to pass all 4 - * registers as pointers, so encode the regno into the data field. - * For consistency, do this even for LD1. - */ +assert(mte_n >= 1 && mte_n <= 4); if (s->mte_active[0]) { int msz = dtype_msz(dtype); @@ -4463,6 +4459,11 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, addr = clean_data_tbi(s, addr); } +/* + * For e.g. LD4, there are not enough arguments to pass all 4 + * registers as pointers, so encode the regno into the data field. + * For consistency, do this even for LD1. + */ desc = simd_desc(vsz, vsz, zt | desc); t_pg = tcg_temp_new_ptr(); @@ -4600,7 +4601,7 @@ static void do_ld_zpa(DisasContext *s, int zt, int pg, * accessible via the instruction encoding. */ assert(fn != NULL); -do_mem_zpa(s, zt, pg, addr, dtype, nreg, false, fn); +do_mem_zpa(s, zt, pg, addr, dtype, nreg + 1, false, fn); } static bool trans_LD_zprr(DisasContext *s, arg_rprr_load *a) @@ -5168,14 +5169,13 @@ static void do_st_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, if (nreg == 0) { /* ST1 */ fn = fn_single[s->mte_active[0]][be][msz][esz]; -nreg = 1; } else { /* ST2, ST3, ST4 -- msz == esz, enforced by encoding */ assert(msz == esz); fn = fn_multiple[s->mte_active[0]][be][nreg - 1][msz]; } assert(fn != NULL); -do_mem_zpa(s, zt, pg, addr, msz_dtype(s, msz), nreg, true, fn); +do_mem_zpa(s, zt, pg, addr, msz_dtype(s, msz), nreg + 1, true, fn); } static bool trans_ST_zprr(DisasContext *s, arg_rprr_store *a) -- 2.34.1
[PATCH v3 0/6] target/arm: assorted mte fixes
Changes for v3: - As if /sys/devices/system/cpu/cpu/mte_tcf_preferred is "sync". - Fix do_st_zpa as well as do_ld_zpa. Oops. Because of the above, I dropped Gustavo's t-b. r~ Richard Henderson (6): linux-user/aarch64: Choose SYNC as the preferred MTE mode target/arm: Fix nregs computation in do_{ld,st}_zpa target/arm: Adjust and validate mtedesc sizem1 target/arm: Split out make_svemte_desc target/arm: Handle mte in do_ldrq, do_ldro target/arm: Fix SVE/SME gross MTE suppression checks linux-user/aarch64/target_prctl.h | 29 ++- target/arm/internals.h| 2 +- target/arm/tcg/translate-a64.h| 2 + target/arm/tcg/sme_helper.c | 8 +-- target/arm/tcg/sve_helper.c | 12 ++--- target/arm/tcg/translate-sme.c| 15 ++ target/arm/tcg/translate-sve.c| 83 ++- 7 files changed, 83 insertions(+), 68 deletions(-) -- 2.34.1
[PATCH v3 4/6] target/arm: Split out make_svemte_desc
Share code that creates mtedesc and embeds within simd_desc. Cc: qemu-sta...@nongnu.org Reviewed-by: Peter Maydell Signed-off-by: Richard Henderson --- target/arm/tcg/translate-a64.h | 2 ++ target/arm/tcg/translate-sme.c | 15 +++ target/arm/tcg/translate-sve.c | 47 ++ 3 files changed, 31 insertions(+), 33 deletions(-) diff --git a/target/arm/tcg/translate-a64.h b/target/arm/tcg/translate-a64.h index 96ba39b37e..7b811b8ac5 100644 --- a/target/arm/tcg/translate-a64.h +++ b/target/arm/tcg/translate-a64.h @@ -28,6 +28,8 @@ bool logic_imm_decode_wmask(uint64_t *result, unsigned int immn, bool sve_access_check(DisasContext *s); bool sme_enabled_check(DisasContext *s); bool sme_enabled_check_with_svcr(DisasContext *s, unsigned); +uint32_t make_svemte_desc(DisasContext *s, unsigned vsz, uint32_t nregs, + uint32_t msz, bool is_write, uint32_t data); /* This function corresponds to CheckStreamingSVEEnabled. */ static inline bool sme_sm_enabled_check(DisasContext *s) diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c index 8f0dfc884e..46c7fce8b4 100644 --- a/target/arm/tcg/translate-sme.c +++ b/target/arm/tcg/translate-sme.c @@ -206,7 +206,7 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a) TCGv_ptr t_za, t_pg; TCGv_i64 addr; -int svl, desc = 0; +uint32_t desc; bool be = s->be_data == MO_BE; bool mte = s->mte_active[0]; @@ -224,18 +224,11 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a) tcg_gen_shli_i64(addr, cpu_reg(s, a->rm), a->esz); tcg_gen_add_i64(addr, addr, cpu_reg_sp(s, a->rn)); -if (mte) { -desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s)); -desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid); -desc = FIELD_DP32(desc, MTEDESC, TCMA, s->tcma); -desc = FIELD_DP32(desc, MTEDESC, WRITE, a->st); -desc = FIELD_DP32(desc, MTEDESC, SIZEM1, (1 << a->esz) - 1); -desc <<= SVE_MTEDESC_SHIFT; -} else { +if (!mte) { addr = clean_data_tbi(s, addr); } -svl = streaming_vec_reg_size(s); -desc = simd_desc(svl, svl, desc); + +desc = make_svemte_desc(s, streaming_vec_reg_size(s), 1, a->esz, a->st, 0); fns[a->esz][be][a->v][mte][a->st](tcg_env, t_za, t_pg, addr, tcg_constant_i32(desc)); diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c index a88e523cba..508f7b6bbd 100644 --- a/target/arm/tcg/translate-sve.c +++ b/target/arm/tcg/translate-sve.c @@ -4437,18 +4437,18 @@ static const uint8_t dtype_esz[16] = { 3, 2, 1, 3 }; -static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, - int dtype, uint32_t mte_n, bool is_write, - gen_helper_gvec_mem *fn) +uint32_t make_svemte_desc(DisasContext *s, unsigned vsz, uint32_t nregs, + uint32_t msz, bool is_write, uint32_t data) { -unsigned vsz = vec_full_reg_size(s); -TCGv_ptr t_pg; uint32_t sizem1; -int desc = 0; +uint32_t desc = 0; -assert(mte_n >= 1 && mte_n <= 4); -sizem1 = (mte_n << dtype_msz(dtype)) - 1; +/* Assert all of the data fits, with or without MTE enabled. */ +assert(nregs >= 1 && nregs <= 4); +sizem1 = (nregs << msz) - 1; assert(sizem1 <= R_MTEDESC_SIZEM1_MASK >> R_MTEDESC_SIZEM1_SHIFT); +assert(data < 1u << SVE_MTEDESC_SHIFT); + if (s->mte_active[0]) { desc = FIELD_DP32(desc, MTEDESC, MIDX, get_mem_index(s)); desc = FIELD_DP32(desc, MTEDESC, TBI, s->tbid); @@ -4456,7 +4456,18 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, desc = FIELD_DP32(desc, MTEDESC, WRITE, is_write); desc = FIELD_DP32(desc, MTEDESC, SIZEM1, sizem1); desc <<= SVE_MTEDESC_SHIFT; -} else { +} +return simd_desc(vsz, vsz, desc | data); +} + +static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, + int dtype, uint32_t nregs, bool is_write, + gen_helper_gvec_mem *fn) +{ +TCGv_ptr t_pg; +uint32_t desc; + +if (!s->mte_active[0]) { addr = clean_data_tbi(s, addr); } @@ -4465,7 +4476,8 @@ static void do_mem_zpa(DisasContext *s, int zt, int pg, TCGv_i64 addr, * registers as pointers, so encode the regno into the data field. * For consistency, do this even for LD1. */ -desc = simd_desc(vsz, vsz, zt | desc); +desc = make_svemte_desc(s, vec_full_reg_size(s), nregs, +dtype_msz(dtype), is_write, zt); t_pg = tcg_temp_new_ptr(); tcg_gen_addi_ptr(t_pg, tcg_env, pred_full_reg_offset(s, pg)); @@ -5224,25 +5236,16 @@ static void do_mem_zpz(DisasContext *s, int zt, int pg, int zm, int scale, TCGv_i64 scalar, int msz, bool is_write, gen_helper_gvec_mem_scatter *fn) { -
Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races
On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote: > Based-on: 20240202102857.110210-1-pet...@redhat.com > [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups > https://lore.kernel.org/r/20240202102857.110210-1-pet...@redhat.com > > Hi, > > For v3 I fixed the refcounting issue spotted by Avihai. The situation > there is a bit clunky due to historical reasons. The gist is that we > have an assumption that channel creation never fails after p->c has > been set, so when 'p->c == NULL' we have to unref and when 'p->c != > NULL' the cleanup code will do the unref. Yes, this looks good to me. That's a good catch. I'll leave at least one more day for Avihai and/or Dan to have another look. My r-b persist as of now on patch 5. Actually I think the conditional unref is slightly tricky, but it's not its own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly abused. My understanding is we can avoid that conditional unref with below patch 1 as a cleanup (on top of this series). Then patch 2 comes all alongside. We don't need to rush on these, though, we should fix the thread race first because multiple of us hit it, and all cleanups can be done later. = >From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 7 Feb 2024 10:08:35 +0800 Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake") introduced a thread for TLS channels, which will resolve the issue on blocking the main thread. However in the same commit p->c is slightly abused just to be able to pass over the pointer "p" into the thread. That's the major reason we'll need to conditionally free the io channel in the fault paths. To clean it up, using a separate structure to pass over both "p" and "tioc" in the tls handshake thread. Then we can make it a rule that p->c will never be set until the channel is completely setup. With that, we can drop the tricky conditional unref of the io channel in the error path. Signed-off-by: Peter Xu --- migration/multifd.c | 37 +++-- 1 file changed, 23 insertions(+), 14 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index adfe8c9a0a..4a85a6b7b3 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -873,16 +873,22 @@ out: static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque); +typedef struct { +MultiFDSendParams *p; +QIOChannelTLS *tioc; +} MultiFDTLSThreadArgs; + static void *multifd_tls_handshake_thread(void *opaque) { -MultiFDSendParams *p = opaque; -QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c); +MultiFDTLSThreadArgs *args = opaque; -qio_channel_tls_handshake(tioc, +qio_channel_tls_handshake(args->tioc, multifd_new_send_channel_async, - p, + args->p, NULL, NULL); +g_free(args); + return NULL; } @@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p, { MigrationState *s = migrate_get_current(); const char *hostname = s->hostname; +MultiFDTLSThreadArgs *args; QIOChannelTLS *tioc; tioc = migration_tls_client_create(ioc, hostname, errp); @@ -906,11 +913,14 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p, object_unref(OBJECT(ioc)); trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname); qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing"); -p->c = QIO_CHANNEL(tioc); + +args = g_new0(MultiFDTLSThreadArgs, 1); +args->tioc = tioc; +args->p = p; p->tls_thread_created = true; qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker", - multifd_tls_handshake_thread, p, + multifd_tls_handshake_thread, args, QEMU_THREAD_JOINABLE); return true; } @@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p, migration_ioc_register_yank(ioc); p->registered_yank = true; +/* Setup p->c only if the channel is completely setup */ p->c = ioc; p->thread_created = true; @@ -976,14 +987,12 @@ out: trace_multifd_new_send_channel_async_error(p->id, local_err); multifd_send_set_error(local_err); -if (!p->c) { -/* - * If no channel has been created, drop the initial - * reference. Otherwise cleanup happens at - * multifd_send_channel_destroy() - */ -object_unref(OBJECT(ioc)); -} +/* + * For error cases (TLS or non-TLS), IO channel is always freed here + * rather than when cleanup multifd: since p->c is not set, multifd + * cleanup code doesn't even know its existance. + */ +object_unref(OBJECT(ioc));
Re: [PATCH v2 1/6] linux-user/aarch64: Extend PR_SET_TAGGED_ADDR_CTRL for FEAT_MTE3
On 2/7/24 00:23, Peter Maydell wrote: +++ b/linux-user/aarch64/target_prctl.h @@ -173,21 +173,22 @@ static abi_long do_prctl_set_tagged_addr_ctrl(CPUArchState *env, abi_long arg2) env->tagged_addr_enable = arg2 & PR_TAGGED_ADDR_ENABLE; if (cpu_isar_feature(aa64_mte, cpu)) { -switch (arg2 & PR_MTE_TCF_MASK) { -case PR_MTE_TCF_NONE: -case PR_MTE_TCF_SYNC: -case PR_MTE_TCF_ASYNC: -break; -default: -return -EINVAL; -} We should probably check here and reject unknown bits being set in arg2, as set_tagged_addr_ctrl() does; but the old code didn't get that right either. This is done higher up in this function: if (arg2 & ~valid_mask) { return -TARGET_EINVAL; } The rejection of ASYNC | SYNC here was either a bug in my original implementation, or the kernel API changed since the initial implementation in June 2020 (not worth digging to find out). r~
[PATCH v2 2/3] ci: Remove tag dependency for build-previous-qemu
From: Peter Xu The new build-previous-qemu job relies on QEMU release tag being present, while that may not be always true for personal git repositories since by default tag is not pushed. The job can fail on those CI kicks, as reported by Peter Maydell. Fix it by fetching the tags remotely from the official repository, as suggested by Dan. [1] https://lore.kernel.org/r/zcc9sckj7vvqe...@redhat.com Reported-by: Peter Maydell Suggested-by: Daniel P. Berrangé Reviewed-by: Daniel P. Berrangé Signed-off-by: Peter Xu --- .gitlab-ci.d/buildtest.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml index 79bbc8585b..cfe95c1b17 100644 --- a/.gitlab-ci.d/buildtest.yml +++ b/.gitlab-ci.d/buildtest.yml @@ -189,6 +189,8 @@ build-previous-qemu: TARGETS: x86_64-softmmu aarch64-softmmu before_script: - export QEMU_PREV_VERSION="$(sed 's/\([0-9.]*\)\.[0-9]*/v\1.0/' VERSION)" +- git remote add upstream https://gitlab.com/qemu-project/qemu +- git fetch upstream $QEMU_PREV_VERSION - git checkout $QEMU_PREV_VERSION after_script: - mv build build-previous -- 2.43.0
[PATCH v2 3/3] ci: Update comment for migration-compat-aarch64
From: Peter Xu It turns out that we may not be able to enable this test even for the upcoming v9.0. Document what we're still missing. Reviewed-by: Daniel P. Berrangé Signed-off-by: Peter Xu --- .gitlab-ci.d/buildtest.yml | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml index cfe95c1b17..f56df59c94 100644 --- a/.gitlab-ci.d/buildtest.yml +++ b/.gitlab-ci.d/buildtest.yml @@ -219,9 +219,10 @@ build-previous-qemu: - QTEST_QEMU_BINARY_DST=./qemu-system-${TARGET} QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} ./tests/qtest/migration-test -# This job is disabled until we release 9.0. The existing -# migration-test in 8.2 is broken on aarch64. The fix was already -# commited, but it will only take effect once 9.0 is out. +# This job needs to be disabled until we can have an aarch64 CPU model that +# will both (1) support both KVM and TCG, and (2) provide a stable ABI. +# Currently only "-cpu max" can provide (1), however it doesn't guarantee +# (2). Mark this test skipped until later. migration-compat-aarch64: extends: .migration-compat-common variables: -- 2.43.0
[PATCH v2 0/3] ci: Fixes on the recent cross-binary test case
From: Peter Xu v2: - Fix a typo in patch 2 on QEMU_PREV_VERSION - Added R-bs for Dan Hi, This small patchset updates the recent cross-binary test for migration on a few things. Patch 1 modifies the aarch64 test GIC version to 3 rather than "max", paving way for enabling it, even if the CPU model is not yet ready. Patch 2 removes the tag dependency of the new build-previous-qemu job, so that in personal CI pipelines the job won't fail if the tag is missing, as reported by Peter Maydell, and solution suggested by Dan. Patch 3 updates the comment for aarch64 on the test to state the fact, and what is missing. Then we don't target it support for v9.0, but only until we have a stable CPU model for aarch64 (if ever possible to support both tcg and kvm). Comments welcomed, thanks. Peter Xu (3): tests/migration-test: Stick with gicv3 in aarch64 test ci: Remove tag dependency for build-previous-qemu ci: Update comment for migration-compat-aarch64 tests/qtest/migration-test.c | 2 +- .gitlab-ci.d/buildtest.yml | 9 ++--- 2 files changed, 7 insertions(+), 4 deletions(-) -- 2.43.0
[PATCH v2 1/3] tests/migration-test: Stick with gicv3 in aarch64 test
From: Peter Xu Recently we introduced cross-binary migration test. It's always wanted that migration-test uses stable guest ABI for both QEMU binaries in this case, so that both QEMU binaries will be compatible on the migration stream with the cmdline specified. Switch to a static gic version "3" rather than using version "max", so that GIC should be stable now across any future QEMU binaries for migration-test. Here the version can actually be anything as long as the ABI is stable. We choose "3" because it's the majority of what we already use in QEMU while still new enough: "git grep gic-version=3" shows 6 hit, while version 4 has no direct user yet besides "max". Note that even with this change, aarch64 won't be able to work yet with migration cross binary test, but then the only missing piece will be the stable CPU model. Reviewed-by: Daniel P. Berrangé Signed-off-by: Peter Xu --- tests/qtest/migration-test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 7675519cfa..8a5bb1752e 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -819,7 +819,7 @@ static int test_migrate_start(QTestState **from, QTestState **to, } else if (strcmp(arch, "aarch64") == 0) { memory_size = "150M"; machine_alias = "virt"; -machine_opts = "gic-version=max"; +machine_opts = "gic-version=3"; arch_opts = g_strdup_printf("-cpu max -kernel %s", bootpath); start_address = ARM_TEST_MEM_START; end_address = ARM_TEST_MEM_END; -- 2.43.0
Re: [PATCH v2 2/6] target/arm: Fix nregs computation in do_ld_zpa
On 2/7/24 00:46, Peter Maydell wrote: @@ -4600,7 +4601,7 @@ static void do_ld_zpa(DisasContext *s, int zt, int pg, * accessible via the instruction encoding. */ assert(fn != NULL); -do_mem_zpa(s, zt, pg, addr, dtype, nreg, false, fn); +do_mem_zpa(s, zt, pg, addr, dtype, nreg + 1, false, fn); } static bool trans_LD_zprr(DisasContext *s, arg_rprr_load *a) What about do_st_zpa() ? It's not obvious what the 'nreg' encoding is in the a->nreg field in arg_rprr_store, but it's definitely confusing that do_st_zpa() calls do_mem_zpa() passing "nreg" whereas do_ld_zpa() now passes it "nreg + 1". Can we make it so the handling in these two functions lines up? Yes, I think there may be a bug in store as well. Comparing the two is complicated by the cut outs for LDFF1, LDNF1, LD1R and PRF. r~
RE: [PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx
-Original Message- From: Nabih Estefan Sent: Wednesday, February 7, 2024 7:24 AM To: peter.mayd...@linaro.org Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; CS20 KFTing ; wuhao...@google.com; jasow...@redhat.com; IS20 Avi Fishman ; nabiheste...@google.com; CS20 KWLiu ; IS20 Tomer Maimon ; IN20 Hila Miranda-Kuzi Subject: [PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx CAUTION - External Email: Do not click links or open attachments unless you acknowledge the sender and content. Fixing the nocm_gmac-test.c file to run on a nuvoton 7xx machine instead of 8xx. Also fixing comments referencing this and values expecting 8xx. Change-Id: I07b91e8be473e6a1ece65a2202608b52ed4025b8 Signed-Off-By: Nabih Estefan Reviewed-by: Tyrone Ting --- tests/qtest/meson.build | 4 ++-- tests/qtest/npcm_gmac-test.c | 12 ++-- 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build index 39557d5ecb..2b89e8634b 100644 --- a/tests/qtest/meson.build +++ b/tests/qtest/meson.build @@ -192,7 +192,8 @@ qtests_npcm7xx = \ 'npcm7xx_sdhci-test', 'npcm7xx_smbus-test', 'npcm7xx_timer-test', - 'npcm7xx_watchdog_timer-test'] + \ + 'npcm7xx_watchdog_timer-test', + 'npcm_gmac-test'] + \ (slirp.found() ? ['npcm7xx_emc-test'] : []) qtests_aspeed = \ ['aspeed_hace-test', @@ -231,7 +232,6 @@ qtests_aarch64 = \ (config_all_devices.has_key('CONFIG_RASPI') ? ['bcm2835-dma-test'] : []) + \ (config_all_accel.has_key('CONFIG_TCG') and \ config_all_devices.has_key('CONFIG_TPM_TIS_I2C') ? ['tpm-tis-i2c-test'] : []) + \ - (config_all_devices.has_key('CONFIG_NPCM7XX') ? qtests_npcm7xx : []) + \ ['arm-cpu-features', 'numa-test', 'boot-serial-test', diff --git a/tests/qtest/npcm_gmac-test.c b/tests/qtest/npcm_gmac-test.c index 9e58b15ca1..0d1bc8107b 100644 --- a/tests/qtest/npcm_gmac-test.c +++ b/tests/qtest/npcm_gmac-test.c @@ -36,7 +36,7 @@ typedef struct TestData { const GMACModule *module; } TestData; -/* Values extracted from hw/arm/npcm8xx.c */ +/* Values extracted from hw/arm/npcm7xx.c */ static const GMACModule gmac_module_list[] = { { .irq= 14, @@ -46,14 +46,6 @@ static const GMACModule gmac_module_list[] = { .irq= 15, .base_addr = 0xf0804000 }, -{ -.irq= 16, -.base_addr = 0xf0806000 -}, -{ -.irq= 17, -.base_addr = 0xf0808000 -} }; /* Returns the index of the GMAC module. */ @@ -196,7 +188,7 @@ static void test_init(gconstpointer test_data) { const TestData *td = test_data; const GMACModule *mod = td->module; -QTestState *qts = qtest_init("-machine npcm845-evb"); +QTestState *qts = qtest_init("-machine npcm750-evb"); #define CHECK_REG32(regno, value) \ do { \ -- 2.43.0.594.gd9cf4e227d-goog The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Nuvoton is strictly prohibited; and any information in this email irrelevant to the official business of Nuvoton shall be deemed as neither given nor endorsed by Nuvoton.
Re: [PATCH v2 1/6] linux-user/aarch64: Extend PR_SET_TAGGED_ADDR_CTRL for FEAT_MTE3
On 2/7/24 00:23, Peter Maydell wrote: On Tue, 6 Feb 2024 at 03:06, Richard Henderson wrote: When MTE3 is supported, the kernel maps PR_MTE_TCF_ASYNC | PR_MTE_TCF_SYNC to MTE_CTRL_TCF_ASYMM and from there to SCTLR_EL1.TCF0 = 3 This depends on the setting of /sys/devices/system/cpu/cpu/mte_tcf_preferred : I think you only get asymm here if the sysadmin has set mte_tcf_preferred to 'asymm'; the default is 'async'. Hmm, I missed that somewhere in the rat's nest. I suspect this is over-engineered, such that no one will understand how to use it. For QEMU's implementation, are there any particular performance differences between sync, async and asymm ? I doubt it. Getting to the error path at all is the bulk of the work. I think "performance" in this case would be highly test-case-centric. Does the test "perform better" with async, which would allow the entire vector operation to finish in one go? I suspect that for debugging purposes, sync is always preferred. That might be the best setting for qemu. r~
Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
Hello Alexander On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov wrote: > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD > routines are invoked much more rarely in normal use when most buffers > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra > frequency and voltage transition periods during which the CPU operates > at reduced performance, as described in > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html I would like to point out that the frequency scaling is not currently an issue on AMD Zen4 Genoa CPUs, for example. And microcode architecture description here: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf Although, the cpu frequency downscaling mentioned in the above document is only in relation to floating point operations. But from other online discussions I gather that the data path for the integer registers in Zen4 is also 256 bits and it allows to avoid frequency downscaling for FP and heavy instructions. And looking at the optimizations for AVX2 in your other patch, would unrolling the loop for AVX512 ops benefit from the speedup taken that the data path has the same width? If the frequency downscaling is not observed on some of the CPUs, can AVX512 be maintained and used selectively for some of the CPUs? Thank you! > > > Signed-off-by: Mikhail Romanov > Signed-off-by: Alexander Monakov > --- > util/bufferiszero.c | 36 ++-- > 1 file changed, 2 insertions(+), 34 deletions(-) > > diff --git a/util/bufferiszero.c b/util/bufferiszero.c > index 01050694a6..c037d11d04 100644 > --- a/util/bufferiszero.c > +++ b/util/bufferiszero.c > @@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) > } > } > > -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || > defined(__SSE2__) > +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) > #include > > /* Note that each of these vectorized functions require len >= 64. */ > @@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len) > } > #endif /* CONFIG_AVX2_OPT */ > > -#ifdef CONFIG_AVX512F_OPT > -static bool __attribute__((target("avx512f"))) > -buffer_zero_avx512(const void *buf, size_t len) > -{ > -/* Begin with an unaligned head of 64 bytes. */ > -__m512i t = _mm512_loadu_si512(buf); > -__m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64); > -__m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64); > - > -/* Loop over 64-byte aligned blocks of 256. */ > -while (p <= e) { > -__builtin_prefetch(p); > -if (unlikely(_mm512_test_epi64_mask(t, t))) { > -return false; > -} > -t = p[-4] | p[-3] | p[-2] | p[-1]; > -p += 4; > -} > - > -t |= _mm512_loadu_si512(buf + len - 4 * 64); > -t |= _mm512_loadu_si512(buf + len - 3 * 64); > -t |= _mm512_loadu_si512(buf + len - 2 * 64); > -t |= _mm512_loadu_si512(buf + len - 1 * 64); > - > -return !_mm512_test_epi64_mask(t, t); > - > -} > -#endif /* CONFIG_AVX512F_OPT */ > - > static unsigned __attribute__((noinline)) > select_accel_cpuinfo(unsigned info) > { > @@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info) > unsigned bit; > bool (*fn)(const void *, size_t); > } all[] = { > -#ifdef CONFIG_AVX512F_OPT > -{ CPUINFO_AVX512F, buffer_zero_avx512 }, > -#endif > #ifdef CONFIG_AVX2_OPT > { CPUINFO_AVX2,buffer_zero_avx2 }, > #endif > @@ -191,7 +159,7 @@ static unsigned used_accel > = 0; > #endif > > -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) > +#if defined(CONFIG_AVX2_OPT) > static void __attribute__((constructor)) init_accel(void) > { > used_accel = select_accel_cpuinfo(cpuinfo_init()); > -- > 2.32.0 > > > -- Elena
[PATCH 0/1] Sending small fix for NPCM GMAC test to properly test on Nuvoton 7xx
Nabih Estefan (1): tests/qtest: Fixing GMAC test to run in 7xx tests/qtest/meson.build | 4 ++-- tests/qtest/npcm_gmac-test.c | 12 ++-- 2 files changed, 4 insertions(+), 12 deletions(-) -- 2.43.0.594.gd9cf4e227d-goog
[PATCH 1/1] tests/qtest: Fixing GMAC test to run in 7xx
Fixing the nocm_gmac-test.c file to run on a nuvoton 7xx machine instead of 8xx. Also fixing comments referencing this and values expecting 8xx. Change-Id: I07b91e8be473e6a1ece65a2202608b52ed4025b8 Signed-Off-By: Nabih Estefan --- tests/qtest/meson.build | 4 ++-- tests/qtest/npcm_gmac-test.c | 12 ++-- 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build index 39557d5ecb..2b89e8634b 100644 --- a/tests/qtest/meson.build +++ b/tests/qtest/meson.build @@ -192,7 +192,8 @@ qtests_npcm7xx = \ 'npcm7xx_sdhci-test', 'npcm7xx_smbus-test', 'npcm7xx_timer-test', - 'npcm7xx_watchdog_timer-test'] + \ + 'npcm7xx_watchdog_timer-test', + 'npcm_gmac-test'] + \ (slirp.found() ? ['npcm7xx_emc-test'] : []) qtests_aspeed = \ ['aspeed_hace-test', @@ -231,7 +232,6 @@ qtests_aarch64 = \ (config_all_devices.has_key('CONFIG_RASPI') ? ['bcm2835-dma-test'] : []) + \ (config_all_accel.has_key('CONFIG_TCG') and \ config_all_devices.has_key('CONFIG_TPM_TIS_I2C') ? ['tpm-tis-i2c-test'] : []) + \ - (config_all_devices.has_key('CONFIG_NPCM7XX') ? qtests_npcm7xx : []) + \ ['arm-cpu-features', 'numa-test', 'boot-serial-test', diff --git a/tests/qtest/npcm_gmac-test.c b/tests/qtest/npcm_gmac-test.c index 9e58b15ca1..0d1bc8107b 100644 --- a/tests/qtest/npcm_gmac-test.c +++ b/tests/qtest/npcm_gmac-test.c @@ -36,7 +36,7 @@ typedef struct TestData { const GMACModule *module; } TestData; -/* Values extracted from hw/arm/npcm8xx.c */ +/* Values extracted from hw/arm/npcm7xx.c */ static const GMACModule gmac_module_list[] = { { .irq= 14, @@ -46,14 +46,6 @@ static const GMACModule gmac_module_list[] = { .irq= 15, .base_addr = 0xf0804000 }, -{ -.irq= 16, -.base_addr = 0xf0806000 -}, -{ -.irq= 17, -.base_addr = 0xf0808000 -} }; /* Returns the index of the GMAC module. */ @@ -196,7 +188,7 @@ static void test_init(gconstpointer test_data) { const TestData *td = test_data; const GMACModule *mod = td->module; -QTestState *qts = qtest_init("-machine npcm845-evb"); +QTestState *qts = qtest_init("-machine npcm750-evb"); #define CHECK_REG32(regno, value) \ do { \ -- 2.43.0.594.gd9cf4e227d-goog
[PATCH 4/6] migration/multifd: Zero page transmission on the multifd thread.
This implements the zero page detection and handling on the multifd threads. Signed-off-by: Hao Xiang --- migration/multifd.c | 62 + migration/multifd.h | 5 2 files changed, 62 insertions(+), 5 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index a20d0ed10e..c031f947c7 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -11,6 +11,7 @@ */ #include "qemu/osdep.h" +#include "qemu/cutils.h" #include "qemu/rcu.h" #include "exec/target_page.h" #include "sysemu/sysemu.h" @@ -278,6 +279,12 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) packet->offset[i] = cpu_to_be64(temp); } +for (i = 0; i < p->zero_num; i++) { +/* there are architectures where ram_addr_t is 32 bit */ +uint64_t temp = p->zero[i]; + +packet->offset[p->normal_num + i] = cpu_to_be64(temp); +} } static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) @@ -360,6 +367,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) p->normal[i] = offset; } +for (i = 0; i < p->zero_num; i++) { +uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]); + +if (offset > (p->block->used_length - p->page_size)) { +error_setg(errp, "multifd: offset too long %" PRIu64 + " (max " RAM_ADDR_FMT ")", + offset, p->block->used_length); +return -1; +} +p->zero[i] = offset; +} + return 0; } @@ -658,13 +677,37 @@ int multifd_send_sync_main(void) return 0; } +static void zero_page_check_send(MultiFDSendParams *p) +{ +/* + * QEMU older than 9.0 don't understand zero page + * on multifd channel. This switch is required to + * maintain backward compatibility. + */ +bool use_multifd_zero_page = migrate_multifd_zero_page(); +RAMBlock *rb = p->pages->block; + +for (int i = 0; i < p->pages->num; i++) { +uint64_t offset = p->pages->offset[i]; +if (use_multifd_zero_page && +buffer_is_zero(rb->host + offset, p->page_size)) { +p->zero[p->zero_num] = offset; +p->zero_num++; +ram_release_page(rb->idstr, offset); +} else { +p->normal[p->normal_num] = offset; +p->normal_num++; +} +} +} + static void *multifd_send_thread(void *opaque) { MultiFDSendParams *p = opaque; MigrationThread *thread = NULL; Error *local_err = NULL; -int ret = 0; bool use_zero_copy_send = migrate_zero_copy_send(); +int ret = 0; thread = migration_threads_add(p->name, qemu_get_thread_id()); @@ -699,10 +742,7 @@ static void *multifd_send_thread(void *opaque) p->iovs_num = 1; } -for (int i = 0; i < p->pages->num; i++) { -p->normal[p->normal_num] = p->pages->offset[i]; -p->normal_num++; -} +zero_page_check_send(p); if (p->normal_num) { ret = multifd_send_state->ops->send_prepare(p, _err); @@ -1107,6 +1147,16 @@ void multifd_recv_sync_main(void) trace_multifd_recv_sync_main(multifd_recv_state->packet_num); } +static void zero_page_check_recv(MultiFDRecvParams *p) +{ +for (int i = 0; i < p->zero_num; i++) { +void *page = p->host + p->zero[i]; +if (!buffer_is_zero(page, p->page_size)) { +memset(page, 0, p->page_size); +} +} +} + static void *multifd_recv_thread(void *opaque) { MultiFDRecvParams *p = opaque; @@ -1153,6 +1203,8 @@ static void *multifd_recv_thread(void *opaque) } } +zero_page_check_recv(p); + if (flags & MULTIFD_FLAG_SYNC) { qemu_sem_post(_recv_state->sem_sync); qemu_sem_wait(>sem_sync); diff --git a/migration/multifd.h b/migration/multifd.h index 6be9b2f6c1..7448cb1aa9 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -53,6 +53,11 @@ typedef struct { uint32_t unused32[1];/* Reserved for future use */ uint64_t unused64[3];/* Reserved for future use */ char ramblock[256]; +/* + * This array contains the pointers to: + * - normal pages (initial normal_pages entries) + * - zero pages (following zero_pages entries) + */ uint64_t offset[]; } __attribute__((packed)) MultiFDPacket_t; -- 2.30.2
[PATCH 3/6] migration/multifd: Support for zero pages transmission in multifd format.
This change adds zero page counters and updates multifd send/receive tracing format to track the newly added counters. Signed-off-by: Hao Xiang --- migration/migration-hmp-cmds.c | 4 migration/multifd.c| 43 ++ migration/multifd.h| 17 +- migration/trace-events | 8 +++ 4 files changed, 57 insertions(+), 15 deletions(-) diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index 8b0c205a41..2dd99b0509 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -111,6 +111,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) info->ram->normal); monitor_printf(mon, "normal bytes: %" PRIu64 " kbytes\n", info->ram->normal_bytes >> 10); +monitor_printf(mon, "zero: %" PRIu64 " pages\n", + info->ram->zero); +monitor_printf(mon, "zero bytes: %" PRIu64 " kbytes\n", + info->ram->zero_bytes >> 10); monitor_printf(mon, "dirty sync count: %" PRIu64 "\n", info->ram->dirty_sync_count); monitor_printf(mon, "page size: %" PRIu64 " kbytes\n", diff --git a/migration/multifd.c b/migration/multifd.c index 25cbc6dc6b..a20d0ed10e 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -264,6 +264,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) packet->flags = cpu_to_be32(p->flags); packet->pages_alloc = cpu_to_be32(p->pages->allocated); packet->normal_pages = cpu_to_be32(p->normal_num); +packet->zero_pages = cpu_to_be32(p->zero_num); packet->next_packet_size = cpu_to_be32(p->next_packet_size); packet->packet_num = cpu_to_be64(p->packet_num); @@ -317,18 +318,26 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) p->normal_num = be32_to_cpu(packet->normal_pages); if (p->normal_num > packet->pages_alloc) { error_setg(errp, "multifd: received packet " - "with %u pages and expected maximum pages are %u", + "with %u normal pages and expected maximum pages are %u", p->normal_num, packet->pages_alloc) ; return -1; } -p->next_packet_size = be32_to_cpu(packet->next_packet_size); -p->packet_num = be64_to_cpu(packet->packet_num); +p->zero_num = be32_to_cpu(packet->zero_pages); +if (p->zero_num > packet->pages_alloc - p->normal_num) { +error_setg(errp, "multifd: received packet " + "with %u zero pages and expected maximum zero pages are %u", + p->zero_num, packet->pages_alloc - p->normal_num) ; +return -1; +} -if (p->normal_num == 0) { +if (p->normal_num == 0 && p->zero_num == 0) { return 0; } +p->next_packet_size = be32_to_cpu(packet->next_packet_size); +p->packet_num = be64_to_cpu(packet->packet_num); + /* make sure that ramblock is 0 terminated */ packet->ramblock[255] = 0; p->block = qemu_ram_block_by_name(packet->ramblock); @@ -430,6 +439,7 @@ static int multifd_send_pages(void) p->packet_num = multifd_send_state->packet_num++; multifd_send_state->pages = p->pages; p->pages = pages; + qemu_mutex_unlock(>mutex); qemu_sem_post(>sem); @@ -551,6 +561,8 @@ void multifd_save_cleanup(void) p->iov = NULL; g_free(p->normal); p->normal = NULL; +g_free(p->zero); +p->zero = NULL; multifd_send_state->ops->send_cleanup(p, _err); if (local_err) { migrate_set_error(migrate_get_current(), local_err); @@ -679,6 +691,7 @@ static void *multifd_send_thread(void *opaque) uint64_t packet_num = p->packet_num; uint32_t flags; p->normal_num = 0; +p->zero_num = 0; if (use_zero_copy_send) { p->iovs_num = 0; @@ -703,12 +716,13 @@ static void *multifd_send_thread(void *opaque) p->flags = 0; p->num_packets++; p->total_normal_pages += p->normal_num; +p->total_zero_pages += p->zero_num; p->pages->num = 0; p->pages->block = NULL; qemu_mutex_unlock(>mutex); -trace_multifd_send(p->id, packet_num, p->normal_num, flags, - p->next_packet_size); +trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num, + flags, p->next_packet_size); if (use_zero_copy_send) { /* Send header first, without zerocopy */ @@ -731,6 +745,8 @@ static void *multifd_send_thread(void *opaque) stat64_add(_stats.multifd_bytes, p->next_packet_size + p->packet_len); +stat64_add(_stats.normal_pages, p->normal_num); +stat64_add(_stats.zero_pages, p->zero_num);
[PATCH 6/6] migration/multifd: Add a new migration test case for legacy zero page checking.
Now that zero page checking is done on the multifd sender threads by default, we still provide an option for backward compatibility. This change adds a qtest migration test case to set the multifd-zero-page option to false and run multifd migration with zero page checking on the migration main thread. Signed-off-by: Hao Xiang --- tests/qtest/migration-test.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c index 7675519cfa..2c13df04c3 100644 --- a/tests/qtest/migration-test.c +++ b/tests/qtest/migration-test.c @@ -2621,6 +2621,15 @@ test_migrate_precopy_tcp_multifd_start(QTestState *from, return test_migrate_precopy_tcp_multifd_start_common(from, to, "none"); } +static void * +test_migrate_precopy_tcp_multifd_start_zero_page_legacy(QTestState *from, +QTestState *to) +{ +test_migrate_precopy_tcp_multifd_start_common(from, to, "none"); +migrate_set_parameter_bool(from, "multifd-zero-page", false); +return NULL; +} + static void * test_migrate_precopy_tcp_multifd_zlib_start(QTestState *from, QTestState *to) @@ -2652,6 +2661,21 @@ static void test_multifd_tcp_none(void) test_precopy_common(); } +static void test_multifd_tcp_zero_page_legacy(void) +{ +MigrateCommon args = { +.listen_uri = "defer", +.start_hook = test_migrate_precopy_tcp_multifd_start_zero_page_legacy, +/* + * Multifd is more complicated than most of the features, it + * directly takes guest page buffers when sending, make sure + * everything will work alright even if guest page is changing. + */ +.live = true, +}; +test_precopy_common(); +} + static void test_multifd_tcp_zlib(void) { MigrateCommon args = { @@ -3550,6 +3574,8 @@ int main(int argc, char **argv) } migration_test_add("/migration/multifd/tcp/plain/none", test_multifd_tcp_none); +migration_test_add("/migration/multifd/tcp/plain/zero_page_legacy", + test_multifd_tcp_zero_page_legacy); migration_test_add("/migration/multifd/tcp/plain/cancel", test_multifd_tcp_cancel); migration_test_add("/migration/multifd/tcp/plain/zlib", -- 2.30.2
[PATCH 5/6] migration/multifd: Enable zero page checking from multifd threads.
This change adds a dedicated handler for MigrationOps::ram_save_target_page in multifd live migration. Now zero page checking can be done in the multifd threads and this becomes the default configuration. We still provide backward compatibility where zero page checking is done from the migration main thread. Signed-off-by: Hao Xiang --- migration/multifd.c | 3 ++- migration/ram.c | 49 - 2 files changed, 42 insertions(+), 10 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index c031f947c7..c6833ccb07 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -13,6 +13,7 @@ #include "qemu/osdep.h" #include "qemu/cutils.h" #include "qemu/rcu.h" +#include "qemu/cutils.h" #include "exec/target_page.h" #include "sysemu/sysemu.h" #include "exec/ramblock.h" @@ -458,7 +459,6 @@ static int multifd_send_pages(void) p->packet_num = multifd_send_state->packet_num++; multifd_send_state->pages = p->pages; p->pages = pages; - qemu_mutex_unlock(>mutex); qemu_sem_post(>sem); @@ -733,6 +733,7 @@ static void *multifd_send_thread(void *opaque) if (p->pending_job) { uint64_t packet_num = p->packet_num; uint32_t flags; + p->normal_num = 0; p->zero_num = 0; diff --git a/migration/ram.c b/migration/ram.c index d5b7cd5ac2..e6742c9593 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1252,6 +1252,10 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss) static int ram_save_multifd_page(RAMBlock *block, ram_addr_t offset) { +assert(migrate_multifd()); +assert(!migrate_compress()); +assert(!migration_in_postcopy()); + if (multifd_queue_page(block, offset) < 0) { return -1; } @@ -2043,7 +2047,6 @@ static bool save_compress_page(RAMState *rs, PageSearchStatus *pss, */ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss) { -RAMBlock *block = pss->block; ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS; int res; @@ -2059,17 +2062,40 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss) return 1; } +return ram_save_page(rs, pss); +} + +/** + * ram_save_target_page_multifd: save one target page + * + * Returns the number of pages written + * + * @rs: current RAM state + * @pss: data about the page we want to send + */ +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss) +{ +RAMBlock *block = pss->block; +ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS; + +/* Multifd is not compatible with old compression. */ +assert(!migrate_compress()); + +/* Multifd is not compabible with postcopy. */ +assert(!migration_in_postcopy()); + /* - * Do not use multifd in postcopy as one whole host page should be - * placed. Meanwhile postcopy requires atomic update of pages, so even - * if host page size == guest page size the dest guest during run may - * still see partially copied pages which is data corruption. + * Backward compatibility support. While using multifd live + * migration, we still need to handle zero page checking on the + * migration main thread. */ -if (migrate_multifd() && !migration_in_postcopy()) { -return ram_save_multifd_page(block, offset); +if (!migrate_multifd_zero_page()) { +if (save_zero_page(rs, pss, offset)) { +return 1; +} } -return ram_save_page(rs, pss); +return ram_save_multifd_page(block, offset); } /* Should be called before sending a host page */ @@ -2981,7 +3007,12 @@ static int ram_save_setup(QEMUFile *f, void *opaque) } migration_ops = g_malloc0(sizeof(MigrationOps)); -migration_ops->ram_save_target_page = ram_save_target_page_legacy; + +if (migrate_multifd()) { +migration_ops->ram_save_target_page = ram_save_target_page_multifd; +} else { +migration_ops->ram_save_target_page = ram_save_target_page_legacy; +} bql_unlock(); ret = multifd_send_sync_main(); -- 2.30.2
[PATCH 2/6] migration/multifd: Add zero pages and zero bytes counter to migration status interface.
This change extends the MigrationStatus interface to track zero pages and zero bytes counter. Signed-off-by: Hao Xiang --- qapi/migration.json | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/qapi/migration.json b/qapi/migration.json index ff033a0344..69366fe3f4 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -63,6 +63,10 @@ # between 0 and @dirty-sync-count * @multifd-channels. (since # 7.1) # +# @zero: number of zero pages (since 9.0) +# +# @zero-bytes: number of zero bytes sent (since 9.0) +# # Features: # # @deprecated: Member @skipped is always zero since 1.5.3 @@ -81,7 +85,8 @@ 'multifd-bytes': 'uint64', 'pages-per-second': 'uint64', 'precopy-bytes': 'uint64', 'downtime-bytes': 'uint64', 'postcopy-bytes': 'uint64', - 'dirty-sync-missed-zero-copy': 'uint64' } } + 'dirty-sync-missed-zero-copy': 'uint64', + 'zero': 'int', 'zero-bytes': 'int' } } ## # @XBZRLECacheStats: @@ -332,6 +337,8 @@ # "duplicate":123, # "normal":123, # "normal-bytes":123456, +# "zero":123, +# "zero-bytes":123456, # "dirty-sync-count":15 # } # } @@ -358,6 +365,8 @@ # "duplicate":123, # "normal":123, # "normal-bytes":123456, +# "zero":123, +# "zero-bytes":123456, # "dirty-sync-count":15 # } # } @@ -379,6 +388,8 @@ # "duplicate":123, # "normal":123, # "normal-bytes":123456, +# "zero":123, +# "zero-bytes":123456, # "dirty-sync-count":15 # }, # "disk":{ @@ -405,6 +416,8 @@ # "duplicate":10, # "normal":, # "normal-bytes":3412992, +# "zero":, +# "zero-bytes":3412992, # "dirty-sync-count":15 # }, # "xbzrle-cache":{ -- 2.30.2
[PATCH 1/6] migration/multifd: Add new migration option multifd-zero-page.
This new parameter controls where the zero page checking is running. If this parameter is set to true, zero page checking is done in the multifd sender threads. If this parameter is set to false, zero page checking is done in the migration main thread. Signed-off-by: Hao Xiang --- migration/migration-hmp-cmds.c | 7 +++ migration/options.c| 20 migration/options.h| 1 + qapi/migration.json| 24 +--- 4 files changed, 49 insertions(+), 3 deletions(-) diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index 99b49df5dd..8b0c205a41 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -344,6 +344,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict) monitor_printf(mon, "%s: %s\n", MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION), MultiFDCompression_str(params->multifd_compression)); +monitor_printf(mon, "%s: %s\n", +MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_ZERO_PAGE), +params->multifd_zero_page ? "on" : "off"); monitor_printf(mon, "%s: %" PRIu64 " bytes\n", MigrationParameter_str(MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE), params->xbzrle_cache_size); @@ -634,6 +637,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict) p->has_multifd_zstd_level = true; visit_type_uint8(v, param, >multifd_zstd_level, ); break; +case MIGRATION_PARAMETER_MULTIFD_ZERO_PAGE: +p->has_multifd_zero_page = true; +visit_type_bool(v, param, >multifd_zero_page, ); +break; case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE: p->has_xbzrle_cache_size = true; if (!visit_type_size(v, param, _size, )) { diff --git a/migration/options.c b/migration/options.c index 3e3e0b93b4..cb18a41267 100644 --- a/migration/options.c +++ b/migration/options.c @@ -179,6 +179,8 @@ Property migration_properties[] = { DEFINE_PROP_MIG_MODE("mode", MigrationState, parameters.mode, MIG_MODE_NORMAL), +DEFINE_PROP_BOOL("multifd-zero-page", MigrationState, + parameters.multifd_zero_page, true), /* Migration capabilities */ DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE), @@ -903,6 +905,13 @@ uint64_t migrate_xbzrle_cache_size(void) return s->parameters.xbzrle_cache_size; } +bool migrate_multifd_zero_page(void) +{ +MigrationState *s = migrate_get_current(); + +return s->parameters.multifd_zero_page; +} + /* parameter setters */ void migrate_set_block_incremental(bool value) @@ -1013,6 +1022,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp) params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit; params->has_mode = true; params->mode = s->parameters.mode; +params->has_multifd_zero_page = true; +params->multifd_zero_page = s->parameters.multifd_zero_page; return params; } @@ -1049,6 +1060,7 @@ void migrate_params_init(MigrationParameters *params) params->has_x_vcpu_dirty_limit_period = true; params->has_vcpu_dirty_limit = true; params->has_mode = true; +params->has_multifd_zero_page = true; } /* @@ -1350,6 +1362,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params, if (params->has_mode) { dest->mode = params->mode; } + +if (params->has_multifd_zero_page) { +dest->multifd_zero_page = params->multifd_zero_page; +} } static void migrate_params_apply(MigrateSetParameters *params, Error **errp) @@ -1494,6 +1510,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp) if (params->has_mode) { s->parameters.mode = params->mode; } + +if (params->has_multifd_zero_page) { +s->parameters.multifd_zero_page = params->multifd_zero_page; +} } void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp) diff --git a/migration/options.h b/migration/options.h index 246c160aee..c080a6ba18 100644 --- a/migration/options.h +++ b/migration/options.h @@ -93,6 +93,7 @@ const char *migrate_tls_authz(void); const char *migrate_tls_creds(void); const char *migrate_tls_hostname(void); uint64_t migrate_xbzrle_cache_size(void); +bool migrate_multifd_zero_page(void); /* parameters setters */ diff --git a/qapi/migration.json b/qapi/migration.json index 819708321d..ff033a0344 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -874,6 +874,11 @@ # @mode: Migration mode. See description in @MigMode. Default is 'normal'. #(Since 8.2) # +# @multifd-zero-page: Multifd zero page checking. If the parameter is true, +# zero page checking is done on the multifd sender thread. If the parameter +# is false, zero page checking is done on the migration main thread. Default +# is set
[PATCH 0/6] Introduce multifd zero page checking.
This patchset is based on Juan Quintela's old series here https://lore.kernel.org/all/20220802063907.18882-1-quint...@redhat.com/ In the multifd live migration model, there is a single migration main thread scanning the page map, queuing the pages to multiple multifd sender threads. The migration main thread runs zero page checking on every page before queuing the page to the sender threads. Zero page checking is a CPU intensive task and hence having a single thread doing all that doesn't scale well. This change introduces a new function to run the zero page checking on the multifd sender threads. This patchset also lays the ground work for future changes to offload zero page checking task to accelerator hardwares. Use two Intel 4th generation Xeon servers for testing. Architecture:x86_64 CPU(s): 192 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s):2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8457C Stepping:8 CPU MHz: 2538.624 CPU max MHz: 3800. CPU min MHz: 800. Perform multifd live migration with below setup: 1. VM has 100GB memory. All pages in the VM are zero pages. 2. Use tcp socket for live migratio. 3. Use 4 multifd channels and zero page checking on migration main thread. 4. Use 1/2/4 multifd channels and zero page checking on multifd sender threads. 5. Record migration total time from sender QEMU console's "info migrate" command. 6. Calculate throughput with "100GB / total time". +--+ |zero-page-checking | total-time(ms) | throughput(GB/s)| +--+ |main-thread| 9629 | 10.38GB/s | +--+ |multifd-1-threads | 6182 | 16.17GB/s | +--+ |multifd-2-threads | 4643 | 21.53GB/s | +--+ |multifd-4-threads | 4143 | 24.13GB/s | +--+ Apply this patchset on top of commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440 Hao Xiang (6): migration/multifd: Add new migration option multifd-zero-page. migration/multifd: Add zero pages and zero bytes counter to migration status interface. migration/multifd: Support for zero pages transmission in multifd format. migration/multifd: Zero page transmission on the multifd thread. migration/multifd: Enable zero page checking from multifd threads. migration/multifd: Add a new migration test case for legacy zero page checking. migration/migration-hmp-cmds.c | 11 migration/multifd.c| 106 - migration/multifd.h| 22 ++- migration/options.c| 20 +++ migration/options.h| 1 + migration/ram.c| 49 --- migration/trace-events | 8 +-- qapi/migration.json| 39 ++-- tests/qtest/migration-test.c | 26 9 files changed, 249 insertions(+), 33 deletions(-) -- 2.30.2
Re: [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants
On 2/7/24 06:48, Alexander Monakov wrote: Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Ah, that answers my question re 128 vs 256 byte minimum. So as far as this patch goes, Reviewed-by: Richard Henderson r~
Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant
On 2/7/24 08:34, Richard Henderson wrote: On 2/7/24 06:48, Alexander Monakov wrote: - /* Otherwise, use the unaligned memory access functions to - handle the beginning and end of the buffer, with a couple + /* Use unaligned memory access functions to handle + the beginning and end of the buffer, with a couple of loops handling the middle aligned section. */ - uint64_t t = ldq_he_p(buf); - const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8); - const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); + uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8); + typedef uint64_t uint64_a __attribute__((may_alias)); + const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8); + const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8); You appear to be optimizing this routine for x86, which is not the primary consumer. This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. Sparc and some RISC-V). I beg your pardon, I mis-read this. You're only replacing the byte loops, which will be more-or-less identical, modulo unrolling, when unaligned access is not supported. But will be much improved if some unaligned access support is available (e.g. MIPS LWL+LWR). Reviewed-by: Richard Henderson r~
Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
On 2/7/24 06:48, Alexander Monakov wrote: Make buffer_is_zero a 'static inline' function that tests up to three bytes from the buffer before handing off to an unrolled loop. This eliminates call overhead for most non-zero buffers, and allows to optimize out length checks when it is known at compile time (which is often the case in Qemu). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- include/qemu/cutils.h | 28 +++- util/bufferiszero.c | 76 --- 2 files changed, 47 insertions(+), 57 deletions(-) diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h index 92c927a6a3..62b153e603 100644 --- a/include/qemu/cutils.h +++ b/include/qemu/cutils.h @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz); /* used to print char* safely */ #define STR_OR_NULL(str) ((str) ? (str) : "null") -bool buffer_is_zero(const void *buf, size_t len); +bool buffer_is_zero_len_4_plus(const void *, size_t); +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t); Why 256, when the avx2 routine can handle size 128, and you're about to remove avx512? You appear to have missed that select_accel_fn() resolves directly to buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an indirect function call. I think you should not attempt to expose the 4 vs larger implementation detail here in the inline function. Presumably the bulk of the benefit in avoiding the function call is already realized via the three byte spot checks. r~
Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant
On 2/7/24 06:48, Alexander Monakov wrote: -/* Otherwise, use the unaligned memory access functions to - handle the beginning and end of the buffer, with a couple +/* Use unaligned memory access functions to handle + the beginning and end of the buffer, with a couple of loops handling the middle aligned section. */ -uint64_t t = ldq_he_p(buf); -const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8); -const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); +uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8); +typedef uint64_t uint64_a __attribute__((may_alias)); +const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8); +const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8); You appear to be optimizing this routine for x86, which is not the primary consumer. This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. Sparc and some RISC-V). r~
Re: [PATCH] hw/net/tulip: add chip status register values
On 2/5/24 20:47, Sven Schnelle wrote: Netbsd isn't able to detect a link on the emulated tulip card. That's because netbsd reads the Chip Status Register of the Phy (address 0x14). The default phy data in the qemu tulip driver is all zero, which means no link is established and autonegotation isn't complete. Therefore set the register to 0x3b40, which means: Link is up, Autonegotation complete, Full Duplex, 100MBit/s Link speed. Also clear the mask because this register is read only. Signed-off-by: Sven Schnelle Reviewed-by: Helge Deller Tested-by: Helge Deller Can be easily tested without installation: Download: wget https://cdn.netbsd.org/pub/NetBSD/NetBSD-9.3/iso/NetBSD-9.3-hppa.iso Run: ./qemu-system-hppa -cdrom NetBSD-9.3-hppa.iso -nographic -> a) Installation on English -> e) Utility Menu -> c) configure network Helge --- hw/net/tulip.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/net/tulip.c b/hw/net/tulip.c index 6d4fb06dad..1f2ef20977 100644 --- a/hw/net/tulip.c +++ b/hw/net/tulip.c @@ -421,7 +421,7 @@ static uint16_t tulip_mdi_default[] = { /* MDI Registers 8 - 15 */ 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, /* MDI Registers 16 - 31 */ -0x0003, 0x, 0x0001, 0x, 0x, 0x, 0x, 0x, +0x0003, 0x, 0x0001, 0x, 0x3b40, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, }; @@ -429,7 +429,7 @@ static uint16_t tulip_mdi_default[] = { static const uint16_t tulip_mdi_mask[] = { 0x, 0x, 0x, 0x, 0xc01f, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, -0x0fff, 0x, 0x, 0x, 0x, 0x, 0x, 0x, +0x0fff, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, 0x, };
Re: [PATCH v3 4/6] util/bufferiszero: remove useless prefetches
On 2/7/24 06:48, Alexander Monakov wrote: Use of prefetching in bufferiszero.c is quite questionable: - prefetches are issued just a few CPU cycles before the corresponding line would be hit by demand loads; - they are done for simple access patterns, i.e. where hardware prefetchers can perform better; - they compete for load ports in loops that should be limited by load port throughput rather than ALU throughput. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 3 --- 1 file changed, 3 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
On 2/7/24 06:48, Alexander Monakov wrote: Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD routines are invoked much more rarely in normal use when most buffers are non-zero. This makes use of AVX512 unprofitable, as it incurs extra frequency and voltage transition periods during which the CPU operates at reduced performance, as described in https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html Signed-off-by: Mikhail Romanov Signed-off-by: Alexander Monakov --- util/bufferiszero.c | 36 ++-- 1 file changed, 2 insertions(+), 34 deletions(-) Reviewed-by: Richard Henderson Although I think this patch should be ordered second. r~
Re: [PATCH v3 5/9] hw/mem/cxl_type3: Add host backend and address space handling for DC regions
On Wed, Jan 24, 2024 at 03:47:21PM +, Jonathan Cameron wrote: > On Tue, 7 Nov 2023 10:07:09 -0800 > nifan@gmail.com wrote: > > > From: Fan Ni > > > > Add (file/memory backed) host backend, all the dynamic capacity regions > > will share a single, large enough host backend. Set up address space for > > DC regions to support read/write operations to dynamic capacity for DCD. > > > > With the change, following supports are added: > > 1. Add a new property to type3 device "nonvolatile-dc-memdev" to point to > > host > >memory backend for dynamic capacity. Currently, all dc regions share one > >one host backend. > > 2. Add namespace for dynamic capacity for read/write support; > > 3. Create cdat entries for each dynamic capacity region; > > 4. Fix dvsec range registers to include DC regions. > > > > Signed-off-by: Fan Ni > Some minor comments inline, mostly suggesting pulling refactors out before > you do the new stuff. > > Thanks, > > Jonathan Hi Jonathan, One question about DVSEC setting inline. Please search ""QUESTION:" > > > --- > > hw/cxl/cxl-mailbox-utils.c | 16 ++- > > hw/mem/cxl_type3.c | 198 +--- > > include/hw/cxl/cxl_device.h | 4 + > > 3 files changed, 179 insertions(+), 39 deletions(-) > > > > > > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > index 2d67d2015c..152a51306d 100644 > > --- a/hw/mem/cxl_type3.c > > +++ b/hw/mem/cxl_type3.c > > @@ -31,6 +31,7 @@ > > #include "hw/pci/spdm.h" > > > > #define DWORD_BYTE 4 > > +#define CXL_CAPACITY_MULTIPLIER (256 * MiB) > > > > /* Default CDAT entries for a memory region */ > > enum { > > @@ -44,8 +45,9 @@ enum { > > }; > > > > static int ct3_build_cdat_entries_for_mr(CDATSubHeader **cdat_table, > > - int dsmad_handle, MemoryRegion > > *mr, > > - bool is_pmem, uint64_t dpa_base) > > + int dsmad_handle, uint64_t size, > > + bool is_pmem, bool is_dynamic, > > + uint64_t dpa_base) > > { > > g_autofree CDATDsmas *dsmas = NULL; > > g_autofree CDATDslbis *dslbis0 = NULL; > > @@ -64,9 +66,10 @@ static int ct3_build_cdat_entries_for_mr(CDATSubHeader > > **cdat_table, > > .length = sizeof(*dsmas), > > }, > > .DSMADhandle = dsmad_handle, > > -.flags = is_pmem ? CDAT_DSMAS_FLAG_NV : 0, > > +.flags = (is_pmem ? CDAT_DSMAS_FLAG_NV : 0) | > > +(is_dynamic ? CDAT_DSMAS_FLAG_DYNAMIC_CAP : 0), > > .DPA_base = dpa_base, > > -.DPA_length = memory_region_size(mr), > > +.DPA_length = size, > > }; > > > > /* For now, no memory side cache, plausiblish numbers */ > > @@ -150,7 +153,7 @@ static int ct3_build_cdat_entries_for_mr(CDATSubHeader > > **cdat_table, > > */ > > .EFI_memory_type_attr = is_pmem ? 2 : 1, > > .DPA_offset = 0, > > -.DPA_length = memory_region_size(mr), > > +.DPA_length = size, > > }; > > Might be better to make the change to this function as a precursor patch > before > you introduce the new users. Will separate the DC bits out from the rest. > > > > > /* Header always at start of structure */ > > @@ -169,21 +172,28 @@ static int ct3_build_cdat_table(CDATSubHeader > > ***cdat_table, void *priv) > > g_autofree CDATSubHeader **table = NULL; > > CXLType3Dev *ct3d = priv; > > MemoryRegion *volatile_mr = NULL, *nonvolatile_mr = NULL; > > +MemoryRegion *dc_mr = NULL; > > int dsmad_handle = 0; > > int cur_ent = 0; > > int len = 0; > > int rc, i; > > +uint64_t vmr_size = 0, pmr_size = 0; > > Put these next to the memory region definitions above given they are > referring to the > same regions. > > > > > -if (!ct3d->hostpmem && !ct3d->hostvmem) { > > +if (!ct3d->hostpmem && !ct3d->hostvmem && !ct3d->dc.num_regions) { > > return 0; > > } > > > > +if (ct3d->hostpmem && ct3d->hostvmem && ct3d->dc.host_dc) { > > +warn_report("The device has static ram and pmem and dynamic > > capacity"); > > This is the whole how many DVSEC ranges question? > I hope we resolved that so we don't care about this... > > > +} > > + > > if (ct3d->hostvmem) { > > volatile_mr = host_memory_backend_get_memory(ct3d->hostvmem); > > if (!volatile_mr) { > > return -EINVAL; > > } > > len += CT3_CDAT_NUM_ENTRIES; > > +vmr_size = memory_region_size(volatile_mr); > > } > > > > if (ct3d->hostpmem) { > > > > > @@ -210,14 +233,38 @@ static int ct3_build_cdat_table(CDATSubHeader > > ***cdat_table, void *priv) > > } > > > > if (nonvolatile_mr) { > > -uint64_t base = volatile_mr ? memory_region_size(volatile_mr) : 0; > > rc =
Re: [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant
On 2/7/24 06:48, Alexander Monakov wrote: The SSE4.1 variant is virtually identical to the SSE2 variant, except for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing if an SSE register is all zeroes. The PTEST instruction decodes to two uops, so it can be handled only by the complex decoder, and since CMP+JNE are macro-fused, both sequences decode to three uops. The uops comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch standpoint. Hence, the use of PTEST brings no benefit from throughput standpoint. Its latency is not important, since it feeds only a conditional jump, which terminates the dependency chain. I never observed PTEST variants to be faster on real hardware. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 29 - 1 file changed, 29 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 03/13] target/arm: Add Cortex-R52 IMPDEF sysregs
On 2/6/24 23:29, Peter Maydell wrote: Add the Cortex-R52 IMPDEF sysregs, by defining them here and also by enabling the AUXCR feature which defines the ACTLR and HACTLR registers. As is our usual practice, we make these simple reads-as-zero stubs for now. Signed-off-by: Peter Maydell --- target/arm/tcg/cpu32.c | 108 + 1 file changed, 108 insertions(+) Reviewed-by: Richard Henderson r~
[PATCH v3 6/6] migration/multifd: Add a synchronization point for channel creation
It is possible that one of the multifd channels fails to be created at multifd_new_send_channel_async() while the rest of the channel creation tasks are still in flight. This could lead to multifd_save_cleanup() executing the qemu_thread_join() loop too early and not waiting for the threads which haven't been created yet, leading to the freeing of resources that the newly created threads will try to access and crash. Add a synchronization point after which there will be no attempts at thread creation and therefore calling multifd_save_cleanup() past that point will ensure it properly waits for the threads. A note about performance: Prior to this patch, if a channel took too long to be established, other channels could finish connecting first and already start taking load. Now we're bounded by the slowest-connecting channel. Reported-by: Avihai Horon Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/multifd.c | 32 ++-- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 339f2428f3..ee77047031 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -62,6 +62,11 @@ struct { * Make it easy for now. */ uintptr_t packet_num; +/* + * Synchronization point past which no more channels will be + * created. + */ +QemuSemaphore channels_created; /* send channels ready */ QemuSemaphore channels_ready; /* @@ -622,10 +627,6 @@ static void multifd_send_terminate_threads(void) /* * Finally recycle all the threads. - * - * TODO: p->running is still buggy, e.g. we can reach here without the - * corresponding multifd_new_send_channel_async() get invoked yet, - * then a new thread can even be created after this function returns. */ for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = _send_state->params[i]; @@ -670,6 +671,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp) static void multifd_send_cleanup_state(void) { +qemu_sem_destroy(_send_state->channels_created); qemu_sem_destroy(_send_state->channels_ready); g_free(multifd_send_state->params); multifd_send_state->params = NULL; @@ -954,18 +956,26 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque) if (migrate_channel_requires_tls_upgrade(ioc)) { ret = multifd_tls_channel_connect(p, ioc, _err); +if (ret) { +return; +} } else { ret = multifd_channel_connect(p, ioc, _err); } +out: +/* + * Here we're not interested whether creation succeeded, only that + * it happened at all. + */ +qemu_sem_post(_send_state->channels_created); + if (ret) { return; } -out: trace_multifd_new_send_channel_async_error(p->id, local_err); multifd_send_set_error(local_err); -multifd_send_kick_main(p); if (!p->c) { /* * If no channel has been created, drop the initial @@ -998,6 +1008,7 @@ bool multifd_send_setup(void) multifd_send_state = g_malloc0(sizeof(*multifd_send_state)); multifd_send_state->params = g_new0(MultiFDSendParams, thread_count); multifd_send_state->pages = multifd_pages_init(page_count); +qemu_sem_init(_send_state->channels_created, 0); qemu_sem_init(_send_state->channels_ready, 0); qatomic_set(_send_state->exiting, 0); multifd_send_state->ops = multifd_ops[migrate_multifd_compression()]; @@ -1023,6 +1034,15 @@ bool multifd_send_setup(void) multifd_new_send_channel_create(p); } +/* + * Wait until channel creation has started for all channels. The + * creation can still fail, but no more channels will be created + * past this point. + */ +for (i = 0; i < thread_count; i++) { +qemu_sem_wait(_send_state->channels_created); +} + for (i = 0; i < thread_count; i++) { MultiFDSendParams *p = _send_state->params[i]; -- 2.35.3
[PATCH v3 4/6] migration/multifd: Move multifd_send_setup into migration thread
We currently have an unfavorable situation around multifd channels creation and the migration thread execution. We create the multifd channels with qio_channel_socket_connect_async -> qio_task_run_in_thread, but only connect them at the multifd_new_send_channel_async callback, called from qio_task_complete, which is registered as a glib event. So at multifd_send_setup() we create the channels, but they will only be actually usable after the whole multifd_send_setup() calling stack returns back to the main loop. Which means that the migration thread is already up and running without any possibility for the multifd channels to be ready on time. We currently rely on the channels-ready semaphore blocking multifd_send_sync_main() until channels start to come up and release it. However there have been bugs recently found when a channel's creation fails and multifd_send_cleanup() is allowed to run while other channels are still being created. Let's start to organize this situation by moving the multifd_send_setup() call into the migration thread. That way we unblock the main-loop to dispatch the completion callbacks and actually have a chance of getting the multifd channels ready for when the migration thread needs them. The next patches will deal with the synchronization aspects. Note that this takes multifd_send_setup() out of the BQL. Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/migration.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 2942f8cf42..0675e12c64 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3315,6 +3315,10 @@ static void *migration_thread(void *opaque) object_ref(OBJECT(s)); update_iteration_initial_status(s); +if (!multifd_send_setup()) { +goto out; +} + bql_lock(); qemu_savevm_state_header(s->to_dst_file); bql_unlock(); @@ -3386,6 +3390,7 @@ static void *migration_thread(void *opaque) urgent = migration_rate_limit(); } +out: trace_migration_thread_after_loop(); migration_iteration_finish(s); object_unref(OBJECT(s)); @@ -3623,11 +3628,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } -if (!multifd_send_setup()) { -migrate_fd_cleanup(s); -return; -} - if (migrate_background_snapshot()) { qemu_thread_create(>thread, "bg_snapshot", bg_migration_thread, s, QEMU_THREAD_JOINABLE); -- 2.35.3
[PATCH v3 5/6] migration/multifd: Unify multifd and TLS connection paths
During multifd channel creation (multifd_send_new_channel_async) when TLS is enabled, the multifd_channel_connect function is called twice, once to create the TLS handshake thread and another time after the asynchrounous TLS handshake has finished. This creates a slightly confusing call stack where multifd_channel_connect() is called more times than the number of channels. It also splits error handling between the two callers of multifd_channel_connect() causing some code duplication. Lastly, it gets in the way of having a single point to determine whether all channel creation tasks have been initiated. Refactor the code to move the reentrancy one level up at the multifd_new_send_channel_async() level, de-duplicating the error handling and allowing for the next patch to introduce a synchronization point common to all the multifd channel creation, regardless of TLS. Note that the previous code would never fail once p->c had been set. This patch changes this assumption, which affects refcounting, so add comments around object_unref to explain the situation. Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/multifd.c | 83 ++--- 1 file changed, 40 insertions(+), 43 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index cc10be2c3f..339f2428f3 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -869,30 +869,7 @@ out: return NULL; } -static bool multifd_channel_connect(MultiFDSendParams *p, -QIOChannel *ioc, -Error **errp); - -static void multifd_tls_outgoing_handshake(QIOTask *task, - gpointer opaque) -{ -MultiFDSendParams *p = opaque; -QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task)); -Error *err = NULL; - -if (!qio_task_propagate_error(task, )) { -trace_multifd_tls_outgoing_handshake_complete(ioc); -if (multifd_channel_connect(p, ioc, )) { -return; -} -} - -trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err)); - -multifd_send_set_error(err); -multifd_send_kick_main(p); -error_free(err); -} +static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque); static void *multifd_tls_handshake_thread(void *opaque) { @@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque) QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c); qio_channel_tls_handshake(tioc, - multifd_tls_outgoing_handshake, + multifd_new_send_channel_async, p, NULL, NULL); @@ -920,6 +897,10 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p, return false; } +/* + * Ownership of the socket channel now transfers to the newly + * created TLS channel, which has already taken a reference. + */ object_unref(OBJECT(ioc)); trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname); qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing"); @@ -936,18 +917,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc, Error **errp) { -trace_multifd_set_outgoing_channel( -ioc, object_get_typename(OBJECT(ioc)), -migrate_get_current()->hostname); - -if (migrate_channel_requires_tls_upgrade(ioc)) { -/* - * tls_channel_connect will call back to this - * function after the TLS handshake, - * so we mustn't call multifd_send_thread until then - */ -return multifd_tls_channel_connect(p, ioc, errp); -} +qio_channel_set_delay(ioc, false); migration_ioc_register_yank(ioc); p->registered_yank = true; @@ -959,24 +929,51 @@ static bool multifd_channel_connect(MultiFDSendParams *p, return true; } +/* + * When TLS is enabled this function is called once to establish the + * TLS connection and a second time after the TLS handshake to create + * the multifd channel. Without TLS it goes straight into the channel + * creation. + */ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque) { MultiFDSendParams *p = opaque; QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task)); Error *local_err = NULL; +bool ret; trace_multifd_new_send_channel_async(p->id); -if (!qio_task_propagate_error(task, _err)) { -qio_channel_set_delay(ioc, false); -if (multifd_channel_connect(p, ioc, _err)) { -return; -} + +if (qio_task_propagate_error(task, _err)) { +ret = false; +goto out; +} + +trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)), + migrate_get_current()->hostname);
[PATCH v3 2/6] migration/multifd: Remove p->running
We currently only need p->running to avoid calling qemu_thread_join() on a non existent thread if the thread has never been created. However, there are at least two bugs in this logic: 1) On the sending side, p->running is set too early and qemu_thread_create() can be skipped due to an error during TLS handshake, leaving the flag set and leading to a crash when multifd_send_cleanup() calls qemu_thread_join(). 2) During exit, the multifd thread clears the flag while holding the channel lock. The counterpart at multifd_send_cleanup() reads the flag outside of the lock and might free the mutex while the multifd thread still has it locked. Fix the first issue by setting the flag right before creating the thread. Rename it from p->running to p->thread_created to clarify its usage. Fix the second issue by not clearing the flag at the multifd thread exit. We don't have any use for that. Note that these bugs are straight-forward logic issues and not race conditions. There is still a gap for races to affect this code due to multifd_send_cleanup() being allowed to run concurrently with the thread creation loop. This issue is solved in the next patches. Cc: qemu-stable Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake") Reported-by: Avihai Horon Reported-by: Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/multifd.c | 27 --- migration/multifd.h | 7 ++- 2 files changed, 14 insertions(+), 20 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 8195c1daf3..515d88e04b 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -634,7 +634,7 @@ static void multifd_send_terminate_threads(void) qemu_thread_join(>tls_thread); } -if (p->running) { +if (p->thread_created) { qemu_thread_join(>thread); } } @@ -862,7 +862,6 @@ out: error_free(local_err); } -p->running = false; rcu_unregister_thread(); migration_threads_remove(thread); trace_multifd_send_thread_end(p->id, p->packets_sent, p->total_normal_pages); @@ -953,6 +952,8 @@ static bool multifd_channel_connect(MultiFDSendParams *p, migration_ioc_register_yank(ioc); p->registered_yank = true; p->c = ioc; + +p->thread_created = true; qemu_thread_create(>thread, p->name, multifd_send_thread, p, QEMU_THREAD_JOINABLE); return true; @@ -967,7 +968,6 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque) trace_multifd_new_send_channel_async(p->id); if (!qio_task_propagate_error(task, _err)) { qio_channel_set_delay(ioc, false); -p->running = true; if (multifd_channel_connect(p, ioc, _err)) { return; } @@ -1128,15 +1128,15 @@ void multifd_recv_cleanup(void) for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDRecvParams *p = _recv_state->params[i]; -if (p->running) { -/* - * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code, - * however try to wakeup it without harm in cleanup phase. - */ -qemu_sem_post(>sem_sync); -} +/* + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code, + * however try to wakeup it without harm in cleanup phase. + */ +qemu_sem_post(>sem_sync); -qemu_thread_join(>thread); +if (p->thread_created) { +qemu_thread_join(>thread); +} } for (i = 0; i < migrate_multifd_channels(); i++) { multifd_recv_cleanup_channel(_recv_state->params[i]); @@ -1222,9 +1222,6 @@ static void *multifd_recv_thread(void *opaque) multifd_recv_terminate_threads(local_err); error_free(local_err); } -qemu_mutex_lock(>mutex); -p->running = false; -qemu_mutex_unlock(>mutex); rcu_unregister_thread(); trace_multifd_recv_thread_end(p->id, p->packets_recved, p->total_normal_pages); @@ -1330,7 +1327,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp) p->c = ioc; object_ref(OBJECT(ioc)); -p->running = true; +p->thread_created = true; qemu_thread_create(>thread, p->name, multifd_recv_thread, p, QEMU_THREAD_JOINABLE); qatomic_inc(_recv_state->count); diff --git a/migration/multifd.h b/migration/multifd.h index 720c9d50db..7881980ee6 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -73,6 +73,7 @@ typedef struct { char *name; /* channel thread id */ QemuThread thread; +bool thread_created; QemuThread tls_thread; bool tls_thread_created; /* communication channel */ @@ -93,8 +94,6 @@ typedef struct { /* syncs main thread and channels */ QemuSemaphore sem_sync; -/* is this channel thread running */ -bool running; /* multifd flags for each packet */ uint32_t
[PATCH v3 3/6] migration/multifd: Move multifd_send_setup error handling in to the function
Hide the error handling inside multifd_send_setup to make it cleaner for the next patch to move the function around. Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/migration.c | 6 +- migration/multifd.c | 24 +--- migration/multifd.h | 2 +- 3 files changed, 19 insertions(+), 13 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index ba99772e76..2942f8cf42 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3623,11 +3623,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } -if (multifd_send_setup(_err) != 0) { -migrate_set_error(s, local_err); -error_report_err(local_err); -migrate_set_state(>state, MIGRATION_STATUS_SETUP, - MIGRATION_STATUS_FAILED); +if (!multifd_send_setup()) { migrate_fd_cleanup(s); return; } diff --git a/migration/multifd.c b/migration/multifd.c index 515d88e04b..cc10be2c3f 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -985,14 +985,16 @@ static void multifd_new_send_channel_create(gpointer opaque) socket_send_channel_create(multifd_new_send_channel_async, opaque); } -int multifd_send_setup(Error **errp) +bool multifd_send_setup(void) { -int thread_count; +MigrationState *s = migrate_get_current(); +Error *local_err = NULL; +int thread_count, ret = 0; uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size(); uint8_t i; if (!migrate_multifd()) { -return 0; +return true; } thread_count = migrate_multifd_channels(); @@ -1026,14 +1028,22 @@ int multifd_send_setup(Error **errp) for (i = 0; i < thread_count; i++) { MultiFDSendParams *p = _send_state->params[i]; -int ret; -ret = multifd_send_state->ops->send_setup(p, errp); +ret = multifd_send_state->ops->send_setup(p, _err); if (ret) { -return ret; +break; } } -return 0; + +if (ret) { +migrate_set_error(s, local_err); +error_report_err(local_err); +migrate_set_state(>state, MIGRATION_STATUS_SETUP, + MIGRATION_STATUS_FAILED); +return false; +} + +return true; } struct { diff --git a/migration/multifd.h b/migration/multifd.h index 7881980ee6..8a1cad0996 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -13,7 +13,7 @@ #ifndef QEMU_MIGRATION_MULTIFD_H #define QEMU_MIGRATION_MULTIFD_H -int multifd_send_setup(Error **errp); +bool multifd_send_setup(void); void multifd_send_shutdown(void); int multifd_recv_setup(Error **errp); void multifd_recv_cleanup(void); -- 2.35.3
[PATCH v3 1/6] migration/multifd: Join the TLS thread
We're currently leaking the resources of the TLS thread by not joining it and also overwriting the p->thread pointer altogether. Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake") Cc: qemu-stable Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas --- migration/multifd.c | 8 +++- migration/multifd.h | 2 ++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/migration/multifd.c b/migration/multifd.c index ef13e2e781..8195c1daf3 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void) for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = _send_state->params[i]; +if (p->tls_thread_created) { +qemu_thread_join(>tls_thread); +} + if (p->running) { qemu_thread_join(>thread); } @@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p, trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname); qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing"); p->c = QIO_CHANNEL(tioc); -qemu_thread_create(>thread, "multifd-tls-handshake-worker", + +p->tls_thread_created = true; +qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker", multifd_tls_handshake_thread, p, QEMU_THREAD_JOINABLE); return true; diff --git a/migration/multifd.h b/migration/multifd.h index 78a2317263..720c9d50db 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -73,6 +73,8 @@ typedef struct { char *name; /* channel thread id */ QemuThread thread; +QemuThread tls_thread; +bool tls_thread_created; /* communication channel */ QIOChannel *c; /* is the yank function registered */ -- 2.35.3
[PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races
Based-on: 20240202102857.110210-1-pet...@redhat.com [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups https://lore.kernel.org/r/20240202102857.110210-1-pet...@redhat.com Hi, For v3 I fixed the refcounting issue spotted by Avihai. The situation there is a bit clunky due to historical reasons. The gist is that we have an assumption that channel creation never fails after p->c has been set, so when 'p->c == NULL' we have to unref and when 'p->c != NULL' the cleanup code will do the unref. CI run: https://gitlab.com/farosas/qemu/-/pipelines/1166889341 v2: https://lore.kernel.org/r/20240205194929.28963-1-faro...@suse.de In this v2 I made sure NO channel is created after the semaphores are posted. Feel free to call me out if that's not the case. Not much changes, except that now both TLS and non-TLS go through the same code, so there's a centralized place to do error handling and releasing the semaphore. CI run: https://gitlab.com/farosas/qemu/-/pipelines/1165206107 based on Peter's code: https://gitlab.com/farosas/qemu/-/pipelines/1165303276 v1: https://lore.kernel.org/r/20240202191128.1901-1-faro...@suse.de This contains 2 patches from my previous series addressing the p->running misuse and the TLS thread leak and 3 new patches to fix the cleanup-while-creating-threads race. For the p->running I'm keeping the idea from the other series to remove p->running and use a more narrow p->thread_created flag. This flag is used only inform whether the thread has been created so we can join it. For the cleanup race I have moved some code around and added a semaphore to make multifd_save_setup() only return once all channel creation tasks have started. The idea is that after multifd_save_setup() returns, no new creations are in flight and the p->thread_created flags will never change again, so they're enough to cause the cleanup code to wait for the threads to join. CI run: https://gitlab.com/farosas/qemu/-/pipelines/1162798843 @Peter: I can rebase this on top of your series once we decide about it. Fabiano Rosas (6): migration/multifd: Join the TLS thread migration/multifd: Remove p->running migration/multifd: Move multifd_send_setup error handling in to the function migration/multifd: Move multifd_send_setup into migration thread migration/multifd: Unify multifd and TLS connection paths migration/multifd: Add a synchronization point for channel creation migration/migration.c | 14 ++-- migration/multifd.c | 168 +- migration/multifd.h | 11 ++- 3 files changed, 109 insertions(+), 84 deletions(-) -- 2.35.3
Re: [PATCH 07/13] hw/misc/mps2-scc: Make changes needed for AN536 FPGA image
On 2/6/24 23:29, Peter Maydell wrote: The MPS2 SCC device is broadly the same for all FPGA images, but has minor differences in the behaviour of the CFG registers depending on the image. In many cases we don't really care about the functionality controlled by these registers and a reads-as-written or similar behaviour is sufficient for the moment. For the AN536 the required behaviour is: * A_CFG0 has CPU reset and halt bits - implement as reads-as-written for the moment * A_CFG1 has flash or ATCM address 0 remap handling - QEMU doesn't model this; implement as reads-as-written * A_CFG2 has QSPI select (like AN524) - implemented (no behaviour, as with AN524) * A_CFG3 is MCC_MSB_ADDR "additional MCC addressing bits" - QEMU doesn't care about these, so use the existing RAZ behaviour for convenience * A_CFG4 is board rev (like all other images) - no change needed * A_CFG5 is ACLK frq in hz (like AN524) - implemented as reads-as-written, as for other boards * A_CFG6 is core 0 vector table base address - implemented as reads-as-written for the moment * A_CFG7 is core 1 vector table base address - implemented as reads-as-written for the moment Make the changes necessary for this; leave TODO comments where appropriate to indicate where we might want to come back and implement things like CPU reset. The other aspects of the device specific to this FPGA image (like the values of the board ID and similar registers) will be set via the device's qdev properties. Signed-off-by: Peter Maydell --- include/hw/misc/mps2-scc.h | 1 + hw/misc/mps2-scc.c | 101 + 2 files changed, 92 insertions(+), 10 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 06/13] hw/misc/mps2-scc: Factor out which-board conditionals
On 2/6/24 23:29, Peter Maydell wrote: The MPS SCC device has a lot of different flavours for the various different MPS FPGA images, which look mostly similar but have differences in how particular registers are handled. Currently we deal with this with a lot of open-coded checks on scc_partno(), but as we add more board types this is getting a bit hard to read. Factor out the conditions into some functions which we can give more descriptive names to. Signed-off-by: Peter Maydell --- hw/misc/mps2-scc.c | 45 +++-- 1 file changed, 31 insertions(+), 14 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 05/13] hw/misc/mps2-scc: Fix condition for CFG3 register
On 2/6/24 23:29, Peter Maydell wrote: We currently guard the CFG3 register read with (scc_partno(s) == 0x524 && scc_partno(s) == 0x547) which is clearly wrong as it is never true. This register is present on all board types except AN524 and AN527; correct the condition. Fixes: 6ac80818941829c0 ("hw/misc/mps2-scc: Implement changes for AN547") Signed-off-by: Peter Maydell --- hw/misc/mps2-scc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 04/13] target/arm: Allow access to SPSR_hyp from hyp mode
On 2/6/24 23:29, Peter Maydell wrote: Architecturally, the AArch32 MSR/MRS to/from banked register instructions are UNPREDICTABLE for attempts to access a banked register that the guest could access in a more direct way (e.g. using this insn to access r8_fiq when already in FIQ mode). QEMU has chosen to UNDEF on all of these. However, for the case of accessing SPSR_hyp from hyp mode, it turns out that real hardware permits this, with the same effect as if the guest had directly written to SPSR. Further, there is some guest code out there that assumes it can do this, because it happens to work on hardware: an example Cortex-R52 startup code fragment uses this, and it got copied into various other places, including Zephyr. Zephyr was fixed to not use this: https://github.com/zephyrproject-rtos/zephyr/issues/47330 but other examples are still out there, like the selftest binary for the MPS3-AN536. For convenience of being able to run guest code, permit this UNPREDICTABLE access instead of UNDEFing it. Signed-off-by: Peter Maydell --- Last time this came up I preferred the "keep QEMU behaviour as it is, try to get the guest code fixed" approach: https://www.mail-archive.com/qemu-devel@nongnu.org/msg899970.html but as this is the second time I lean a bit more towards behaving like the hardware. --- target/arm/tcg/op_helper.c | 43 ++ target/arm/tcg/translate.c | 19 +++-- 2 files changed, 43 insertions(+), 19 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PULL v2 00/39] tcg patch queue
On Tue, 6 Feb 2024 at 21:24, Peter Maydell wrote: > > On Tue, 6 Feb 2024 at 03:22, Richard Henderson > wrote: > > > > v2: Fix rebase error in patch 38 (tcg/s390x: Support TCG_COND_TST{EQ,NE}). > > > > > > r~ > > > > > > The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: > > > > Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into > > staging (2024-02-03 13:31:58 +) > > > > are available in the Git repository at: > > > > https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20240205-2 > > > > for you to fetch changes up to 23c5692abc3917151dee36c00d751cf5bc46ef19: > > > > tcg/tci: Support TCG_COND_TST{EQ,NE} (2024-02-05 22:45:41 +) > > > > > > tcg: Introduce TCG_COND_TST{EQ,NE} > > target/alpha: Use TCG_COND_TST{EQ,NE} > > target/m68k: Use TCG_COND_TST{EQ,NE} in gen_fcc_cond > > target/sparc: Use TCG_COND_TSTEQ in gen_op_mulscc > > target/s390x: Use TCG_COND_TSTNE for CC_OP_{TM,ICM} > > target/s390x: Improve general case of disas_jcc > > This really doesn't want to pass the ubuntu-20.04-s390x-all job: > > https://gitlab.com/qemu-project/qemu/-/jobs/6109442678 > https://gitlab.com/qemu-project/qemu/-/jobs/6108249863 > https://gitlab.com/qemu-project/qemu/-/jobs/6106928534 > https://gitlab.com/qemu-project/qemu/-/jobs/6105718495 > > Now, this has definitely been a flaky job recently, so maybe it's > not this pullreq's fault. > > This is a passing job from the last successful merge: > https://gitlab.com/qemu-project/qemu/-/jobs/6089342252 > That took 24 minutes to run, and all the failed jobs above > took 70 minutes plus. Ruling out anything about this particular merge attempt: This is a passing job from a recent succesful merge: https://gitlab.com/qemu-project/qemu/-/jobs/6089089816 That took 37 minutes to run (21 mins in configure-n-compile). This is a failing job for the same commit: https://gitlab.com/qemu-project/qemu/-/jobs/6086439717 That took 58 minutes (26 mins in configure-n-compile). So there's a lot of between run variation, though in that case it was not so much as in some of these examples. -- PMM
Re: [PULL v2 00/39] tcg patch queue
On Tue, 6 Feb 2024 at 03:22, Richard Henderson wrote: > > v2: Fix rebase error in patch 38 (tcg/s390x: Support TCG_COND_TST{EQ,NE}). > > > r~ > > > The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: > > Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into > staging (2024-02-03 13:31:58 +) > > are available in the Git repository at: > > https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20240205-2 > > for you to fetch changes up to 23c5692abc3917151dee36c00d751cf5bc46ef19: > > tcg/tci: Support TCG_COND_TST{EQ,NE} (2024-02-05 22:45:41 +) > > > tcg: Introduce TCG_COND_TST{EQ,NE} > target/alpha: Use TCG_COND_TST{EQ,NE} > target/m68k: Use TCG_COND_TST{EQ,NE} in gen_fcc_cond > target/sparc: Use TCG_COND_TSTEQ in gen_op_mulscc > target/s390x: Use TCG_COND_TSTNE for CC_OP_{TM,ICM} > target/s390x: Improve general case of disas_jcc This really doesn't want to pass the ubuntu-20.04-s390x-all job: https://gitlab.com/qemu-project/qemu/-/jobs/6109442678 https://gitlab.com/qemu-project/qemu/-/jobs/6108249863 https://gitlab.com/qemu-project/qemu/-/jobs/6106928534 https://gitlab.com/qemu-project/qemu/-/jobs/6105718495 Now, this has definitely been a flaky job recently, so maybe it's not this pullreq's fault. This is a passing job from the last successful merge: https://gitlab.com/qemu-project/qemu/-/jobs/6089342252 That took 24 minutes to run, and all the failed jobs above took 70 minutes plus. TBH I think there is something weird with the runner. Looking at the timestamps in the log, it seems like the passing job completed its compile step in about 14 minutes, whereas one of the failing jobs took about 39 minutes. So the entire run of the job slowed down by more than 2.5x, which is enough to put it into the range where either the whole job or individual tests time out. thuth: any idea why that might happen? (I look in on the machine from time to time and it doesn't seem to be doing anything it shouldn't that would be eating CPU.) Christian: this is on the s390x machine we have. Does the VM setup for that share IO or CPU with other VMs somehow? Is there some reason why it might have very variable performance over time? thanks -- PMM
Re: [PATCH 02/13] target/arm: The Cortex-R52 has a read-only CBAR
On Tue, 6 Feb 2024 at 20:38, Richard Henderson wrote: > > On 2/6/24 23:29, Peter Maydell wrote: > > The Cortex-R52 implements the Configuration Base Address Register > > (CBAR), as a read-only register. Add ARM_FEATURE_CBAR_RO to this CPU > > type, so that our implementation provides the register and the > > associated qdev property. > > > > Signed-off-by: Peter Maydell > > --- > > target/arm/tcg/cpu32.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/target/arm/tcg/cpu32.c b/target/arm/tcg/cpu32.c > > index 11253051156..311d654cdce 100644 > > --- a/target/arm/tcg/cpu32.c > > +++ b/target/arm/tcg/cpu32.c > > @@ -809,6 +809,7 @@ static void cortex_r52_initfn(Object *obj) > > set_feature(>env, ARM_FEATURE_PMSA); > > set_feature(>env, ARM_FEATURE_NEON); > > set_feature(>env, ARM_FEATURE_GENERIC_TIMER); > > +set_feature(>env, ARM_FEATURE_CBAR_RO); > > Reviewed-by: Richard Henderson > > I just noticed that arm_cpu_post_init can be simplified to not check CBAR_RO, > now that we > have arm_cpu_propagate_feature_implications. The other bit of CBAR cleanup I have is that cortex-a55, cortex-a76, neoverse-n1, neoverse-v1, neoverse-v2 and cortex-a710 have all cut-n-pasted the line that sets ARM_FEATURE_CBAR_RO, but none of them actually have a CBAR according to their TRM. The only reason I didn't throw in a patch fixing that is that I think it would be a migration compat break and I didn't feel like it was worth the effort to try to deal with that... -- PMM
Re: [PATCH 01/13] target/arm: Use new CBAR encoding for all v8 CPUs, not all aarch64 CPUs
On Tue, 6 Feb 2024 at 20:34, Richard Henderson wrote: > > On 2/6/24 23:29, Peter Maydell wrote: > > We support two different encodings for the AArch32 IMPDEF > > CBAR register -- older cores like the Cortex A9, A7, A15 > > have this at 4, c15, c0, 0; newer cores like the > > Cortex A35, A53, A57 and A72 have it at 1 c15 c0 0. > > > > When we implemented this we picked which encoding to > > use based on whether the CPU set ARM_FEATURE_AARCH64. > > However this isn't right for three cases: > > * the qemu-system-arm 'max' CPU, which is supposed to be > > a variant on a Cortex-A57; it ought to use the same > > encoding the A57 does and which the AArch64 'max' > > exposes to AArch32 guest code > > * the Cortex-R52, which is AArch32-only but has the CBAR > > at the newer encoding (and where we incorrectly are > > not yet setting ARM_FEATURE_CBAR_RO anyway) > > * any possible future support for other v8 AArch32 > > only CPUs, or for supporting "boot the CPU into > > AArch32 mode" on our existing cores like the A57 etc > > > > Make the decision of the encoding be based on whether > > the CPU implements the ARM_FEATURE_V8 flag instead. > > > > This changes the behaviour only for the qemu-system-arm > > '-cpu max'. We don't expect anybody to be relying on the > > old behaviour because: > > * it's not what the real hardware Cortex-A57 does > > (and that's what our ID register claims we are) > > Not even that, because max resets MIDR. qemu-system-aarch64 max does (in aarch64_max_tcg_initfn(), yes; but qemu-system-arm max is set up in arm_max_initfn() in cpu32.c, and that sets cpu->midr = 0x411fd070 (which is the same as A57's MIDR)... > Anyway, > Reviewed-by: Richard Henderson thanks -- PMM
Re: [PATCH 08/13] hw/arm/mps3r: Initial skeleton for mps3-an536 board
On Tue, 6 Feb 2024 at 19:21, Philippe Mathieu-Daudé wrote: > > Hi Peter, > > On 6/2/24 14:29, Peter Maydell wrote: > > The AN536 is another FPGA image for the MPS3 development board. Unlike > > the existing FPGA images we already model, this board uses a Cortex-R > > family CPU, and it does not use any equivalent to the M-profile > > "Subsystem for Embedded" SoC-equivalent that we model in hw/arm/armsse.c. > > It's therefore more convenient for us to model it as a completely > > separate C file. > > > > This commit adds the basic skeleton of the board model, and the > > code to create all the RAM and ROM. We assume that we're probably > > going to want to add more images in future, so use the same > > base class/subclass setup that mps2-tz.c uses, even though at > > the moment there's only a single subclass. > > > > Following commits will add the CPUs and the peripherals. > > > > Signed-off-by: Peter Maydell > > --- > > MAINTAINERS | 3 +- > > configs/devices/arm-softmmu/default.mak | 1 + > > hw/arm/mps3r.c | 239 > > hw/arm/Kconfig | 5 + > > hw/arm/meson.build | 1 + > > 5 files changed, 248 insertions(+), 1 deletion(-) > > create mode 100644 hw/arm/mps3r.c > > > > +static MemoryRegion *mr_for_raminfo(MPS3RMachineState *mms, > > +const RAMInfo *raminfo) > > +{ > > +/* Return an initialized MemoryRegion for the RAMInfo. */ > > +MemoryRegion *ram; > > + > > +if (raminfo->mrindex < 0) { > > +/* Means this RAMInfo is for QEMU's "system memory" */ > > +MachineState *machine = MACHINE(mms); > > +assert(!(raminfo->flags & IS_ROM)); > > +return machine->ram; > > +} > > + > > +assert(raminfo->mrindex < MPS3R_RAM_MAX); > > +ram = >ram[raminfo->mrindex]; > > + > > +memory_region_init_ram(ram, NULL, raminfo->name, > > You are not using the parent=mms, is that deliberate? > (as in: easier to migrate eventually?) No, I didn't have a particular reason for not setting the parent; I just copied this bit of code from mps2-tz.c, which also doesn't set the parent pointer... -- PMM
Re: [PATCH v2] hw: riscv: Allow large kernels to boot by moving the initrd further away in RAM
On Tue, Feb 6, 2024 at 9:39 PM Daniel Henrique Barboza wrote: > > > > On 2/6/24 12:40, Alexandre Ghiti wrote: > > Currently, the initrd is placed at 128MB, which overlaps with the kernel > > when it is large (for example syzbot kernels are). From the kernel side, > > there is no reason we could not push the initrd further away in memory > > to accommodate large kernels, so move the initrd at 512MB when possible. > > > > The ideal solution would have been to place the initrd based on the > > kernel size but we actually can't since the bss size is not known when > > the image is loaded by load_image_targphys_as() and the initrd would > > then overlap with this section. > > > > Signed-off-by: Alexandre Ghiti > > --- > > Reviewed-by: Daniel Henrique Barboza Thanks for your help! Alex > > > > > Changes in v2: > > - Fix typos in commit log (Daniel) and title > > - Added to the commit log why using the kernel size does not work > >(Daniel) > > > > hw/riscv/boot.c | 12 ++-- > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c > > index 0ffca05189..9a367af2fa 100644 > > --- a/hw/riscv/boot.c > > +++ b/hw/riscv/boot.c > > @@ -188,13 +188,13 @@ static void riscv_load_initrd(MachineState *machine, > > uint64_t kernel_entry) > >* kernel is uncompressed it will not clobber the initrd. However > >* on boards without much RAM we must ensure that we still leave > >* enough room for a decent sized initrd, and on boards with large > > - * amounts of RAM we must avoid the initrd being so far up in RAM > > - * that it is outside lowmem and inaccessible to the kernel. > > - * So for boards with less than 256MB of RAM we put the initrd > > - * halfway into RAM, and for boards with 256MB of RAM or more we put > > - * the initrd at 128MB. > > + * amounts of RAM, we put the initrd at 512MB to allow large kernels > > + * to boot. > > + * So for boards with less than 1GB of RAM we put the initrd > > + * halfway into RAM, and for boards with 1GB of RAM or more we put > > + * the initrd at 512MB. > >*/ > > -start = kernel_entry + MIN(mem_size / 2, 128 * MiB); > > +start = kernel_entry + MIN(mem_size / 2, 512 * MiB); > > > > size = load_ramdisk(filename, start, mem_size - start); > > if (size == -1) {
[PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant
The SSE4.1 variant is virtually identical to the SSE2 variant, except for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing if an SSE register is all zeroes. The PTEST instruction decodes to two uops, so it can be handled only by the complex decoder, and since CMP+JNE are macro-fused, both sequences decode to three uops. The uops comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch standpoint. Hence, the use of PTEST brings no benefit from throughput standpoint. Its latency is not important, since it feeds only a conditional jump, which terminates the dependency chain. I never observed PTEST variants to be faster on real hardware. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 29 - 1 file changed, 29 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 3e6a5dfd63..f5a3634f9a 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len) } #ifdef CONFIG_AVX2_OPT -static bool __attribute__((target("sse4"))) -buffer_zero_sse4(const void *buf, size_t len) -{ -__m128i t = _mm_loadu_si128(buf); -__m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); -__m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); - -/* Loop over 16-byte aligned blocks of 64. */ -while (likely(p <= e)) { -__builtin_prefetch(p); -if (unlikely(!_mm_testz_si128(t, t))) { -return false; -} -t = p[-4] | p[-3] | p[-2] | p[-1]; -p += 4; -} - -/* Finish the aligned tail. */ -t |= e[-3]; -t |= e[-2]; -t |= e[-1]; - -/* Finish the unaligned tail. */ -t |= _mm_loadu_si128(buf + len - 16); - -return _mm_testz_si128(t, t); -} - static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { @@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info) #endif #ifdef CONFIG_AVX2_OPT { CPUINFO_AVX2,128, buffer_zero_avx2 }, -{ CPUINFO_SSE4, 64, buffer_zero_sse4 }, #endif { CPUINFO_SSE2, 64, buffer_zero_sse2 }, { CPUINFO_ALWAYS,0, buffer_zero_int }, -- 2.32.0
[PATCH v3 6/6] util/bufferiszero: improve scalar variant
Take into account that the inline wrapper ensures len >= 4. Use __attribute__((may_alias)) for accesses via non-char pointers. Avoid using out-of-bounds pointers in loop boundary conditions by reformulating the 'for' loop as 'if (...) do { ... } while (...)'. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 30 +++--- 1 file changed, 11 insertions(+), 19 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index d752edd8cc..1f4cbfaea4 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -29,35 +29,27 @@ bool buffer_is_zero_len_4_plus(const void *buf, size_t len) { -if (unlikely(len < 8)) { -/* For a very small buffer, simply accumulate all the bytes. */ -const unsigned char *p = buf; -const unsigned char *e = buf + len; -unsigned char t = 0; - -do { -t |= *p++; -} while (p < e); - -return t == 0; +if (unlikely(len <= 8)) { +/* Our caller ensures len >= 4. */ +return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0; } else { -/* Otherwise, use the unaligned memory access functions to - handle the beginning and end of the buffer, with a couple +/* Use unaligned memory access functions to handle + the beginning and end of the buffer, with a couple of loops handling the middle aligned section. */ -uint64_t t = ldq_he_p(buf); -const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8); -const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); +uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8); +typedef uint64_t uint64_a __attribute__((may_alias)); +const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8); +const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8); -for (; p + 8 <= e; p += 8) { +if (e - p >= 8) do { if (t) { return false; } t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7]; -} +} while ((p += 8) <= e - 8); while (p < e) { t |= *p++; } -t |= ldq_he_p(buf + len - 8); return t == 0; } -- 2.32.0
[PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
Make buffer_is_zero a 'static inline' function that tests up to three bytes from the buffer before handing off to an unrolled loop. This eliminates call overhead for most non-zero buffers, and allows to optimize out length checks when it is known at compile time (which is often the case in Qemu). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- include/qemu/cutils.h | 28 +++- util/bufferiszero.c | 76 --- 2 files changed, 47 insertions(+), 57 deletions(-) diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h index 92c927a6a3..62b153e603 100644 --- a/include/qemu/cutils.h +++ b/include/qemu/cutils.h @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz); /* used to print char* safely */ #define STR_OR_NULL(str) ((str) ? (str) : "null") -bool buffer_is_zero(const void *buf, size_t len); +bool buffer_is_zero_len_4_plus(const void *, size_t); +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t); bool test_buffer_is_zero_next_accel(void); +/* + * Check if a buffer is all zeroes. + */ +static inline bool buffer_is_zero(const void *vbuf, size_t len) +{ +const char *buf = vbuf; + +if (len == 0) { +return true; +} +if (buf[0] || buf[len - 1] || buf[len / 2]) { +return false; +} +/* All bytes are covered for any len <= 3. */ +if (len <= 3) { +return true; +} + +if (len >= 256) { +return buffer_is_zero_len_256_plus(vbuf, len); +} else { +return buffer_is_zero_len_4_plus(vbuf, len); +} +} + /* * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128) * Input is limited to 14-bit numbers diff --git a/util/bufferiszero.c b/util/bufferiszero.c index f5a3634f9a..01050694a6 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -26,8 +26,8 @@ #include "qemu/bswap.h" #include "host/cpuinfo.h" -static bool -buffer_zero_int(const void *buf, size_t len) +bool +buffer_is_zero_len_4_plus(const void *buf, size_t len) { if (unlikely(len < 8)) { /* For a very small buffer, simply accumulate all the bytes. */ @@ -157,57 +157,40 @@ buffer_zero_avx512(const void *buf, size_t len) } #endif /* CONFIG_AVX512F_OPT */ -/* - * Make sure that these variables are appropriately initialized when - * SSE2 is enabled on the compiler command-line, but the compiler is - * too old to support CONFIG_AVX2_OPT. - */ -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) -# define INIT_USED 0 -# define INIT_LENGTH 0 -# define INIT_ACCELbuffer_zero_int -#else -# ifndef __SSE2__ -# error "ISA selection confusion" -# endif -# define INIT_USED CPUINFO_SSE2 -# define INIT_LENGTH 64 -# define INIT_ACCELbuffer_zero_sse2 -#endif - -static unsigned used_accel = INIT_USED; -static unsigned length_to_accel = INIT_LENGTH; -static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL; - static unsigned __attribute__((noinline)) select_accel_cpuinfo(unsigned info) { /* Array is sorted in order of algorithm preference. */ static const struct { unsigned bit; -unsigned len; bool (*fn)(const void *, size_t); } all[] = { #ifdef CONFIG_AVX512F_OPT -{ CPUINFO_AVX512F, 256, buffer_zero_avx512 }, +{ CPUINFO_AVX512F, buffer_zero_avx512 }, #endif #ifdef CONFIG_AVX2_OPT -{ CPUINFO_AVX2,128, buffer_zero_avx2 }, +{ CPUINFO_AVX2,buffer_zero_avx2 }, #endif -{ CPUINFO_SSE2, 64, buffer_zero_sse2 }, -{ CPUINFO_ALWAYS,0, buffer_zero_int }, +{ CPUINFO_SSE2,buffer_zero_sse2 }, +{ CPUINFO_ALWAYS, buffer_is_zero_len_4_plus }, }; for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) { if (info & all[i].bit) { -length_to_accel = all[i].len; -buffer_accel = all[i].fn; +buffer_is_zero_len_256_plus = all[i].fn; return all[i].bit; } } return 0; } +static unsigned used_accel +#if defined(__SSE2__) += CPUINFO_SSE2; +#else += 0; +#endif + #if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) static void __attribute__((constructor)) init_accel(void) { @@ -227,35 +210,16 @@ bool test_buffer_is_zero_next_accel(void) return used; } -static bool select_accel_fn(const void *buf, size_t len) -{ -if (likely(len >= length_to_accel)) { -return buffer_accel(buf, len); -} -return buffer_zero_int(buf, len); -} - #else -#define select_accel_fn buffer_zero_int bool test_buffer_is_zero_next_accel(void) { return false; } #endif -/* - * Checks if a buffer is all zeroes - */ -bool buffer_is_zero(const void *buf, size_t len) -{ -if (unlikely(len == 0)) { -return true; -} - -/* Fetch the beginning of the buffer while we select the accelerator. */ -__builtin_prefetch(buf); - -/* Use an optimized zero check if possible. Note that this also - includes a
[PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD routines are invoked much more rarely in normal use when most buffers are non-zero. This makes use of AVX512 unprofitable, as it incurs extra frequency and voltage transition periods during which the CPU operates at reduced performance, as described in https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html Signed-off-by: Mikhail Romanov Signed-off-by: Alexander Monakov --- util/bufferiszero.c | 36 ++-- 1 file changed, 2 insertions(+), 34 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 01050694a6..c037d11d04 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) } } -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__) +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include /* Note that each of these vectorized functions require len >= 64. */ @@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len) } #endif /* CONFIG_AVX2_OPT */ -#ifdef CONFIG_AVX512F_OPT -static bool __attribute__((target("avx512f"))) -buffer_zero_avx512(const void *buf, size_t len) -{ -/* Begin with an unaligned head of 64 bytes. */ -__m512i t = _mm512_loadu_si512(buf); -__m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64); -__m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64); - -/* Loop over 64-byte aligned blocks of 256. */ -while (p <= e) { -__builtin_prefetch(p); -if (unlikely(_mm512_test_epi64_mask(t, t))) { -return false; -} -t = p[-4] | p[-3] | p[-2] | p[-1]; -p += 4; -} - -t |= _mm512_loadu_si512(buf + len - 4 * 64); -t |= _mm512_loadu_si512(buf + len - 3 * 64); -t |= _mm512_loadu_si512(buf + len - 2 * 64); -t |= _mm512_loadu_si512(buf + len - 1 * 64); - -return !_mm512_test_epi64_mask(t, t); - -} -#endif /* CONFIG_AVX512F_OPT */ - static unsigned __attribute__((noinline)) select_accel_cpuinfo(unsigned info) { @@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info) unsigned bit; bool (*fn)(const void *, size_t); } all[] = { -#ifdef CONFIG_AVX512F_OPT -{ CPUINFO_AVX512F, buffer_zero_avx512 }, -#endif #ifdef CONFIG_AVX2_OPT { CPUINFO_AVX2,buffer_zero_avx2 }, #endif @@ -191,7 +159,7 @@ static unsigned used_accel = 0; #endif -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) +#if defined(CONFIG_AVX2_OPT) static void __attribute__((constructor)) init_accel(void) { used_accel = select_accel_cpuinfo(cpuinfo_init()); -- 2.32.0
[PATCH v3 4/6] util/bufferiszero: remove useless prefetches
Use of prefetching in bufferiszero.c is quite questionable: - prefetches are issued just a few CPU cycles before the corresponding line would be hit by demand loads; - they are done for simple access patterns, i.e. where hardware prefetchers can perform better; - they compete for load ports in loops that should be limited by load port throughput rather than ALU throughput. Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index c037d11d04..cb3eb2543f 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -49,7 +49,6 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8); for (; p + 8 <= e; p += 8) { -__builtin_prefetch(p + 8); if (t) { return false; } @@ -79,7 +78,6 @@ buffer_zero_sse2(const void *buf, size_t len) /* Loop over 16-byte aligned blocks of 64. */ while (likely(p <= e)) { -__builtin_prefetch(p); t = _mm_cmpeq_epi8(t, zero); if (unlikely(_mm_movemask_epi8(t) != 0x)) { return false; @@ -110,7 +108,6 @@ buffer_zero_avx2(const void *buf, size_t len) /* Loop over 32-byte aligned blocks of 128. */ while (p <= e) { -__builtin_prefetch(p); if (unlikely(!_mm256_testz_si256(t, t))) { return false; } -- 2.32.0
[PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants
Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov --- util/bufferiszero.c | 108 1 file changed, 69 insertions(+), 39 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index cb3eb2543f..d752edd8cc 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -66,62 +66,92 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len) #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include -/* Note that each of these vectorized functions require len >= 64. */ +/* Helper for preventing the compiler from reassociating + chains of binary vector operations. */ +#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1)) + +/* Note that these vectorized functions may assume len >= 256. */ static bool __attribute__((target("sse2"))) buffer_zero_sse2(const void *buf, size_t len) { -__m128i t = _mm_loadu_si128(buf); -__m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); -__m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); -__m128i zero = _mm_setzero_si128(); - -/* Loop over 16-byte aligned blocks of 64. */ -while (likely(p <= e)) { -t = _mm_cmpeq_epi8(t, zero); -if (unlikely(_mm_movemask_epi8(t) != 0x)) { +/* Unaligned loads at head/tail. */ +__m128i v = *(__m128i_u *)(buf); +__m128i w = *(__m128i_u *)(buf + len - 16); +/* Align head/tail to 16-byte boundaries. */ +__m128i *p = (void *)(((uintptr_t)buf + 16) & -16); +__m128i *e = (void *)(((uintptr_t)buf + len - 1) & -16); +__m128i zero = { 0 }; + +/* Collect a partial block at tail end. */ +v |= e[-1]; w |= e[-2]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-3]; w |= e[-4]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-5]; w |= e[-6]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-7]; v |= w; + +/* Loop over complete 128-byte blocks. */ +for (; p < e - 7; p += 8) { +v = _mm_cmpeq_epi8(v, zero); +if (unlikely(_mm_movemask_epi8(v) != 0x)) { return false; } -t = p[-4] | p[-3] | p[-2] | p[-1]; -p += 4; +v = p[0]; w = p[1]; +SSE_REASSOC_BARRIER(v, w); +v |= p[2]; w |= p[3]; +SSE_REASSOC_BARRIER(v, w); +v |= p[4]; w |= p[5]; +SSE_REASSOC_BARRIER(v, w); +v |= p[6]; w |= p[7]; +SSE_REASSOC_BARRIER(v, w); +v |= w; } -/* Finish the aligned tail. */ -t |= e[-3]; -t |= e[-2]; -t |= e[-1]; - -/* Finish the unaligned tail. */ -t |= _mm_loadu_si128(buf + len - 16); - -return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0x; +return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0x; } #ifdef CONFIG_AVX2_OPT static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { -/* Begin with an unaligned head of 32 bytes. */ -__m256i t = _mm256_loadu_si256(buf); -__m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32); -__m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32); - -/* Loop over 32-byte aligned blocks of 128. */ -while (p <= e) { -if (unlikely(!_mm256_testz_si256(t, t))) { +/* Unaligned loads at head/tail. */ +__m256i v = *(__m256i_u *)(buf); +__m256i w = *(__m256i_u *)(buf + len - 32); +/* Align head/tail to 32-byte boundaries. */ +__m256i *p = (void *)(((uintptr_t)buf + 32) & -32); +__m256i *e = (void *)(((uintptr_t)buf + len - 1) & -32); +__m256i zero = { 0 }; + +/* Collect a partial block at tail end. */ +v |= e[-1]; w |= e[-2]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-3]; w |= e[-4]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-5]; w |= e[-6]; +SSE_REASSOC_BARRIER(v, w); +v |= e[-7]; v |= w; + +/* Loop over complete 256-byte blocks. */ +for (; p < e - 7; p += 8) { +/* PTEST is not profitable here. */ +v = _mm256_cmpeq_epi8(v, zero); +if (unlikely(_mm256_movemask_epi8(v) != 0x)) { return false; } -t = p[-4] | p[-3] | p[-2] | p[-1]; -p += 4; -} ; - -/* Finish the last block of 128 unaligned. */ -t |= _mm256_loadu_si256(buf + len - 4 * 32); -t |= _mm256_loadu_si256(buf + len - 3 * 32); -t |= _mm256_loadu_si256(buf + len - 2 * 32); -t |= _mm256_loadu_si256(buf + len - 1 * 32); +v = p[0]; w = p[1]; +SSE_REASSOC_BARRIER(v, w); +v |= p[2]; w |= p[3]; +SSE_REASSOC_BARRIER(v, w); +v |= p[4]; w |= p[5]; +SSE_REASSOC_BARRIER(v, w); +v |=
[PATCH v3 0/6] Optimize buffer_is_zero
I am posting a new revision of buffer_is_zero improvements (v2 can be found at https://patchew.org/QEMU/20231027143704.7060-1-mmroma...@ispras.ru/ ). In our experiments buffer_is_zero took about 40%-50% of overall qemu-img run time, even though Glib I/O is not very efficient. Hence, it remains an important routine to optimize. We substantially improve its performance in typical cases, mostly by introducing an inline wrapper that samples three bytes from head/middle/tail, avoid call overhead when any of those is non-zero. We also provide improvements for SIMD and portable scalar variants. Changed for v3: - separate into 6 patches - fix an oversight which would break the build on non-x86 hosts - properly avoid out-of-bounds pointers in the scalar variant Alexander Monakov (6): util/bufferiszero: remove SSE4.1 variant util/bufferiszero: introduce an inline wrapper util/bufferiszero: remove AVX512 variant util/bufferiszero: remove useless prefetches util/bufferiszero: optimize SSE2 and AVX2 variants util/bufferiszero: improve scalar variant include/qemu/cutils.h | 28 - util/bufferiszero.c | 280 +++--- 2 files changed, 128 insertions(+), 180 deletions(-) -- 2.32.0
Re: [PATCH v2] hw: riscv: Allow large kernels to boot by moving the initrd further away in RAM
On 2/6/24 12:40, Alexandre Ghiti wrote: Currently, the initrd is placed at 128MB, which overlaps with the kernel when it is large (for example syzbot kernels are). From the kernel side, there is no reason we could not push the initrd further away in memory to accommodate large kernels, so move the initrd at 512MB when possible. The ideal solution would have been to place the initrd based on the kernel size but we actually can't since the bss size is not known when the image is loaded by load_image_targphys_as() and the initrd would then overlap with this section. Signed-off-by: Alexandre Ghiti --- Reviewed-by: Daniel Henrique Barboza Changes in v2: - Fix typos in commit log (Daniel) and title - Added to the commit log why using the kernel size does not work (Daniel) hw/riscv/boot.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c index 0ffca05189..9a367af2fa 100644 --- a/hw/riscv/boot.c +++ b/hw/riscv/boot.c @@ -188,13 +188,13 @@ static void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry) * kernel is uncompressed it will not clobber the initrd. However * on boards without much RAM we must ensure that we still leave * enough room for a decent sized initrd, and on boards with large - * amounts of RAM we must avoid the initrd being so far up in RAM - * that it is outside lowmem and inaccessible to the kernel. - * So for boards with less than 256MB of RAM we put the initrd - * halfway into RAM, and for boards with 256MB of RAM or more we put - * the initrd at 128MB. + * amounts of RAM, we put the initrd at 512MB to allow large kernels + * to boot. + * So for boards with less than 1GB of RAM we put the initrd + * halfway into RAM, and for boards with 1GB of RAM or more we put + * the initrd at 512MB. */ -start = kernel_entry + MIN(mem_size / 2, 128 * MiB); +start = kernel_entry + MIN(mem_size / 2, 512 * MiB); size = load_ramdisk(filename, start, mem_size - start); if (size == -1) {
Re: [PATCH 02/13] target/arm: The Cortex-R52 has a read-only CBAR
On 2/6/24 23:29, Peter Maydell wrote: The Cortex-R52 implements the Configuration Base Address Register (CBAR), as a read-only register. Add ARM_FEATURE_CBAR_RO to this CPU type, so that our implementation provides the register and the associated qdev property. Signed-off-by: Peter Maydell --- target/arm/tcg/cpu32.c | 1 + 1 file changed, 1 insertion(+) diff --git a/target/arm/tcg/cpu32.c b/target/arm/tcg/cpu32.c index 11253051156..311d654cdce 100644 --- a/target/arm/tcg/cpu32.c +++ b/target/arm/tcg/cpu32.c @@ -809,6 +809,7 @@ static void cortex_r52_initfn(Object *obj) set_feature(>env, ARM_FEATURE_PMSA); set_feature(>env, ARM_FEATURE_NEON); set_feature(>env, ARM_FEATURE_GENERIC_TIMER); +set_feature(>env, ARM_FEATURE_CBAR_RO); Reviewed-by: Richard Henderson I just noticed that arm_cpu_post_init can be simplified to not check CBAR_RO, now that we have arm_cpu_propagate_feature_implications. r~
Re: [PATCH 01/13] target/arm: Use new CBAR encoding for all v8 CPUs, not all aarch64 CPUs
On 2/6/24 23:29, Peter Maydell wrote: We support two different encodings for the AArch32 IMPDEF CBAR register -- older cores like the Cortex A9, A7, A15 have this at 4, c15, c0, 0; newer cores like the Cortex A35, A53, A57 and A72 have it at 1 c15 c0 0. When we implemented this we picked which encoding to use based on whether the CPU set ARM_FEATURE_AARCH64. However this isn't right for three cases: * the qemu-system-arm 'max' CPU, which is supposed to be a variant on a Cortex-A57; it ought to use the same encoding the A57 does and which the AArch64 'max' exposes to AArch32 guest code * the Cortex-R52, which is AArch32-only but has the CBAR at the newer encoding (and where we incorrectly are not yet setting ARM_FEATURE_CBAR_RO anyway) * any possible future support for other v8 AArch32 only CPUs, or for supporting "boot the CPU into AArch32 mode" on our existing cores like the A57 etc Make the decision of the encoding be based on whether the CPU implements the ARM_FEATURE_V8 flag instead. This changes the behaviour only for the qemu-system-arm '-cpu max'. We don't expect anybody to be relying on the old behaviour because: * it's not what the real hardware Cortex-A57 does (and that's what our ID register claims we are) Not even that, because max resets MIDR. Anyway, Reviewed-by: Richard Henderson r~
Re: [PATCH] target/riscv: Update $pc after linking to $ra in trans_cm_jalt()
On 2/6/24 23:18, Jason Chien wrote: The original implementation sets $pc to the address read from the jump vector table first and links $ra with the address of the next instruction after the updated $pc. After jumping to the updated $pc and executing the next ret instruction, the program jumps to $ra, which is in the same function currently executing, which results in an infinite loop. This commit reverses the two action. Firstly, $ra is updated with the address of the next instruction after $pc, and sets $pc to the address read from the jump vector table. This is unlikely to be correct in the case the vector table read faults, leaving $ra updated. I guess this got broken with CF_PCREL. Anyway, the solution is to use a temporary... -/* - * Update pc to current for the non-unwinding exception - * that might come from cpu_ld*_code() in the helper. - */ -gen_update_pc(ctx, 0); -gen_helper_cm_jalt(cpu_pc, cpu_env, tcg_constant_i32(a->index)); ... here and then ... @@ -307,6 +300,13 @@ static bool trans_cm_jalt(DisasContext *ctx, arg_cm_jalt *a) gen_set_gpr(ctx, xRA, succ_pc); } ... copy the temp to cpu_pc here. tcg_gen_lookup_and_goto_ptr(); ctx->base.is_jmp = DISAS_NORETURN; return true; r~