Re: Proposal for a regular upstream performance testing

2020-11-30 Thread Lukáš Doktor

Dne 30. 11. 20 v 14:23 Stefan Hajnoczi napsal(a):

On Thu, Nov 26, 2020 at 09:43:38AM +, Daniel P. Berrangé wrote:

On Thu, Nov 26, 2020 at 09:10:14AM +0100, Lukáš Doktor wrote:

Ideally the community should have a way to also issue their custom builds
in order to verify their patches so they can debug and address issues
better than just commit to qemu-master.


Allowing community builds certainly adds an extra dimension of complexity
to the problem, as you need some kind of permissions control, as you can't
allow any arbitrary user on the web to trigger jobs with arbitrary code,
as that is a significant security risk to your infra.


syzkaller and other upstream CI/fuzzing systems do this, so it may be
hard but not impossible.



Sure, not impossible, but could not be offered by me at this point. I can't 
promise anything but maybe in the future this can change, or in solution 2 
someone else might resolve the perm issues and I can only assist with the setup 
(if needed).


I think I'd just suggest providing a mechanism for the user to easily spin
up performance test jobs on their own hardware. This could be as simple
as providing a docker container recipe that users can deploy on some
arbitrary machine of their choosing that contains the test rig. All they
should need do is provide a git ref, and then launching the container and
running jobs should be a single command. They can simply run the tests
twice, with and without the patch series in question.


As soon as developers need to recreate an environment it becomes
time-consuming and there is a risk that the issue won't be reproduced.
That doesn't mean the system is useless - big regressions will still be
tackled - but I think it's too much friction and we should aim to run
community builds.



I do understand but unfortunately at this point I can not serve.


The problem with those is that we can not simply use travis/gitlab/...
machines for running those tests, because we are measuring in-guest
actual performance.


As mentioned above - distinguish between the CI framework, and the
actual test runner.


Does the CI framework or the test runner handle detecting regressions
and providing historical data? I ask because I'm not sure if GitLab CI
provides any of this functionality or whether we'd need to write a
custom CI tool to track and report regressions.



Currently I am using Jenkins which allows to publish result (number of failures 
and total checks) and store artifacts. I am storing the pbench json results 
with metadata (few MBs) and html report (also few MBs). Each html report 
contains a timeline of usually 14 previous builds using them as a reference.

Provided GitLab can do that similarly we should be able to see the number of 
tests run/failed somewhere and then browse the builds html reports. Last but 
not least we can fetch the pbench json results and issue another comparison 
cherry-picking individual results (internally I have a pipeline to do that for 
me, I could add a helper to do that via cmdline/container for others as well).

Regards,
Lukáš


Stefan






Re: [RFC PATCH v2 4/5] virtio-net: Added eBPF RSS to virtio-net.

2020-11-30 Thread Yuri Benditovich
On Tue, Nov 24, 2020 at 10:49 AM Jason Wang  wrote:

>
> On 2020/11/19 下午7:13, Andrew Melnychenko wrote:
> > From: Andrew 
> >
> > When RSS is enabled the device tries to load the eBPF program
> > to select RX virtqueue in the TUN. If eBPF can be loaded
> > the RSS will function also with vhost (works with kernel 5.8 and later).
> > Software RSS is used as a fallback with vhost=off when eBPF can't be
> loaded
> > or when hash population requested by the guest.
> >
> > Signed-off-by: Yuri Benditovich 
> > Signed-off-by: Andrew Melnychenko 
> > ---
> >   hw/net/vhost_net.c |   2 +
> >   hw/net/virtio-net.c| 120 +++--
> >   include/hw/virtio/virtio-net.h |   4 ++
> >   net/vhost-vdpa.c   |   2 +
> >   4 files changed, 124 insertions(+), 4 deletions(-)
> >
> > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > index 24d555e764..16124f99c3 100644
> > --- a/hw/net/vhost_net.c
> > +++ b/hw/net/vhost_net.c
> > @@ -71,6 +71,8 @@ static const int user_feature_bits[] = {
> >   VIRTIO_NET_F_MTU,
> >   VIRTIO_F_IOMMU_PLATFORM,
> >   VIRTIO_F_RING_PACKED,
> > +VIRTIO_NET_F_RSS,
> > +VIRTIO_NET_F_HASH_REPORT,
> >
> >   /* This bit implies RARP isn't sent by QEMU out of band */
> >   VIRTIO_NET_F_GUEST_ANNOUNCE,
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index 277289d56e..afcc3032ec 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -698,6 +698,19 @@ static void virtio_net_set_queues(VirtIONet *n)
> >
> >   static void virtio_net_set_multiqueue(VirtIONet *n, int multiqueue);
> >
> > +static uint64_t fix_ebpf_vhost_features(uint64_t features)
> > +{
> > +/* If vhost=on & CONFIG_EBPF doesn't set - disable RSS feature */
> > +uint64_t ret = features;
> > +#ifndef CONFIG_EBPF
> > +virtio_clear_feature(&ret, VIRTIO_NET_F_RSS);
> > +#endif
> > +/* for now, there is no solution for populating the hash from eBPF
> */
> > +virtio_clear_feature(&ret, VIRTIO_NET_F_HASH_REPORT);
>
>
> I think there's still some misunderstanding here.
>
> When "rss" is enabled via command line, qemu can't not turn it off
> silently, otherwise it may break migration. Instead, qemu should disable
> vhost-net if eBPF can't be loaded.
>
> When "hash_report" is enabled via command line, qemu should disable
> vhost-net unconditionally.
>
>
I agree in general with this requirement and I'm preparing an
implementation of such fallback.

The problem is that qemu already uses the mechanism of turning off host
features
silently if they are not supported by the current vhost in kernel:
https://github.com/qemu/qemu/blob/b0f8c22d6d4d07f3bd2307bcc62e1660ef965472/hw/virtio/vhost.c#L1526

Can you please comment on it and let me know how it should be modified in
future?
I've planned to use it in next work (implementing hash report in kernel)


>
> > +
> > +return ret;
> > +}
> > +
> >   static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t
> features,
> >   Error **errp)
> >   {
> > @@ -732,9 +745,9 @@ static uint64_t virtio_net_get_features(VirtIODevice
> *vdev, uint64_t features,
> >   return features;
> >   }
> >
> > -virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
> > -virtio_clear_feature(&features, VIRTIO_NET_F_HASH_REPORT);
> > -features = vhost_net_get_features(get_vhost_net(nc->peer),
> features);
> > +features = fix_ebpf_vhost_features(
> > +vhost_net_get_features(get_vhost_net(nc->peer), features));
> > +
> >   vdev->backend_features = features;
> >
> >   if (n->mtu_bypass_backend &&
> > @@ -1169,12 +1182,75 @@ static int virtio_net_handle_announce(VirtIONet
> *n, uint8_t cmd,
> >   }
> >   }
> >
> > +static void virtio_net_unload_epbf_rss(VirtIONet *n);
> > +
> >   static void virtio_net_disable_rss(VirtIONet *n)
> >   {
> >   if (n->rss_data.enabled) {
> >   trace_virtio_net_rss_disable();
> >   }
> >   n->rss_data.enabled = false;
> > +
> > +if (!n->rss_data.enabled_software_rss &&
> ebpf_rss_is_loaded(&n->ebpf_rss)) {
> > +virtio_net_unload_epbf_rss(n);
> > +}
> > +}
> > +
> > +static bool virtio_net_attach_steering_ebpf(NICState *nic, int prog_fd)
> > +{
> > +NetClientState *nc = qemu_get_peer(qemu_get_queue(nic), 0);
> > +if (nc == NULL || nc->info->set_steering_ebpf == NULL) {
> > +return false;
> > +}
> > +
> > +return nc->info->set_steering_ebpf(nc, prog_fd);
> > +}
> > +
> > +static void rss_data_to_rss_config(struct VirtioNetRssData *data,
> > +   struct EBPFRSSConfig *config)
> > +{
> > +config->redirect = data->redirect;
> > +config->populate_hash = data->populate_hash;
> > +config->hash_types = data->hash_types;
> > +config->indirections_len = data->indirections_len;
> > +config->default_queue = data->default_queue;
> > +}
> > +
> > +static bool virtio_net_load_epbf

Re: [PATCH] python 3.5 compatibility

2020-11-30 Thread Marc-André Lureau
Hi

On Mon, Nov 30, 2020 at 10:29 PM Enrico Weigelt, metux IT consult <
l...@metux.net> wrote:

> On 27.11.20 20:15, Peter Maydell wrote:
>
> Hi,
>
> > Could you say which "stable distros" you have in mind, and whether
> > they are covered by our "supported build platforms" policy
> > https://www.qemu.org/docs/master/system/build-platforms.html  ?
>
> I'm running on Devuan Ascii.
>
> And packaging python-3.6 just for a few pieces of one application (qemu)
> is far too much work.
>

It might be necessary to keep that distro alive, as python 3.5 is EOL:

https://www.python.org/downloads/release/python-3510/

(fwiw, upcoming meson version will also depend on >= 3.6)


> --mtx
>
> --
> ---
> Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
> werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
> GPG/PGP-Schlüssel zu.
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> i...@metux.net -- +49-151-27565287
>
>

-- 
Marc-André Lureau


Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots

2020-11-30 Thread Peter Krempa
On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Hi,

I gave this a try when attempting to implement the libvirt code for this
feature. I've ran into a problem of migration failing right away. The
VM's cpus were running at that point.

QEMU logged the following to stdout/err:

2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: 
start=7f2007fff000 len=33554432000 mode=2 errno=22
2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() 
failed: restoring initial memory state
2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: 
start=7f2007fff000 len=33554432000 mode=0 errno=2
2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() 
failed: start=7f2007fff000 len=33554432000 errno=22


The migration was started by the following QMP conversation:


QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 
buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": 
"libvirt-14"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 
buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": 
"libvirt-15"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 
buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"}

QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": 
"libvirt-16"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 
buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"}

QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 
1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": 
"setup"}}
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": 
"libvirt-17"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 
1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": 
{"pass": 1}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 
1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": 
"active"}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 
1606805733, "microseconds": 966488}, "event": "STOP"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 
1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": 
"failed"}}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 
buf={"execute":"query-migrate","id":"libvirt-18"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": 
"failed"}, "id": "libvirt-18"}
qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly 
failed

 $ uname -r
5.8.18-300.fc33.x86_64


created by libvirt with the following patchset applied:

https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot

git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot

Start the snapshot via:

virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no 
--diskspec sda,snapshot=no --no-metadata upstream

Note you can omit --diskspec if you have a diskless VM.

The patches are VERY work in progress as I need to figure out the proper
sequencing to ensure a consistent snapshot.

Note that in cases when qemu can't guarantee that the
background_snapshot feature will work it should not advertise it. We
need a way to check whether it's possible to use it, so we can replace
the existing --live flag with it rather than adding a new one and
shifting the problem of checking whether the feature works to the user.




Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

2020-11-30 Thread Shenming Lu
On 2020/12/1 1:03, Alex Williamson wrote:
> On Thu, 26 Nov 2020 14:56:17 +0800
> Shenming Lu  wrote:
> 
>> Hi,
>>
>> After reading everyone's opinions, we have a rough idea for this issue.
>>
>> One key point is whether it is necessary to setup the config space before
>> the device can accept further migration data. I think it is decided by
>> the vendor driver, so we can simply ask the vendor driver about it in
>> .save_setup, which could avoid a lot of unnecessary copies and settings.
>> Once we have known the need, we can iterate the config space (before)
>> along with the device migration data in .save_live_iterate and
>> .save_live_complete_precopy, and if not needed, we can only migrate the
>> config space in .save_state.
>>
>> Another key point is that the interrupt enabling should be after the
>> restoring of the interrupt controller (might not only interrupts).
>> My solution is to add a subflag at the beginning of the config data
>> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
>> actions on the dst (such as whether to enable interrupts).
>>
>> Below is it's workflow.
>>
>> On the save path:
>>  In vfio_save_setup():
>>  Ask the vendor driver if it needs the config space setup before it
>>  can accept further migration data.
> 
> How does "ask the vendor driver" actually work?

Maybe via a ioctl?
Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup)
and send the config data (before) along with the device migration data
every time?...

> 
>>  |
>>  In vfio_save_iterate() (pre-copy):
>>  If *needed*, save the config space which would be setup on the dst
>>  before the migration data, but send with a subflag to instruct not
>>  to (such as) enable interrupts.
> 
> If not for triggering things like MSI/X configuration, isn't config
> space almost entirely virtual?  What visibility does the vendor driver
> have to the VM machine dependencies regarding device interrupt versus
> interrupt controller migration?

My thought is that the vendor driver only decides the order of the config
space setup and the receiving of the migration data, but leaves the VM
machine dependencies to QEMU.

> 
>>  |
>>  In vfio_save_complete_precopy() (stop-and-copy, iterable process):
>>  The same as that in vfio_save_iterate().
>>  |
>>  In .save_state (stop-and-copy, non-iterable process):
>>  If *needed*, only send a subflag to instruct to enable interrupts.
>>  If *not needed*, save the config space and setup everything on the dst.
> 
> Again, how does the vendor driver have visibility to know when the VM
> machine can enable interrupts?

It seems troubling if the vendor driver needs the interrupts to be enabled
first...

> 
>>
>> Besides the above idea, we might be able to choose to let the vendor driver 
>> do
>> more: qemu just sends and writes the config data (before) along with the 
>> device
>> migration data every time, and it's up to the vendor driver to filter 
>> out/buffer
>> the received data or reorder the settings...
> 
> There is no vendor driver in QEMU though, so are you suggesting that
> QEMU follows a standard protocol and the vendor driver chooses when to
> enable specific features?  For instance, QEMU would call SET_IRQS and
> the driver would return success, but defer that setup if necessary?
> That seems quite troubling as we then have ioctls that behave
> differently depending on the device state and we have no error path to
> userspace should that setup fail later.  The vendor driver does have
> its own data stream for migration, so the vendor driver could tell the
> destination version of itself what type of interrupt to use, which
> might be sufficient if we were to ignore the latency if QEMU were to
> defer interrupt setup until stop-and-copy.

Did you mean that we could only enable MSI-X during the iterable phase, but
leave the setup of these unmasked vectors to the non-iterable phase?
It looks good to me.

> 
> Is the question of when to setup device interrupts versus the interrupt
> controller state largely a machine issue within QEMU?  If so, shouldn't
> it be at QEMU's determination when to act on the config space
> information on the target?

I think it would be simpler if ensuring the proper calling order in QEMU...

> IOW, if a vendor driver has a dependency on
> interrupt configuration, they need to include it in their own pre-copy
> data stream and decouple that dependency from userspace interrupt
> configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,
> 

I don't understand what the decoupling that dependency from userspace
interrupt configuration means...

Thanks,
Shenming

> Alex
> 
> .
> 



Re: [PATCH v4] qga: Correct loop count in qmp_guest_get_vcpus()

2020-11-30 Thread Marc-André Lureau
Hi

On Tue, Dec 1, 2020 at 10:14 AM Lin Ma  wrote:

> The guest-get-vcpus returns incorrect vcpu info in case we hotunplug
> vcpus(not
> the last one).
> e.g.:
> A VM has 4 VCPUs: cpu0 + 3 hotunpluggable online vcpus(cpu1, cpu2 and
> cpu3).
> Hotunplug cpu2,  Now only cpu0, cpu1 and cpu3 are present & online.
>
> ./qmp-shell /tmp/qmp-monitor.sock
> (QEMU) query-hotpluggable-cpus
> {"return": [
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 3}, "vcpus-count": 1,
>  "qom-path": "/machine/peripheral/cpu3", "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 2}, "vcpus-count": 1,
>  "qom-path": "/machine/peripheral/cpu2", "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, "vcpus-count": 1,
>  "qom-path": "/machine/peripheral/cpu1", "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 0}, "vcpus-count": 1,
>  "qom-path": "/machine/unattached/device[0]", "type": "host-x86_64-cpu"}
> ]}
>
> (QEMU) device_del id=cpu2
> {"return": {}}
>
> (QEMU) query-hotpluggable-cpus
> {"return": [
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 3}, "vcpus-count": 1,
>  "qom-path": "/machine/peripheral/cpu3", "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 2}, "vcpus-count": 1,
>  "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, "vcpus-count": 1,
>  "qom-path": "/machine/peripheral/cpu1", "type": "host-x86_64-cpu"},
> {"props": {"core-id": 0, "thread-id": 0, "socket-id": 0}, "vcpus-count": 1,
>  "qom-path": "/machine/unattached/device[0]", "type": "host-x86_64-cpu"}
> ]}
>
> Before:
> ./qmp-shell -N /tmp/qmp-ga.sock
> Welcome to the QMP low-level shell!
> Connected
> (QEMU) guest-get-vcpus
> {"return": [
> {"online": true, "can-offline": false, "logical-id": 0},
> {"online": true, "can-offline": true, "logical-id": 1}]}
>
> After:
> ./qmp-shell -N /tmp/qmp-ga.sock
> Welcome to the QMP low-level shell!
> Connected
> (QEMU) guest-get-vcpus
> {"return": [
> {"online": true, "can-offline": false, "logical-id": 0},
> {"online": true, "can-offline": true, "logical-id": 1},
> {"online": true, "can-offline": true, "logical-id": 3}]}
>
> Signed-off-by: Lin Ma 
> ---
>  qga/commands-posix.c | 44 +++-
>  1 file changed, 15 insertions(+), 29 deletions(-)
>
> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
> index c089e38120..48dcdae239 100644
> --- a/qga/commands-posix.c
> +++ b/qga/commands-posix.c
> @@ -2376,24 +2376,6 @@ error:
>  return NULL;
>  }
>
> -#define SYSCONF_EXACT(name, errp) sysconf_exact((name), #name, (errp))
> -
> -static long sysconf_exact(int name, const char *name_str, Error **errp)
> -{
> -long ret;
> -
> -errno = 0;
> -ret = sysconf(name);
> -if (ret == -1) {
> -if (errno == 0) {
> -error_setg(errp, "sysconf(%s): value indefinite", name_str);
> -} else {
> -error_setg_errno(errp, errno, "sysconf(%s)", name_str);
> -}
> -}
> -return ret;
> -}
> -
>  /* Transfer online/offline status between @vcpu and the guest system.
>   *
>   * On input either @errp or *@errp must be NULL.
> @@ -2464,24 +2446,28 @@ static void transfer_vcpu(GuestLogicalProcessor
> *vcpu, bool sys2vcpu,
>
>  GuestLogicalProcessorList *qmp_guest_get_vcpus(Error **errp)
>  {
> -int64_t current;
>  GuestLogicalProcessorList *head, **link;
> -long sc_max;
> +const char *cpu_dir = "/sys/devices/system/cpu";
> +const gchar *line;
> +GDir *cpu_gdir = NULL;
>  Error *local_err = NULL;
>
> -current = 0;
>  head = NULL;
>  link = &head;
> -sc_max = SYSCONF_EXACT(_SC_NPROCESSORS_CONF, &local_err);
> +cpu_gdir = g_dir_open(cpu_dir, 0, NULL);
> +
> +if (cpu_gdir == NULL) {
> +error_setg_errno(errp, errno, "failed to list entries: %s",
> cpu_dir);
> +return NULL;
> +}
>
> -while (local_err == NULL && current < sc_max) {
> +while (local_err == NULL && (line = g_dir_read_name(cpu_gdir)) !=
> NULL) {
>  GuestLogicalProcessor *vcpu;
>  GuestLogicalProcessorList *entry;
> -int64_t id = current++;
> -char *path = g_strdup_printf("/sys/devices/system/cpu/cpu%"
> PRId64 "/",
> - id);
> -
> -if (g_file_test(path, G_FILE_TEST_EXISTS)) {
> +int64_t id;
> +if (sscanf(line, "cpu%ld", &id)) {
> +g_autofree char *path =
> g_strdup_printf("/sys/devices/system/cpu/"
> +"cpu%" PRId64 "/",
> id);
>  vcpu = g_malloc0(sizeof *vcpu);
>  vcpu->logical_id = id;
>  vcpu->has_can_offline = true; /* lolspeak ftw */
> @@ -2491,8 +2477,8 @@ GuestLogicalProcessorList *qmp_guest_get_vcpus(Error
> **errp)
>  *link = entry;
>  link = &entry->next;
>  }
> -g_free(path);
>  }
> +

Re: [PATCH] python 3.5 compatibility

2020-11-30 Thread Markus Armbruster
"Enrico Weigelt, metux IT consult"  writes:

> On 27.11.20 20:15, Peter Maydell wrote:
>
> Hi,
>
>> Could you say which "stable distros" you have in mind, and whether
>> they are covered by our "supported build platforms" policy
>> https://www.qemu.org/docs/master/system/build-platforms.html  ?
>
> I'm running on Devuan Ascii.

Which has oldstable status.  Good for running the old and stable
software packaged by it (such as QEMU 2.8), and old (and hopefully
stable) software of similar vintage.

Devian ASCII appears to be a derivative of Debian 9 Stretch, which we no
longer support, because it's EOL.  docs/system/build-platforms.rst:

For distributions with long-lifetime releases, the project will aim to
support the most recent major version at all times. Support for the
previous major version will be dropped 2 years after the new major
version is released, or when it reaches "end of life". For the purposes
of identifying supported software versions, the project will look at
RHEL, Debian, Ubuntu LTS, and SLES distros. Other long-lifetime distros
will be assumed to ship similar software versions.

This policy balances cost and benefit of keeping bleeding-edge QEMU work
on a wide variety of hosts.  Years of balancing.

> And packaging python-3.6 just for a few pieces of one application (qemu)
> is far too much work.

Have you considered upgrading to stable?




[PATCH] qemu-nbd: Fix a memleak in nbd_client_thread()

2020-11-30 Thread Alex Chen
When the qio_channel_socket_connect_sync() fails
we should goto 'out_socket' label to free the 'sioc' instead of
goto 'out' label.
In addition, now the 'out' label is useless, delete it.

Reported-by: Euler Robot 
Signed-off-by: Alex Chen 
---
 qemu-nbd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/qemu-nbd.c b/qemu-nbd.c
index 47587a709e..643b0777c0 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -275,7 +275,7 @@ static void *nbd_client_thread(void *arg)
 saddr,
 &local_error) < 0) {
 error_report_err(local_error);
-goto out;
+goto out_socket;
 }
 
 ret = nbd_receive_negotiate(NULL, QIO_CHANNEL(sioc),
@@ -325,7 +325,6 @@ out_fd:
 close(fd);
 out_socket:
 object_unref(OBJECT(sioc));
-out:
 g_free(info.name);
 kill(getpid(), SIGTERM);
 return (void *) EXIT_FAILURE;
-- 
2.19.1




[PATCH v4] qga: Correct loop count in qmp_guest_get_vcpus()

2020-11-30 Thread Lin Ma
The guest-get-vcpus returns incorrect vcpu info in case we hotunplug vcpus(not
the last one).
e.g.:
A VM has 4 VCPUs: cpu0 + 3 hotunpluggable online vcpus(cpu1, cpu2 and cpu3).
Hotunplug cpu2,  Now only cpu0, cpu1 and cpu3 are present & online.

./qmp-shell /tmp/qmp-monitor.sock
(QEMU) query-hotpluggable-cpus
{"return": [
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 3}, "vcpus-count": 1,
 "qom-path": "/machine/peripheral/cpu3", "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 2}, "vcpus-count": 1,
 "qom-path": "/machine/peripheral/cpu2", "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, "vcpus-count": 1,
 "qom-path": "/machine/peripheral/cpu1", "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 0}, "vcpus-count": 1,
 "qom-path": "/machine/unattached/device[0]", "type": "host-x86_64-cpu"}
]}

(QEMU) device_del id=cpu2
{"return": {}}

(QEMU) query-hotpluggable-cpus
{"return": [
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 3}, "vcpus-count": 1,
 "qom-path": "/machine/peripheral/cpu3", "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 2}, "vcpus-count": 1,
 "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, "vcpus-count": 1,
 "qom-path": "/machine/peripheral/cpu1", "type": "host-x86_64-cpu"},
{"props": {"core-id": 0, "thread-id": 0, "socket-id": 0}, "vcpus-count": 1,
 "qom-path": "/machine/unattached/device[0]", "type": "host-x86_64-cpu"}
]}

Before:
./qmp-shell -N /tmp/qmp-ga.sock
Welcome to the QMP low-level shell!
Connected
(QEMU) guest-get-vcpus
{"return": [
{"online": true, "can-offline": false, "logical-id": 0},
{"online": true, "can-offline": true, "logical-id": 1}]}

After:
./qmp-shell -N /tmp/qmp-ga.sock
Welcome to the QMP low-level shell!
Connected
(QEMU) guest-get-vcpus
{"return": [
{"online": true, "can-offline": false, "logical-id": 0},
{"online": true, "can-offline": true, "logical-id": 1},
{"online": true, "can-offline": true, "logical-id": 3}]}

Signed-off-by: Lin Ma 
---
 qga/commands-posix.c | 44 +++-
 1 file changed, 15 insertions(+), 29 deletions(-)

diff --git a/qga/commands-posix.c b/qga/commands-posix.c
index c089e38120..48dcdae239 100644
--- a/qga/commands-posix.c
+++ b/qga/commands-posix.c
@@ -2376,24 +2376,6 @@ error:
 return NULL;
 }
 
-#define SYSCONF_EXACT(name, errp) sysconf_exact((name), #name, (errp))
-
-static long sysconf_exact(int name, const char *name_str, Error **errp)
-{
-long ret;
-
-errno = 0;
-ret = sysconf(name);
-if (ret == -1) {
-if (errno == 0) {
-error_setg(errp, "sysconf(%s): value indefinite", name_str);
-} else {
-error_setg_errno(errp, errno, "sysconf(%s)", name_str);
-}
-}
-return ret;
-}
-
 /* Transfer online/offline status between @vcpu and the guest system.
  *
  * On input either @errp or *@errp must be NULL.
@@ -2464,24 +2446,28 @@ static void transfer_vcpu(GuestLogicalProcessor *vcpu, 
bool sys2vcpu,
 
 GuestLogicalProcessorList *qmp_guest_get_vcpus(Error **errp)
 {
-int64_t current;
 GuestLogicalProcessorList *head, **link;
-long sc_max;
+const char *cpu_dir = "/sys/devices/system/cpu";
+const gchar *line;
+GDir *cpu_gdir = NULL;
 Error *local_err = NULL;
 
-current = 0;
 head = NULL;
 link = &head;
-sc_max = SYSCONF_EXACT(_SC_NPROCESSORS_CONF, &local_err);
+cpu_gdir = g_dir_open(cpu_dir, 0, NULL);
+
+if (cpu_gdir == NULL) {
+error_setg_errno(errp, errno, "failed to list entries: %s", cpu_dir);
+return NULL;
+}
 
-while (local_err == NULL && current < sc_max) {
+while (local_err == NULL && (line = g_dir_read_name(cpu_gdir)) != NULL) {
 GuestLogicalProcessor *vcpu;
 GuestLogicalProcessorList *entry;
-int64_t id = current++;
-char *path = g_strdup_printf("/sys/devices/system/cpu/cpu%" PRId64 "/",
- id);
-
-if (g_file_test(path, G_FILE_TEST_EXISTS)) {
+int64_t id;
+if (sscanf(line, "cpu%ld", &id)) {
+g_autofree char *path = g_strdup_printf("/sys/devices/system/cpu/"
+"cpu%" PRId64 "/", id);
 vcpu = g_malloc0(sizeof *vcpu);
 vcpu->logical_id = id;
 vcpu->has_can_offline = true; /* lolspeak ftw */
@@ -2491,8 +2477,8 @@ GuestLogicalProcessorList *qmp_guest_get_vcpus(Error 
**errp)
 *link = entry;
 link = &entry->next;
 }
-g_free(path);
 }
+g_dir_close(cpu_gdir);
 
 if (local_err == NULL) {
 /* there's no guest with zero VCPUs */
-- 
2.26.0




Re: [PATCH 1/6] migration: Add multi-thread compress method

2020-11-30 Thread Zeyu Jin
On 2020/11/30 16:35, Markus Armbruster wrote:
> Zeyu Jin  writes:
> 
>> On 2020/11/27 17:48, Markus Armbruster wrote:
>>> Kevin, Max, suggest to skip right to Qcow2CompressionType.
>>>
>>> Zeyu Jin  writes:
>>>
 A multi-thread compress method parameter is added to hold the method we
 are going to use. By default the 'zlib' method is used to maintain the
 compatibility as before.

 Signed-off-by: Zeyu Jin 
 Signed-off-by: Ying Fang 
>>> [...]
 diff --git a/qapi/migration.json b/qapi/migration.json
 index 3c75820527..2ed6a55b92 100644
 --- a/qapi/migration.json
 +++ b/qapi/migration.json
 @@ -525,6 +525,19 @@
'data': [ 'none', 'zlib',
  { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
  
 +##
 +# @CompressMethod:
 +#
 +# An enumeration of multi-thread compression methods.
 +#
 +# @zlib: use zlib compression method.
 +#
 +# Since: 6.0
 +#
 +##
 +{ 'enum': 'CompressMethod',
 +  'data': [ 'zlib' ] }
 +
  ##
  # @BitmapMigrationBitmapAlias:
  #
 @@ -599,6 +612,9 @@
  #  compression, so set the decompress-threads to the 
 number about 1/4
  #  of compress-threads is adequate.
  #
 +# @compress-method: Set compression method to use in multi-thread 
 compression.
 +#   Defaults to zlib. (Since 6.0)
>>>
>>> We already have @multifd-compression.  Why do we need to control the two
>>> separately?  Can you give a use case for different settings?
>>>
>>
>> Generally, mulit-thread compression deals with the situation
>> where network bandwith is limited but cpu resource is adequate. Multifd
>> instead aims to situation where single fd cannot take full advantage of
>> network bandwith. So compression based on multifd cannot fully cover the
>> cases for multi-thread compression.
>>
>> For example, for migration with a bandwith limitation of 10M
>> bytes/second, single fd is enough for data delivery. This is the case
>> for multi-thread compression.
> 
> Let me rephrase my question.
> 
> According to query-migrate-parameters, we default to
> 
> "compress-level": 1
> "compress-threads": 8
> "compress-wait-thread": true
> "decompress-threads": 2
> "multifd-channels": 2
> "multifd-compression": "none"
> "multifd-zlib-level": 1
> "multifd-zstd-level": 1
> 
> Your patch adds
> 
> "compress-method": "zlib"
> 
> I have several basic questions I can't answer from the documentation:
> 
> 1. We appear to have two distinct sets of compression parameters:
> 
>* Traditional: compress-level, compress-threads,
>  compress-wait-thread, decompress-threads.
> 
>  These parameters all apply to the same compression.  Correct?
> 
>  What data is being compressed by it?
> 
>* Multi-fd: multifd-channels, multifd-compression,
>  multifd-zlib-level, multifd-std-level
> 
>  These parameters all apply to the same compression.  Correct?
> 
>  What data is being compressed by it?
> 
>* Why do we want *two*?  I understand why multi-fd is optional, but
>  why do we need the capability to compress differently there?  Use
>  case?
> 
>All of these questions predate your patch.  David, Juan?
>

I see. The problem is that the parameter sets seem to be redundant and
maybe there is an overlap between these two compression capabilities.

As you said, the questions predate my patch, so maybe we can have a
discussion here. What do you think, David, Juan?

> 2. Does compress-method belong to "traditional"?
>

Yes.

>>> If we do want two parameters: the names @compress-method and
>>> @multifd-compression are inconsistent.  According to your comment,
>>> @compress-method applies only to multi-thread compression.  That leads
>>> me to suggest renaming it to @multi-thread-compression.
>>>
>>
>> For the names, my original idea is to make 'CompressMethod' consistent
>> with other multi-thread compression parameters, like 'compress-threads'
>> and 'compress-level'. There is truly some inconsistency here, between
>> multifd-compression params and old multi-thread compression params.
> 
> I see.
> 
>> For now, I agree with you that 'multi-thread-compression' is better. It
>> is more specific and clear. I will rename the params in next version.
>> Thanks for the suggestion.
> 
> Please wait until we've sorted out the documentation mess.
> 
>>> After PATCH 4, CompressMethod is almost the same as MultiFDCompression:
>>>
>>>{ 'enum': 'CompressMethod',
>>>  'data': [ 'zlib' ] }
>>>  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
>>>
>>>{ 'enum': 'MultiFDCompression',
>>>  'data': [ 'none', 'zlib',
>>>{ 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
>>>
>>> The difference is member 'none'.  Why does compression 'none' make sense
>>> for multi-fd, but not for multi-thread?
>>>
>>
>> When you set 'none'in multi-

Re: [PATCH v2 4/6] migration: Add zstd support in multi-thread compression

2020-11-30 Thread Zeyu Jin
On 2020/12/1 0:43, Eric Blake wrote:
> On 11/27/20 3:32 AM, Zeyu Jin wrote:
>> This patch enables zstd option in multi-thread compression.
>>
>> Signed-off-by: Zeyu Jin 
>> Signed-off-by: Ying Fang 
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -536,7 +536,7 @@
>>  #
>>  ##
>>  { 'enum': 'CompressMethod',
>> -  'data': [ 'zlib' ] }
>> +  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
> 
> Missing documentation of the new value, including a '(since 6.0)' marker.
> 

Yes, will fix it.



Re: [PATCH v2 0/6] migration: Multi-thread compression method support

2020-11-30 Thread Zeyu Jin
On 2020/12/1 0:42, Eric Blake wrote:
> On 11/27/20 3:36 AM, Zeyu Jin wrote:
> 
> Meta-comment: you appear to be having problems threading your series;
> I've now seen three separate cover letters (RFC v1, v2 with no subject,
> v2 with subject) and two series where each patch was a separate thread.
> It is difficult to follow which messages are related when reading in a
> mail client that sorts by most-recently-active thread first.  You may
> want to investigate why your threading is not working, although I'd wait
> to send v3 until you have actual changes to incorporate.
> 

Thank you for noticing that. It`s my mistake when sending patches.
Everything will be fine in v3.

>> Currently we have both multi-thread compression and multifd to optimize
>> live migration in Qemu. Mulit-thread compression deals with the situation
>> where network bandwith is limited but cpu resource adequate. Multifd instead
> 
> Not that typos in the cover letter matter, but this should be 'bandwidth'
>

Yes, I will fix that.

>> aims to take full advantage of network bandwith. Moreover it supports both
>> zlib and zstd compression on each channel.
>>
>> In this patch series, we did some code refactoring on multi-thread 
>> compression
>> live migration and bring zstd compression method support for it.
>>
>> Below is the test result of multi-thread compression live migration
>> with different compress methods. Test result shows that zstd outperforms
>> zlib by about 70%.
>>
> 




Re: [PATCH] hw/net/dp8393x: fix integer underflow in dp8393x_do_transmit_packets()

2020-11-30 Thread Jason Wang



On 2020/11/30 下午8:11, Mauro Matteo Cascella wrote:

On Mon, Nov 30, 2020 at 11:44 AM Philippe Mathieu-Daudé  wrote:

+Laurent/Finn

On 11/24/20 10:24 AM, Mauro Matteo Cascella wrote:

An integer underflow could occur during packet transmission due to 'tx_len' not
being updated if SONIC_TFC register is set to zero. Check for negative 'tx_len'
when removing existing FCS.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1899722
Signed-off-by: Mauro Matteo Cascella 
Reported-by: Gaoning Pan 
---
  hw/net/dp8393x.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 674b04b354..205c0decc5 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -495,6 +495,10 @@ static void dp8393x_do_transmit_packets(dp8393xState *s)
  } else {
  /* Remove existing FCS */
  tx_len -= 4;
+if (tx_len < 0) {
+SONIC_ERROR("tx_len is %d\n", tx_len);
+break;
+}
  }

  if (s->regs[SONIC_RCR] & (SONIC_RCR_LB1 | SONIC_RCR_LB0)) {


Doesn't it make more sense to check 'tx_len >= 4'
and skip tx/rx when 'tx_len == 0'?


Yes, it makes sense. I thought that skipping tx/rx in case of null
'tx_len' could lead to potential inconsistencies when writing the
status or reading the footer of the packet. but I'm not really sure. I
can send a new version of the patch if needed, otherwise feel free to
apply your changes. Thank you.



I think we can go with this patch first and tweak on top consider it's 
near the release. So:


Acked-by: Jason Wang 

Peter, do you want to merge this patch?

Thanks





-- >8 --
@@ -488,25 +488,29 @@ static void
dp8393x_do_transmit_packets(dp8393xState *s)
  }
  }

-/* Handle Ethernet checksum */
-if (!(s->regs[SONIC_TCR] & SONIC_TCR_CRCI)) {
-/* Don't append FCS there, to look like slirp packets
- * which don't have one */
-} else {
-/* Remove existing FCS */
-tx_len -= 4;
+if (tx_len >= 4) {
+/* Handle Ethernet checksum */
+if (!(s->regs[SONIC_TCR] & SONIC_TCR_CRCI)) {
+/* Don't append FCS there, to look like slirp packets
+ * which don't have one */
+} else {
+/* Remove existing FCS */
+tx_len -= 4;
+}
  }

-if (s->regs[SONIC_RCR] & (SONIC_RCR_LB1 | SONIC_RCR_LB0)) {
-/* Loopback */
-s->regs[SONIC_TCR] |= SONIC_TCR_CRSL;
-if (nc->info->can_receive(nc)) {
-s->loopback_packet = 1;
-nc->info->receive(nc, s->tx_buffer, tx_len);
+if (tx_len > 0) {
+if (s->regs[SONIC_RCR] & (SONIC_RCR_LB1 | SONIC_RCR_LB0)) {
+/* Loopback */
+s->regs[SONIC_TCR] |= SONIC_TCR_CRSL;
+if (nc->info->can_receive(nc)) {
+s->loopback_packet = 1;
+nc->info->receive(nc, s->tx_buffer, tx_len);
+}
+} else {
+/* Transmit packet */
+qemu_send_packet(nc, s->tx_buffer, tx_len);
  }
-} else {
-/* Transmit packet */
-qemu_send_packet(nc, s->tx_buffer, tx_len);
  }
  s->regs[SONIC_TCR] |= SONIC_TCR_PTX;

---


Regards,





Re: [PATCH v2] net/e1000e_core: adjust count if RDH exceeds RDT in e1000e_ring_advance()

2020-11-30 Thread Jason Wang



On 2020/11/30 下午10:12, Mauro Matteo Cascella wrote:

On Mon, Nov 30, 2020 at 3:58 AM Jason Wang  wrote:


On 2020/11/27 下午10:49, Mauro Matteo Cascella wrote:

On Fri, Nov 27, 2020 at 6:21 AM Jason Wang  wrote:

On 2020/11/24 上午5:30, Mauro Matteo Cascella wrote:

On Thu, Nov 19, 2020 at 6:57 AM Jason Wang  wrote:

On 2020/11/18 下午4:53, Mauro Matteo Cascella wrote:

On Wed, Nov 18, 2020 at 4:56 AM Jason Wang  wrote:

On 2020/11/13 下午6:31, Mauro Matteo Cascella wrote:

The e1000e_write_packet_to_guest() function iterates over a set of
receive descriptors by advancing rx descriptor head register (RDH) from
its initial value to rx descriptor tail register (RDT). The check in
e1000e_ring_empty() is responsible for detecting whether RDH has reached
RDT, terminating the loop if that's the case. Additional checks have
been added in the past to deal with bogus values submitted by the guest
to prevent possible infinite loop. This is done by "wrapping around" RDH
at some point and detecting whether it assumes the original value during
the loop.

However, when e1000e is configured to use the packet split feature, RDH is
incremented by two instead of one, as the packet split descriptors are
32 bytes while regular descriptors are 16 bytes. A malicious or buggy
guest may set RDT to an odd value and transmit only null RX descriptors.
This corner case would prevent RDH from ever matching RDT, leading to an
infinite loop. This patch adds a check in e1000e_ring_advance() to make sure
RDH does not exceed RDT in a single incremental step, adjusting the count
value accordingly.

Can this patch solve this issue in another way?

https://patchew.org/QEMU/2020130636.2208620-1-ppan...@redhat.com/

Thanks


Yes, it does work nicely. Still, I think this patch is useful to avoid
possible inconsistent state in e1000e_ring_advance() when count > 1.

So if RDT is odd, it looks to me the following codes in
e1000e_write_packet_to_guest() needs to be fixed as well.


base = e1000e_ring_head_descr(core, rxi);

pci_dma_read(d, base, &desc, core->rx_desc_len);

Otherwise e1000e may try to read out of descriptor ring.

Sorry, I'm not sure I understand what you mean. Isn't the base address
computed from RDH? How can e1000e read out of the descriptor ring if
RDT is odd?


Thanks

On Thu, Nov 19, 2020 at 6:57 AM Jason Wang  wrote:

On 2020/11/18 下午4:53, Mauro Matteo Cascella wrote:

On Wed, Nov 18, 2020 at 4:56 AM Jason Wang  wrote:

On 2020/11/13 下午6:31, Mauro Matteo Cascella wrote:

The e1000e_write_packet_to_guest() function iterates over a set of
receive descriptors by advancing rx descriptor head register (RDH) from
its initial value to rx descriptor tail register (RDT). The check in
e1000e_ring_empty() is responsible for detecting whether RDH has reached
RDT, terminating the loop if that's the case. Additional checks have
been added in the past to deal with bogus values submitted by the guest
to prevent possible infinite loop. This is done by "wrapping around" RDH
at some point and detecting whether it assumes the original value during
the loop.

However, when e1000e is configured to use the packet split feature, RDH is
incremented by two instead of one, as the packet split descriptors are
32 bytes while regular descriptors are 16 bytes. A malicious or buggy
guest may set RDT to an odd value and transmit only null RX descriptors.
This corner case would prevent RDH from ever matching RDT, leading to an
infinite loop. This patch adds a check in e1000e_ring_advance() to make sure
RDH does not exceed RDT in a single incremental step, adjusting the count
value accordingly.

Can this patch solve this issue in another way?

https://patchew.org/QEMU/2020130636.2208620-1-ppan...@redhat.com/

Thanks


Yes, it does work nicely. Still, I think this patch is useful to avoid
possible inconsistent state in e1000e_ring_advance() when count > 1.

So if RDT is odd, it looks to me the following codes in
e1000e_write_packet_to_guest() needs to be fixed as well.


base = e1000e_ring_head_descr(core, rxi);

pci_dma_read(d, base, &desc, core->rx_desc_len);

Otherwise e1000e may try to read out of descriptor ring.

Thanks

Sorry, I meant RDH actually, when packet split descriptor is used, it
doesn't check whether DH exceeds DLEN?


When the packet split feature is used (i.e., count > 1) this patch
basically sets RDH=RDT in case the increment would exceed RDT.


Can software set RDH to an odd value? If not, I think we are probably fine.

Thanks


Honestly I don't know the answer to your question, my guess is that it
may be possible for a malicious/bogus guest to set RDH the same way as
RDT.

Thank you,
--
Mauro Matteo Cascella
Red Hat Product Security
PGP-Key ID: BB3410B0




Then I think we should fix that.

Thanks




RE: [EXTERNAL] Re: [PATCH] WHPX: support for the kernel-irqchip on/off

2020-11-30 Thread Sunil Muthuswamy

> It's close enough to the release that you can resend (or if no rebasing
> is needed, just tell me).
> 
> Paolo

Thank you. No rebasing needed; the same patch applies on the current master.



Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

2020-11-30 Thread Cleber Rosa
On Fri, Nov 27, 2020 at 06:41:10PM +0100, Philippe Mathieu-Daudé wrote:
> We lately realized that the Avocado framework was not designed
> to be regularly run on CI environments. Therefore, as of 5.2

Hi Phil,

First of all, let me say that I understand your overall goal, and
although I don't agree with the strategy, I believe we're in agreement
wrt the destination.

The main issue that you seem to address here is the fact that some CI
tests may fail more often than others, which will lead to jobs that
will fail more than others, which will ultimately taint the overall CI
status.  Does that sound like an appropriate overall grasp of your
motivation?

Assuming I got it right, let me say that having an "always green CI"
is a noble goal, but it can also be extremely frustrating if one
doesn't apply some safeguarding measures.  More on that later.

As you certainly know, I'm also interested in understanding the "not
designed to be regularly run on CI environments" part.  The best
correlation I could make was to link that to the these two points you
raised elsewhere:

 1) Failing randomly
 2) Images hardcoded in tests are being removed from public servers

With regards to point #1, this is probably unavoidable as a whole.
I've had some experience running dedicated test jobs for close to a
decade, and maybe the only way to get close to avoid random failures
on integration tests, is to run close to nothing in those jobs.
Running "/bin/true" has a very low chance of failing randomly.

In my own experience, the only way to address point #1, is to
babysit jobs.  That means:

 a) assume they will produce some messy stuff at no particular time
 b) act as quickly and effectively as possible
 c) be compassionate, that is, waive the unavoidable mess incidents

Building on the previous analogy, if you decide to not have a baby,
but a plant, you'll probably need to to a lot less of those.
If you get a pet, than some more.  Now a human baby will probably
(I guess) require a whole lot more of those.  And as those age
and reach maturity, they'll (hopefully) require less babysitting,
but they can still mess up at any given time.

Analogies and jokes aside, the urgent *action item* here has been
discussed both publicly, and internally at Red Hat. It consists of
having an "always on" maintainer for those jobs.  In the specific case
of the "Acceptance" jobs, Willian Rampazzo has volunteered to,
initially, be this person.  He'll manage all related information
on job's issues.

We're still discussing the tools to use to give the visibility that
the QEMU projects needs.  I personally would be happy enough to start
with a publicly accessible spreadsheet that builds upon the
information produced by GitLab.  A proper application is also being
considered.  A sample of the requirements include:

   I) waive failures (say a job and tests failed because of a
  network outage)
  II) build trends (show how stable all executions of test "foo"
  were during the last: week, month, eternity, etc).
 III) keep a list of the known issues and relate them to waivers
  and currently skipped tests

Getting back to point #2, I have two main points about it.  First is
that I've had a lot of experience with tests having copies of images,
both on local filesystems and in on close by NFS servers.  Local
filesystems would fail at provisioning/sync time. And NFS based ones
would fail every now and then for various network or NFS server
issues.  It goes back to my point about not being able to escape the
babysitting ever.

Second is that this is somehow related to features and improvements
that could/should be added to whathever supporting code (aka
framework) we use.  Right now, we have some specific features
scheduled to be introduced in Avocado 84.0 (due in ~2 weeks).  They
can be seen with the "customer:QEMU" label:

  
https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+is%3Aopen+label%3Acustomer%3AQEMU

A number of other features have already landed on previous versions,
but I was unable to send patches in time for 5.2, so my expectation is
to bundle more of them and bump Avocado to 84.0 at once (instead of
82.0 or 83.0).

> we deprecate the gitlab-ci jobs using Avocado. To not disrupt
> current users, it is possible to keep the current behavior by
> setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable
> (see [*]).
> From now on, using these jobs (or adding new tests to them)
> is strongly discouraged.
>

These jobs run `make check-acceptance`, which will pickup new tests.
So how do you suggest to *not adding new tests* to those jobs?  Are
you suggesting that no new acceptance test be added?

> Tests based on Avocado will be ported to new job schemes during
> the next releases, with better documentation and templates.
>

This is a very good approach to move forward.  For what is worth,
Avocado has invested in an API specifically for that, the Job API.
The goal is to have smarter jobs for different purposes that behave
ap

[PATCH 1/1] target-riscv: support QMP dump-guest-memory

2020-11-30 Thread Yifei Jiang
Add the support needed for creating prstatus elf notes. Now elf notes
only contains user_regs. This allows us to use QMP dump-guest-memory.

Signed-off-by: Yifei Jiang 
Signed-off-by: Mingwang Li 
---
 target/riscv/arch_dump.c | 189 +++
 target/riscv/cpu.c   |   2 +
 target/riscv/cpu.h   |   4 +
 target/riscv/cpu_bits.h  |   1 +
 target/riscv/meson.build |   1 +
 5 files changed, 197 insertions(+)
 create mode 100644 target/riscv/arch_dump.c

diff --git a/target/riscv/arch_dump.c b/target/riscv/arch_dump.c
new file mode 100644
index 00..b89ddf18c7
--- /dev/null
+++ b/target/riscv/arch_dump.c
@@ -0,0 +1,189 @@
+/* Support for writing ELF notes for RISC-V architectures
+ *
+ * Copyright (C) 2020 Huawei Technologies Co., Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2 or later, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include "elf.h"
+#include "sysemu/dump.h"
+
+/* struct user_regs_struct from arch/riscv/include/uapi/asm/ptrace.h */
+struct riscv_user_regs {
+target_ulong pc;
+target_ulong regs[31];
+} QEMU_PACKED;
+
+/* struct elf_prstatus from include/uapi/linux/elfcore.h */
+struct riscv64_elf_prstatus {
+char pad1[32]; /* 32 == offsetof(struct elf_prstatus, pr_pid) */
+uint32_t pr_pid;
+char pad2[76]; /* 76 == offsetof(struct elf_prstatus, pr_reg) -
+offsetof(struct elf_prstatus, pr_ppid) */
+struct riscv_user_regs pr_reg;
+char pad3[8];
+} QEMU_PACKED;
+
+struct riscv64_note {
+Elf64_Nhdr hdr;
+char name[8]; /* align_up(sizeof("CORE"), 4) */
+struct riscv64_elf_prstatus prstatus;
+} QEMU_PACKED;
+
+#define RISCV64_NOTE_HEADER_SIZE offsetof(struct riscv64_note, prstatus)
+#define RISCV64_PRSTATUS_NOTE_SIZE \
+(RISCV64_NOTE_HEADER_SIZE + sizeof(struct riscv64_elf_prstatus))
+
+static void riscv64_note_init(struct riscv64_note *note, DumpState *s,
+  const char *name, Elf64_Word namesz,
+  Elf64_Word type, Elf64_Word descsz)
+{
+memset(note, 0, sizeof(*note));
+
+note->hdr.n_namesz = cpu_to_dump32(s, namesz);
+note->hdr.n_descsz = cpu_to_dump32(s, descsz);
+note->hdr.n_type = cpu_to_dump32(s, type);
+
+memcpy(note->name, name, namesz);
+}
+
+int riscv_cpu_write_elf64_note(WriteCoreDumpFunction f, CPUState *cs,
+   int cpuid, void *opaque)
+{
+struct riscv64_note note;
+RISCVCPU *cpu = RISCV_CPU(cs);
+CPURISCVState *env = &cpu->env;
+DumpState *s = opaque;
+int ret, i = 0;
+const char name[] = "CORE";
+
+riscv64_note_init(¬e, s, name, sizeof(name),
+  NT_PRSTATUS, sizeof(note.prstatus));
+
+note.prstatus.pr_pid = cpu_to_dump32(s, cpuid);
+
+note.prstatus.pr_reg.pc = cpu_to_dump64(s, env->pc);
+
+for (i = 0; i < 31; i++) {
+note.prstatus.pr_reg.regs[i] = cpu_to_dump64(s, env->gpr[i + 1]);
+}
+
+ret = f(¬e, RISCV64_PRSTATUS_NOTE_SIZE, s);
+if (ret < 0) {
+return -1;
+}
+
+return ret;
+}
+
+struct riscv32_elf_prstatus {
+char pad1[24]; /* 24 == offsetof(struct elf_prstatus, pr_pid) */
+uint32_t pr_pid;
+char pad2[44]; /* 44 == offsetof(struct elf_prstatus, pr_reg) -
+offsetof(struct elf_prstatus, pr_ppid) */
+struct riscv_user_regs pr_reg;
+char pad3[4];
+} QEMU_PACKED;
+
+struct riscv32_note {
+Elf32_Nhdr hdr;
+char name[8]; /* align_up(sizeof("CORE"), 4) */
+struct riscv32_elf_prstatus prstatus;
+} QEMU_PACKED;
+
+#define RISCV32_NOTE_HEADER_SIZE offsetof(struct riscv32_note, prstatus)
+#define RISCV32_PRSTATUS_NOTE_SIZE \
+(RISCV32_NOTE_HEADER_SIZE + sizeof(struct riscv32_elf_prstatus))
+
+static void riscv32_note_init(struct riscv32_note *note, DumpState *s,
+  const char *name, Elf32_Word namesz,
+  Elf32_Word type, Elf32_Word descsz)
+{
+memset(note, 0, sizeof(*note));
+
+note->hdr.n_namesz = cpu_to_dump32(s, namesz);
+note->hdr.n_descsz = cpu_to_dump32(s, descsz);
+note->hdr.n_type = cpu_to_dump32(s, type);
+
+memcpy(note->name, name, namesz);
+}
+
+int riscv_cpu_write_elf32_note(WriteCoreDumpFunction f, CPUState *cs,
+   int cpuid, void *opaque)
+{
+struct riscv32_note note;
+RISCVCPU *cpu = RISCV_CPU(cs);
+CPURISCVState *env = &c

[PATCH 0/1] target-riscv: support QMP dump-guest-memory

2020-11-30 Thread Yifei Jiang
Hi,

This patch supports QMP dump-guest-memory in RISC-V. We tested this feature by
using following command: dump-guest-memory guest.memory.

Then we used the gdb tool to debug guest.memory: gdb vmlinux guest.memory.
The test result is as follow:
1. info registers
ra 0xffe0008cb83c   0xffe0008cb83c 
<_raw_spin_lock_irqsave+28>
sp 0xffe0012c3f70   0xffe0012c3f70
gp 0xffe0010d6048   0xffe0010d6048 
<__compound_literal.109>
tp 0xffe00127f200   0xffe00127f200
t0 0x1f8504
t1 0x0  0
t2 0x3fd9bf5c3c 274236136508
fp 0xffe0012c3f80   0xffe0012c3f80
s1 0xffe0010d7228   -137421295064
a0 0x1  1
a1 0xffe00127f200   -137419558400
a2 0xffe00110a0b8   -137421086536
a3 0x3af32000   989011968
a4 0x35b2   13746
a5 0xffe03af6b880   -136449705856
a6 0x1c5d09af00 12182000
a7 0x54494d45   1414090053
s2 0x1  1
s3 0xffe0010d73f0   -137421294608
s4 0x0  0
s5 0x0  0
s6 0x0  0
s7 0xc  12
s8 0x2000   8192
s9 0x80015708   2147571464
s100x0  0
s110x0  0
t3 0x2257d7136011377
t4 0x0  0
t5 0x3ab003061538352
t6 0x3fffefb3a0 274876838816
pc 0xffe000201478   0xffe000201478 

2. x/1024ux 0x8000
0x8000: 0x00050433  0x000584b3  0x00060933  0x62c000ef
0x8010: 0x00050833  0x00040533  0x000485b3  0x00090633
0x8020: 0x046358fd  0x1d630118  0x08171305  0x0813
0x8030: 0x48854868  0x0118282f  0x12081463  0x0297
0x8040: 0x48428293  0x0317  0xfbc30313  0x0062b023
...

Yifei Jiang (1):
  target-riscv: support QMP dump-guest-memory

 target/riscv/arch_dump.c | 189 +++
 target/riscv/cpu.c   |   2 +
 target/riscv/cpu.h   |   4 +
 target/riscv/cpu_bits.h  |   1 +
 target/riscv/meson.build |   1 +
 5 files changed, 197 insertions(+)
 create mode 100644 target/riscv/arch_dump.c

-- 
2.19.1




Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Frank Yang
On Mon, Nov 30, 2020 at 2:10 PM Peter Maydell 
wrote:

> On Mon, 30 Nov 2020 at 20:56, Frank Yang  wrote:
> > We'd actually like to contribute upstream too :) We do want to maintain
> > our own downstream though; Android Emulator codebase needs to work
> > solidly on macos and windows which has made keeping up with upstream
> difficult
>
> One of the main reasons why OSX and Windows support upstream is
> not so great is because very few people are helping to develop,
> test and support it upstream. The way to fix that IMHO is for more
> people who do care about those platforms to actively engage
> with us upstream to help in making those platforms move closer to
> being first class citizens. If you stay on a downstream fork
> forever then I don't think you'll ever see things improve.
>
> thanks
> -- PMM
>

That's a really good point. I'll definitely be more active about sending
comments upstream in the future :)

Frank


Re: [PATCH] target/riscv: Fix definition of MSTATUS_TW and MSTATUS_TSR

2020-11-30 Thread Alistair Francis
On Mon, Nov 30, 2020 at 9:07 AM Alex Richardson
 wrote:
>
> The TW and TSR fields should be bits 21 and 22 and not 30/29.
> This was found while comparing QEMU behaviour against the sail formal
> model (https://github.com/rems-project/sail-riscv/).
>
> Signed-off-by: Alex Richardson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu_bits.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
> index 24b24c69c5..92147332c6 100644
> --- a/target/riscv/cpu_bits.h
> +++ b/target/riscv/cpu_bits.h
> @@ -379,8 +379,8 @@
>  #define MSTATUS_MXR 0x0008
>  #define MSTATUS_VM  0x1F00 /* until: priv-1.9.1 */
>  #define MSTATUS_TVM 0x0010 /* since: priv-1.10 */
> -#define MSTATUS_TW  0x2000 /* since: priv-1.10 */
> -#define MSTATUS_TSR 0x4000 /* since: priv-1.10 */
> +#define MSTATUS_TW  0x0020 /* since: priv-1.10 */
> +#define MSTATUS_TSR 0x0040 /* since: priv-1.10 */
>  #define MSTATUS_GVA 0x40ULL
>  #define MSTATUS_MPV 0x80ULL
>
> --
> 2.29.2
>
>



Re: [PATCH] target/riscv: Fix the bug of HLVX/HLV/HSV

2020-11-30 Thread Alistair Francis
On Sun, Nov 29, 2020 at 5:37 PM Yifei Jiang  wrote:
>
> We found that the hypervisor virtual-machine load and store instructions,
> included HLVX/HLV/HSV, couldn't access guest userspace memory.
>
> In the riscv-privileged spec, HLVX/HLV/HSV is defined as follow:
> "As usual when V=1, two-stage address translation is applied, and
> the HS-level sstatus.SUM is ignored."
>
> But get_physical_address() doesn't ignore sstatus.SUM, when HLVX/HLV/HSV
> accesses guest userspace memory. So this patch fixes it.
>
> Signed-off-by: Yifei Jiang 
> Signed-off-by: Yipeng Yin 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu_helper.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> index a2787b1d48..7274f971a4 100644
> --- a/target/riscv/cpu_helper.c
> +++ b/target/riscv/cpu_helper.c
> @@ -367,7 +367,8 @@ static int get_physical_address(CPURISCVState *env, 
> hwaddr *physical,
>  vm = get_field(env->hgatp, HGATP_MODE);
>  widened = 2;
>  }
> -sum = get_field(env->mstatus, MSTATUS_SUM);
> +/* status.SUM will be ignored if execute on background */
> +sum = get_field(env->mstatus, MSTATUS_SUM) || use_background;
>  switch (vm) {
>  case VM_1_10_SV32:
>levels = 2; ptidxbits = 10; ptesize = 4; break;
> --
> 2.19.1
>
>



Re: [PATCH-for-5.2? 1/4] hw/arm/fsl-imx: Add SD bus QOM alias on the SoC

2020-11-30 Thread Alistair Francis
On Tue, Nov 24, 2020 at 1:54 AM Philippe Mathieu-Daudé  wrote:
>
> To be able to select a particular SD bus from the command
> line, add a QOM alias on the SoC (using an unique name).
>
> Buglink: https://bugs.launchpad.net/qemu/+bug/1895895
> Reported-by: David Aghaian 
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  hw/arm/fsl-imx25.c | 6 ++
>  hw/arm/fsl-imx6.c  | 6 ++
>  2 files changed, 12 insertions(+)
>
> diff --git a/hw/arm/fsl-imx25.c b/hw/arm/fsl-imx25.c
> index 08a98f828fc..6e66ae742af 100644
> --- a/hw/arm/fsl-imx25.c
> +++ b/hw/arm/fsl-imx25.c
> @@ -239,6 +239,7 @@ static void fsl_imx25_realize(DeviceState *dev, Error 
> **errp)
>  { FSL_IMX25_ESDHC1_ADDR, FSL_IMX25_ESDHC1_IRQ },
>  { FSL_IMX25_ESDHC2_ADDR, FSL_IMX25_ESDHC2_IRQ },
>  };
> +g_autofree char *bus_name = NULL;
>
>  object_property_set_uint(OBJECT(&s->esdhc[i]), "sd-spec-version", 2,
>   &error_abort);
> @@ -253,6 +254,11 @@ static void fsl_imx25_realize(DeviceState *dev, Error 
> **errp)
>  sysbus_connect_irq(SYS_BUS_DEVICE(&s->esdhc[i]), 0,
> qdev_get_gpio_in(DEVICE(&s->avic),
>  esdhc_table[i].irq));
> +
> +/* Alias controller SD bus to the SoC itself */
> +bus_name = g_strdup_printf("sd-bus%d", i);
> +object_property_add_alias(OBJECT(s), bus_name,
> +  OBJECT(&s->esdhc[i]), "sd-bus");
>  }
>
>  /* USB */
> diff --git a/hw/arm/fsl-imx6.c b/hw/arm/fsl-imx6.c
> index 00dafe3f62d..144bcdcaf6c 100644
> --- a/hw/arm/fsl-imx6.c
> +++ b/hw/arm/fsl-imx6.c
> @@ -314,6 +314,7 @@ static void fsl_imx6_realize(DeviceState *dev, Error 
> **errp)
>  { FSL_IMX6_uSDHC3_ADDR, FSL_IMX6_uSDHC3_IRQ },
>  { FSL_IMX6_uSDHC4_ADDR, FSL_IMX6_uSDHC4_IRQ },
>  };
> +g_autofree char *bus_name = NULL;
>
>  /* UHS-I SDIO3.0 SDR104 1.8V ADMA */
>  object_property_set_uint(OBJECT(&s->esdhc[i]), "sd-spec-version", 3,
> @@ -329,6 +330,11 @@ static void fsl_imx6_realize(DeviceState *dev, Error 
> **errp)
>  sysbus_connect_irq(SYS_BUS_DEVICE(&s->esdhc[i]), 0,
> qdev_get_gpio_in(DEVICE(&s->a9mpcore),
>  esdhc_table[i].irq));
> +
> +/* Alias controller SD bus to the SoC itself */
> +bus_name = g_strdup_printf("sd-bus%d", i);
> +object_property_add_alias(OBJECT(s), bus_name,
> +  OBJECT(&s->esdhc[i]), "sd-bus");
>  }
>
>  /* USB */
> --
> 2.26.2
>
>



Re: [PATCH-for-5.2? 4/4] hw/arm/xilinx_zynq: Add SD bus QOM alias on the machine

2020-11-30 Thread Alistair Francis
On Tue, Nov 24, 2020 at 1:50 AM Philippe Mathieu-Daudé  wrote:
>
> To be able to select a particular SD bus from the command
> line, add a QOM alias on the machine (using an unique name).
>
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  hw/arm/xilinx_zynq.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/hw/arm/xilinx_zynq.c b/hw/arm/xilinx_zynq.c
> index b72772bc824..6a4a1de88cf 100644
> --- a/hw/arm/xilinx_zynq.c
> +++ b/hw/arm/xilinx_zynq.c
> @@ -286,6 +286,7 @@ static void zynq_init(MachineState *machine)
>  DriveInfo *di;
>  BlockBackend *blk;
>  DeviceState *carddev;
> +g_autofree char *bus_name = NULL;
>
>  /* Compatible with:
>   * - SD Host Controller Specification Version 2.0 Part A2
> @@ -299,6 +300,11 @@ static void zynq_init(MachineState *machine)
>  sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, hci_addr);
>  sysbus_connect_irq(SYS_BUS_DEVICE(dev), 0, pic[hci_irq - 
> IRQ_OFFSET]);
>
> +/* Alias controller SD bus to the machine itself */
> +bus_name = g_strdup_printf("sd-bus%d", n);
> +object_property_add_alias(OBJECT(machine), bus_name,
> +  OBJECT(dev), "sd-bus");
> +
>  di = drive_get_next(IF_SD);
>  blk = di ? blk_by_legacy_dinfo(di) : NULL;
>  carddev = qdev_new(TYPE_SD_CARD);
> --
> 2.26.2
>
>



Re: [PATCH-for-5.2? 3/4] hw/arm/xlnx-versal: Add SD bus QOM alias on the SoC

2020-11-30 Thread Alistair Francis
On Tue, Nov 24, 2020 at 1:51 AM Philippe Mathieu-Daudé  wrote:
>
> To be able to select a particular SD bus from the command
> line, add a QOM alias on the SoC (using an unique name).
>
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  hw/arm/xlnx-versal.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c
> index 12ba6c4ebae..da3ee24a5b9 100644
> --- a/hw/arm/xlnx-versal.c
> +++ b/hw/arm/xlnx-versal.c
> @@ -210,6 +210,7 @@ static void versal_create_sds(Versal *s, qemu_irq *pic)
>  int i;
>
>  for (i = 0; i < ARRAY_SIZE(s->pmc.iou.sd); i++) {
> +g_autofree char *bus_name = NULL;
>  DeviceState *dev;
>  MemoryRegion *mr;
>
> @@ -224,6 +225,10 @@ static void versal_create_sds(Versal *s, qemu_irq *pic)
>  object_property_set_uint(OBJECT(dev), "uhs", UHS_I, &error_fatal);
>  sysbus_realize(SYS_BUS_DEVICE(dev), &error_fatal);
>
> +/* Alias controller SD bus to the SoC itself */
> +bus_name = g_strdup_printf("sd-bus%d", i);
> +object_property_add_alias(OBJECT(s), bus_name, OBJECT(dev), 
> "sd-bus");
> +
>  mr = sysbus_mmio_get_region(SYS_BUS_DEVICE(dev), 0);
>  memory_region_add_subregion(&s->mr_ps,
>  MM_PMC_SD0 + i * MM_PMC_SD0_SIZE, mr);
> --
> 2.26.2
>
>



Re: [PATCH 2/2] hw/ssi: imx_spi: Disable chip selects in imx_spi_reset()

2020-11-30 Thread Alistair Francis
On Sun, Nov 29, 2020 at 8:05 PM Bin Meng  wrote:
>
> From: Xuzhou Cheng 
>
> When a write to ECSPI_CONREG register to disable the SPI controller,
> imx_spi_reset() is called to reset the controller, during which CS
> lines should have been disabled, otherwise the state machine of any
> devices (e.g.: SPI flashes) connected to the SPI master is stuck to
> its last state and responds incorrectly to any follow-up commands.
>
> Fixes c906a3a01582: ("i.MX: Add the Freescale SPI Controller")
> Signed-off-by: Xuzhou Cheng 
> Signed-off-by: Bin Meng 

Acked-by: Alistair Francis 

Alistair

>
> ---
>
>  hw/ssi/imx_spi.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/hw/ssi/imx_spi.c b/hw/ssi/imx_spi.c
> index e605049..85c172e 100644
> --- a/hw/ssi/imx_spi.c
> +++ b/hw/ssi/imx_spi.c
> @@ -231,6 +231,7 @@ static void imx_spi_flush_txfifo(IMXSPIState *s)
>  static void imx_spi_reset(DeviceState *dev)
>  {
>  IMXSPIState *s = IMX_SPI(dev);
> +int i;
>
>  DPRINTF("\n");
>
> @@ -243,6 +244,10 @@ static void imx_spi_reset(DeviceState *dev)
>
>  imx_spi_update_irq(s);
>
> +for (i = 0; i < ECSPI_NUM_CS; i++) {
> +qemu_set_irq(s->cs_lines[i], 1);
> +}
> +
>  s->burst_length = 0;
>  }
>
> --
> 2.7.4
>
>



Re: [PATCH 1/2] hw/ssi: imx_spi: Use a macro for number of chip selects supported

2020-11-30 Thread Alistair Francis
On Sun, Nov 29, 2020 at 8:06 PM Bin Meng  wrote:
>
> From: Bin Meng 
>
> Avoid using a magic number (4) everywhere for the number of chip
> selects supported.
>
> Signed-off-by: Bin Meng 

Reviewed-by: Alistair Francis 

Alistair

> ---
>
>  hw/ssi/imx_spi.c | 4 ++--
>  include/hw/ssi/imx_spi.h | 5 -
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/hw/ssi/imx_spi.c b/hw/ssi/imx_spi.c
> index d8885ae..e605049 100644
> --- a/hw/ssi/imx_spi.c
> +++ b/hw/ssi/imx_spi.c
> @@ -361,7 +361,7 @@ static void imx_spi_write(void *opaque, hwaddr offset, 
> uint64_t value,
>
>  /* We are in master mode */
>
> -for (i = 0; i < 4; i++) {
> +for (i = 0; i < ECSPI_NUM_CS; i++) {
>  qemu_set_irq(s->cs_lines[i],
>   i == imx_spi_selected_channel(s) ? 0 : 1);
>  }
> @@ -424,7 +424,7 @@ static void imx_spi_realize(DeviceState *dev, Error 
> **errp)
>  sysbus_init_mmio(SYS_BUS_DEVICE(dev), &s->iomem);
>  sysbus_init_irq(SYS_BUS_DEVICE(dev), &s->irq);
>
> -for (i = 0; i < 4; ++i) {
> +for (i = 0; i < ECSPI_NUM_CS; ++i) {
>  sysbus_init_irq(SYS_BUS_DEVICE(dev), &s->cs_lines[i]);
>  }
>
> diff --git a/include/hw/ssi/imx_spi.h b/include/hw/ssi/imx_spi.h
> index b82b17f..eeaf49b 100644
> --- a/include/hw/ssi/imx_spi.h
> +++ b/include/hw/ssi/imx_spi.h
> @@ -77,6 +77,9 @@
>
>  #define EXTRACT(value, name) extract32(value, name##_SHIFT, name##_LENGTH)
>
> +/* number of chip selects supported */
> +#define ECSPI_NUM_CS 4
> +
>  #define TYPE_IMX_SPI "imx.spi"
>  OBJECT_DECLARE_SIMPLE_TYPE(IMXSPIState, IMX_SPI)
>
> @@ -89,7 +92,7 @@ struct IMXSPIState {
>
>  qemu_irq irq;
>
> -qemu_irq cs_lines[4];
> +qemu_irq cs_lines[ECSPI_NUM_CS];
>
>  SSIBus *bus;
>
> --
> 2.7.4
>
>



Re: [PATCH-for-5.2? 2/4] hw/arm/exynos4210: Add SD bus QOM alias on the SoC

2020-11-30 Thread Alistair Francis
On Tue, Nov 24, 2020 at 1:55 AM Philippe Mathieu-Daudé  wrote:
>
> To be able to select a particular SD bus from the command
> line, add a QOM alias on the SoC (using an unique name).
>
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  hw/arm/exynos4210.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/hw/arm/exynos4210.c b/hw/arm/exynos4210.c
> index ced2769b102..a60f08d372a 100644
> --- a/hw/arm/exynos4210.c
> +++ b/hw/arm/exynos4210.c
> @@ -408,6 +408,7 @@ static void exynos4210_realize(DeviceState *socdev, Error 
> **errp)
>
>  /*** SD/MMC host controllers ***/
>  for (n = 0; n < EXYNOS4210_SDHCI_NUMBER; n++) {
> +g_autofree char *bus_name = NULL;
>  DeviceState *carddev;
>  BlockBackend *blk;
>  DriveInfo *di;
> @@ -432,6 +433,10 @@ static void exynos4210_realize(DeviceState *socdev, 
> Error **errp)
>  sysbus_mmio_map(busdev, 0, EXYNOS4210_SDHCI_ADDR(n));
>  sysbus_connect_irq(busdev, 0, s->irq_table[exynos4210_get_irq(29, 
> n)]);
>
> +/* Alias controller SD bus to the SoC itself */
> +bus_name = g_strdup_printf("sd-bus%d", n);
> +object_property_add_alias(OBJECT(s), bus_name, OBJECT(dev), 
> "sd-bus");
> +
>  di = drive_get(IF_SD, 0, n);
>  blk = di ? blk_by_legacy_dinfo(di) : NULL;
>  carddev = qdev_new(TYPE_SD_CARD);
> --
> 2.26.2
>
>



[Bug 1906156] Re: Host OS Reboot Required, for Guest kext to Load (Fully)

2020-11-30 Thread Russell Morris
Sure, will do (upstream version). Is there a preferred way to do it?
Meaning ... build locally, or install from some repository?

Thanks!

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1906156

Title:
  Host OS Reboot Required, for Guest kext to Load (Fully)

Status in QEMU:
  Incomplete

Bug description:
  Hi,

  Finding this one a bit odd, but I am loading a driver (kext) in a
  macOS guest ... and it works, on the first VM (domain) startup after a
  full / clean host OS boot (or reboot). However, if I even reboot the
  guest OS, then the driver load fails => can be "corrected" by a full
  host OS reboot (which seems very extreme).

  Is this a known issue, and/or is there a workaround?

  FYI, running,
  QEMU emulator version 5.0.0 (Debian 1:5.0-5ubuntu9.1)
  Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

  This is for a macOS guest, on a Linux host.

  Thanks!

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1906156/+subscriptions



Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Roman Bolshakov
On Mon, Nov 30, 2020 at 10:40:49PM +0100, Alexander Graf wrote:
> Hi Peter,
> 
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:
> > > 
> > > 
> > > On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:
> > > > Hi Frank,
> > > > 
> > > > Thanks for the update :). Your previous email nudged me into the right 
> > > > direction. I previously had implemented WFI through the internal timer 
> > > > framework which performed way worse.
> > > Cool, glad it's helping. Also, Peter found out that the main thing 
> > > keeping us from just using cntpct_el0 on the host directly and compare 
> > > with cval is that if we sleep, cval is going to be much < cntpct_el0 by 
> > > the sleep time. If we can get either the architecture or macos to read 
> > > out the sleep time then we might be able to not have to use a poll 
> > > interval either!
> > > > Along the way, I stumbled over a few issues though. For starters, the 
> > > > signal mask for SIG_IPI was not set correctly, so while pselect() would 
> > > > exit, the signal would never get delivered to the thread! For a fix, 
> > > > check out
> > > > 
> > > >
> > > > https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/
> > > > 
> > > Thanks, we'll take a look :)
> > > 
> > > > Please also have a look at my latest stab at WFI emulation. It doesn't 
> > > > handle WFE (that's only relevant in overcommitted scenarios). But it 
> > > > does handle WFI and even does something similar to hlt polling, albeit 
> > > > not with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> > 
> > +/* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +cpu->hvf->sleeping = true;
> > +smp_mb();
> > +
> > +/* Bail out if we received an IRQ meanwhile */
> > +if (cpu->thread_kicked || (cpu->interrupt_request &
> > +(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +cpu->hvf->sleeping = false;
> > +break;
> > +}
> > +
> > +/* nanosleep returns on signal, so we wake up on kick. 
> > */
> > +nanosleep(ts, NULL);
> > 
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
> 
> 
> Thanks a bunch for the comment! So the trick I was using here is to modify
> the timespec from the kick function before sending the IPI signal. That way,
> we know that either we are inside the sleep (where the signal wakes it up)
> or we are outside the sleep (where timespec={} will make it return
> immediately).
> 
> The only race I can think of is if nanosleep does calculations based on the
> timespec and we happen to send the signal right there and then.
> 
> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal is
> never delivered, how can pselect differentiate "signal from last time is
> still pending" from "new signal because I got an IPI"?
> 
> 

Hi Alex,

There was a patch for x86 HVF that implements CPU kick and it wasn't
merged (mostly because of my lazyness). It has some changes like you
introduced in the series and VMX-specific handling of preemption timer
to gurantee interrupt delivery without kick loss:

https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolsha...@yadro.com/

I wonder if it'd possible to have common handling of kicks for both x86
and arm (given that arch-specific bits are wrapped)?

Thanks,
Roman



Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Alexander Graf



On 01.12.20 01:00, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf  wrote:


On 01.12.20 00:01, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf  wrote:

Hi Peter,

On 30.11.20 22:08, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:

On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:

Hi Frank,

Thanks for the update :). Your previous email nudged me into the right 
direction. I previously had implemented WFI through the internal timer 
framework which performed way worse.

Cool, glad it's helping. Also, Peter found out that the main thing keeping us from 
just using cntpct_el0 on the host directly and compare with cval is that if we 
sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get 
either the architecture or macos to read out the sleep time then we might be able 
to not have to use a poll interval either!

Along the way, I stumbled over a few issues though. For starters, the signal 
mask for SIG_IPI was not set correctly, so while pselect() would exit, the 
signal would never get delivered to the thread! For a fix, check out

 
https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/


Thanks, we'll take a look :)


Please also have a look at my latest stab at WFI emulation. It doesn't handle 
WFE (that's only relevant in overcommitted scenarios). But it does handle WFI 
and even does something similar to hlt polling, albeit not with an adaptive 
threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+/* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+cpu->hvf->sleeping = true;
+smp_mb();
+
+/* Bail out if we received an IRQ meanwhile */
+if (cpu->thread_kicked || (cpu->interrupt_request &
+(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+cpu->hvf->sleeping = false;
+break;
+}
+
+/* nanosleep returns on signal, so we wake up on kick. */
+nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.

Thanks a bunch for the comment! So the trick I was using here is to
modify the timespec from the kick function before sending the IPI
signal. That way, we know that either we are inside the sleep (where the
signal wakes it up) or we are outside the sleep (where timespec={} will
make it return immediately).

The only race I can think of is if nanosleep does calculations based on
the timespec and we happen to send the signal right there and then.

Yes that's the race I was thinking of. Admittedly it's a small window
but it's theoretically possible and part of the reason why pselect was
created.


The problem with blocking IPIs is basically what Frank was describing
earlier: How do you unset the IPI signal pending status? If the signal
is never delivered, how can pselect differentiate "signal from last time
is still pending" from "new signal because I got an IPI"?

In this case we would take the additional wakeup which should be
harmless since we will take the WFx exit again and put us in the
correct state. But that's a lot better than busy looping.


I'm not sure I follow. I'm thinking of the following scenario:

- trap into WFI handler
- go to sleep with blocked SIG_IPI
- SIG_IPI arrives, pselect() exits
- signal is still pending because it's blocked
- enter guest
- trap into WFI handler
- run pselect(), but it immediate exits because SIG_IPI is still pending

This was the loop I was seeing when running with SIG_IPI blocked. That's
part of the reason why I switched to a different model.

What I observe is that when returning from a pending signal pselect
consumes the signal (which is also consistent with my understanding of
what pselect does). That means that it doesn't matter if we take a
second WFx exit because once we reach the pselect in the second WFx
exit the signal will have been consumed by the pselect in the first
exit and we will just wait for the next one.

I don't know why things may have been going wrong in your
implementation but it may be related to the issue with
mach_absolute_time() which I posted about separately and was also
causing busy loops for us in so

Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Peter Collingbourne
On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf  wrote:
>
>
> On 01.12.20 00:01, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf  wrote:
> >> Hi Peter,
> >>
> >> On 30.11.20 22:08, Peter Collingbourne wrote:
> >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:
> 
>  On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:
> > Hi Frank,
> >
> > Thanks for the update :). Your previous email nudged me into the right 
> > direction. I previously had implemented WFI through the internal timer 
> > framework which performed way worse.
>  Cool, glad it's helping. Also, Peter found out that the main thing 
>  keeping us from just using cntpct_el0 on the host directly and compare 
>  with cval is that if we sleep, cval is going to be much < cntpct_el0 by 
>  the sleep time. If we can get either the architecture or macos to read 
>  out the sleep time then we might be able to not have to use a poll 
>  interval either!
> > Along the way, I stumbled over a few issues though. For starters, the 
> > signal mask for SIG_IPI was not set correctly, so while pselect() would 
> > exit, the signal would never get delivered to the thread! For a fix, 
> > check out
> >
> > 
> > https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/
> >
>  Thanks, we'll take a look :)
> 
> > Please also have a look at my latest stab at WFI emulation. It doesn't 
> > handle WFE (that's only relevant in overcommitted scenarios). But it 
> > does handle WFI and even does something similar to hlt polling, albeit 
> > not with an adaptive threshold.
> >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> >>> I'll reply to your patch here. You have:
> >>>
> >>> +/* Set cpu->hvf->sleeping so that we get a
> >>> SIG_IPI signal. */
> >>> +cpu->hvf->sleeping = true;
> >>> +smp_mb();
> >>> +
> >>> +/* Bail out if we received an IRQ meanwhile */
> >>> +if (cpu->thread_kicked || (cpu->interrupt_request &
> >>> +(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>> +cpu->hvf->sleeping = false;
> >>> +break;
> >>> +}
> >>> +
> >>> +/* nanosleep returns on signal, so we wake up on 
> >>> kick. */
> >>> +nanosleep(ts, NULL);
> >>>
> >>> and then send the signal conditional on whether sleeping is true, but
> >>> I think this is racy. If the signal is sent after sleeping is set to
> >>> true but before entering nanosleep then I think it will be ignored and
> >>> we will miss the wakeup. That's why in my implementation I block IPI
> >>> on the CPU thread at startup and then use pselect to atomically
> >>> unblock and begin sleeping. The signal is sent unconditionally so
> >>> there's no need to worry about races between actually sleeping and the
> >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> >>> that's better than missing it entirely.
> >>
> >> Thanks a bunch for the comment! So the trick I was using here is to
> >> modify the timespec from the kick function before sending the IPI
> >> signal. That way, we know that either we are inside the sleep (where the
> >> signal wakes it up) or we are outside the sleep (where timespec={} will
> >> make it return immediately).
> >>
> >> The only race I can think of is if nanosleep does calculations based on
> >> the timespec and we happen to send the signal right there and then.
> > Yes that's the race I was thinking of. Admittedly it's a small window
> > but it's theoretically possible and part of the reason why pselect was
> > created.
> >
> >> The problem with blocking IPIs is basically what Frank was describing
> >> earlier: How do you unset the IPI signal pending status? If the signal
> >> is never delivered, how can pselect differentiate "signal from last time
> >> is still pending" from "new signal because I got an IPI"?
> > In this case we would take the additional wakeup which should be
> > harmless since we will take the WFx exit again and put us in the
> > correct state. But that's a lot better than busy looping.
>
>
> I'm not sure I follow. I'm thinking of the following scenario:
>
>- trap into WFI handler
>- go to sleep with blocked SIG_IPI
>- SIG_IPI arrives, pselect() exits
>- signal is still pending because it's blocked
>- enter guest
>- trap into WFI handler
>- run pselect(), but it immediate exits because SIG_IPI is still pending
>
> This was the loop I was seeing when running with SIG_IPI blocked. That's
> part of the reason why I switched to a different model.

What I observe is that when returning from a pending signal pselect
consumes the signal (which is also consistent with my understanding of
what pselect does

Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Alexander Graf



On 01.12.20 00:01, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf  wrote:

Hi Peter,

On 30.11.20 22:08, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:


On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:

Hi Frank,

Thanks for the update :). Your previous email nudged me into the right 
direction. I previously had implemented WFI through the internal timer 
framework which performed way worse.

Cool, glad it's helping. Also, Peter found out that the main thing keeping us from 
just using cntpct_el0 on the host directly and compare with cval is that if we 
sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get 
either the architecture or macos to read out the sleep time then we might be able 
to not have to use a poll interval either!

Along the way, I stumbled over a few issues though. For starters, the signal 
mask for SIG_IPI was not set correctly, so while pselect() would exit, the 
signal would never get delivered to the thread! For a fix, check out


https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/


Thanks, we'll take a look :)


Please also have a look at my latest stab at WFI emulation. It doesn't handle 
WFE (that's only relevant in overcommitted scenarios). But it does handle WFI 
and even does something similar to hlt polling, albeit not with an adaptive 
threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+/* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+cpu->hvf->sleeping = true;
+smp_mb();
+
+/* Bail out if we received an IRQ meanwhile */
+if (cpu->thread_kicked || (cpu->interrupt_request &
+(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+cpu->hvf->sleeping = false;
+break;
+}
+
+/* nanosleep returns on signal, so we wake up on kick. */
+nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.


Thanks a bunch for the comment! So the trick I was using here is to
modify the timespec from the kick function before sending the IPI
signal. That way, we know that either we are inside the sleep (where the
signal wakes it up) or we are outside the sleep (where timespec={} will
make it return immediately).

The only race I can think of is if nanosleep does calculations based on
the timespec and we happen to send the signal right there and then.

Yes that's the race I was thinking of. Admittedly it's a small window
but it's theoretically possible and part of the reason why pselect was
created.


The problem with blocking IPIs is basically what Frank was describing
earlier: How do you unset the IPI signal pending status? If the signal
is never delivered, how can pselect differentiate "signal from last time
is still pending" from "new signal because I got an IPI"?

In this case we would take the additional wakeup which should be
harmless since we will take the WFx exit again and put us in the
correct state. But that's a lot better than busy looping.



I'm not sure I follow. I'm thinking of the following scenario:

  - trap into WFI handler
  - go to sleep with blocked SIG_IPI
  - SIG_IPI arrives, pselect() exits
  - signal is still pending because it's blocked
  - enter guest
  - trap into WFI handler
  - run pselect(), but it immediate exits because SIG_IPI is still pending

This was the loop I was seeing when running with SIG_IPI blocked. That's 
part of the reason why I switched to a different model.




I reckon that you could improve things a little by unblocking the
signal and then reblocking it before unlocking iothread (e.g. with a
pselect with zero time interval), which would flush any pending
signals. Since any such signal would correspond to a signal from last
time (because we still have the iothread lock) we know that any future
signals should correspond to new IPIs.



Yeah, I think you actually *have* to do exactly that, because otherwise 
pselect() will always return after 0ns because the signal is still pending.


And yes, I agree that that starts to sound a bit less racy now. But it 
means we can probably also just do


  - WFI handler
  - block SIG_IPI
  - set hvf->sleeping = true
  - c

Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Peter Collingbourne
On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf  wrote:
>
> Hi Peter,
>
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:
> >>
> >>
> >> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:
> >>> Hi Frank,
> >>>
> >>> Thanks for the update :). Your previous email nudged me into the right 
> >>> direction. I previously had implemented WFI through the internal timer 
> >>> framework which performed way worse.
> >> Cool, glad it's helping. Also, Peter found out that the main thing keeping 
> >> us from just using cntpct_el0 on the host directly and compare with cval 
> >> is that if we sleep, cval is going to be much < cntpct_el0 by the sleep 
> >> time. If we can get either the architecture or macos to read out the sleep 
> >> time then we might be able to not have to use a poll interval either!
> >>> Along the way, I stumbled over a few issues though. For starters, the 
> >>> signal mask for SIG_IPI was not set correctly, so while pselect() would 
> >>> exit, the signal would never get delivered to the thread! For a fix, 
> >>> check out
> >>>
> >>>
> >>> https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/
> >>>
> >> Thanks, we'll take a look :)
> >>
> >>> Please also have a look at my latest stab at WFI emulation. It doesn't 
> >>> handle WFE (that's only relevant in overcommitted scenarios). But it does 
> >>> handle WFI and even does something similar to hlt polling, albeit not 
> >>> with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> >
> > +/* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +cpu->hvf->sleeping = true;
> > +smp_mb();
> > +
> > +/* Bail out if we received an IRQ meanwhile */
> > +if (cpu->thread_kicked || (cpu->interrupt_request &
> > +(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +cpu->hvf->sleeping = false;
> > +break;
> > +}
> > +
> > +/* nanosleep returns on signal, so we wake up on kick. 
> > */
> > +nanosleep(ts, NULL);
> >
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
>
>
> Thanks a bunch for the comment! So the trick I was using here is to
> modify the timespec from the kick function before sending the IPI
> signal. That way, we know that either we are inside the sleep (where the
> signal wakes it up) or we are outside the sleep (where timespec={} will
> make it return immediately).
>
> The only race I can think of is if nanosleep does calculations based on
> the timespec and we happen to send the signal right there and then.

Yes that's the race I was thinking of. Admittedly it's a small window
but it's theoretically possible and part of the reason why pselect was
created.

> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal
> is never delivered, how can pselect differentiate "signal from last time
> is still pending" from "new signal because I got an IPI"?

In this case we would take the additional wakeup which should be
harmless since we will take the WFx exit again and put us in the
correct state. But that's a lot better than busy looping.

I reckon that you could improve things a little by unblocking the
signal and then reblocking it before unlocking iothread (e.g. with a
pselect with zero time interval), which would flush any pending
signals. Since any such signal would correspond to a signal from last
time (because we still have the iothread lock) we know that any future
signals should correspond to new IPIs.

Peter



Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Peter Collingbourne
On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right 
>> direction. I previously had implemented WFI through the internal timer 
>> framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us 
> from just using cntpct_el0 on the host directly and compare with cval is that 
> if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we 
> can get either the architecture or macos to read out the sleep time then we 
> might be able to not have to use a poll interval either!

We tracked down the discrepancies between CNTPCT_EL0 on the guest vs
on the host to the fact that CNTPCT_EL0 on the guest does not
increment while the system is asleep and as such corresponds to
mach_absolute_time() on the host (if you read the XNU sources you will
see that mach_absolute_time() is implemented as CNTPCT_EL0 plus a
constant representing the time spent asleep) while CNTPCT_EL0 on the
host does increment while asleep. This patch switches the
implementation over to using mach_absolute_time() instead of reading
CNTPCT_EL0 directly:

https://android-review.googlesource.com/c/platform/external/qemu/+/1514870

Peter

>>
>> Along the way, I stumbled over a few issues though. For starters, the signal 
>> mask for SIG_IPI was not set correctly, so while pselect() would exit, the 
>> signal would never get delivered to the thread! For a fix, check out
>>
>>   
>> https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't 
>> handle WFE (that's only relevant in overcommitted scenarios). But it does 
>> handle WFI and even does something similar to hlt polling, albeit not with 
>> an adaptive threshold.
>>
>> Also, is there a particular reason you're working on this super interesting 
>> and useful code in a random downstream fork of QEMU? Wouldn't it be more 
>> helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our 
> own downstream though; Android Emulator codebase needs to work solidly on 
> macos and windows which has made keeping up with upstream difficult, and 
> staying on a previous version (2.12) with known quirks easier. (theres also 
> some android related customization relating to Qt Ui + different set of 
> virtual devices and snapshot support (incl. snapshots of graphics devices 
> with OpenGLES state tracking), which we hope to separate into other 
> libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But 
>> the high CPU usage seems to be mitigated by having a poll interval (like KVM 
>> does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by 
>> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 
>> which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and 
>> restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang  wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye 
>>> toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already 
>>> (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation 
>>> and hook into QEMU's software devices that previously assumed to only work 
>>> with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; 
>>> our way of addressing this is to still allow highmem but not put pci high 
>>> mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, which 
>>> might be worth looking into in Alexander's implementation as well:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>>
>>> Patches so far, FYI:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd5004

Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Peter Maydell
On Mon, 30 Nov 2020 at 20:56, Frank Yang  wrote:
> We'd actually like to contribute upstream too :) We do want to maintain
> our own downstream though; Android Emulator codebase needs to work
> solidly on macos and windows which has made keeping up with upstream difficult

One of the main reasons why OSX and Windows support upstream is
not so great is because very few people are helping to develop,
test and support it upstream. The way to fix that IMHO is for more
people who do care about those platforms to actively engage
with us upstream to help in making those platforms move closer to
being first class citizens. If you stay on a downstream fork
forever then I don't think you'll ever see things improve.

thanks
-- PMM



Re: [PATCH] qmp-shell: Sort by key when pretty-printing

2020-11-30 Thread David Edmondson
On Monday, 2020-11-30 at 15:56:51 -05, John Snow wrote:

> On 10/13/20 10:14 AM, David Edmondson wrote:
>> If the user selects pretty-printing (-p) the contents of any
>> dictionaries in the output are sorted by key.
>> 
>> Signed-off-by: David Edmondson 
>> ---
>>   scripts/qmp/qmp-shell | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/scripts/qmp/qmp-shell b/scripts/qmp/qmp-shell
>> index c5eef06f3f..b4d06096ab 100755
>> --- a/scripts/qmp/qmp-shell
>> +++ b/scripts/qmp/qmp-shell
>> @@ -260,7 +260,7 @@ class QMPShell(qmp.QEMUMonitorProtocol):
>>   indent = None
>>   if self._pretty:
>>   indent = 4
>> -jsobj = json.dumps(qmp, indent=indent)
>> +jsobj = json.dumps(qmp, indent=indent, sort_keys=self._pretty)
>>   print(str(jsobj))
>>   
>>   def _execute_cmd(self, cmdline):
>> 
>
> Hi, out of curiosity, what does this help you accomplish?

When dumping a dictionary with many values, visually finding a specific
one that is of interest is much quicker if they are sorted. Nothing more
than that.

> I've recently been overhauling a LOT of the Python utilities we have, so 
> I'm interested in hearing about how people use these tools and what 
> they'd like them to do.
>
> --js

dme.
-- 
I can't explain, you would not understand. This is not how I am.



Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

2020-11-30 Thread Willian Rampazzo
On Fri, Nov 27, 2020 at 2:41 PM Philippe Mathieu-Daudé
 wrote:
>
> We lately realized that the Avocado framework was not designed
> to be regularly run on CI environments. Therefore, as of 5.2
> we deprecate the gitlab-ci jobs using Avocado. To not disrupt
> current users, it is possible to keep the current behavior by
> setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable
> (see [*]).
> From now on, using these jobs (or adding new tests to them)
> is strongly discouraged.

When you say, "Avocado framework was not designed to be regularly run
on CI environments", I feel your pain. Avocado is a really nice test
framework, and I agree with you that running it locally is a little
easier than running inside a CI environment. Debugging a job failure
in the CI is not user-friendly; finding the command to reproduce a job
failure locally is not user-friendly. I understand why you would like
to remove the CI's acceptance tests, but I think your proposal is
missing some arguments and some planning.

If I read correctly, we share the same view that the CI and the
software tests are two different things. Here you are proposing that
we temporarily remove the CI's acceptance tests because it is not
user-friendly to the devs. This does not mean the tests will be lost.
It will be possible to run them locally or in the CI using the
QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable.

>
> Tests based on Avocado will be ported to new job schemes during
> the next releases, with better documentation and templates.

I understand you intend to make a more reliable and stable CI. Some
wording on why the new job scheme will be better than what we have now
and some planning on enabling the acceptance tests again in the CI may
help evaluate if it is feasible or just the same as what we have
today.

It would be nice to hear from other subsystem maintainers their pain
points about using the CI and how we can improve it. I hear you that
Avocado needs to improve its interface to be more user friendly. As an
Avocado developer, I would also like to hear from others where we can
improve Avocado to make it less painful for the QEMU developers'.




Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Alexander Graf

Hi Peter,

On 30.11.20 22:08, Peter Collingbourne wrote:

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:



On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:

Hi Frank,

Thanks for the update :). Your previous email nudged me into the right 
direction. I previously had implemented WFI through the internal timer 
framework which performed way worse.

Cool, glad it's helping. Also, Peter found out that the main thing keeping us from 
just using cntpct_el0 on the host directly and compare with cval is that if we 
sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get 
either the architecture or macos to read out the sleep time then we might be able 
to not have to use a poll interval either!

Along the way, I stumbled over a few issues though. For starters, the signal 
mask for SIG_IPI was not set correctly, so while pselect() would exit, the 
signal would never get delivered to the thread! For a fix, check out

   
https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/


Thanks, we'll take a look :)


Please also have a look at my latest stab at WFI emulation. It doesn't handle 
WFE (that's only relevant in overcommitted scenarios). But it does handle WFI 
and even does something similar to hlt polling, albeit not with an adaptive 
threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+/* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+cpu->hvf->sleeping = true;
+smp_mb();
+
+/* Bail out if we received an IRQ meanwhile */
+if (cpu->thread_kicked || (cpu->interrupt_request &
+(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+cpu->hvf->sleeping = false;
+break;
+}
+
+/* nanosleep returns on signal, so we wake up on kick. */
+nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.



Thanks a bunch for the comment! So the trick I was using here is to 
modify the timespec from the kick function before sending the IPI 
signal. That way, we know that either we are inside the sleep (where the 
signal wakes it up) or we are outside the sleep (where timespec={} will 
make it return immediately).


The only race I can think of is if nanosleep does calculations based on 
the timespec and we happen to send the signal right there and then.


The problem with blocking IPIs is basically what Frank was describing 
earlier: How do you unset the IPI signal pending status? If the signal 
is never delivered, how can pselect differentiate "signal from last time 
is still pending" from "new signal because I got an IPI"?



Alex




Re: [PATCH-for-6.0 v4 15/17] gitlab-ci: Add test for Xen (on CentOS 7)

2020-11-30 Thread Stefano Stabellini
On Fri, 27 Nov 2020, Anthony PERARD wrote:
> On Thu, Nov 26, 2020 at 12:45:59PM -0500, Eduardo Habkost wrote:
> > On Thu, Nov 26, 2020 at 05:38:24PM +, Anthony PERARD wrote:
> > > Is `make check` going to do something useful with the Xen support? Or is
> > > it going to need more work in order to test the Xen support of QEMU?
> > > (Like starting an actual Xen guest.)
> > 
> > I don't think it will test Xen support, but we still want to at
> > least check if --enable-xen doesn't break anything else.
> 
> That sound good.
> 
> > Is there any public CI system anywhere where Xen support is
> > tested today?
> 
> Yes, we have osstest which regularly test Xen with QEMU from upstream.
> Result are sent to xen-devel. But that might not be very useful for
> qemu-devel.
> 
> We also have a GitLab CI which does some Xen tests, but I don't think
> QEMU is tested there.
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=automation/gitlab-ci/test.yaml;hb=HEAD
> https://gitlab.com/xen-project/xen/

QEMU (the version of QEMU picked by the Xen tools) is built but not used
in the Xen Project CI-loop yet.

I am extending the CI-loop with more tests [1], and I would like to have at
least one QEMU test at some point soon. Probably something based on Xen 9pfs.

[1] https://marc.info/?l=xen-devel&m=160627845825763 



Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Peter Collingbourne
On Mon, Nov 30, 2020 at 12:56 PM Frank Yang  wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf  wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right 
>> direction. I previously had implemented WFI through the internal timer 
>> framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us 
> from just using cntpct_el0 on the host directly and compare with cval is that 
> if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we 
> can get either the architecture or macos to read out the sleep time then we 
> might be able to not have to use a poll interval either!
>>
>> Along the way, I stumbled over a few issues though. For starters, the signal 
>> mask for SIG_IPI was not set correctly, so while pselect() would exit, the 
>> signal would never get delivered to the thread! For a fix, check out
>>
>>   
>> https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't 
>> handle WFE (that's only relevant in overcommitted scenarios). But it does 
>> handle WFI and even does something similar to hlt polling, albeit not with 
>> an adaptive threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+/* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+cpu->hvf->sleeping = true;
+smp_mb();
+
+/* Bail out if we received an IRQ meanwhile */
+if (cpu->thread_kicked || (cpu->interrupt_request &
+(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+cpu->hvf->sleeping = false;
+break;
+}
+
+/* nanosleep returns on signal, so we wake up on kick. */
+nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.

Peter

>>
>> Also, is there a particular reason you're working on this super interesting 
>> and useful code in a random downstream fork of QEMU? Wouldn't it be more 
>> helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our 
> own downstream though; Android Emulator codebase needs to work solidly on 
> macos and windows which has made keeping up with upstream difficult, and 
> staying on a previous version (2.12) with known quirks easier. (theres also 
> some android related customization relating to Qt Ui + different set of 
> virtual devices and snapshot support (incl. snapshots of graphics devices 
> with OpenGLES state tracking), which we hope to separate into other 
> libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But 
>> the high CPU usage seems to be mitigated by having a poll interval (like KVM 
>> does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by 
>> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 
>> which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and 
>> restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang  wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye 
>>> toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already 
>>> (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation 
>>> and hook into QEMU's software devices that previously assumed to only work 
>>> with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; 
>>> our way of addressing this is to still allow highmem but not put pci high 
>>> mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, 

Re: [PATCH v2 07/12] tpm: put some tpm devices into the correct category

2020-11-30 Thread Stefan Berger

On 11/30/20 3:36 AM, Gan Qixin wrote:

Some tpm devices have no category, put them into the correct category.

Signed-off-by: Gan Qixin 
---
Cc: Stefan Berger 
---
  hw/tpm/tpm_tis_isa.c| 1 +
  hw/tpm/tpm_tis_sysbus.c | 1 +
  2 files changed, 2 insertions(+)

diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
index 6fd876eebf..10d8a14f19 100644
--- a/hw/tpm/tpm_tis_isa.c
+++ b/hw/tpm/tpm_tis_isa.c
@@ -150,6 +150,7 @@ static void tpm_tis_isa_class_init(ObjectClass *klass, void 
*data)
  dc->reset = tpm_tis_isa_reset;
  tc->request_completed = tpm_tis_isa_request_completed;
  tc->get_version = tpm_tis_isa_get_tpm_version;
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
  }

  static const TypeInfo tpm_tis_isa_info = {
diff --git a/hw/tpm/tpm_tis_sysbus.c b/hw/tpm/tpm_tis_sysbus.c
index 2c32aa7099..45e63efd63 100644
--- a/hw/tpm/tpm_tis_sysbus.c
+++ b/hw/tpm/tpm_tis_sysbus.c
@@ -139,6 +139,7 @@ static void tpm_tis_sysbus_class_init(ObjectClass *klass, 
void *data)
  dc->reset = tpm_tis_sysbus_reset;
  tc->request_completed = tpm_tis_sysbus_request_completed;
  tc->get_version = tpm_tis_sysbus_get_tpm_version;
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
  }

  static const TypeInfo tpm_tis_sysbus_info = {


Reviewed-by: Stefan Berger 





Re: [PATCH] qmp-shell: Sort by key when pretty-printing

2020-11-30 Thread John Snow

On 10/13/20 10:14 AM, David Edmondson wrote:

If the user selects pretty-printing (-p) the contents of any
dictionaries in the output are sorted by key.

Signed-off-by: David Edmondson 
---
  scripts/qmp/qmp-shell | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/qmp/qmp-shell b/scripts/qmp/qmp-shell
index c5eef06f3f..b4d06096ab 100755
--- a/scripts/qmp/qmp-shell
+++ b/scripts/qmp/qmp-shell
@@ -260,7 +260,7 @@ class QMPShell(qmp.QEMUMonitorProtocol):
  indent = None
  if self._pretty:
  indent = 4
-jsobj = json.dumps(qmp, indent=indent)
+jsobj = json.dumps(qmp, indent=indent, sort_keys=self._pretty)
  print(str(jsobj))
  
  def _execute_cmd(self, cmdline):




Hi, out of curiosity, what does this help you accomplish?

I've recently been overhauling a LOT of the Python utilities we have, so 
I'm interested in hearing about how people use these tools and what 
they'd like them to do.


--js




Re: [PATCH v2] hw/nvme: Move NVMe emulation out of hw/block/ directory

2020-11-30 Thread Klaus Jensen
On Nov 30 15:52, Philippe Mathieu-Daudé wrote:
> As IDE used to be, NVMe emulation is becoming an active
> subsystem. Move it into its own namespace.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> v2: Rebased after nvme-ns got merged in commit 8680d6e3646
> ---
>  meson.build   |   1 +
>  hw/{block/nvme.h => nvme/nvme-internal.h} |   4 +-
>  hw/{block => nvme}/nvme-ns.h  |   0
>  hw/{block/nvme.c => nvme/core.c}  |   2 +-
>  hw/{block => nvme}/nvme-ns.c  |   0
>  MAINTAINERS   |   2 +-
>  hw/Kconfig|   1 +
>  hw/block/Kconfig  |   5 -
>  hw/block/meson.build  |   1 -
>  hw/block/trace-events | 132 -
>  hw/meson.build|   1 +
>  hw/nvme/Kconfig   |   4 +
>  hw/nvme/meson.build   |   1 +
>  hw/nvme/trace-events  | 133 ++
>  14 files changed, 145 insertions(+), 142 deletions(-)
>  rename hw/{block/nvme.h => nvme/nvme-internal.h} (98%)
>  rename hw/{block => nvme}/nvme-ns.h (100%)
>  rename hw/{block/nvme.c => nvme/core.c} (99%)
>  rename hw/{block => nvme}/nvme-ns.c (100%)

Would we want to consider renaming nvme-ns.c to namespace.c? And maybe
also follow up with consolidating nvme-ns.h into nvme-internal.h?


signature.asc
Description: PGP signature


Re: [PATCH 2/8] hvf: Move common code out

2020-11-30 Thread Frank Yang
Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But
the high CPU usage seems to be mitigated by having a poll interval (like
KVM does) in handling WFI:

https://android-review.googlesource.com/c/platform/external/qemu/+/1512501

This is loosely inspired by
https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766
which does seem to specify a poll interval.

It would be cool if we could have a lightweight way to enter sleep and
restart the vcpus precisely when CVAL passes, though.

Frank


On Fri, Nov 27, 2020 at 3:30 PM Frank Yang  wrote:

> Hi all,
>
> +Peter Collingbourne 
>
> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>
> Peter and I have been working on an HVF Apple Silicon backend with an eye
> toward Android guests.
>
> We have gotten things to basically switch to Android userspace already
> (logcat/shell and graphics available at least)
>
> Our strategy so far has been to import logic from the KVM implementation
> and hook into QEMU's software devices that previously assumed to only work
> with TCG, or have KVM-specific paths.
>
> Thanks to Alexander for the tip on the 36-bit address space limitation
> btw; our way of addressing this is to still allow highmem but not put pci
> high mmio so high.
>
> Also, note we have a sleep/signal based mechanism to deal with WFx, which
> might be worth looking into in Alexander's implementation as well:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>
> Patches so far, FYI:
>
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>
>
> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>
> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>
> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>
> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>
> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>
> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>
> Peter's also noticed that there are extra steps needed for M1's to allow
> TCG to work, as it involves JIT:
>
>
> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>
> We'd appreciate any feedback/comments :)
>
> Best,
>
> Frank
>
> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf  wrote:
>
>>
>> On 27.11.20 21:00, Roman Bolshakov wrote:
>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>> >> Until now, Hypervisor.framework has only been available on x86_64
>> systems.
>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>> >> prepare for support for multiple architectures, let's move common code
>> out
>> >> into its own accel directory.
>> >>
>> >> Signed-off-by: Alexander Graf 
>> >> ---
>> >>   MAINTAINERS |   9 +-
>> >>   accel/hvf/hvf-all.c |  56 +
>> >>   accel/hvf/hvf-cpus.c| 468
>> 
>> >>   accel/hvf/meson.build   |   7 +
>> >>   accel/meson.build   |   1 +
>> >>   include/sysemu/hvf_int.h|  69 ++
>> >>   target/i386/hvf/hvf-cpus.c  | 131 --
>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>> >>   target/i386/hvf/hvf.c   | 360 +--
>> >>   target/i386/hvf/meson.build |   1 -
>> >>   target/i386/hvf/x86hvf.c|  11 +-
>> >>   target/i386/hvf/x86hvf.h|   2 -
>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>> >>   create mode 100644 accel/hvf/hvf-all.c
>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>> >>   create mode 100644 accel/hvf/meson.build
>> >>   create mode 100644 include/sysemu/hvf_int.h
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>> >>
>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>> >> index 68bc160f41..ca4b6d9279 100644
>> >> --- a/MAINTAINERS
>> >> +++ b/MAINTAINERS
>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani 
>> >>   M: Roman Bolshakov 
>> >>   W: https://wiki.qemu.org/Features/HVF
>> >>   S: Maintained
>> >> -F: accel/stubs/hvf-stub.c
>> > There was a patch for that in the RFC series from Claudio.
>>
>>
>> Yeah, I'm not worried about this hunk :).
>>
>>
>> >
>> >>   F: target/i386/hvf/
>> >> +
>> >> +HVF
>> >> +M: Cameron Esfahani 
>> >> +M: Roman Bolshakov 
>> >> +W: https://wiki.qemu.org/Features/HVF
>> >> +S: Maintained
>> 

[Bug 1906295] Re: Implementation of exclusive monitor in ARM

2020-11-30 Thread Peter Maydell
QEMU's load/store exclusive implementation is known to not be
architecturally compliant. Most notably, it works on virtual addresses
and the architecture specifies that exclusives are supposed to work on
physical addresses. We provide a "good enough for what real code (not
weird test cases) needs" implementation.

However, in this specific case, we're within the architectural
requirements. In the ARMv8A Arm ARM (DDI0487F.c) the pseudocode
AArch32.ExclusiveMonitorsPass() function has this comment:

// It is IMPLEMENTATION DEFINED whether the detection of memory aborts happens
// before or after the check on the local Exclusives monitor. As a result a 
failure
// of the local monitor can occur on some implementations even if the memory
// access would give a memory about.

In other words, it's up to the implementation whether it detects and
reports the invalid address first. QEMU's implementation chooses not to.


** Changed in: qemu
   Status: New => Invalid

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1906295

Title:
  Implementation of exclusive monitor in ARM

Status in QEMU:
  Invalid

Bug description:
  Hi

  I refer to the implementation of exclusive monitor in ARM32. For
  instruction like STREX Rx,Ry,[Rz], we need to check whether the
  address [Rz] is in exclusive state. If not, we set the value Rx as 1
  without doing the store operation. However, I noticed that QEMU will
  not check whether the address that Rz points to is a legal address or
  not. If the value of Rz is 0x0 but it is not in exclusive state. QEMU
  will set Rx as 1 and continue to execute the following instructions.

  However, physical devices will check the value of Rz. If Rz is an illegal 
address (e.g., 0x0), a SIGSEGV signal will be raised even the address is not in 
exclusive state. I searched many documentation about ARM and it seems that 
manual of ARM specification does not specify the implementation of exclusive 
monitor in detail. I am not sure which one is the right behavior. 
  Should QEMU add this check? This might not be a mistake. However, should it 
be better if QEMU has the same behavior as a physical device? Feel free if you 
need a testcase. Many thanks

  Regards
  Muhui

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1906295/+subscriptions



Re: [RFC] ich9:cpuhp: add support for cpu hot-unplug with SMI broadcast enabled

2020-11-30 Thread Ankur Arora

On 2020-11-30 8:58 a.m., Laszlo Ersek wrote:

On 11/28/20 00:48, Ankur Arora wrote:


It is possible that there are CPUs with bits for both is_inserting and
is_removing. In that case QemuCpuhpCollectApicIds() would put them in the
PluggedApicIds array and the unplug eventually happens in the next
firmware invocation.

If a CPU has both is_inserting and fw_remove set, the firmware processes
the
hotplug in that invocation and the unplug happens whenever the OSPM
triggers
the firmware next.


If these corner cases will actually work (I'm somewhat doubtful), that
will be really great.


Heh, yeah. That's a big if. I'll see if I can hit it in my testing.

Thanks
Ankur



Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Paolo Bonzini

On 30/11/20 19:10, Kevin Wolf wrote:

Am 30.11.2020 um 17:57 hat Paolo Bonzini geschrieben:

The main problem is that it wouldn't extend well, if at all, to
machines and devices.  So those would still not be integrated into the
QAPI schema.


What do you think is the biggest difference there? Don't devices work
the same as user creatable objects in that you first set a bunch of
properties (which would now be done through QAPI instead) and then call
the completion/realize method that actually makes use of them?


For devices it's just the practical issue that there are too many to 
have something like this series.  For machine types the main issue is 
that the version-specific machine types would have to be defined in one 
more place.



The single struct doesn't bother me _too much_ actually.  What bothers me is
that it won't be a single source of all QOM objects, only those that happen
to be created by object-add.


But isn't it only natural that a list of these objects will exist in
some way, implicitly or explicitly? object-add must somehow decide which
object types it allows the user to create.


There's already one, it's objects that implement user creatable.  I 
don't think having it as a QAPI enum (or discriminator) is worthwhile, 
and it's one more thing that can go out of sync between the QAPI schema 
and the C code.



I'm also pretty sure that QOM as it exists now is not the right solution
for any of them because it has some major shortcomings. It's too easy to
get things wrong (like the writable properties after creation), its
introspection is rather weak and separated from the QAPI introspection,
it doesn't encourage documentation, and it involves quite a bit of
boilerplate and duplicated code between class implementations.

A modified QOM might be the right solution, though. I would like to
bring QAPI and QOM together because most of these weaknesses are
strengths of QAPI.


I agree wholeheartedly.  But your series goes to the opposite excess. 
Eduardo is doing work in QOM to mitigate the issues you point out, and 
you have to meet in the middle with him.  Starting with the user-visible 
parts focuses on the irrelevant parts.



Mostly because it's more or less the same issue that you have with
BlockdevOptions, with the extra complication that this series only deals
with the easy one of the four above categories.


I'm not sure which exact problem with BlockdevOptions you mean. The
reason why the block layer doesn't use BlockdevOptions everywhere is
-drive support.


You don't have a single BlockdevOptions* field in the structs of block/. 
 My interpretation of this is that BlockdevOptions* is a good schema 
for configuring things but not for run-time state.



Maybe if we don't want to commit to keeping the ObjectOptions schema,
the part that should wait is object-add and I should do the command line
options first? Then the schema remains an implementation detail for now
that is invisible in introspection.


I don't see much benefit in converting _any_ of the three actually.  The
only good reason I see for QAPIfying this is the documentation, and the
promise of deduplicating QOM boilerplate.  The latter is only a promise, but
documentation alone is a damn good reason and it's enough to get this work
into a mergeable shape as soon as possible!


I think having some input validation done by QAPI instead of in each QOM
object individually is nice, too. You get it after CLI, QMP and HMP all
go through QAPI.


But the right way to do that is to use Eduardo's field properties: they 
make QOM do the right thing, adding another layer of parsing is just 
adding more complexity.  Are there any validation bugs that you're 
fixing?  Is that something that cannot be fixed elsewhere, or are you 
papering over bad QOM coding?  (Again, I'm not debating that QOM 
properties are hard to write correctly).



But maybe, we could start in the opposite direction: start with the use QAPI
to eliminate QOM boilerplate.  Basing your work on Eduardo's field
properties series, you could add a new 'object' "kind" to QAPI that would
create an array of field properties (e.g. a macro expanding to a compound
literal?)


There is a very simple reason why I don't want to start with the QAPI
generator: John has multiple series pending that touch basically every
part of the QAPI generator. This means not only that I need to be
careful about merge conflict (or base my work on top of five other
series, which feels adventurous), but also that I would be competing
with John for Markus' reviewer capacity, further slowing things down.


That's something that you have to discuss with John and Markus.  It may 
be that John has to shelve part of his work or rebase on top of yours 
instead.


By adding an extra parsing layer you're adding complexity that may not 
be needed in the end, and we're very bad of removing it later.  So I'm 
worried that this series will become technical debt in just a few months.



Well, two reasons: Also bec

Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism

2020-11-30 Thread Andrey Gruzdev

On 30.11.2020 19:40, Peter Xu wrote:

On Mon, Nov 30, 2020 at 11:11:00AM +0300, Andrey Gruzdev wrote:

On 28.11.2020 01:28, Peter Xu wrote:

On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:

Since reading UFFD events and saving paged data are performed
from the same thread, write fault latencies are sensitive to
migration stream stalls. Limiting total page saving rate is a
method to reduce amount of noticiable fault resolution latencies.

Migration bandwidth limiting is achieved via noticing cases of
out-of-threshold write fault latencies and temporarily disabling
(strictly speaking, severely throttling) saving non-faulting pages.


So have you done any measurements out of it, as we've talked in previous
version?  Thanks,



Sorry, not done yet.


So do you still plan to? :)

And if not, could you describe the rational behind this patch?  For example,
what's the problem behind (e.g., guest hangs for xxx seconds, maybe?) and
what's the outcome (guest doesn't hang any more)?

Thanks,



Yes, I think providing a latency histogram with description of the test 
case is the right thing.


--
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
virtuzzo.com



Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()

2020-11-30 Thread Andrey Gruzdev

On 30.11.2020 19:32, Peter Xu wrote:

On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote:

+#ifdef CONFIG_LINUX
+/**
+ * ram_find_block_by_host_address: find RAM block containing host page
+ *
+ * Returns pointer to RAMBlock if found, NULL otherwise
+ *
+ * @rs: current RAM state
+ * @page_address: host page address
+ */
+static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr 
page_address)


Reuse qemu_ram_block_from_host() somehow?



Seems not very suitable here, since we use rs->last_seen_block to restart
search..


The search logic is actually still the same, it's just about which block to
start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one).  So
an internal helper to pass in that information would be nice.  Though that'll
require rcu lock taken before hand to keep the ramblock ptr valid.

No worry - we can keep this and work on top too, I think.

[...]
  +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)

+{
+struct uffd_msg uffd_msg;
+hwaddr page_address;
+RAMBlock *bs;
+int res;
+
+if (!migrate_background_snapshot()) {
+return NULL;
+}
+
+res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
+if (res <= 0) {
+return NULL;
+}
+
+page_address = uffd_msg.arg.pagefault.address;
+bs = ram_find_block_by_host_address(rs, page_address);
+if (!bs) {
+/* In case we couldn't find respective block, just unprotect faulting 
page. */
+uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, 
false);
+error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
+page_address);


Looks ok to error_report() instead of assert(), but I'll suggest drop the call
to uffd_protect_memory() at least.  The only reason to not use assert() is
because we try our best to avoid crashing the vm, however I really doubt
whether uffd_protect_memory() is the right thing to do even if it happens - we
may at last try to unprotect some strange pages that we don't even know where
it is...



IMHO better to unprotect these strange pages then to leave them protected by
UFFD.. To avoid getting VM completely in-operational.
At least we know the page generated wr-fault, maybe due to incorrect
write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if
leave the page as is, VM would certainly lock.


Yes makes some senes too.  However it's definitely a severe programming error,
even if the VM is unblocked, it can be in a very weird state...

Maybe we just assert? Then we can see how unlucky we'll be. :)



Yeah, seems assert is a right thing here :)



Hmm, I wonder about assert(). In QEMU it would do something in release
builds?


I guess yes, at least for now.  Because osdep.h has this:

/*
  * We have a lot of unaudited code that may fail in strange ways, or
  * even be a security risk during migration, if you disable assertions
  * at compile-time.  You may comment out these safety checks if you
  * absolutely want to disable assertion overhead, but it is not
  * supported upstream so the risk is all yours.  Meanwhile, please
  * submit patches to remove any side-effects inside an assertion, or
  * fixing error handling that should use Error instead of assert.
  */
#ifdef NDEBUG
#error building with NDEBUG is not supported
#endif
#ifdef G_DISABLE_ASSERT
#error building with G_DISABLE_ASSERT is not supported
#endif

[...]



Ooops, didn't know that, thanks for info!


+/**
+ * ram_save_host_page_post: ram_save_host_page() post-notifier
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ * @opaque: opaque context value
+ * @res_override: pointer to the return value of ram_save_host_page(),
+ *   overwritten in case of an error
+ */
+static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
+void *opaque, int *res_override)
+{
+/* Check if page is from UFFD-managed region. */
+if (pss->block->flags & RAM_UF_WRITEPROTECT) {
+#ifdef CONFIG_LINUX
+ram_addr_t page_from = (ram_addr_t) opaque;
+hwaddr page_address = (hwaddr) pss->block->host +
+  ((hwaddr) page_from << TARGET_PAGE_BITS);


I feel like most new uses of hwaddr is not correct...  As I also commented in
the other patch.  We should save a lot of castings if switched.



Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as
uintptr_t. I any case UFFD interface relies on u64 type.


For example, page_from should be a host address, we can use unsigned long, or
uint64_t, but ram_addr_t is not proper, which is only used in ramblock address
space.

I think it's fine that e.g. we use common types like uint64_t directly.



Now I also think we'd better use common types - mostly uint64_t 
directly, not to confuse anybody with that specific type descriptions.





+hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << 
TARGET_PAGE_BITS;
+int res;
+
+/* F

Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Peter Krempa
On Mon, Nov 30, 2020 at 13:25:20 +0100, Kevin Wolf wrote:
> This series adds a QAPI type for the properties of all user creatable
> QOM types and finally makes QMP object-add use the new ObjectOptions
> union so that QAPI introspection can be used for user creatable objects.

FYI, here's a libvirt patchset which (apart from refactors) adds
compatibility with the deprecated/removed 'props' sub-object and also
adds testing to see whether all objects used by libvirt conform to the
new schema:

https://www.redhat.com/archives/libvir-list/2020-November/msg01625.html

Spoiler alert: surprisingly they do match so no updates are necessary!




Re: [PATCH] qemu-nbd: Fix a memleak in qemu_nbd_client_list()

2020-11-30 Thread Eric Blake
On 11/30/20 6:36 AM, Alex Chen wrote:
> When the qio_channel_socket_connect_sync() fails
> we should goto 'out' label to free the 'sioc' instead of return.
> 
> Reported-by: Euler Robot 
> Signed-off-by: Alex Chen 
> ---
>  qemu-nbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

A local leak, but the only caller of qemu_nbd_client_list() is main()
which returns the value and thereby exits the program, so it is
inconsequential in the bigger picture.  I'll defer this to 6.0, and pick
it up through my NBD tree once that opens.

> 
> diff --git a/qemu-nbd.c b/qemu-nbd.c
> index a7075c5419..47587a709e 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -181,7 +181,7 @@ static int qemu_nbd_client_list(SocketAddress *saddr, 
> QCryptoTLSCreds *tls,
>  sioc = qio_channel_socket_new();
>  if (qio_channel_socket_connect_sync(sioc, saddr, &err) < 0) {
>  error_report_err(err);
> -return EXIT_FAILURE;
> +goto out;
>  }
>  rc = nbd_receive_export_list(QIO_CHANNEL(sioc), tls, hostname, &list,
>   &err);
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




[Bug 1906295] [NEW] Implementation of exclusive monitor in ARM

2020-11-30 Thread JIANG Muhui
Public bug reported:

Hi

I refer to the implementation of exclusive monitor in ARM32. For
instruction like STREX Rx,Ry,[Rz], we need to check whether the address
[Rz] is in exclusive state. If not, we set the value Rx as 1 without
doing the store operation. However, I noticed that QEMU will not check
whether the address that Rz points to is a legal address or not. If the
value of Rz is 0x0 but it is not in exclusive state. QEMU will set Rx as
1 and continue to execute the following instructions.

However, physical devices will check the value of Rz. If Rz is an illegal 
address (e.g., 0x0), a SIGSEGV signal will be raised even the address is not in 
exclusive state. I searched many documentation about ARM and it seems that 
manual of ARM specification does not specify the implementation of exclusive 
monitor in detail. I am not sure which one is the right behavior. 
Should QEMU add this check? This might not be a mistake. However, should it be 
better if QEMU has the same behavior as a physical device? Feel free if you 
need a testcase. Many thanks

Regards
Muhui

** Affects: qemu
 Importance: Undecided
 Status: New

** Summary changed:

- Improper implementation of exclusive monitor in ARM
+ Implementation of exclusive monitor in ARM

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1906295

Title:
  Implementation of exclusive monitor in ARM

Status in QEMU:
  New

Bug description:
  Hi

  I refer to the implementation of exclusive monitor in ARM32. For
  instruction like STREX Rx,Ry,[Rz], we need to check whether the
  address [Rz] is in exclusive state. If not, we set the value Rx as 1
  without doing the store operation. However, I noticed that QEMU will
  not check whether the address that Rz points to is a legal address or
  not. If the value of Rz is 0x0 but it is not in exclusive state. QEMU
  will set Rx as 1 and continue to execute the following instructions.

  However, physical devices will check the value of Rz. If Rz is an illegal 
address (e.g., 0x0), a SIGSEGV signal will be raised even the address is not in 
exclusive state. I searched many documentation about ARM and it seems that 
manual of ARM specification does not specify the implementation of exclusive 
monitor in detail. I am not sure which one is the right behavior. 
  Should QEMU add this check? This might not be a mistake. However, should it 
be better if QEMU has the same behavior as a physical device? Feel free if you 
need a testcase. Many thanks

  Regards
  Muhui

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1906295/+subscriptions



Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers

2020-11-30 Thread Andrey Gruzdev

On 30.11.2020 18:34, Peter Xu wrote:

On Sun, Nov 29, 2020 at 11:12:10PM +0300, Andrey Gruzdev wrote:

+void ram_write_tracking_stop(void)
+{
+#ifdef CONFIG_LINUX
+RAMState *rs = ram_state;
+RAMBlock *bs;
+assert(rs->uffdio_fd >= 0);


Maybe too harsh - we can return if it's invalid.

Meanwhile, better rcu_read_lock(), as well?



Yep, RCU lock, I'll add. Why too harsh? Just a debug assertion.


I was afraid some special path could trigger ram_write_tracking_stop() being
called before ram_write_tracking_start(), then vm could crash.  If we can
guarantee that not happening, then it's also ok with assert().

[...]



It can't really happen, but I agree that assert() is a bit excessive here.


+/**
+ * uffd_poll_events: poll UFFD file descriptor for read
+ *
+ * Returns true if events are available for read, false otherwise
+ *
+ * @uffd: UFFD file descriptor
+ * @tmo: timeout in milliseconds, 0 for non-blocking operation,
+ *   negative value for infinite wait
+ */
+bool uffd_poll_events(int uffd, int tmo)


Shall we spell "tmo" out?

In the comment? I think it's ok.


I'd suggest to spell it out everywhere, especially in the codes.  But feel free
to take your own preference.  Thanks,



Thanks,

--
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
virtuzzo.com



Re: QMP and the 'id' parameter

2020-11-30 Thread John Snow

On 11/23/20 1:57 AM, Markus Armbruster wrote:

(Was it not possible to have the client send an ACK response that
doesn't indicate success, but just signals receipt? Semantically, it
would be the difference between do-x and start-x as a command.)

Feels quite possible to me.

If I read git-log correctly, the commands' design ignored the race with
shutdown / suspend, and 'success-response': false was a quick fix.

I think changing the commands from 'do-x' to 'start-x' would have been
better.  A bit more work, though, and back then there weren't any
'start-x' examples to follow, I guess.

I wish success-response didn't exist, but is getting rid of it worth
changing the interface?  I honestly don't know.



OK, I'll jot it down in the ideas book and we can come back to it later, 
after we've fried the other 2,385 fish in queue. I think simplifying the 
interface is going to be helpful long-term, but I don't have a good grip 
on the actual work estimate.


In terms of a simple library design, this edge case makes an async 
library quite a bit worse to support, because now it needs to understand 
that sometimes the target will never reply, and it's command dependent. 
Now the core "QMP" library, ostensibly the command-agnostic layer, needs 
to query to discover the semantics of each command it sends.



(1) Identify 'success-response: false' commands:

- guest-shutdown
- guest-suspend-disk
- guest-suspend-ram
- guest-suspend-hybrid


(2) Replace them with jobs, where a success response acknowledges 
receipt and start of the job. Errors encountered setting up the job can 
be returned synchronously; errors encountered past the point of no 
return can be queried via the job system.


(I am assuming that it is untenable to have a system where we 
acknowledge receipt, go past the point of no return and then encounter 
an error and have no way to report it, since we've already responded to 
the shutdown/suspend request. Therefore, the job system seems appropriate.)


((The onus is now on the job layer to understand that on job success, 
the server goes away and cannot report success, but I think this is 
easier to implement in a client at the logical layer than at the 
protocol layer.))



(3) Deprecate the old interfaces.


(4) Delete the old interfaces. Remove 'success-response: false' from the 
meta-schema. Remove 'success-response' from GuestAgentCommandInfo.



...but, not terribly important. Might be nice to help plumb some 
non-block jobs for sake of example for other areas where they might be 
useful. Other stuff to think about in the meantime, but thank you for 
the heads up. Maybe a good GSoC project, actually? It seems 
straightforward, just with a lot of plumbing.



I suppose for now, what can happen is the client (using the AQMP 
library) can simply await the results of execute() with a timeout. If 
the server closes the connection, we know it worked -- or, if there's a 
timeout and we used --no-shutdown, we can interrogate the VM status.





Re: [PATCH for-6.0 v2 1/3] spapr: Improve naming of some vCPU id related items

2020-11-30 Thread Greg Kurz
On Mon, 30 Nov 2020 18:32:27 +0100
Cédric Le Goater  wrote:

> On 11/30/20 5:52 PM, Greg Kurz wrote:
> > The machine tells the IC backend the number of vCPU ids it will be
> > exposed to, in order to:
> > - fill the "ibm,interrupt-server-ranges" property in the DT (XICS)
> > - size the VP block used by the in-kernel chip (XICS-on-XIVE, XIVE)
> > 
> > The current "nr_servers" and "spapr_max_server_number" naming can
> > mislead people info thinking it is about a quantity of CPUs. Make
> > it clear this is all about vCPU ids.
> 
> OK. This looks fine. 
> 
> But, XIVE does not care about vCPU ids. Only the count of vCPUs

XIVE does not care about vCPU ids indeed but KVM does. The KVM XIVE code
has been indexing VPs with vCPU ids from the start as visible in linux
commit 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE
interrupt controller") :

+int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
+struct kvm_vcpu *vcpu, u32 cpu)
+{
[...]
+   xc->vp_id = xive->vp_base + cpu;

This allows to have a quick conversion between vCPU id and VP id
without relying on an intermediate mapping structure. 

> is relevant. So, it would be nice to add a comment in the code 
> that we got it wrong at some point or that XICS imposed the use
> of max vCPU ids.
> 

There's more to it. The vCPU id spacing is a hack that was
introduced to workaround the limitation of pre-POWER9 cpu
types that cannot have HW threads from the same core running
in different MMU contexts. This was probably not such a good
idea but we have to live with it now since vCPU ids are guest
visible. I don't think it brings much to complain once more
over vCPU spacing since this has been addressed with VSMT.

The machine now always have consecutive vCPU ids by default,
ie. number of vCPUs == number of vCPU ids. Only custom machine
setups that tweak VSMT to some other value can be affected
by the spacing.

> C. 
> 
> 
> > Signed-off-by: Greg Kurz 
> > ---
> >  include/hw/ppc/spapr.h  |  2 +-
> >  include/hw/ppc/spapr_irq.h  |  8 
> >  include/hw/ppc/spapr_xive.h |  2 +-
> >  include/hw/ppc/xics_spapr.h |  2 +-
> >  hw/intc/spapr_xive.c|  8 
> >  hw/intc/spapr_xive_kvm.c|  4 ++--
> >  hw/intc/xics_kvm.c  |  4 ++--
> >  hw/intc/xics_spapr.c|  8 
> >  hw/ppc/spapr.c  |  8 
> >  hw/ppc/spapr_irq.c  | 18 +-
> >  10 files changed, 32 insertions(+), 32 deletions(-)
> > 
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index b7ced9faebf5..dc99d45e2852 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -849,7 +849,7 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> >  int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error 
> > **errp);
> >  void spapr_clear_pending_events(SpaprMachineState *spapr);
> >  void spapr_clear_pending_hotplug_events(SpaprMachineState *spapr);
> > -int spapr_max_server_number(SpaprMachineState *spapr);
> > +int spapr_max_vcpu_ids(SpaprMachineState *spapr);
> >  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
> >uint64_t pte0, uint64_t pte1);
> >  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
> > diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> > index c22a72c9e270..2e53fc9e6cbb 100644
> > --- a/include/hw/ppc/spapr_irq.h
> > +++ b/include/hw/ppc/spapr_irq.h
> > @@ -43,7 +43,7 @@ DECLARE_CLASS_CHECKERS(SpaprInterruptControllerClass, 
> > SPAPR_INTC,
> >  struct SpaprInterruptControllerClass {
> >  InterfaceClass parent;
> >  
> > -int (*activate)(SpaprInterruptController *intc, uint32_t nr_servers,
> > +int (*activate)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
> >  Error **errp);
> >  void (*deactivate)(SpaprInterruptController *intc);
> >  
> > @@ -62,7 +62,7 @@ struct SpaprInterruptControllerClass {
> >  /* These methods should only be called on the active intc */
> >  void (*set_irq)(SpaprInterruptController *intc, int irq, int val);
> >  void (*print_info)(SpaprInterruptController *intc, Monitor *mon);
> > -void (*dt)(SpaprInterruptController *intc, uint32_t nr_servers,
> > +void (*dt)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
> > void *fdt, uint32_t phandle);
> >  int (*post_load)(SpaprInterruptController *intc, int version_id);
> >  };
> > @@ -74,7 +74,7 @@ int spapr_irq_cpu_intc_create(struct SpaprMachineState 
> > *spapr,
> >  void spapr_irq_cpu_intc_reset(struct SpaprMachineState *spapr, PowerPCCPU 
> > *cpu);
> >  void spapr_irq_cpu_intc_destroy(struct SpaprMachineState *spapr, 
> > PowerPCCPU *cpu);
> >  void spapr_irq_print_info(struct SpaprMachineState *spapr, Monitor *mon);
> > -void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t nr_servers,
> > +void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t max_vcpu_ids,
> >void *fdt, u

Re: [PATCH 4/4] elf_ops.h: Be more verbose with ROM blob names

2020-11-30 Thread Richard Henderson
On 11/29/20 2:39 PM, Peter Maydell wrote:
> Instead of making the ROM blob name something like:
>   phdr #0: /home/petmay01/linaro/qemu-misc-tests/ldmia-fault.axf
> make it a little more self-explanatory for people who don't know
> ELF format details:
>   /home/petmay01/linaro/qemu-misc-tests/ldmia-fault.axf ELF program header 
> segment 0
> 
> Signed-off-by: Peter Maydell 
> ---

Reviewed-by: Richard Henderson 

r~




Re: [PATCH] python 3.5 compatibility

2020-11-30 Thread Enrico Weigelt, metux IT consult
On 30.11.20 10:44, Kevin Wolf wrote:

Hi,

> While type hints are valuable documentation, they are more than just
> that. They help to find and prevent avoidable bugs in the code. We are
> actively in the process of adding them to everything in the QAPI
> generator to improve maintainability rather than removing them.

IOW: I'll have to do lots of backporting work, just to keep it running :(


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287



Re: [PATCH] python 3.5 compatibility

2020-11-30 Thread Enrico Weigelt, metux IT consult
On 27.11.20 20:15, Peter Maydell wrote:

Hi,

> Could you say which "stable distros" you have in mind, and whether
> they are covered by our "supported build platforms" policy
> https://www.qemu.org/docs/master/system/build-platforms.html  ?

I'm running on Devuan Ascii.

And packaging python-3.6 just for a few pieces of one application (qemu)
is far too much work.

--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287



Re: [PATCH 3/4] elf_ops.h: Don't truncate name of the ROM blobs we create

2020-11-30 Thread Richard Henderson
On 11/29/20 2:39 PM, Peter Maydell wrote:
> Currently the load_elf code assembles the ROM blob name into a
> local 128 byte fixed-size array. Use g_strdup_printf() instead so
> that we don't truncate the pathname if it happens to be long.
> (This matters mostly for monitor 'info roms' output and for the
> error messages if ROM blobs overlap.)
> 
> Signed-off-by: Peter Maydell 
> ---
>  include/hw/elf_ops.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson 

r~




Re: [PATCH 2/4] hw/core/loader.c: Improve reporting of ROM overlap errors

2020-11-30 Thread Richard Henderson
On 11/29/20 2:39 PM, Peter Maydell wrote:
> In rom_check_and_register_reset() we report to the user if there is
> a "ROM region overlap". This has a couple of problems:
>  * the reported information is not very easy to intepret
>  * the function just prints the overlap to stderr (and relies on
>its single callsite in vl.c to do an error_report() and exit)
>  * only the first overlap encountered is diagnosed
> 
> Make this function use error_report() and error_printf() and
> report a more user-friendly report with all the overlaps
> diagnosed.
> 
> Sample old output:
> 
> rom: requested regions overlap (rom dtb. free=0x8000, 
> addr=0x)
> qemu-system-aarch64: rom check and register reset failed
> 
> Sample new output:
> 
> qemu-system-aarch64: Some ROM regions are overlapping
> These ROM regions might have been loaded by direct user request or by default.
> They could be BIOS/firmware images, a guest kernel, initrd or some other file 
> loaded into guest memory.
> Check whether you intended to load all this guest code, and whether it has 
> been built to load to the correct addresses.
> 
> The following two regions overlap (in the cpu-memory-0 address space):
>   phdr #0: /home/petmay01/linaro/qemu-misc-tests/ldmia-fault.axf (addresses 
> 0x - 0x8000)
>   dtb (addresses 0x - 0x0010)
> 
> The following two regions overlap (in the cpu-memory-0 address space):
>   phdr #1: /home/petmay01/linaro/qemu-misc-tests/bad-psci-call.axf (addresses 
> 0x4000 - 0x4010)
>   phdr #0: /home/petmay01/linaro/qemu-misc-tests/bp-test.elf (addresses 
> 0x4000 - 0x4020)
> 
> Signed-off-by: Peter Maydell 
> ---

Reviewed-by: Richard Henderson 

r~




Re: [PATCH 1/4] hw/core/loader.c: Track last-seen ROM in rom_check_and_register_reset()

2020-11-30 Thread Richard Henderson
On 11/29/20 2:39 PM, Peter Maydell wrote:
> In rom_check_and_register_reset() we detect overlaps by looking at
> whether the ROM blob we're currently examining is in the same address
> space and starts before the previous ROM blob ends.  (This works
> because the ROM list is kept sorted in order by AddressSpace and then
> by address.)
> 
> Instead of keeping the AddressSpace and last address of the previous ROM
> blob in local variables, just keep a pointer to it.
> 
> This will allow us to print more useful information when we do detect
> an overlap.
> 
> Signed-off-by: Peter Maydell 
> ---
>  hw/core/loader.c | 23 +++
>  1 file changed, 15 insertions(+), 8 deletions(-)

Reviewed-by: Richard Henderson 

r~




Re: [PATCH] target/mips: Allow executing MSA instructions on Loongson-3A4000

2020-11-30 Thread Richard Henderson
On 11/30/20 4:22 AM, Philippe Mathieu-Daudé wrote:
> The Loongson-3A4000 is a GS464V-based processor with MIPS MSA ASE:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg763059.html
> 
> Commit af868995e1b correctly set the 'MSA present' bit of Config3
> register, but forgot to allow the MSA instructions decoding in
> insn_flags, so executing them triggers a 'Reserved Instruction'.
> 
> Fix by adding the ASE_MSA mask to insn_flags.
> 
> Fixes: af868995e1b ("target/mips: Add Loongson-3 CPU definition")
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Buggy since 5.1, so probably not a big deal.
> ---
>  target/mips/translate_init.c.inc | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson 

r~



Re: [PATCH] decodetree: Allow use of hex/bin format for argument field values

2020-11-30 Thread Richard Henderson
On 11/30/20 6:26 AM, Philippe Mathieu-Daudé wrote:
>  # 'Foo=number' sets an argument field to a constant value
> -if re.fullmatch(re_C_ident + '=[+-]?[0-9]+', t):
> +if re.fullmatch(re_C_ident + '=[+-]?(0[bx])?[0-9]+', t):
>  (fname, value) = t.split('=')
> -value = int(value)
> +if re.fullmatch('[+-]?0b[0-9]+', value):
> +base = 2
> +elif re.fullmatch('[+-]?0x[0-9]+', value):
> +base = 16
> +else:
> +base = 10
> +value = int(value, base)
>  flds = add_field(lineno, flds, fname, ConstField(value))
>  continue

Well, the regxps are off.  No letters for the hex, and 9 accepted for binary.
I think with the right regexps, just trusting int(value, 0) is good enough.

So maybe something like

  re_C_ident + "=[+-]?([0-9]+|0x[0-9a-fA-F]+|0b[01]+)"


r~



Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Kevin Wolf
Am 30.11.2020 um 17:57 hat Paolo Bonzini geschrieben:
> On 30/11/20 16:46, Kevin Wolf wrote:
> > Am 30.11.2020 um 15:58 hat Paolo Bonzini geschrieben:
> > > With this series it's basically pointless to have QOM properties at
> > > all.
> > 
> > Not entirely, because there are still some writable properties that can
> > be changed later on.
> 
> Are there really any (that are not bugs like opened/loaded)? That's also why
> Eduardo and I discussed a class-wide allow_set function for his field
> properties series.

Yes. I don't really know most objects apart from what I had to learn for
this series, but I know at least two that actually allow this
intentionally: iothread and throttle-group.

> > So with this in mind, I think I'm in favour of completely leaving the
> > initialisation of properties on object creation to QAPI, and only
> > providing individual setters if we actually intend to allow property
> > changes after creation.
> 
> The main problem is that it wouldn't extend well, if at all, to
> machines and devices.  So those would still not be integrated into the
> QAPI schema.

What do you think is the biggest difference there? Don't devices work
the same as user creatable objects in that you first set a bunch of
properties (which would now be done through QAPI instead) and then call
the completion/realize method that actually makes use of them?

I must admit that I don't know how machine types work.

> > > So the question is, are we okay with shoveling half of QEMU's backend data
> > > model into a single struct?  If so, there are important consequences.
> > 
> > Yeah, the single struct bothers me a bit, both in the QAPI schema and in
> > the C source.
> 
> The single struct doesn't bother me _too much_ actually.  What bothers me is
> that it won't be a single source of all QOM objects, only those that happen
> to be created by object-add.

But isn't it only natural that a list of these objects will exist in
some way, implicitly or explicitly? object-add must somehow decide which
object types it allows the user to create.

Once we describe the object types in the schema (rather than only their
properties), which is required if we want to generate the QOM
boilerplate, this can possibly become implicit because then QAPI can
know which objects implement USER_CREATABLE.

> So I start to wonder if QOM as it exists now is the right solution for
> all kind of objects:
> 
> - backends and other object-add types (not per-target, relatively few
> classes and even fewer hierarchies)
> 
> - machine (per-target, many classes, no hierarchy)
> 
> - device (can be treated as not per-target, many many classes, very few
> hierarchies)
> 
> - accelerator (per-target, few classes, no hierarchy)
> 
> - chardev (ok those are the same as the first category)
> 
> If QOM is the right solution, this patch goes in the wrong direction.
> 
> If QOM is the wrong solution, this patch is okay but then we have another
> problem to solve. :)

I think the requirements for all of them are probably similar enough
that they can potentially be covered by a single thing.

I'm also pretty sure that QOM as it exists now is not the right solution
for any of them because it has some major shortcomings. It's too easy to
get things wrong (like the writable properties after creation), its
introspection is rather weak and separated from the QAPI introspection,
it doesn't encourage documentation, and it involves quite a bit of
boilerplate and duplicated code between class implementations.

A modified QOM might be the right solution, though. I would like to
bring QAPI and QOM together because most of these weaknesses are
strengths of QAPI.

I guess the real question is what aspects of QOM need to be changed to
make it the right solution.

> > > The problem with this series is that you are fine with deduplicating 
> > > things
> > > as a third step, but you cannot be sure that such deduplication is 
> > > possible
> > > at all.  So while I don't have any problems in principle with the
> > > ObjectOptions concept, I don't think it should be committed without a 
> > > clear
> > > idea of how to do the third step.
> > 
> > Do you have any specific concerns why the deduplication might not
> > possible, or just because it's uncharted territory?
> 
> Mostly because it's more or less the same issue that you have with
> BlockdevOptions, with the extra complication that this series only deals
> with the easy one of the four above categories.

I'm not sure which exact problem with BlockdevOptions you mean. The
reason why the block layer doesn't use BlockdevOptions everywhere is
-drive support.

> > Maybe if we don't want to commit to keeping the ObjectOptions schema,
> > the part that should wait is object-add and I should do the command line
> > options first? Then the schema remains an implementation detail for now
> > that is invisible in introspection.
> 
> I don't see much benefit in converting _any_ of the three actually.  The
> only good reason I see f

Re: [PATCH for-6.0 v2 1/3] spapr: Improve naming of some vCPU id related items

2020-11-30 Thread Cédric Le Goater
On 11/30/20 6:32 PM, Cédric Le Goater wrote:
> On 11/30/20 5:52 PM, Greg Kurz wrote:
>> The machine tells the IC backend the number of vCPU ids it will be
>> exposed to, in order to:
>> - fill the "ibm,interrupt-server-ranges" property in the DT (XICS)
>> - size the VP block used by the in-kernel chip (XICS-on-XIVE, XIVE)
>>
>> The current "nr_servers" and "spapr_max_server_number" naming can
>> mislead people info thinking it is about a quantity of CPUs. Make
>> it clear this is all about vCPU ids.
> 
> OK. This looks fine. 
> 
> But, XIVE does not care about vCPU ids. Only the count of vCPUs
> is relevant. So, it would be nice to add a comment in the code 
> that we got it wrong at some point or that XICS imposed the use
> of max vCPU ids.

Which you do in the next patch,

Reviewed-by: Cédric Le Goater 

Thanks,

C. 
> 
>> Signed-off-by: Greg Kurz 
>> ---
>>  include/hw/ppc/spapr.h  |  2 +-
>>  include/hw/ppc/spapr_irq.h  |  8 
>>  include/hw/ppc/spapr_xive.h |  2 +-
>>  include/hw/ppc/xics_spapr.h |  2 +-
>>  hw/intc/spapr_xive.c|  8 
>>  hw/intc/spapr_xive_kvm.c|  4 ++--
>>  hw/intc/xics_kvm.c  |  4 ++--
>>  hw/intc/xics_spapr.c|  8 
>>  hw/ppc/spapr.c  |  8 
>>  hw/ppc/spapr_irq.c  | 18 +-
>>  10 files changed, 32 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index b7ced9faebf5..dc99d45e2852 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -849,7 +849,7 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
>>  int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error **errp);
>>  void spapr_clear_pending_events(SpaprMachineState *spapr);
>>  void spapr_clear_pending_hotplug_events(SpaprMachineState *spapr);
>> -int spapr_max_server_number(SpaprMachineState *spapr);
>> +int spapr_max_vcpu_ids(SpaprMachineState *spapr);
>>  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
>>uint64_t pte0, uint64_t pte1);
>>  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>> index c22a72c9e270..2e53fc9e6cbb 100644
>> --- a/include/hw/ppc/spapr_irq.h
>> +++ b/include/hw/ppc/spapr_irq.h
>> @@ -43,7 +43,7 @@ DECLARE_CLASS_CHECKERS(SpaprInterruptControllerClass, 
>> SPAPR_INTC,
>>  struct SpaprInterruptControllerClass {
>>  InterfaceClass parent;
>>  
>> -int (*activate)(SpaprInterruptController *intc, uint32_t nr_servers,
>> +int (*activate)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
>>  Error **errp);
>>  void (*deactivate)(SpaprInterruptController *intc);
>>  
>> @@ -62,7 +62,7 @@ struct SpaprInterruptControllerClass {
>>  /* These methods should only be called on the active intc */
>>  void (*set_irq)(SpaprInterruptController *intc, int irq, int val);
>>  void (*print_info)(SpaprInterruptController *intc, Monitor *mon);
>> -void (*dt)(SpaprInterruptController *intc, uint32_t nr_servers,
>> +void (*dt)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
>> void *fdt, uint32_t phandle);
>>  int (*post_load)(SpaprInterruptController *intc, int version_id);
>>  };
>> @@ -74,7 +74,7 @@ int spapr_irq_cpu_intc_create(struct SpaprMachineState 
>> *spapr,
>>  void spapr_irq_cpu_intc_reset(struct SpaprMachineState *spapr, PowerPCCPU 
>> *cpu);
>>  void spapr_irq_cpu_intc_destroy(struct SpaprMachineState *spapr, PowerPCCPU 
>> *cpu);
>>  void spapr_irq_print_info(struct SpaprMachineState *spapr, Monitor *mon);
>> -void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t nr_servers,
>> +void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t max_vcpu_ids,
>>void *fdt, uint32_t phandle);
>>  
>>  uint32_t spapr_irq_nr_msis(struct SpaprMachineState *spapr);
>> @@ -105,7 +105,7 @@ typedef int 
>> (*SpaprInterruptControllerInitKvm)(SpaprInterruptController *,
>>  
>>  int spapr_irq_init_kvm(SpaprInterruptControllerInitKvm fn,
>> SpaprInterruptController *intc,
>> -   uint32_t nr_servers,
>> +   uint32_t max_vcpu_ids,
>> Error **errp);
>>  
>>  /*
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 26c8d90d7196..643129b13536 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -79,7 +79,7 @@ int spapr_xive_end_to_target(uint8_t end_blk, uint32_t 
>> end_idx,
>>  /*
>>   * KVM XIVE device helpers
>>   */
>> -int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t nr_servers,
>> +int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t 
>> max_vcpu_ids,
>>  Error **errp);
>>  void kvmppc_xive_disconnect(SpaprInterruptController *intc);
>>  void kvmppc_xive_reset(SpaprXive *xive, Error **errp);
>> diff --git a/include/hw/ppc/xics_spapr.

Re: [PATCH for-6.0 v2 2/3] spapr/xive: Fix size of END table and number of claimed IPIs

2020-11-30 Thread Cédric Le Goater
On 11/30/20 5:52 PM, Greg Kurz wrote:
> The sPAPR XIVE device has an internal ENDT table the size of
> which is configurable by the machine. This table is supposed
> to contain END structures for all possible vCPUs that may
> enter the guest. The machine must also claim IPIs for all
> possible vCPUs since this is expected by the guest.
> 
> spapr_irq_init() takes care of that under the assumption that
> spapr_max_vcpu_ids() returns the number of possible vCPUs.
> This happens to be the case when the VSMT mode is set to match
> the number of threads per core in the guest (default behavior).
> With non-default VSMT settings, this limit is > to the number
> of vCPUs. In the worst case, we can end up allocating an 8 times
> bigger ENDT and claiming 8 times more IPIs than needed. But more
> importantly, this creates a confusion between number of vCPUs and
> vCPU ids, which can lead to subtle bugs like [1].
> 
> Use smp.max_cpus instead of spapr_max_vcpu_ids() in
> spapr_irq_init() for the latest machine type. Older machine
> types continue to use spapr_max_vcpu_ids() since the size of
> the ENDT is migration visible.
> 
> [1] https://bugs.launchpad.net/qemu/+bug/1900241
> 
> Signed-off-by: Greg Kurz 


Reviewed-by: Cédric Le Goater 

Thanks,

C.

> ---
>  include/hw/ppc/spapr.h |  1 +
>  hw/ppc/spapr.c |  3 +++
>  hw/ppc/spapr_irq.c | 16 +---
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index dc99d45e2852..95bf210d0bf6 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -139,6 +139,7 @@ struct SpaprMachineClass {
>  hwaddr rma_limit;  /* clamp the RMA to this size */
>  bool pre_5_1_assoc_refpoints;
>  bool pre_5_2_numa_associativity;
> +bool pre_6_0_xive_max_cpus;
>  
>  bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>uint64_t *buid, hwaddr *pio, 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index ab59bfe941d0..227a926dfd48 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4530,8 +4530,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
>   */
>  static void spapr_machine_5_2_class_options(MachineClass *mc)
>  {
> +SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>  spapr_machine_6_0_class_options(mc);
>  compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
> +smc->pre_6_0_xive_max_cpus = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 552e30e93036..4d9ecd5d0af8 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -324,17 +324,27 @@ void spapr_irq_init(SpaprMachineState *spapr, Error 
> **errp)
>  }
>  
>  if (spapr->irq->xive) {
> -uint32_t max_vcpu_ids = spapr_max_vcpu_ids(spapr);
> +uint32_t max_cpus = MACHINE(spapr)->smp.max_cpus;
>  DeviceState *dev;
>  int i;
>  
> +/*
> + * Older machine types used to size the ENDT and IPI range
> + * according to the upper limit of vCPU ids, which can be
> + * higher than smp.max_cpus with custom VSMT settings. Keep
> + * the previous behavior for compatibility with such setups.
> + */
> +if (smc->pre_6_0_xive_max_cpus) {
> +max_cpus = spapr_max_vcpu_ids(spapr);
> +}
> +
>  dev = qdev_new(TYPE_SPAPR_XIVE);
>  qdev_prop_set_uint32(dev, "nr-irqs", smc->nr_xirqs + 
> SPAPR_XIRQ_BASE);
>  /*
>   * 8 XIVE END structures per CPU. One for each available
>   * priority
>   */
> -qdev_prop_set_uint32(dev, "nr-ends", max_vcpu_ids << 3);
> +qdev_prop_set_uint32(dev, "nr-ends", max_cpus << 3);
>  object_property_set_link(OBJECT(dev), "xive-fabric", OBJECT(spapr),
>   &error_abort);
>  sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal);
> @@ -342,7 +352,7 @@ void spapr_irq_init(SpaprMachineState *spapr, Error 
> **errp)
>  spapr->xive = SPAPR_XIVE(dev);
>  
>  /* Enable the CPU IPIs */
> -for (i = 0; i < max_vcpu_ids; ++i) {
> +for (i = 0; i < max_cpus; ++i) {
>  SpaprInterruptControllerClass *sicc
>  = SPAPR_INTC_GET_CLASS(spapr->xive);
>  
> 




[PATCH 2/3] tests/acceptance: verify s390x device detection

2020-11-30 Thread Cornelia Huck
The kernel/initrd combination does not provide the virtio-net
driver; therefore, simply check whether the presented device type
is indeed virtio-net for the two virtio-net-{ccw,pci} devices.

Signed-off-by: Cornelia Huck 
---
 tests/acceptance/machine_s390_ccw_virtio.py | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tests/acceptance/machine_s390_ccw_virtio.py 
b/tests/acceptance/machine_s390_ccw_virtio.py
index 683b6e0dac2e..e203ee304264 100644
--- a/tests/acceptance/machine_s390_ccw_virtio.py
+++ b/tests/acceptance/machine_s390_ccw_virtio.py
@@ -80,3 +80,14 @@ class S390CCWVirtioMachine(Test):
 exec_command_and_wait_for_pattern(self,
   'cat 
/sys/bus/ccw/devices/0.3.1234/virtio?/features',
   virtio_rng_features)
+# verify that we indeed have virtio-net devices (without having the
+# virtio-net driver handy)
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/ccw/devices/0.1./cutype',
+  '3832/01')
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/pci/devices/0005\:00\:00.0/subsystem_vendor',
+  '0x1af4')
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/pci/devices/0005\:00\:00.0/subsystem_device',
+  '0x0001')
-- 
2.26.2




[PATCH 1/3] tests/acceptance: test virtio-ccw revision handling

2020-11-30 Thread Cornelia Huck
The max_revision prop of virtio-ccw devices can be used to force
an older revision for compatibility handling. The easiest way to
check this is to force a device to revision 0, which turns off
virtio-1.

Signed-off-by: Cornelia Huck 
---
 tests/acceptance/machine_s390_ccw_virtio.py | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/tests/acceptance/machine_s390_ccw_virtio.py 
b/tests/acceptance/machine_s390_ccw_virtio.py
index db6352c44434..683b6e0dac2e 100644
--- a/tests/acceptance/machine_s390_ccw_virtio.py
+++ b/tests/acceptance/machine_s390_ccw_virtio.py
@@ -51,6 +51,10 @@ class S390CCWVirtioMachine(Test):
  '-initrd', initrd_path,
  '-append', kernel_command_line,
  '-device', 'virtio-net-ccw,devno=fe.1.',
+ '-device',
+ 'virtio-rng-ccw,devno=fe.2.,max_revision=0',
+ '-device',
+ 'virtio-rng-ccw,devno=fe.3.1234,max_revision=2',
  '-device', 'zpci,uid=5,target=zzz',
  '-device', 'virtio-net-pci,id=zzz')
 self.vm.launch()
@@ -60,9 +64,19 @@ class S390CCWVirtioMachine(Test):
 # first debug shell is too early, we need to wait for device detection
 exec_command_and_wait_for_pattern(self, 'exit', shell_ready)
 
-ccw_bus_id="0.1."
+ccw_bus_ids="0.1.  0.2.  0.3.1234"
 pci_bus_id="0005:00:00.0"
 exec_command_and_wait_for_pattern(self, 'ls /sys/bus/ccw/devices/',
-  ccw_bus_id)
+  ccw_bus_ids)
 exec_command_and_wait_for_pattern(self, 'ls /sys/bus/pci/devices/',
   pci_bus_id)
+# check that the device at 0.2. is in legacy mode, while the
+# device at 0.3.1234 has the virtio-1 feature bit set
+
virtio_rng_features="11001000"
+
virtio_rng_features_legacy="1100"
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/ccw/devices/0.2./virtio?/features',
+  virtio_rng_features_legacy)
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/ccw/devices/0.3.1234/virtio?/features',
+  virtio_rng_features)
-- 
2.26.2




[PATCH 3/3] tests/acceptance: test s390x zpci fid propagation

2020-11-30 Thread Cornelia Huck
Verify that a fid specified on the command line shows up correctly
as the function_id in the guest.

Signed-off-by: Cornelia Huck 
---
 tests/acceptance/machine_s390_ccw_virtio.py | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tests/acceptance/machine_s390_ccw_virtio.py 
b/tests/acceptance/machine_s390_ccw_virtio.py
index e203ee304264..53b8484f8f9c 100644
--- a/tests/acceptance/machine_s390_ccw_virtio.py
+++ b/tests/acceptance/machine_s390_ccw_virtio.py
@@ -56,7 +56,9 @@ class S390CCWVirtioMachine(Test):
  '-device',
  'virtio-rng-ccw,devno=fe.3.1234,max_revision=2',
  '-device', 'zpci,uid=5,target=zzz',
- '-device', 'virtio-net-pci,id=zzz')
+ '-device', 'virtio-net-pci,id=zzz',
+ '-device', 'zpci,uid=0xa,fid=12,target=serial',
+ '-device', 'virtio-serial-pci,id=serial')
 self.vm.launch()
 
 shell_ready = "sh: can't access tty; job control turned off"
@@ -65,11 +67,11 @@ class S390CCWVirtioMachine(Test):
 exec_command_and_wait_for_pattern(self, 'exit', shell_ready)
 
 ccw_bus_ids="0.1.  0.2.  0.3.1234"
-pci_bus_id="0005:00:00.0"
+pci_bus_ids="0005:00:00.0  000a:00:00.0"
 exec_command_and_wait_for_pattern(self, 'ls /sys/bus/ccw/devices/',
   ccw_bus_ids)
 exec_command_and_wait_for_pattern(self, 'ls /sys/bus/pci/devices/',
-  pci_bus_id)
+  pci_bus_ids)
 # check that the device at 0.2. is in legacy mode, while the
 # device at 0.3.1234 has the virtio-1 feature bit set
 
virtio_rng_features="11001000"
@@ -91,3 +93,7 @@ class S390CCWVirtioMachine(Test):
 exec_command_and_wait_for_pattern(self,
   'cat 
/sys/bus/pci/devices/0005\:00\:00.0/subsystem_device',
   '0x0001')
+# check fid propagation
+exec_command_and_wait_for_pattern(self,
+  'cat 
/sys/bus/pci/devices/000a\:00\:00.0/function_id',
+  '0x000c')
-- 
2.26.2




[PATCH 0/3] tests/acceptance: enhance s390x devices test

2020-11-30 Thread Cornelia Huck
This series builds upon the new s390x acceptance test currently
queued on my s390-next branch.

Sadly, the kernel/initrd I'm using does not have the virtio-net driver,
so I cannot test things like mac address specification etc. Instead,
I added some quick checks regarding legacy virtio and propagation of
device type and fid properties.

Next up: maybe some device plug/unplug tests; but I still need to find
some inspiration there.

[And yes, I know that checkpatch moans about long lines -- hard to avoid
if you use a function with a very long name.]

Cornelia Huck (3):
  tests/acceptance: test virtio-ccw revision handling
  tests/acceptance: verify s390x device detection
  tests/acceptance: test s390x zpci fid propagation

 tests/acceptance/machine_s390_ccw_virtio.py | 41 ++---
 1 file changed, 36 insertions(+), 5 deletions(-)


base-commit: 875a99a0354211276b6daf635427b3c52a025790
-- 
2.26.2




Re: ImageInfo oddities regarding compression

2020-11-30 Thread Daniel P . Berrangé
On Fri, Nov 27, 2020 at 05:52:09PM +0100, Markus Armbruster wrote:
> Kevin Wolf  writes:
> 
> > Am 27.11.2020 um 13:21 hat Markus Armbruster geschrieben:
> >> >> I fell down this (thankfully shallow) rabbit hole because we also have
> >> >> 
> >> >> { 'enum': 'MultiFDCompression',
> >> >>   'data': [ 'none', 'zlib',
> >> >> { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
> >> >> 
> >> >> I wonder whether we could merge them into a common type.
> >> 
> >> Looks like we could: current code would never report the additional
> >> value 'none'.  Introspection would show it, though.  Seems unlikely to
> >> cause trouble.  Observation, not demand.
> >
> > Forgot to comment on this one...
> >
> > Technically we could probably, but would it make sense? Support for
> > compression formats has to be implemented separately for both cases, so
> > that they currently happen to support the same list is more of a
> > coincidence.
> >
> > If we ever add a third compression format to qcow2, would we add the
> > same format to migration, too, or would we split the schema into two
> > types again?
> 
> I figure if a compression method is worth adding to one, it's probably
> worth adding to the other.

I'm not entirely sure about that, as compression algorithms have different
tradeoffs.

For migration compression we're streaming large volumes of data in a live
session, both compression and decompression speed is very important.

For qcow2 we're handling relative small discrete blocks (a qcow2
cluster at a time) and decompression speed is much more important that
compression speed, since we generally compress once out of band, and
decompress many times while live.

So we could see new compression methods desired for migration that are
not relevant for qcow2, or vica-verca.

> Having two separate enums isn't too bad.  A third has been proposed[*],
> but I hope we can reuse migration's existing enum there.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH for-6.0 v2 1/3] spapr: Improve naming of some vCPU id related items

2020-11-30 Thread Cédric Le Goater
On 11/30/20 5:52 PM, Greg Kurz wrote:
> The machine tells the IC backend the number of vCPU ids it will be
> exposed to, in order to:
> - fill the "ibm,interrupt-server-ranges" property in the DT (XICS)
> - size the VP block used by the in-kernel chip (XICS-on-XIVE, XIVE)
> 
> The current "nr_servers" and "spapr_max_server_number" naming can
> mislead people info thinking it is about a quantity of CPUs. Make
> it clear this is all about vCPU ids.

OK. This looks fine. 

But, XIVE does not care about vCPU ids. Only the count of vCPUs
is relevant. So, it would be nice to add a comment in the code 
that we got it wrong at some point or that XICS imposed the use
of max vCPU ids.

C. 


> Signed-off-by: Greg Kurz 
> ---
>  include/hw/ppc/spapr.h  |  2 +-
>  include/hw/ppc/spapr_irq.h  |  8 
>  include/hw/ppc/spapr_xive.h |  2 +-
>  include/hw/ppc/xics_spapr.h |  2 +-
>  hw/intc/spapr_xive.c|  8 
>  hw/intc/spapr_xive_kvm.c|  4 ++--
>  hw/intc/xics_kvm.c  |  4 ++--
>  hw/intc/xics_spapr.c|  8 
>  hw/ppc/spapr.c  |  8 
>  hw/ppc/spapr_irq.c  | 18 +-
>  10 files changed, 32 insertions(+), 32 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index b7ced9faebf5..dc99d45e2852 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -849,7 +849,7 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
>  int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error **errp);
>  void spapr_clear_pending_events(SpaprMachineState *spapr);
>  void spapr_clear_pending_hotplug_events(SpaprMachineState *spapr);
> -int spapr_max_server_number(SpaprMachineState *spapr);
> +int spapr_max_vcpu_ids(SpaprMachineState *spapr);
>  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
>uint64_t pte0, uint64_t pte1);
>  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index c22a72c9e270..2e53fc9e6cbb 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -43,7 +43,7 @@ DECLARE_CLASS_CHECKERS(SpaprInterruptControllerClass, 
> SPAPR_INTC,
>  struct SpaprInterruptControllerClass {
>  InterfaceClass parent;
>  
> -int (*activate)(SpaprInterruptController *intc, uint32_t nr_servers,
> +int (*activate)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
>  Error **errp);
>  void (*deactivate)(SpaprInterruptController *intc);
>  
> @@ -62,7 +62,7 @@ struct SpaprInterruptControllerClass {
>  /* These methods should only be called on the active intc */
>  void (*set_irq)(SpaprInterruptController *intc, int irq, int val);
>  void (*print_info)(SpaprInterruptController *intc, Monitor *mon);
> -void (*dt)(SpaprInterruptController *intc, uint32_t nr_servers,
> +void (*dt)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
> void *fdt, uint32_t phandle);
>  int (*post_load)(SpaprInterruptController *intc, int version_id);
>  };
> @@ -74,7 +74,7 @@ int spapr_irq_cpu_intc_create(struct SpaprMachineState 
> *spapr,
>  void spapr_irq_cpu_intc_reset(struct SpaprMachineState *spapr, PowerPCCPU 
> *cpu);
>  void spapr_irq_cpu_intc_destroy(struct SpaprMachineState *spapr, PowerPCCPU 
> *cpu);
>  void spapr_irq_print_info(struct SpaprMachineState *spapr, Monitor *mon);
> -void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t nr_servers,
> +void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t max_vcpu_ids,
>void *fdt, uint32_t phandle);
>  
>  uint32_t spapr_irq_nr_msis(struct SpaprMachineState *spapr);
> @@ -105,7 +105,7 @@ typedef int 
> (*SpaprInterruptControllerInitKvm)(SpaprInterruptController *,
>  
>  int spapr_irq_init_kvm(SpaprInterruptControllerInitKvm fn,
> SpaprInterruptController *intc,
> -   uint32_t nr_servers,
> +   uint32_t max_vcpu_ids,
> Error **errp);
>  
>  /*
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 26c8d90d7196..643129b13536 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -79,7 +79,7 @@ int spapr_xive_end_to_target(uint8_t end_blk, uint32_t 
> end_idx,
>  /*
>   * KVM XIVE device helpers
>   */
> -int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t nr_servers,
> +int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t 
> max_vcpu_ids,
>  Error **errp);
>  void kvmppc_xive_disconnect(SpaprInterruptController *intc);
>  void kvmppc_xive_reset(SpaprXive *xive, Error **errp);
> diff --git a/include/hw/ppc/xics_spapr.h b/include/hw/ppc/xics_spapr.h
> index de752c0d2c7e..5c0e9430a964 100644
> --- a/include/hw/ppc/xics_spapr.h
> +++ b/include/hw/ppc/xics_spapr.h
> @@ -35,7 +35,7 @@
>  DECLARE_INSTANCE_CHECKER(ICSState, ICS_SPAPR,
>  

Re: [PATCH for-6.0 v2 0/3] spapr: Address the confusion between IPI numbers and vCPU ids

2020-11-30 Thread Cédric Le Goater
On 11/30/20 5:52 PM, Greg Kurz wrote:
> A regression was recently fixed in the sPAPR XIVE code for QEMU 5.2
> RC3 [1]. It boiled down to a confusion between IPI numbers and vCPU
> ids, which happen to be numerically equal in general, but are really
> different entities that can diverge in some setups. This was causing
> QEMU to misconfigure XIVE and to crash the guest.
> 
> The confusion comes from XICS actually. Interrupt presenters in XICS
> are identified by a "server number" which is a 1:1 mapping to vCPU
> ids. The range of these "server numbers" is exposed to the guest in
> the "ibm,interrupt-server-ranges" property. A xics_max_server_number()
> helper was introduced at some point to compute the upper limit of the
> range. When XIVE was added, commit 1a518e7693c9 renamed the helper to
> spapr_max_server_number(). It ended up being used to size a bunch of
> things in XIVE that are per-vCPU, such as internal END tables or
> IPI ranges presented to the guest. The problem is that the maximum
> "server number" can be much higher (up to 8 times) than the actual
> number of vCPUs when the VSMT mode doesn't match the number of threads
> per core in the guest:
> 
> DIV_ROUND_UP(ms->smp.max_cpus * spapr->vsmt, ms->smp.threads);
> 
> Since QEMU 4.2, the default behavior is to set spapr->vsmt to
> ms->smp.threads. Setups with custom VSMT settings will configure XIVE
> to use more HW resources than needed. This is a bit unfortunate but
> not extremely harmful, 

Indeed. The default usage case (without vsmt) has no impact since 
it does not fragment the XIVE VP space more than needed.

> unless maybe if a lot of guests are running on the host. 

We can run 4K (-2) KVM guests today on a P9 system. To reach the 
internal limits, each should have 32 vCPUs. It's possible with a 
lot of RAM but it's not a common scenario. 

C.


> The sizing of the IPI range is more problematic though
> as it eventually led to [1].
> 
> This series first does some renaming to make it clear when we're
> dealing with vCPU ids. It then fixes the machine code to pass
> smp.max_cpus to XIVE where appropriate. Since these changes are
> guest/migration visible, a machine property is added to keep the
> existing behavior for older machine types. The series is thus based
> on Connie's recent patch that introduces compat machines for
> QEMU 6.0.
> 
> Based-on: 20201109173928.1001764-1-coh...@redhat.com
> 
> Note that we still use spapr_max_vcpu_ids() when activating the
> in-kernel irqchip because this is what both XICS-on-XIVE and XIVE
> KVM devices expect.
> 
> [1] https://bugs.launchpad.net/qemu/+bug/1900241
> 
> v2: - comments on v1 highlighted that problems mostly come from
>   spapr_max_server_number() which got misused over the years.
>   Updated the cover letter accordingly.
> - completely new approach. Instead of messing with device properties,
>   pass the appropriate values to the IC backend handlers.
> - rename a few things using the "max_vcpu_ids" wording instead of
>   "nr_servers" and "max_server_number"
> 
> Greg Kurz (3):
>   spapr: Improve naming of some vCPU id related items
>   spapr/xive: Fix size of END table and number of claimed IPIs
>   spapr/xive: Fix the "ibm,xive-lisn-ranges" property
> 
>  include/hw/ppc/spapr.h  |  3 ++-
>  include/hw/ppc/spapr_irq.h  | 12 ++--
>  include/hw/ppc/spapr_xive.h |  2 +-
>  include/hw/ppc/xics_spapr.h |  2 +-
>  hw/intc/spapr_xive.c|  9 +
>  hw/intc/spapr_xive_kvm.c|  4 ++--
>  hw/intc/xics_kvm.c  |  4 ++--
>  hw/intc/xics_spapr.c| 11 ++-
>  hw/ppc/spapr.c  | 12 
>  hw/ppc/spapr_irq.c  | 34 --
>  10 files changed, 57 insertions(+), 36 deletions(-)
> 




Re: ImageInfo oddities regarding compression

2020-11-30 Thread Eric Blake
On 11/27/20 4:32 AM, Kevin Wolf wrote:

>>>##
>>># @Qcow2CompressionType:
>>>#
>>># Compression type used in qcow2 image file
>>>#
>>># @zlib: zlib compression, see 
>>># @zstd: zstd compression, see 
>>>#
>>># Since: 5.1
>>>##
>>>{ 'enum': 'Qcow2CompressionType',
>>>  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }
>>>
>>> Apparently, you can't have a qcow2 image without compression.  Correct?
>>>
>>> Can you imagine a use case for "without compression"?
>>
>> Yes & no. An image always has a compression type, either implicitly
>> zlib as the historical default, or explicitly as a type recorded in
>> the header.  This doesn't mean compression is used.
>>
>> There may be zero or more clusters actually using compression.
>> Typically compression will never be used, but there's still an
>> associated compression type regardless.
> 
> Right, so the correct answer to "is this image compressed?" is "unknown"
> for qcow2. Providing an answer would require checking all clusters in
> the image for the compression flag, which is not something we want to do
> for query-block. And even if we did that, it would still be unclear what
> to do with a partially compressed image.

If we truly need it, we could define three new autoclear bits in the
qcow2 spec:

bit 2: set if bits 3-4 are known to describe entire image
bit 3: set any time a compressed cluster is written to the image
bit 4: set any time an uncompressed cluster is written to the image

Any edit to the image by an older qemu will clear all three bits,
whereas new qemu would set bit 2 on image creation, and set the
appropriate bit 3 or 4 on any cluster write, so that we then have the
following knowledge:

234
===
000 - image was created or modified by older qemu; we have no idea if
clusters are written, or if compression was used, without expensive scan
001 - image contains at least one normal cluster, but no idea if it also
contains compressed clusters without expensive scan
010 - image contains at least one compressed cluster, but no idea if it
is fully compressed without expensive scan
011 - image contains mix of normal and compressed clusters
100 - image is newly created with no written clusters
101 - image contains only normal clusters; compression type could be
changed without risk
110 - image contains only compressed clusters
111 - image contains mix of normal and compressed clusters

But I'm not sure we need it.

> 
> The @compression-type only tells you what format a compressed cluster
> uses if any compressed cluster exists in the image (or if a new
> compressed cluster is written to it). It doesn't mean that such clusters
> currently exist.
> 
> Kevin
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




[Bug 498523] Re: Add on-line write compression support to qcow2

2020-11-30 Thread Max Reitz
The compression filter can be used e.g. with -drive
driver=compress,file.driver=qcow2,file.file.filename=foo.qcow2.
However, it shouldn’t be used lightly, as it will only do the right
thing in very specific circumstances, namely every cluster that’s
written to must not be allocated already.  So writing to the same
cluster twice will not work.  (Which is why I was hesitant to merge this
filter, but in the end I was contend with the fact that it’s at least
difficult enough to use that unsuspecting users hopefully won’t
accidentally enable it.)

(It should be noted that this is not a limitation of the compression
filter, though, but of qcow2’s implementation (VMDK isn’t any better).
So technically qemu has the feature now, but qcow2 is still missing it.)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/498523

Title:
  Add on-line write compression support to qcow2

Status in QEMU:
  Fix Released

Bug description:
  This is a wishlist item.  Launchpad really need a way for the
  submitter to indicate this.

  It would be really cool if qemu were to support disk compression on-
  line for writes.

  I know this wouldn't be really easy.  Although most OS's use blocks,
  you can really only count on being able to compress 512-byte sectors,
  which doesn't give much room for a good compression ratio.  Moreover,
  the index indicating where in the image file each sector is located
  would be complex to manage, since the compressed blocks would be
  variable sized, and you'd be wanting to do some kind of best-fit
  allocation of space in the image file.  (If you were to make the image
  file compressed block size granularity, say, 64 bytes, you could
  probably do this best fit O(1).)  If you were to buffer enough writes,
  you could group arbitrary sequences of written sectors into blocks to
  compress (which with writeback could be sent to a helper thread on
  another CPU, so the throughput would be good).

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/498523/+subscriptions



Re: [PATCH v1 8/9] virtio-mem: Require only coordinated discards

2020-11-30 Thread Dr. David Alan Gilbert
* David Hildenbrand (da...@redhat.com) wrote:
> We implement the RamDiscardMgr interface and only require coordinated
> discarding of RAM to work.
> 
> Cc: Paolo Bonzini 
> Cc: "Michael S. Tsirkin" 
> Cc: Alex Williamson 
> Cc: Dr. David Alan Gilbert 
> Cc: Igor Mammedov 
> Cc: Pankaj Gupta 
> Cc: Peter Xu 
> Cc: Auger Eric 
> Cc: Wei Yang 
> Cc: teawater 
> Cc: Marek Kedzierski 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  hw/virtio/virtio-mem.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 93257b6c26..932d386c37 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -687,7 +687,7 @@ static void virtio_mem_device_realize(DeviceState *dev, 
> Error **errp)
>  return;
>  }
>  
> -if (ram_block_discard_require(true)) {
> +if (ram_block_coordinated_discard_require(true)) {
>  error_setg(errp, "Discarding RAM is disabled");
>  return;
>  }
> @@ -695,7 +695,7 @@ static void virtio_mem_device_realize(DeviceState *dev, 
> Error **errp)
>  ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>  if (ret) {
>  error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> -ram_block_discard_require(false);
> +ram_block_coordinated_discard_require(false);
>  return;
>  }
>  
> @@ -738,7 +738,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>  virtio_del_queue(vdev, 0);
>  virtio_cleanup(vdev);
>  g_free(vmem->bitmap);
> -ram_block_discard_require(false);
> +ram_block_coordinated_discard_require(false);
>  }
>  
>  static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg,
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH for-6.0 1/6] qapi: Add query-accel command

2020-11-30 Thread Philippe Mathieu-Daudé
On 11/16/20 2:10 PM, Roman Bolshakov wrote:
> There's a problem for management applications to determine if certain
> accelerators available. Generic QMP command should help with that.
> 
> Signed-off-by: Roman Bolshakov 
> ---
>  monitor/qmp-cmds.c | 15 +++
>  qapi/machine.json  | 19 +++
>  2 files changed, 34 insertions(+)
...
> +##
> +# @query-accel:
> +#
> +# Returns information about an accelerator
> +#
> +# Returns: @KvmInfo
> +#
> +# Since: 6.0.0
> +#
> +# Example:
> +#
> +# -> { "execute": "query-accel", "arguments": { "name": "kvm" } }
> +# <- { "return": { "enabled": true, "present": true } }

FWIW you can use 'qom-list-types' for that:

{ "execute": "qom-list-types", "arguments": { "implements": "accel" } }
{
"return": [
{
"name": "qtest-accel",
"parent": "accel"
},
{
"name": "tcg-accel",
"parent": "accel"
},
{
"name": "xen-accel",
"parent": "accel"
},
{
"name": "kvm-accel",
"parent": "accel"
},
{
"name": "accel",
"parent": "object"
}
]
}

Which is what I use for integration tests:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg675079.html
https://www.mail-archive.com/qemu-devel@nongnu.org/msg675085.html




[PATCH] target/riscv: Fix definition of MSTATUS_TW and MSTATUS_TSR

2020-11-30 Thread Alex Richardson
The TW and TSR fields should be bits 21 and 22 and not 30/29.
This was found while comparing QEMU behaviour against the sail formal
model (https://github.com/rems-project/sail-riscv/).

Signed-off-by: Alex Richardson 
---
 target/riscv/cpu_bits.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index 24b24c69c5..92147332c6 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -379,8 +379,8 @@
 #define MSTATUS_MXR 0x0008
 #define MSTATUS_VM  0x1F00 /* until: priv-1.9.1 */
 #define MSTATUS_TVM 0x0010 /* since: priv-1.10 */
-#define MSTATUS_TW  0x2000 /* since: priv-1.10 */
-#define MSTATUS_TSR 0x4000 /* since: priv-1.10 */
+#define MSTATUS_TW  0x0020 /* since: priv-1.10 */
+#define MSTATUS_TSR 0x0040 /* since: priv-1.10 */
 #define MSTATUS_GVA 0x40ULL
 #define MSTATUS_MPV 0x80ULL
 
-- 
2.29.2




Re: [RFC PATCH 18/25] hw/cxl/device: Add a memory device (8.2.8.5)

2020-11-30 Thread Ben Widawsky
On 20-11-26 07:36:23, Markus Armbruster wrote:
> Ben Widawsky  writes:
> 
> > On 20-11-13 08:47:59, Markus Armbruster wrote:
> >> Eric Blake  writes:
> >> 
> >> > On 11/10/20 11:47 PM, Ben Widawsky wrote:
> >> >> A CXL memory device (AKA Type 3) is a CXL component that contains some
> >> >> combination of volatile and persistent memory. It also implements the
> >> >> previously defined mailbox interface as well as the memory device
> >> >> firmware interface.
> >> >> 
> >> >> The following example will create a 256M device in a 512M window:
> >> >> 
> >> >> -object 
> >> >> "memory-backend-file,id=cxl-mem1,share,mem-path=cxl-type3,size=512M"
> >> >> -device "cxl-type3,bus=rp0,memdev=cxl-mem1,id=cxl-pmem0,size=256M"
> >> >> 
> >> >> Signed-off-by: Ben Widawsky 
> >> >> ---
> >> >
> >> >> +++ b/qapi/machine.json
> >> >> @@ -1394,6 +1394,7 @@
> >> >>  { 'union': 'MemoryDeviceInfo',
> >> >>'data': { 'dimm': 'PCDIMMDeviceInfo',
> >> >>  'nvdimm': 'PCDIMMDeviceInfo',
> >> >> +'cxl': 'PCDIMMDeviceInfo',
> >> >>  'virtio-pmem': 'VirtioPMEMDeviceInfo',
> >> >>  'virtio-mem': 'VirtioMEMDeviceInfo'
> >> >>}
> >> >
> >> > Missing documentation of the new data type, and the fact that it will be
> >> > introduced in 6.0.
> >> 
> >> Old wish list item: improve the QAPI schema frontend to flag this.
> >> 
> >
> > "Introduced in 6.0" - quite the optimist. Kidding aside, this is the area 
> > where
> > I could use some feedback. CXL Type 3 memory devices can contain both 
> > volatile
> > and persistent memory at the same time. As such, I think I'll need a new 
> > type to
> > represent that, but I'd love to know how best to accomplish that.
> 
> We can help.  Tell us what information you want to provide in variant
> 'cxl'.  If it's a superset of an existing variant, give us just the
> delta.
> 

I'm not exactly sure what the best way to do this is in QEMU, so I'm not really
sure what to specify as the delta. A CXL memory device can have both volatile
and persistent memory. Currently when a CXL memory device implements the
TYPE_MEMORY_DEVICE interface. So I believe the shortest path is a
MemoryDeviceInfo that can have two memory devices associated with it, but I
don't know if that's the best path.


> >> > Also, Markus has been trying to get rid of so-called
> >> > "simple unions" in favor of "flat unions" - every time we modify an
> >> > existing simple union, it is worth asking if it is time to first flatten
> >> > this.
> >> 
> >> 0. Simple unions can be transformed into flat unions.  The
> >> transformation can either preserve the nested wire format, or flatten
> >> it.  See docs/devel/qapi-code-gen.txt "A simple union can always be
> >> re-written as a flat union ..."
> >> 
> >> 1. No new simple unions.
> >> 
> >> 2. Existing simple unions that can be flattened without breaking
> >> backward compatibility have long been flattened.
> >> 
> >> 3. The remaining simple unions are part of QMP, where we need to
> >> preserve the wire format.  We could turn them into flat union preserving
> >> the wire format.  Only worthwhile if we kill simple unions and simplify
> >> scripts/qapi/.  Opportunity to make the flat union syntax less
> >> cumbersome.  Not done due to lack of time.
> >> 
> >> 4. Kevin and I have been experimenting with ways to provide both flat
> >> and nested wire format.  This would pave the way for orderly deprecation
> >> of the nested wire format.  May not be practical for QMP output.
> >> 
> >
> > So is there anything for me to do here?
> 
> No.  Extending an existing simple union is okay.
> 
> We should not add news ones.  We should think twice before we add new
> uses of existing ones.
> 
> 



Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

2020-11-30 Thread Alex Williamson
On Thu, 26 Nov 2020 14:56:17 +0800
Shenming Lu  wrote:

> Hi,
> 
> After reading everyone's opinions, we have a rough idea for this issue.
> 
> One key point is whether it is necessary to setup the config space before
> the device can accept further migration data. I think it is decided by
> the vendor driver, so we can simply ask the vendor driver about it in
> .save_setup, which could avoid a lot of unnecessary copies and settings.
> Once we have known the need, we can iterate the config space (before)
> along with the device migration data in .save_live_iterate and
> .save_live_complete_precopy, and if not needed, we can only migrate the
> config space in .save_state.
> 
> Another key point is that the interrupt enabling should be after the
> restoring of the interrupt controller (might not only interrupts).
> My solution is to add a subflag at the beginning of the config data
> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
> actions on the dst (such as whether to enable interrupts).
> 
> Below is it's workflow.
> 
> On the save path:
>   In vfio_save_setup():
>   Ask the vendor driver if it needs the config space setup before it
>   can accept further migration data.

How does "ask the vendor driver" actually work?

>   |
>   In vfio_save_iterate() (pre-copy):
>   If *needed*, save the config space which would be setup on the dst
>   before the migration data, but send with a subflag to instruct not
>   to (such as) enable interrupts.

If not for triggering things like MSI/X configuration, isn't config
space almost entirely virtual?  What visibility does the vendor driver
have to the VM machine dependencies regarding device interrupt versus
interrupt controller migration?

>   |
>   In vfio_save_complete_precopy() (stop-and-copy, iterable process):
>   The same as that in vfio_save_iterate().
>   |
>   In .save_state (stop-and-copy, non-iterable process):
>   If *needed*, only send a subflag to instruct to enable interrupts.
>   If *not needed*, save the config space and setup everything on the dst.

Again, how does the vendor driver have visibility to know when the VM
machine can enable interrupts?

> 
> Besides the above idea, we might be able to choose to let the vendor driver do
> more: qemu just sends and writes the config data (before) along with the 
> device
> migration data every time, and it's up to the vendor driver to filter 
> out/buffer
> the received data or reorder the settings...

There is no vendor driver in QEMU though, so are you suggesting that
QEMU follows a standard protocol and the vendor driver chooses when to
enable specific features?  For instance, QEMU would call SET_IRQS and
the driver would return success, but defer that setup if necessary?
That seems quite troubling as we then have ioctls that behave
differently depending on the device state and we have no error path to
userspace should that setup fail later.  The vendor driver does have
its own data stream for migration, so the vendor driver could tell the
destination version of itself what type of interrupt to use, which
might be sufficient if we were to ignore the latency if QEMU were to
defer interrupt setup until stop-and-copy.

Is the question of when to setup device interrupts versus the interrupt
controller state largely a machine issue within QEMU?  If so, shouldn't
it be at QEMU's determination when to act on the config space
information on the target?  IOW, if a vendor driver has a dependency on
interrupt configuration, they need to include it in their own pre-copy
data stream and decouple that dependency from userspace interrupt
configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,

Alex




Re: [RFC] ich9:cpuhp: add support for cpu hot-unplug with SMI broadcast enabled

2020-11-30 Thread Laszlo Ersek
On 11/28/20 01:43, Ankur Arora wrote:
> On 2020-11-27 7:19 a.m., Laszlo Ersek wrote:
>> On 11/27/20 05:10, Ankur Arora wrote:
>>
>>> Yeah I was wondering what would happen for simultaneous hot add and
>>> remove.
>>> I guess we would always do remove first and then the add, unless we hit
>>> the break due to max_cpus_per_pass and switch to hot-add mode.
>>
>> Considering the firmware only, I disagree with remove-then-add.
>>
>> EFI_SMM_CPU_SERVICE_PROTOCOL.AddProcessor() and
>> EFI_SMM_CPU_SERVICE_PROTOCOL.RemoveProcessor() (implemented in
>> SmmAddProcessor() and SmmRemoveProcessor() in
>> "UefiCpuPkg/PiSmmCpuDxeSmm/CpuService.c", respectively) only mark the
>> processors for addition/removal. The actual processing is done only
>> later, in BSPHandler() --> SmmCpuUpdate(), when "all SMI handlers are
>> finished" (see the comment in SmmRemoveProcessor()).
>>
>> Consequently, I would not suggest replacing a valid APIC ID in a
>> particular mCpuHotPlugData.ApicId[Index] slot with INVALID_APIC_ID
>> (corresponding to the unplug operation), and then possibly replacing
>> INVALID_APIC_ID in the *same slot* with the APIC ID of the newly plugged
>> CPU, in the exact same SMI invocation (= in the same execution of
>> CpuHotplugMmi()). That might cause some component in edk2 to see the
>> APIC ID in mCpuHotPlugData.ApicId[Index] to change from one valid ACPI
>> ID to another valid APIC ID, and I don't even want to think about what
>> kind of mess that could cause.
> 
> Shudders.
> 
>>
>> So no, please handle plugs first, for which unused slots in
>> mCpuHotPlugData.ApicId will be populated, and *then* handle removals (in
>> the same invocation of CpuHotplugMmi()).
> 
> Yeah, that ordering makes complete sense.
> 
>>
>> By the way, for unplug, you will not have to re-set
>> mCpuHotPlugData.ApicId[Index] to INVALID_APIC_ID, as
>> SmmRemoveProcessor() does that internally. You just have to locate the
>> Index for the APIC ID being removed, for calling
>> EFI_SMM_CPU_SERVICE_PROTOCOL.RemoveProcessor().
> 
> Right. The hotplug is more involved (given the need to pen the new CPU)
> but for the unplug, AFAICS all the actual handling for removal is in
> .RemoveProcessor() and at SMI exit in SmmCpuUpdate().

Yes, I got the same impression (without having tried to implement it, of
course).

Laszlo




Re: [RFC] ich9:cpuhp: add support for cpu hot-unplug with SMI broadcast enabled

2020-11-30 Thread Laszlo Ersek
On 11/28/20 00:48, Ankur Arora wrote:

> It is possible that there are CPUs with bits for both is_inserting and
> is_removing. In that case QemuCpuhpCollectApicIds() would put them in the
> PluggedApicIds array and the unplug eventually happens in the next
> firmware invocation.
> 
> If a CPU has both is_inserting and fw_remove set, the firmware processes
> the
> hotplug in that invocation and the unplug happens whenever the OSPM
> triggers
> the firmware next.

If these corner cases will actually work (I'm somewhat doubtful), that
will be really great.

Thanks!
Laszlo




Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Paolo Bonzini

On 30/11/20 16:46, Kevin Wolf wrote:

Am 30.11.2020 um 15:58 hat Paolo Bonzini geschrieben:

With this series it's basically pointless to have QOM properties at
all.


Not entirely, because there are still some writable properties that can
be changed later on.


Are there really any (that are not bugs like opened/loaded)? That's also 
why Eduardo and I discussed a class-wide allow_set function for his 
field properties series.



So with this in mind, I think I'm in favour of completely leaving the
initialisation of properties on object creation to QAPI, and only
providing individual setters if we actually intend to allow property
changes after creation.


The main problem is that it wouldn't extend well, if at all, to machines 
and devices.  So those would still not be integrated into the QAPI schema.



So the question is, are we okay with shoveling half of QEMU's backend data
model into a single struct?  If so, there are important consequences.


Yeah, the single struct bothers me a bit, both in the QAPI schema and in
the C source.


The single struct doesn't bother me _too much_ actually.  What bothers 
me is that it won't be a single source of all QOM objects, only those 
that happen to be created by object-add.  So I start to wonder if QOM as 
it exists now is the right solution for all kind of objects:


- backends and other object-add types (not per-target, relatively few 
classes and even fewer hierarchies)


- machine (per-target, many classes, no hierarchy)

- device (can be treated as not per-target, many many classes, very few 
hierarchies)


- accelerator (per-target, few classes, no hierarchy)

- chardev (ok those are the same as the first category)

If QOM is the right solution, this patch goes in the wrong direction.

If QOM is the wrong solution, this patch is okay but then we have 
another problem to solve. :)



The problem with this series is that you are fine with deduplicating things
as a third step, but you cannot be sure that such deduplication is possible
at all.  So while I don't have any problems in principle with the
ObjectOptions concept, I don't think it should be committed without a clear
idea of how to do the third step.


Do you have any specific concerns why the deduplication might not
possible, or just because it's uncharted territory?


Mostly because it's more or less the same issue that you have with 
BlockdevOptions, with the extra complication that this series only deals 
with the easy one of the four above categories.



Maybe if we don't want to commit to keeping the ObjectOptions schema,
the part that should wait is object-add and I should do the command line
options first? Then the schema remains an implementation detail for now
that is invisible in introspection.


I don't see much benefit in converting _any_ of the three actually.  The 
only good reason I see for QAPIfying this is the documentation, and the 
promise of deduplicating QOM boilerplate.  The latter is only a promise, 
but documentation alone is a damn good reason and it's enough to get 
this work into a mergeable shape as soon as possible!


But maybe, we could start in the opposite direction: start with the use 
QAPI to eliminate QOM boilerplate.  Basing your work on Eduardo's field 
properties series, you could add a new 'object' "kind" to QAPI that 
would create an array of field properties (e.g. a macro expanding to a 
compound literal?)

.  Something like


+{ 'object': 'InputBarrier',
+  'data': { 'name': 'str',
+'x-origin': 'int16',
+'y-origin': 'int16',
+'width': 'int16',
+'height': 'int16' },
+  'properties': { 'server': 'str',
+  'port': 'str' } }

would create a macro QOM_InputBarrier_FIELDS defining properties for the 
following fields of the InputBarrier struct:


gchar *name;
int16_t x_origin, y_origin;
int16_t width, height;

while server and port would only appear in the documentation (or 
alternatively you could leave them out completely, as you wish).

The advantages would be noticeable:

1) the information would be moved in the QAPI schema JSON from the 
beginning, decoupling the conflict-heavy part from the complex question 
of how to expose the QOM schema in the introspection data


2) there would not be any more duplication than before (there would be 
duplication between structs and QAPI schema, but not between structs and 
C code that defines properties).


3) it would be opt-in, so it doesn't put on you the burden of keeping 
the series in sync with new objects that are added (I have one for the 
qtest server for example).  At the same time it would be quite appealing 
for owners of QOM code to convert their objects to field properties and 
get documentation for free.


4) we could special-case 'object' definitions and generate them in the 
system manual as well, since they are also useful to document -object.


Yes it's a huge change but you have the boring part already done.  What 
do you think?

Re: [PATCH v2 0/4] [RfC] fix tracing for modules

2020-11-30 Thread Stefan Hajnoczi
On Tue, Nov 24, 2020 at 05:02:51PM +0100, Gerd Hoffmann wrote:
> First version that actually works.  Only qxl covered for this RfC, other
> modules will follow once the basics are hashed out.
> 
> More context:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1898700
>   https://bugzilla.redhat.com/show_bug.cgi?id=1869642
> 
> take care,
>   Gerd
> 
> Gerd Hoffmann (4):
>   meson: add trace_events_config[]
>   meson: move up hw subdir (specifically before trace subdir)
>   meson: add module_trace & module_trace_src
>   meson: move qxl trace events to separate file

Awesome, thank you for working on this!

I noticed an issue with simpletrace: the trace file does not contain
qxl_* TRACE_RECORD_TYPE_MAPPING records when ./configure
--enable-modules is used. This happens because st_write_event_mapping()
is called before the qxl module calls trace_event_register_group().

(The mapping records describe the integer ID to string name mapping used
in a simpletrace file.)

You can check this using "grep -a qxl_ trace-$LAST_QEMU_PID" after
running qemu --device qxl built with ./configure --enable-modules
--enable-trace-backend=simple.

Remove --enable-modules and the file will contain the qxl_ trace events.

This means the trace file is broken because the simpletrace records
cannot be mapped to a trace event name.

One way to solve this is by modifying trace_event_register_group() to
call into trace/simple.c (maybe with a TraceEventIter initialized to
iterate over the newly registered trace events group?).

Alternatively, simpletrace.c might be able to emit
TRACE_RECORD_TYPE_MAPPING on demand instead of ahead of time.

Stefan


signature.asc
Description: PGP signature


[PATCH for-6.0 v2 1/3] spapr: Improve naming of some vCPU id related items

2020-11-30 Thread Greg Kurz
The machine tells the IC backend the number of vCPU ids it will be
exposed to, in order to:
- fill the "ibm,interrupt-server-ranges" property in the DT (XICS)
- size the VP block used by the in-kernel chip (XICS-on-XIVE, XIVE)

The current "nr_servers" and "spapr_max_server_number" naming can
mislead people info thinking it is about a quantity of CPUs. Make
it clear this is all about vCPU ids.

Signed-off-by: Greg Kurz 
---
 include/hw/ppc/spapr.h  |  2 +-
 include/hw/ppc/spapr_irq.h  |  8 
 include/hw/ppc/spapr_xive.h |  2 +-
 include/hw/ppc/xics_spapr.h |  2 +-
 hw/intc/spapr_xive.c|  8 
 hw/intc/spapr_xive_kvm.c|  4 ++--
 hw/intc/xics_kvm.c  |  4 ++--
 hw/intc/xics_spapr.c|  8 
 hw/ppc/spapr.c  |  8 
 hw/ppc/spapr_irq.c  | 18 +-
 10 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index b7ced9faebf5..dc99d45e2852 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -849,7 +849,7 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
 int spapr_reallocate_hpt(SpaprMachineState *spapr, int shift, Error **errp);
 void spapr_clear_pending_events(SpaprMachineState *spapr);
 void spapr_clear_pending_hotplug_events(SpaprMachineState *spapr);
-int spapr_max_server_number(SpaprMachineState *spapr);
+int spapr_max_vcpu_ids(SpaprMachineState *spapr);
 void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
   uint64_t pte0, uint64_t pte1);
 void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index c22a72c9e270..2e53fc9e6cbb 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -43,7 +43,7 @@ DECLARE_CLASS_CHECKERS(SpaprInterruptControllerClass, 
SPAPR_INTC,
 struct SpaprInterruptControllerClass {
 InterfaceClass parent;
 
-int (*activate)(SpaprInterruptController *intc, uint32_t nr_servers,
+int (*activate)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
 Error **errp);
 void (*deactivate)(SpaprInterruptController *intc);
 
@@ -62,7 +62,7 @@ struct SpaprInterruptControllerClass {
 /* These methods should only be called on the active intc */
 void (*set_irq)(SpaprInterruptController *intc, int irq, int val);
 void (*print_info)(SpaprInterruptController *intc, Monitor *mon);
-void (*dt)(SpaprInterruptController *intc, uint32_t nr_servers,
+void (*dt)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
void *fdt, uint32_t phandle);
 int (*post_load)(SpaprInterruptController *intc, int version_id);
 };
@@ -74,7 +74,7 @@ int spapr_irq_cpu_intc_create(struct SpaprMachineState *spapr,
 void spapr_irq_cpu_intc_reset(struct SpaprMachineState *spapr, PowerPCCPU 
*cpu);
 void spapr_irq_cpu_intc_destroy(struct SpaprMachineState *spapr, PowerPCCPU 
*cpu);
 void spapr_irq_print_info(struct SpaprMachineState *spapr, Monitor *mon);
-void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t nr_servers,
+void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t max_vcpu_ids,
   void *fdt, uint32_t phandle);
 
 uint32_t spapr_irq_nr_msis(struct SpaprMachineState *spapr);
@@ -105,7 +105,7 @@ typedef int 
(*SpaprInterruptControllerInitKvm)(SpaprInterruptController *,
 
 int spapr_irq_init_kvm(SpaprInterruptControllerInitKvm fn,
SpaprInterruptController *intc,
-   uint32_t nr_servers,
+   uint32_t max_vcpu_ids,
Error **errp);
 
 /*
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 26c8d90d7196..643129b13536 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -79,7 +79,7 @@ int spapr_xive_end_to_target(uint8_t end_blk, uint32_t 
end_idx,
 /*
  * KVM XIVE device helpers
  */
-int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t nr_servers,
+int kvmppc_xive_connect(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
 Error **errp);
 void kvmppc_xive_disconnect(SpaprInterruptController *intc);
 void kvmppc_xive_reset(SpaprXive *xive, Error **errp);
diff --git a/include/hw/ppc/xics_spapr.h b/include/hw/ppc/xics_spapr.h
index de752c0d2c7e..5c0e9430a964 100644
--- a/include/hw/ppc/xics_spapr.h
+++ b/include/hw/ppc/xics_spapr.h
@@ -35,7 +35,7 @@
 DECLARE_INSTANCE_CHECKER(ICSState, ICS_SPAPR,
  TYPE_ICS_SPAPR)
 
-int xics_kvm_connect(SpaprInterruptController *intc, uint32_t nr_servers,
+int xics_kvm_connect(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
  Error **errp);
 void xics_kvm_disconnect(SpaprInterruptController *intc);
 bool xics_kvm_has_broken_disconnect(void);
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 12dd6d3ce357..d0a0ca822367 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -669,7

[PATCH for-6.0 v2 2/3] spapr/xive: Fix size of END table and number of claimed IPIs

2020-11-30 Thread Greg Kurz
The sPAPR XIVE device has an internal ENDT table the size of
which is configurable by the machine. This table is supposed
to contain END structures for all possible vCPUs that may
enter the guest. The machine must also claim IPIs for all
possible vCPUs since this is expected by the guest.

spapr_irq_init() takes care of that under the assumption that
spapr_max_vcpu_ids() returns the number of possible vCPUs.
This happens to be the case when the VSMT mode is set to match
the number of threads per core in the guest (default behavior).
With non-default VSMT settings, this limit is > to the number
of vCPUs. In the worst case, we can end up allocating an 8 times
bigger ENDT and claiming 8 times more IPIs than needed. But more
importantly, this creates a confusion between number of vCPUs and
vCPU ids, which can lead to subtle bugs like [1].

Use smp.max_cpus instead of spapr_max_vcpu_ids() in
spapr_irq_init() for the latest machine type. Older machine
types continue to use spapr_max_vcpu_ids() since the size of
the ENDT is migration visible.

[1] https://bugs.launchpad.net/qemu/+bug/1900241

Signed-off-by: Greg Kurz 
---
 include/hw/ppc/spapr.h |  1 +
 hw/ppc/spapr.c |  3 +++
 hw/ppc/spapr_irq.c | 16 +---
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index dc99d45e2852..95bf210d0bf6 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -139,6 +139,7 @@ struct SpaprMachineClass {
 hwaddr rma_limit;  /* clamp the RMA to this size */
 bool pre_5_1_assoc_refpoints;
 bool pre_5_2_numa_associativity;
+bool pre_6_0_xive_max_cpus;
 
 bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
   uint64_t *buid, hwaddr *pio, 
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ab59bfe941d0..227a926dfd48 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4530,8 +4530,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
  */
 static void spapr_machine_5_2_class_options(MachineClass *mc)
 {
+SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
 spapr_machine_6_0_class_options(mc);
 compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
+smc->pre_6_0_xive_max_cpus = true;
 }
 
 DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 552e30e93036..4d9ecd5d0af8 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -324,17 +324,27 @@ void spapr_irq_init(SpaprMachineState *spapr, Error 
**errp)
 }
 
 if (spapr->irq->xive) {
-uint32_t max_vcpu_ids = spapr_max_vcpu_ids(spapr);
+uint32_t max_cpus = MACHINE(spapr)->smp.max_cpus;
 DeviceState *dev;
 int i;
 
+/*
+ * Older machine types used to size the ENDT and IPI range
+ * according to the upper limit of vCPU ids, which can be
+ * higher than smp.max_cpus with custom VSMT settings. Keep
+ * the previous behavior for compatibility with such setups.
+ */
+if (smc->pre_6_0_xive_max_cpus) {
+max_cpus = spapr_max_vcpu_ids(spapr);
+}
+
 dev = qdev_new(TYPE_SPAPR_XIVE);
 qdev_prop_set_uint32(dev, "nr-irqs", smc->nr_xirqs + SPAPR_XIRQ_BASE);
 /*
  * 8 XIVE END structures per CPU. One for each available
  * priority
  */
-qdev_prop_set_uint32(dev, "nr-ends", max_vcpu_ids << 3);
+qdev_prop_set_uint32(dev, "nr-ends", max_cpus << 3);
 object_property_set_link(OBJECT(dev), "xive-fabric", OBJECT(spapr),
  &error_abort);
 sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal);
@@ -342,7 +352,7 @@ void spapr_irq_init(SpaprMachineState *spapr, Error **errp)
 spapr->xive = SPAPR_XIVE(dev);
 
 /* Enable the CPU IPIs */
-for (i = 0; i < max_vcpu_ids; ++i) {
+for (i = 0; i < max_cpus; ++i) {
 SpaprInterruptControllerClass *sicc
 = SPAPR_INTC_GET_CLASS(spapr->xive);
 
-- 
2.26.2




[PATCH for-6.0 v2 0/3] spapr: Address the confusion between IPI numbers and vCPU ids

2020-11-30 Thread Greg Kurz
A regression was recently fixed in the sPAPR XIVE code for QEMU 5.2
RC3 [1]. It boiled down to a confusion between IPI numbers and vCPU
ids, which happen to be numerically equal in general, but are really
different entities that can diverge in some setups. This was causing
QEMU to misconfigure XIVE and to crash the guest.

The confusion comes from XICS actually. Interrupt presenters in XICS
are identified by a "server number" which is a 1:1 mapping to vCPU
ids. The range of these "server numbers" is exposed to the guest in
the "ibm,interrupt-server-ranges" property. A xics_max_server_number()
helper was introduced at some point to compute the upper limit of the
range. When XIVE was added, commit 1a518e7693c9 renamed the helper to
spapr_max_server_number(). It ended up being used to size a bunch of
things in XIVE that are per-vCPU, such as internal END tables or
IPI ranges presented to the guest. The problem is that the maximum
"server number" can be much higher (up to 8 times) than the actual
number of vCPUs when the VSMT mode doesn't match the number of threads
per core in the guest:

DIV_ROUND_UP(ms->smp.max_cpus * spapr->vsmt, ms->smp.threads);

Since QEMU 4.2, the default behavior is to set spapr->vsmt to
ms->smp.threads. Setups with custom VSMT settings will configure XIVE
to use more HW resources than needed. This is a bit unfortunate but
not extremely harmful, unless maybe if a lot of guests are running
on the host. The sizing of the IPI range is more problematic though
as it eventually led to [1].

This series first does some renaming to make it clear when we're
dealing with vCPU ids. It then fixes the machine code to pass
smp.max_cpus to XIVE where appropriate. Since these changes are
guest/migration visible, a machine property is added to keep the
existing behavior for older machine types. The series is thus based
on Connie's recent patch that introduces compat machines for
QEMU 6.0.

Based-on: 20201109173928.1001764-1-coh...@redhat.com

Note that we still use spapr_max_vcpu_ids() when activating the
in-kernel irqchip because this is what both XICS-on-XIVE and XIVE
KVM devices expect.

[1] https://bugs.launchpad.net/qemu/+bug/1900241

v2: - comments on v1 highlighted that problems mostly come from
  spapr_max_server_number() which got misused over the years.
  Updated the cover letter accordingly.
- completely new approach. Instead of messing with device properties,
  pass the appropriate values to the IC backend handlers.
- rename a few things using the "max_vcpu_ids" wording instead of
  "nr_servers" and "max_server_number"

Greg Kurz (3):
  spapr: Improve naming of some vCPU id related items
  spapr/xive: Fix size of END table and number of claimed IPIs
  spapr/xive: Fix the "ibm,xive-lisn-ranges" property

 include/hw/ppc/spapr.h  |  3 ++-
 include/hw/ppc/spapr_irq.h  | 12 ++--
 include/hw/ppc/spapr_xive.h |  2 +-
 include/hw/ppc/xics_spapr.h |  2 +-
 hw/intc/spapr_xive.c|  9 +
 hw/intc/spapr_xive_kvm.c|  4 ++--
 hw/intc/xics_kvm.c  |  4 ++--
 hw/intc/xics_spapr.c| 11 ++-
 hw/ppc/spapr.c  | 12 
 hw/ppc/spapr_irq.c  | 34 --
 10 files changed, 57 insertions(+), 36 deletions(-)

-- 
2.26.2





Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Daniel P . Berrangé
On Mon, Nov 30, 2020 at 05:13:57PM +0100, Kevin Wolf wrote:
> Am 30.11.2020 um 16:30 hat Daniel P. Berrangé geschrieben:
> > On Mon, Nov 30, 2020 at 03:58:23PM +0100, Paolo Bonzini wrote:
> > > On 30/11/20 13:25, Kevin Wolf wrote:
> > > > This series adds a QAPI type for the properties of all user creatable
> > > > QOM types and finally makes QMP object-add use the new ObjectOptions
> > > > union so that QAPI introspection can be used for user creatable objects.
> > > > 
> > > > After this series, there is least one obvious next step that needs to be
> > > > done: Change HMP and all of the command line parser to use
> > > > ObjectOptions, too, so that the QAPI schema is consistently enforced in
> > > > all external interfaces. I am planning to send another series to address
> > > > this.
> > > > 
> > > > In a third step, we can try to start deduplicating and integrating 
> > > > things
> > > > better between QAPI and the QOM implementation, e.g. by generating parts
> > > > of the QOM boilerplate from the QAPI schema.
> > > 
> > > With this series it's basically pointless to have QOM properties at all.
> > > Instead, you are basically having half of QEMU's backend data model into a
> > > single struct.
> > > 
> > > So the question is, are we okay with shoveling half of QEMU's backend data
> > > model into a single struct?  If so, there are important consequences.
> > 
> > In theory they should have the same set of options, but nothing in
> > this series will enforce that. So we're introducing the danger that
> > QMP object-add will miss some property, and thus be less functional
> > than the CLI -object.  If we convert CLI -object  to use the QAPI
> > parser too, we eliminate that danger, but we still have the struct
> > duplication.
> 
> I think converting the CLI is doable in the short term. I already have
> the patch for qemu-storage-daemon, but decided to keep it for a separate
> series.
> 
> The most difficult part is probably -readconfig, but with Paolo's RFC
> patches to move it away from QemuOpts, even that shouldn't be very hard.
> 
> > > 1) QOM basically does not need properties anymore except for devices and
> > > machines (accelerators could be converted to QAPI as well). All
> > > user-creatable objects can be changed to something like chardev's "get a
> > > struct and use it fill in the fields", and only leave properties to 
> > > devices
> > > and machines.
> > > 
> > > 2) User-creatable objects can have a much more flexible schema.  This 
> > > means
> > > there's no reason to have block device creation as its own command and
> > > struct for example.
> > > 
> > > The problem with this series is that you are fine with deduplicating 
> > > things
> > > as a third step, but you cannot be sure that such deduplication is 
> > > possible
> > > at all.  So while I don't have any problems in principle with the
> > > ObjectOptions concept, I don't think it should be committed without a 
> > > clear
> > > idea of how to do the third step.
> > 
> > I feel like we should at least aim to kill the struct duplication, even if
> > we ignore the bigger QOM stuff like setters/getters/constructors/etc. The
> > generated structs are not far off being usable.
> > 
> > eg for the secret object we have the QAPI schema
> > 
> > { 'struct': 'SecretCommonProperties',
> >   'data': { '*loaded': { 'type': 'bool', 'features': ['deprecated'] },
> > '*format': 'QCryptoSecretFormat',
> > '*keyid': 'str',
> > '*iv': 'str' } }
> > 
> > { 'struct': 'SecretProperties',
> >   'base': 'SecretCommonProperties',
> >   'data': { '*data': 'str',
> > '*file': 'str' } }
> > 
> > IIUC this will resulting in a QAPI generated flattened struct:
> > 
> >   struct SecretProperties {
> > bool loaded;
> > QCryptoSecretFormat format;
> > char *keyid;
> > char *iv;
> > char *data;
> > char *file;
> >   };
> > 
> > vs the QOM manually written structs
> > 
> >   struct QCryptoSecretCommon {
> > Object parent_obj;
> > uint8_t *rawdata;
> > size_t rawlen;
> > QCryptoSecretFormat format;
> > char *keyid;
> > char *iv;
> >   };
> > 
> >   struct QCryptoSecret {
> > QCryptoSecretCommon parent_obj;
> > char *data;
> > char *file;
> >   };
> > 
> > The key differences
> > 
> >  - The parent struct is embedded, rather than flattened
> >  - The "loaded" property doesn't need to exist
> >  - Some extra fields are live state (rawdata, rawlen)
> > 
> > Lets pretend we just kill "loaded" entirely, so ignore that.
> > 
> > We could simply make QOM "Object" a well known built-in type, so
> > we can reference it as a "parent". Then any struct with "Object"
> > as a parent could use struct embedding rather flattening and thus
> > just work.
> > 
> > Can we invent a "state" field for fields that are internal
> > only, separate from the public "data" fields.
> > 
> > eg the secret QAPI def would only need a couple of changes:
> > 
> > { 'struct': 'QCrypto

[PATCH for-6.0 v2 3/3] spapr/xive: Fix the "ibm, xive-lisn-ranges" property

2020-11-30 Thread Greg Kurz
The dt() callback of the sPAPR IC class has a "nr_servers"
argument which is used by both XIVE and XICS to setup the
"interrupt-controller" node in the DT.

The machine currently passes spapr_max_server_number() to
spapr_irq_dt(). This is perfectly fine to populate the range
of vCPU ids in the "ibm,interrupt-server-ranges" property
for XICS. However, this doesn't makes sense for XIVE's
"ibm,xive-lisn-ranges" property which really expects the
maximum number of vCPUs instead.

Add a new "max_cpus" argument to spapr_irq_dt() and the
dt() handler to convey the maximum number of vCPUs. Have
the latest machine type to pass smp.max_cpus and sPAPR XIVE
to use that for "ibm,xive-lisn-ranges". Older machine types
go on with the previous behavior since this is guest visible.

Signed-off-by: Greg Kurz 
---
 include/hw/ppc/spapr_irq.h | 4 ++--
 hw/intc/spapr_xive.c   | 3 ++-
 hw/intc/xics_spapr.c   | 3 ++-
 hw/ppc/spapr.c | 3 ++-
 hw/ppc/spapr_irq.c | 8 ++--
 5 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 2e53fc9e6cbb..1edf4851afa4 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -63,7 +63,7 @@ struct SpaprInterruptControllerClass {
 void (*set_irq)(SpaprInterruptController *intc, int irq, int val);
 void (*print_info)(SpaprInterruptController *intc, Monitor *mon);
 void (*dt)(SpaprInterruptController *intc, uint32_t max_vcpu_ids,
-   void *fdt, uint32_t phandle);
+   uint32_t max_cpus, void *fdt, uint32_t phandle);
 int (*post_load)(SpaprInterruptController *intc, int version_id);
 };
 
@@ -75,7 +75,7 @@ void spapr_irq_cpu_intc_reset(struct SpaprMachineState 
*spapr, PowerPCCPU *cpu);
 void spapr_irq_cpu_intc_destroy(struct SpaprMachineState *spapr, PowerPCCPU 
*cpu);
 void spapr_irq_print_info(struct SpaprMachineState *spapr, Monitor *mon);
 void spapr_irq_dt(struct SpaprMachineState *spapr, uint32_t max_vcpu_ids,
-  void *fdt, uint32_t phandle);
+  uint32_t max_cpus, void *fdt, uint32_t phandle);
 
 uint32_t spapr_irq_nr_msis(struct SpaprMachineState *spapr);
 int spapr_irq_msi_alloc(struct SpaprMachineState *spapr, uint32_t num, bool 
align,
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index d0a0ca822367..f9a563cd0a9b 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -670,6 +670,7 @@ static void spapr_xive_print_info(SpaprInterruptController 
*intc, Monitor *mon)
 }
 
 static void spapr_xive_dt(SpaprInterruptController *intc, uint32_t 
max_vcpu_ids,
+  uint32_t max_cpus,
   void *fdt, uint32_t phandle)
 {
 SpaprXive *xive = SPAPR_XIVE(intc);
@@ -678,7 +679,7 @@ static void spapr_xive_dt(SpaprInterruptController *intc, 
uint32_t max_vcpu_ids,
 /* Interrupt number ranges for the IPIs */
 uint32_t lisn_ranges[] = {
 cpu_to_be32(SPAPR_IRQ_IPI),
-cpu_to_be32(SPAPR_IRQ_IPI + max_vcpu_ids),
+cpu_to_be32(SPAPR_IRQ_IPI + max_cpus),
 };
 /*
  * EQ size - the sizes of pages supported by the system 4K, 64K,
diff --git a/hw/intc/xics_spapr.c b/hw/intc/xics_spapr.c
index 8f753a858cc2..d9f887dfd303 100644
--- a/hw/intc/xics_spapr.c
+++ b/hw/intc/xics_spapr.c
@@ -309,7 +309,8 @@ static void ics_spapr_realize(DeviceState *dev, Error 
**errp)
 }
 
 static void xics_spapr_dt(SpaprInterruptController *intc, uint32_t 
max_vcpu_ids,
-  void *fdt, uint32_t phandle)
+  uint32_t max_cpus, void *fdt,
+  uint32_t phandle)
 {
 uint32_t interrupt_server_ranges_prop[] = {
 0, cpu_to_be32(max_vcpu_ids),
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 227a926dfd48..be3b4b9119b7 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1164,7 +1164,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool 
reset, size_t space)
 _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
 /* /interrupt controller */
-spapr_irq_dt(spapr, spapr_max_vcpu_ids(spapr), fdt, PHANDLE_INTC);
+spapr_irq_dt(spapr, spapr_max_vcpu_ids(spapr), machine->smp.max_cpus,
+ fdt, PHANDLE_INTC);
 
 ret = spapr_dt_memory(spapr, fdt);
 if (ret < 0) {
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 4d9ecd5d0af8..e1fd777aff62 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -272,12 +272,16 @@ void spapr_irq_print_info(SpaprMachineState *spapr, 
Monitor *mon)
 }
 
 void spapr_irq_dt(SpaprMachineState *spapr, uint32_t max_vcpu_ids,
-  void *fdt, uint32_t phandle)
+  uint32_t max_cpus, void *fdt, uint32_t phandle)
 {
 SpaprInterruptControllerClass *sicc
 = SPAPR_INTC_GET_CLASS(spapr->active_intc);
 
-sicc->dt(spapr->active_intc, max_vcpu_ids, fdt, phandle);
+/* For older machine types in case they have an unusual VSMT setting */
+if (SPAPR_MACHINE_GET_CLASS(spapr

Re: [RFC PATCH-for-5.2] gitlab-ci: Do not automatically run Avocado integration tests anymore

2020-11-30 Thread Ademar Reis
On Mon, Nov 30, 2020 at 10:31:09AM +, Daniel P. Berrangé wrote:
> On Fri, Nov 27, 2020 at 07:29:31PM +0100, Thomas Huth wrote:
> > On 27/11/2020 18.57, Philippe Mathieu-Daudé wrote:
> > > On 11/27/20 6:47 PM, Thomas Huth wrote:
> > >> On 27/11/2020 18.41, Philippe Mathieu-Daudé wrote:
> > >>> We lately realized that the Avocado framework was not designed
> > >>> to be regularly run on CI environments. Therefore, as of 5.2
> > >>> we deprecate the gitlab-ci jobs using Avocado. To not disrupt
> > >>> current users, it is possible to keep the current behavior by
> > >>> setting the QEMU_CI_INTEGRATION_JOBS_PRE_5_2_RELEASE variable
> > >>> (see [*]).
> > >>> From now on, using these jobs (or adding new tests to them)
> > >>> is strongly discouraged.
> > >>>
> > >>> Tests based on Avocado will be ported to new job schemes during
> > >>> the next releases, with better documentation and templates.
> > >>
> > >> Why should we disable the jobs by default as long as there is no 
> > >> replacement
> > >> available yet?
> > > 
> > > Why keep it enabled if it is failing randomly
> > 
> > We can still disable single jobs if they are not stable, but that's no
> > reason to disable all of them by default, is it?
> 
> Agreed, the jobs which are known to be broken or unreliable should
> be unconditonally disabled in QEMU as a whole. This isn't specific
> to gitlab config - the qemu build makefiles/mesonfiles should disable
> the problem tests entirely, as we don't want developers wasting time
> running them locally either if they're known to be broken/unreliable.
> 

The problem is identifying when a test is broken/unreliable. No
complex test is 100% reliable: change the environment,
(configuration, build options, platform, etc) and any test complex
enough will start to fail. I've worked in projects orders of
magnitude simpler than QEMU and that was a golden rule. Testing QEMU
is *HARD*.

Which is why I defend test-automation separated from CI:

 * Have a stable CI with tests cherry-picked by whoever is
   maintaining a particular CI runner (we shouldn't have orphan
   runners).

 * Have as many tests as possible in the git repo: maintained,
   reviewed and run (outside of a CI) by people who care about them.

I absolutely agree with you that maintainers and developers should
care and our goal should be a gating CI. The question is how to
create a strategy and a plan to get there. Forcing people to care
rarely works.

Thanks.
   - Ademar

-- 
Ademar Reis Jr
Red Hat

^[:wq!




Re: [PATCH v2 4/6] migration: Add zstd support in multi-thread compression

2020-11-30 Thread Eric Blake
On 11/27/20 3:32 AM, Zeyu Jin wrote:
> This patch enables zstd option in multi-thread compression.
> 
> Signed-off-by: Zeyu Jin 
> Signed-off-by: Ying Fang 
> ---

> +++ b/qapi/migration.json
> @@ -536,7 +536,7 @@
>  #
>  ##
>  { 'enum': 'CompressMethod',
> -  'data': [ 'zlib' ] }
> +  'data': [ 'zlib', { 'name': 'zstd', 'if': 'defined(CONFIG_ZSTD)' } ] }

Missing documentation of the new value, including a '(since 6.0)' marker.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 0/6] migration: Multi-thread compression method support

2020-11-30 Thread Eric Blake
On 11/27/20 3:36 AM, Zeyu Jin wrote:

Meta-comment: you appear to be having problems threading your series;
I've now seen three separate cover letters (RFC v1, v2 with no subject,
v2 with subject) and two series where each patch was a separate thread.
It is difficult to follow which messages are related when reading in a
mail client that sorts by most-recently-active thread first.  You may
want to investigate why your threading is not working, although I'd wait
to send v3 until you have actual changes to incorporate.

> Currently we have both multi-thread compression and multifd to optimize
> live migration in Qemu. Mulit-thread compression deals with the situation
> where network bandwith is limited but cpu resource adequate. Multifd instead

Not that typos in the cover letter matter, but this should be 'bandwidth'

> aims to take full advantage of network bandwith. Moreover it supports both
> zlib and zstd compression on each channel.
> 
> In this patch series, we did some code refactoring on multi-thread compression
> live migration and bring zstd compression method support for it.
> 
> Below is the test result of multi-thread compression live migration
> with different compress methods. Test result shows that zstd outperforms
> zlib by about 70%.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism

2020-11-30 Thread Peter Xu
On Mon, Nov 30, 2020 at 11:11:00AM +0300, Andrey Gruzdev wrote:
> On 28.11.2020 01:28, Peter Xu wrote:
> > On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:
> > > Since reading UFFD events and saving paged data are performed
> > > from the same thread, write fault latencies are sensitive to
> > > migration stream stalls. Limiting total page saving rate is a
> > > method to reduce amount of noticiable fault resolution latencies.
> > > 
> > > Migration bandwidth limiting is achieved via noticing cases of
> > > out-of-threshold write fault latencies and temporarily disabling
> > > (strictly speaking, severely throttling) saving non-faulting pages.
> > 
> > So have you done any measurements out of it, as we've talked in previous
> > version?  Thanks,
> > 
> 
> Sorry, not done yet.

So do you still plan to? :)

And if not, could you describe the rational behind this patch?  For example,
what's the problem behind (e.g., guest hangs for xxx seconds, maybe?) and
what's the outcome (guest doesn't hang any more)?

Thanks,

-- 
Peter Xu




Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()

2020-11-30 Thread Peter Xu
On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote:
> > > +#ifdef CONFIG_LINUX
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns pointer to RAMBlock if found, NULL otherwise
> > > + *
> > > + * @rs: current RAM state
> > > + * @page_address: host page address
> > > + */
> > > +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr 
> > > page_address)
> > 
> > Reuse qemu_ram_block_from_host() somehow?
> > 
> 
> Seems not very suitable here, since we use rs->last_seen_block to restart
> search..

The search logic is actually still the same, it's just about which block to
start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one).  So
an internal helper to pass in that information would be nice.  Though that'll
require rcu lock taken before hand to keep the ramblock ptr valid.

No worry - we can keep this and work on top too, I think.

[...]

> > > +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > > +{
> > > +struct uffd_msg uffd_msg;
> > > +hwaddr page_address;
> > > +RAMBlock *bs;
> > > +int res;
> > > +
> > > +if (!migrate_background_snapshot()) {
> > > +return NULL;
> > > +}
> > > +
> > > +res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> > > +if (res <= 0) {
> > > +return NULL;
> > > +}
> > > +
> > > +page_address = uffd_msg.arg.pagefault.address;
> > > +bs = ram_find_block_by_host_address(rs, page_address);
> > > +if (!bs) {
> > > +/* In case we couldn't find respective block, just unprotect 
> > > faulting page. */
> > > +uffd_protect_memory(rs->uffdio_fd, page_address, 
> > > TARGET_PAGE_SIZE, false);
> > > +error_report("ram_find_block_by_host_address() failed: 
> > > address=0x%0lx",
> > > +page_address);
> > 
> > Looks ok to error_report() instead of assert(), but I'll suggest drop the 
> > call
> > to uffd_protect_memory() at least.  The only reason to not use assert() is
> > because we try our best to avoid crashing the vm, however I really doubt
> > whether uffd_protect_memory() is the right thing to do even if it happens - 
> > we
> > may at last try to unprotect some strange pages that we don't even know 
> > where
> > it is...
> > 
> 
> IMHO better to unprotect these strange pages then to leave them protected by
> UFFD.. To avoid getting VM completely in-operational.
> At least we know the page generated wr-fault, maybe due to incorrect
> write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if
> leave the page as is, VM would certainly lock.

Yes makes some senes too.  However it's definitely a severe programming error,
even if the VM is unblocked, it can be in a very weird state...

Maybe we just assert? Then we can see how unlucky we'll be. :)

> 
> Hmm, I wonder about assert(). In QEMU it would do something in release
> builds?

I guess yes, at least for now.  Because osdep.h has this:

/*
 * We have a lot of unaudited code that may fail in strange ways, or
 * even be a security risk during migration, if you disable assertions
 * at compile-time.  You may comment out these safety checks if you
 * absolutely want to disable assertion overhead, but it is not
 * supported upstream so the risk is all yours.  Meanwhile, please
 * submit patches to remove any side-effects inside an assertion, or
 * fixing error handling that should use Error instead of assert.
 */
#ifdef NDEBUG
#error building with NDEBUG is not supported
#endif
#ifdef G_DISABLE_ASSERT
#error building with G_DISABLE_ASSERT is not supported
#endif

[...]

> > > +/**
> > > + * ram_save_host_page_post: ram_save_host_page() post-notifier
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + * @opaque: opaque context value
> > > + * @res_override: pointer to the return value of ram_save_host_page(),
> > > + *   overwritten in case of an error
> > > + */
> > > +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > > +void *opaque, int *res_override)
> > > +{
> > > +/* Check if page is from UFFD-managed region. */
> > > +if (pss->block->flags & RAM_UF_WRITEPROTECT) {
> > > +#ifdef CONFIG_LINUX
> > > +ram_addr_t page_from = (ram_addr_t) opaque;
> > > +hwaddr page_address = (hwaddr) pss->block->host +
> > > +  ((hwaddr) page_from << TARGET_PAGE_BITS);
> > 
> > I feel like most new uses of hwaddr is not correct...  As I also commented 
> > in
> > the other patch.  We should save a lot of castings if switched.
> > 
> 
> Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as
> uintptr_t. I any case UFFD interface relies on u64 type.

For example, page_from should be a host address, we can use unsigned long, or
uint64_t, but ram_addr_t is not proper, which is only used in ramblock address
space.

I think it's fine that e.g. we use common types like u

Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Paolo Bonzini

On 30/11/20 16:30, Daniel P. Berrangé wrote:

{ 'struct': 'QCryptoSecretCommon',
   'base': 'Object',
   'state': { 'rawdata': '*uint8_t',
  'rawlen': 'size_t' },
   'data': { '*format': 'QCryptoSecretFormat',
 '*keyid': 'str',
 '*iv': 'str' } }

{ 'struct': 'QCryptoSecret',
   'base': 'QCryptoSecretCommon',
   'data': { '*data': 'str',
 '*file': 'str' } }


No, please don't do this.  I want to keep writing C code, not a weird 
JSON syntax for C.


I'd much rather have a FooOptions field in my struct so that I can just 
do visit_FooOptions in the UserCreatable.complete() method, that's feasible.


Paolo




Re: [PATCH 00/18] qapi/qom: QAPIfy object-add

2020-11-30 Thread Kevin Wolf
Am 30.11.2020 um 16:30 hat Daniel P. Berrangé geschrieben:
> On Mon, Nov 30, 2020 at 03:58:23PM +0100, Paolo Bonzini wrote:
> > On 30/11/20 13:25, Kevin Wolf wrote:
> > > This series adds a QAPI type for the properties of all user creatable
> > > QOM types and finally makes QMP object-add use the new ObjectOptions
> > > union so that QAPI introspection can be used for user creatable objects.
> > > 
> > > After this series, there is least one obvious next step that needs to be
> > > done: Change HMP and all of the command line parser to use
> > > ObjectOptions, too, so that the QAPI schema is consistently enforced in
> > > all external interfaces. I am planning to send another series to address
> > > this.
> > > 
> > > In a third step, we can try to start deduplicating and integrating things
> > > better between QAPI and the QOM implementation, e.g. by generating parts
> > > of the QOM boilerplate from the QAPI schema.
> > 
> > With this series it's basically pointless to have QOM properties at all.
> > Instead, you are basically having half of QEMU's backend data model into a
> > single struct.
> > 
> > So the question is, are we okay with shoveling half of QEMU's backend data
> > model into a single struct?  If so, there are important consequences.
> 
> In theory they should have the same set of options, but nothing in
> this series will enforce that. So we're introducing the danger that
> QMP object-add will miss some property, and thus be less functional
> than the CLI -object.  If we convert CLI -object  to use the QAPI
> parser too, we eliminate that danger, but we still have the struct
> duplication.

I think converting the CLI is doable in the short term. I already have
the patch for qemu-storage-daemon, but decided to keep it for a separate
series.

The most difficult part is probably -readconfig, but with Paolo's RFC
patches to move it away from QemuOpts, even that shouldn't be very hard.

> > 1) QOM basically does not need properties anymore except for devices and
> > machines (accelerators could be converted to QAPI as well). All
> > user-creatable objects can be changed to something like chardev's "get a
> > struct and use it fill in the fields", and only leave properties to devices
> > and machines.
> > 
> > 2) User-creatable objects can have a much more flexible schema.  This means
> > there's no reason to have block device creation as its own command and
> > struct for example.
> > 
> > The problem with this series is that you are fine with deduplicating things
> > as a third step, but you cannot be sure that such deduplication is possible
> > at all.  So while I don't have any problems in principle with the
> > ObjectOptions concept, I don't think it should be committed without a clear
> > idea of how to do the third step.
> 
> I feel like we should at least aim to kill the struct duplication, even if
> we ignore the bigger QOM stuff like setters/getters/constructors/etc. The
> generated structs are not far off being usable.
> 
> eg for the secret object we have the QAPI schema
> 
> { 'struct': 'SecretCommonProperties',
>   'data': { '*loaded': { 'type': 'bool', 'features': ['deprecated'] },
> '*format': 'QCryptoSecretFormat',
> '*keyid': 'str',
> '*iv': 'str' } }
> 
> { 'struct': 'SecretProperties',
>   'base': 'SecretCommonProperties',
>   'data': { '*data': 'str',
> '*file': 'str' } }
> 
> IIUC this will resulting in a QAPI generated flattened struct:
> 
>   struct SecretProperties {
> bool loaded;
> QCryptoSecretFormat format;
> char *keyid;
> char *iv;
> char *data;
> char *file;
>   };
> 
> vs the QOM manually written structs
> 
>   struct QCryptoSecretCommon {
> Object parent_obj;
> uint8_t *rawdata;
> size_t rawlen;
> QCryptoSecretFormat format;
> char *keyid;
> char *iv;
>   };
> 
>   struct QCryptoSecret {
> QCryptoSecretCommon parent_obj;
> char *data;
> char *file;
>   };
> 
> The key differences
> 
>  - The parent struct is embedded, rather than flattened
>  - The "loaded" property doesn't need to exist
>  - Some extra fields are live state (rawdata, rawlen)
> 
> Lets pretend we just kill "loaded" entirely, so ignore that.
> 
> We could simply make QOM "Object" a well known built-in type, so
> we can reference it as a "parent". Then any struct with "Object"
> as a parent could use struct embedding rather flattening and thus
> just work.
> 
> Can we invent a "state" field for fields that are internal
> only, separate from the public "data" fields.
> 
> eg the secret QAPI def would only need a couple of changes:
> 
> { 'struct': 'QCryptoSecretCommon',
>   'base': 'Object',
>   'state': { 'rawdata': '*uint8_t',
>  'rawlen': 'size_t' },
>   'data': { '*format': 'QCryptoSecretFormat',
> '*keyid': 'str',
> '*iv': 'str' } }
> 
> { 'struct': 'QCryptoSecret',
>   'base': 'QCryptoSecretCommon',
>   'data': { '*data': 'str',
> '*file'

[PATCH v7] introduce vfio-user protocol specification

2020-11-30 Thread Thanos Makatos
This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 

---

Changed since v1:
  * fix coding style issues
  * update MAINTAINERS for VFIO-over-socket
  * add vfio-over-socket to ToC

Changed since v2:
  * fix whitespace

Changed since v3:
  * rename protocol to vfio-user
  * add table of contents
  * fix Unicode problems
  * fix typos and various reStructuredText issues
  * various stylistic improvements
  * add backend program conventions
  * rewrite part of intro, drop QEMU-specific stuff
  * drop QEMU-specific paragraph about implementation
  * explain that passing of FDs isn't necessary
  * minor improvements in the VFIO section
  * various text substitutions for the sake of consistency
  * drop paragraph about client and server, already explained in
  * intro
  * drop device ID
  * drop type from version
  * elaborate on request concurrency
  * convert some inessential paragraphs into notes
  * explain why some existing VFIO defines cannot be reused
  * explain how to make changes to the protocol
  * improve text of DMA map
  * reword comment about existing VFIO commands
  * add reference to Version section
  * reset device on disconnection
  * reword live migration section
  * replace sys/vfio.h with linux/vfio.h
  * drop reference to iovec
  * use argz the same way it is used in VFIO
  * add type field in header for clarity

Changed since v4:
  * introduce support for live migration as defined in
  * include/uapi/linux/vfio.h
  * introduce 'max_fds' and 'migration' capabilities:
  * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
  * fix minor typos and reworded some text for clarity

Changed since v5:
  * fix minor typos
  * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
  * clarify meaning of VFIO bitmap size field
  * move version major/minor outside JSON
  * client proposes version first
  * make Errno optional in message header
  * clarification about message ID uniqueness
  * clarify that server->client request can appear in between
client->server request/reply

Changed since v6:
  * put JSON strings in double quotes
  * clarify reply behavior on error
  * introduce max message size capability
  * clarify semantics when failing to map multiple DMA regions in a
single command

You can focus on v6 to v7 changes by cloning my fork
(https://github.com/tmakatos/qemu) and doing:

git diff refs/tags/vfio-user/v6 refs/heads/vfio-user/v7
---
 MAINTAINERS  |6 +
 docs/devel/index.rst |1 +
 docs/devel/vfio-user.rst | 1662 ++
 3 files changed, 1669 insertions(+)
 create mode 100644 docs/devel/vfio-user.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 68bc160f41..6a4c662976 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1766,6 +1766,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+vfio-user
+M: John G Johnson 
+M: Thanos Makatos 
+S: Supported
+F: docs/devel/vfio-user.rst
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index f10ed77e4c..2e2cba28c6 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -35,3 +35,4 @@ Contents:
clocks
qom
block-coroutine-wrapper
+   vfio-user
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 00..d15a228a1a
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1662 @@
+.. include:: 
+
+
+vfio-user Protocol Specification
+
+
+
+Version_ 0.1
+
+
+.. contents:: Table of Contents
+
+Introduction
+
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The `Linux VFIO ioctl interface 
`_
+been chosen as the base for this protocol for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can be
+   largely reused.
+
+.. Note::
+   In a proof of concept implementation it has been demonstrated that using 
VFIO
+   over a UNIX domain socket is a viable option. vfio-user is designed with
+   QEMU in mind, however it could be used by other client applications. The
+   vfio-user protocol does not require that QEMU's VFIO client  im

Re: [PATCH for-6.0 6/6] qapi: Deprecate 'query-kvm'

2020-11-30 Thread Markus Armbruster
Peter Krempa  writes:

> On Mon, Nov 30, 2020 at 10:21:08 +0100, Markus Armbruster wrote:
>> Peter Krempa  writes:
>> 
>> > On Fri, Nov 27, 2020 at 16:44:05 +0100, Markus Armbruster wrote:
>> >> Peter Krempa  writes:
>
> [...]
>
>> > I know it's hard to enforce, but probably the cheapest in terms of
>> > drawbacks any other solution would be.
>> 
>> We can and should try.  
>> 
>> Can we flag problematic interface changes automatically?  Semantic
>> changes, no.  What about changes visible in query-qmp-schema?
>
> I don't know the internals of qemu good enough, but from the perspective
> of an user of 'query-qmp-schema' it might be possible but not without
> additional tooling.
>
> The output of query-qmp-schema is basically unreviewable when we are
> updating it. A small change in the schema may trigger a re-numbering of
> the internal type names so the result is a giant messy piece of JSON
> where it's impossible to differentiate changes from churn.

A structural comparison could fare better.

qapi-gen option -u might help:

  -u, --unmask-non-abi-names
expose non-ABI names in introspection

> I've played with generating/expanding all the possibilites and thus
> stripping the internal numbering for a tool which would simplify writing
> the query strings (a libvirtism for querying whether the QMP schema has
> something [1]). This tool could be used in this case to generate a fully
> expanded and sorted list which should in most cases be append only when
> new stuff is added. This could be then used to see whether something
> changed when we'd store the output and compare it against the new one.

Such an expansion is one way to compare structurally.  It reports
differences for "command C, argument A.B...".  Mapping to the QAPI
schema is straightforward.

> Unfortunately that would just make query-qmp-schema dumps more
> reviewable in libvirt though. A change in an interface would be noticed
> only after it hits qemu upstream.

Yes, implementing the comparison in the QEMU repository would be more
useful.

> [1] 
> https://gitlab.com/libvirt/libvirt/-/blob/08ae9e5f40f8bae0c3cf48f84181884ddd310fa0/src/qemu/qemu_qapi.c#L392
> 
> https://gitlab.com/libvirt/libvirt/-/blob/08ae9e5f40f8bae0c3cf48f84181884ddd310fa0/src/qemu/qemu_capabilities.c#L1512




Re: [PATCH 01/18] qapi/qom: Add ObjectOptions for iothread

2020-11-30 Thread Kevin Wolf
Am 30.11.2020 um 16:00 hat Paolo Bonzini geschrieben:
> On 30/11/20 13:25, Kevin Wolf wrote:
> > +##
> > +# @IothreadProperties:
> > +#
> > +# Properties for iothread objects.
> > +#
> > +# @poll-max-ns: the maximum number of nanoseconds to busy wait for events.
> > +#   0 means polling is disabled (default: 32768 on POSIX hosts,
> > +#   0 otherwise)
> > +#
> > +# @poll-grow: the multiplier used to increase the polling time when the
> > +# algorithm detects it is missing events due to not polling 
> > long
> > +# enough. 0 selects a default behaviour (default: 0)
> > +#
> > +# @poll-shrink: the divisor used to decrease the polling time when the
> > +#   algorithm detects it is spending too long polling without
> > +#   encountering events. 0 selects a default behaviour 
> > (default: 0)
> > +#
> > +# Since: 6.0
> > +##
> > +{ 'struct': 'IothreadProperties',
> > +  'data': { '*poll-max-ns': 'int',
> > +'*poll-grow': 'int',
> > +'*poll-shrink': 'int' } }
> > +
> 
> Documentation is the main advantage of the ObjectOptions concept. However,
> please use the version where each object and property was introduced for the
> "since" value.  Otherwise the documentation will appear to show that none of
> these objects was available before 6.0.
> 
> Yes, there is no documentation at all right now for QOM objects. However,
> wrong documentation sometimes is worse than non-existing documentation.

I think we generally use the version when the schema was introduced (so
blockdev-add has 2.9 for most things even if they existed before in
-drive and drive_add), but I agree that your suggestion is more useful.
And object-add isn't a new command, we're just giving it a schema now.

So let's first discuss the general direction, but if we agree on that,
using the version numbers where objects and properties were first
introduced makes sense.

Maybe if maintainers can include the right version numbers in the review
of the patch for "their" object, that would help me updating the
patches.

Kevin




Re: [PATCH] virtio: reset device on bad guest index in virtio_load()

2020-11-30 Thread Michael S. Tsirkin


No, but how about sending the patch to me and the mailing list?
I didn't get it through either channel.

On Mon, Nov 30, 2020 at 03:51:13PM +, John Levon wrote:
> On Fri, Nov 20, 2020 at 06:51:07PM +, John Levon wrote:
> 
> > If we find a queue with an inconsistent guest index value, explicitly mark 
> > the
> > device as needing a reset - and broken - via virtio_error().
> > 
> > There's at least one driver implementation - the virtio-win NetKVM driver - 
> > that
> > is able to handle a VIRTIO_CONFIG_S_NEEDS_RESET notification and 
> > successfully
> > restore the device to a working state. Other implementations do not 
> > correctly
> > handle this, but as the VQ is not in a functional state anyway, this is 
> > still
> > worth doing.
> 
> Ping, anyone have issues with doing this?
> 
> cheers
> john
> 
> > Signed-off-by: John Levon 
> > ---
> >  hw/virtio/virtio.c | 15 +--
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index ceb58fda6c..eff35fab7c 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -3161,12 +3161,15 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, 
> > int version_id)
> >  nheads = vring_avail_idx(&vdev->vq[i]) - 
> > vdev->vq[i].last_avail_idx;
> >  /* Check it isn't doing strange things with descriptor 
> > numbers. */
> >  if (nheads > vdev->vq[i].vring.num) {
> > -qemu_log_mask(LOG_GUEST_ERROR,
> > -  "VQ %d size 0x%x Guest index 0x%x "
> > -  "inconsistent with Host index 0x%x: delta 
> > 0x%x",
> > -  i, vdev->vq[i].vring.num,
> > -  vring_avail_idx(&vdev->vq[i]),
> > -  vdev->vq[i].last_avail_idx, nheads);
> > +virtio_error(vdev, "VQ %d size 0x%x Guest index 0x%x "
> > + "inconsistent with Host index 0x%x: delta 
> > 0x%x",
> > + i, vdev->vq[i].vring.num,
> > + vring_avail_idx(&vdev->vq[i]),
> > + vdev->vq[i].last_avail_idx, nheads);
> > +vdev->vq[i].used_idx = 0;
> > +vdev->vq[i].shadow_avail_idx = 0;
> > +vdev->vq[i].inuse = 0;
> > +continue;
> >  }
> >  vdev->vq[i].used_idx = vring_used_idx(&vdev->vq[i]);
> >  vdev->vq[i].shadow_avail_idx = vring_avail_idx(&vdev->vq[i]);




  1   2   3   >